Wine Quality Exploration by Alona Varshal

Having worked previously on wines, I chose the red and white wines data set. I combined the two using the rbind() command after removing the variable “X”, adding a variable “type” having a value of 1 for red wines and 0 for white wines. The new data set, wines, have 13 features which include type and quality which were both converted from integer to factor variables. The goal of this exploratory analysis is to determine which features contribute to the most separation of the different values of quality of wines.

Univariate Plots Section

## [1] 6497   14

The actual variables are 13. The rest were created from the variable quality to make it an ordered factor variable. Two of the 13 variables (type and quality) are output variables.

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "type"                 "quality.f"

## 'data.frame':    6497 obs. of  14 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ type                : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ quality.f           : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...

## [1] "3" "4" "5" "6" "7" "8" "9"

## [1] "0" "1"

Most of the data come from white wines (1599 red wine and 4898 white wine observations). Features are physicochemical tests of wines. The following are summaries for each variable (except type):

Fixed acidity is measured as gram tartaric acid per liter of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.400   7.000   7.215   7.700  15.900

Volatile acidity is the amount in grams of acetic acid per liter of wine. High levels of this compound contributes to the unpleasant taste of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800

Citric acid, measured in g/L of the substance, adds “freshness” and flavor to wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2500  0.3100  0.3186  0.3900  1.6600

Residual sugar, measure in g/L, is the sugar left over after fermentation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800

Chlorides are measured by the amount of sodium chloride per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100

Free sulfur dioxide, the undissolved portion of sulfur dioxide is in mg per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   29.00   30.53   41.00  289.00

Total sulfur dioxide is the free and dissolved sulfur dioxide. Sulfur dioxide is used in wine to prevent microbial growth and oxidation of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    77.0   118.0   115.7   156.0   440.0

In g/mL, the density is related to residual sugar and alcohol content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390

Most wines are between pH 3-4.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.110   3.210   3.219   3.320   4.010

Sulphates, measured by the amount of potassium sulphate per liter, are also used as antimicrobial agent in wine, contributing to the amount of sulfur dioxide in wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4300  0.5100  0.5313  0.6000  2.0000

Alcohol is measured by % volume.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

Quality is an output variable which is a score given by a human test panel and has possible value of 0 to 10 with 10 being the best.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.818   6.000   9.000

White wines and red wines have different distributions in some of the variables. Because of this, it was necessary to analyze the reds separately from the whites.

Some variables have normal distribution and some don’t. It is surprising to see that in some cases, the variable has normal distribution in red wines but not in white wines (for example, chlorides). Citric acid is where reds and whites are most different in terms of distribution.

For some variables, transformation created a more normal distribution but for other variables, transformation didn’t change the distribution.

Residual sugar is not normally distributed. Transformation using log10 yields something like a bimodal distribution for white wines.

Transformation of chlorides variable for the white wines made it a little more normally distributed.

Free sulfur dioxide for red wine does not look like a normal curve but transformation didn’t result into a normal distribution.

Transformation for the total sulfur dioxide distribution was needed for red wines though it wasn’t needed for the white wines.

Calculating the ratio of free to total sulfur dioxide created a feature that is normally distributed.

Transformation of density didn’t change the distribution for all wines.

pH of all wines have a normal distribution

Transformation of sulphates created a little better distribution.

Alcohol distribution isn’t normal and transformation didn’t do much to improve the plot though red wines’ distribution seemed to be bimodal.

Boxplots of individual features

These plots show the spread of the data in another way, and show that all features have outliers (not all variables are shown).

Univariate Analysis

What is the structure of your dataset?

The resulting “wines”" dataset has 6497 observations and 12 variables after creating it from the red wine and the white wine data sets downloaded from the Udacity project site.

What is/are the main feature(s) of interest in your dataset?

Main features are the quality, type, and probably alcohol. Quality was based on a human test panel. It can be seen from the histograms that most wines were classified as 5, 6 and 7. Only a few made the 8 and the 9 classification (only five 9’s and none of them were reds). There were 30 worst wines classified as 3. None of the wines were classified as 1, 2 and 10. From the boxplots, all features of wines have outliers. Removing some of the outliers in the plot, we can see a better distribution of the features in all of the wines and that red wines usually have a different distribution from those of the white wines.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

It makes sense to think that levels of all components of wine can determine its quality. Therefore, I think that aside from alcohol, citric acid, fixed acidity (acetic acid content), sulphates and sulfur dioxide levels will contribute to the quality of wines. Some of the features in the data set are related as will be seen in the bivariate section, so picking which variables among those that are related might be a good idea.

Did you create any new variables from existing variables in the dataset?

To create some of the plots above, I resorted to making a factor (ordered) variable out of “quality”. I also thought about creating a ratio between the free sulfur dioxide and the total sulfur dioxide ratio and its distribution is different from the individual features as shown in the histogram for “free to total SO2 ratio” above. One can also calculate a ratio of citric acid to fixed acidity, but when I tried this, there was really no new info I could obtain.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

A lot of the features didn’t have a normal distribution and transforming them created distributions that approach the normal curve but not totally. Some didn’t change at al.

Fixed acidity is tailing, so transformation was done. The resulting histogram is more normally distributed.

Volatile acidity is skewed to the left and log10 transformation showed the bimodal characteristic of the distribution.

Residual sugar is not normally distributed. Transformation using log10 yielded something like a bimodal distribution.

Chlorides also don’t look normally distributed. Transformation made it look better, but it also revealed a bimodal distribution.

Free sulfur dioxide and total sulfur dioxide had non-normal distribution and transformation didn’t do anything. But when the ratio of the two have a normal distribution.

Transformation of density also didn’t make the distribution better.

Alcohol distribution isn’t normal and transformation didn’t change the distribution.

Regarding tidying the data, the data have no missing data so I didn’t have to manipulate it so as to remove missing data. All I did was combine two csv files, create ordered factor variables and another variable “type”.

Bivariate Plots Section

Correlation matrix for all wines (Spearman)

Correlation matrix for red wines (Spearman)

Correlation matrix for white wines (Spearman)

For red wines, quality is correlated to volatile acidity, sulphates and alcohol. For white wines, quality is correlated to chlorides, density and alcohol.

There are also correlations existing between some of the input variables.

For example, density and alcohol have a negative correlation, which makes sense because, since alcohol is less dense than water, if there is more alcohol in a mixture, it is expected to have a lower density that the one with less alcohol. Density is also related to residual sugar. The more sugar there is, the more dense a mixture would be.

pH would expectedly be correlated with fixed acidity, volatile acidity and citric acid. So are total and free sulfur dioxide.

So choosing the best features that help the most in classifying wines would be a good idea.

But it is surprising to me that the correlations are sometimes different for the two types of wine.

The following plots explore the correlation of quality and some of the input variables.

Red wine quality showed a negative correlation with volatile acidity while there was no clear relationship obtained for white wines.

Chlorides also show a negative correlation with quality for both wines.

Red wines show a positive correlation in terms of sulphates and quality while white wines do not.

Wine quality is influenced by alcohol content but the plots below show that the it is true for those with qualities 5 and above. Perhaps some other factor/factors mask the effect of alcohol on quality. It might be good to look at observations with quality values of 3 - 5 and see what feature dominates to influence quality.

Density follows almost the same behavior (though inversely) as alcohol for the reasons I have mentioned before.

The following shows in detail the correlations among the features.

Citric acid is undoubtedly correlated to pH:

Residual sugar understandably increases density:

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Initially, I found it hard to extract the most number of correlations if all types wines are analyzed together. But learning how to create the matrix correlation I did above made it easier for me to pinpoint which input variables can contribute to classification of wine quality. These are alcohol, volatile acidity, chlorides and density.

But as I found out different types have different behaviors in different features, I did the analysis separately. Using a correlation matrix again, I was able to find which input variables correlate to quality in both wines. As mentioned above, quality correlates to volatile acidity, sulphates and alcohol for red wines, and chlorides, density and alcohol for white wines. These have the highest values of correlation coefficients (Spearman method).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The correlation matrix shows which input variables are correlated to another input variable. It is really surprising for me to find that sometimes, a pair of input variables may be correlated in red wines and not in white wines and vice versa. I plotted the ones that are correlated in both wines, except for the citric acid vs. fixed acidity, above.

What was the strongest relationship you found?

The strongest relationships between the output variable (quality) and input variables is the alcohol content in both wines.

Among input variables, the strongest are density and alcohol; free and total sulfur dioxide; density and residual sugar. The rest of the following pairs are found to have correlations at a lower extent than the pairs just mentioned:

pH and fixed acidity
citric acid and fixed acidity
chlorides and density
citric acid and volatile acidity (for red wines only)
citric acid and pH (reds only)
sulphates and volatile acidity (reds only)
alcohol and residual sugar (whites only)
total sulfur dioxide and density (whites only)
alcohol and chlorides (whites only)
alcohol and total sulfur dioxide (whites only)

Multivariate Plots Section

Scatter plots of alcohol vs. other input variables in red and white wines

Classification of all wine by quality:

Analyzing all wines using alcohol, chlorides and volatile acidity:

(Using density did not create a better plot.)

Classification of red wines by quality:

From the correlation matrix above, red wine quality is influenced more by volatile.acidity, alcohol and sulphates.

Red wines are more separated than white white wine by these varibles.

Combining all three variables to classify the quality of red wines only:

Since volatile acidity is related to pH (in theory):

Classification of white wines by quality:

To see how white wine quality is influenced by chlorides, density and alcohol, I plot the following.

It seems like white wines are not as properly separated by these variables.

Since residual sugar is highly correlated with density in white wines:

Another attempt using residual sugar/chlorides ratio:

Reading the literature where the data came from, it mentioned that sulphates had the highest input to the classification (using support vector machines):

It looked like this has a better effect in separating the qualities of white wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Of all the input variables, alcohol content and volatile acidity and total sulfur dioxide probably best separate red wines from white wines, judging from the scatter plots above. pH and residual sugar definitely cannot determine whether a wine is red or white.

For red wines, using the ratio of the variables that had the highest correlation coefficient with quality increased the separation of the qualities of red wines. It was harder for white wines. The correlation of density with residual sugar helped by swapping the density with residual sugar in the classification of white wines. Other swaps can be done in a similar fashion to see if the qualities of wine can be separated in the two-dimensional plot. Maybe if using a three-dimensional plot, a better classification can be obtained. Separation is easier to view by just looking at the colors of quality values of 5, 6, and 7. Perhaps because these were most represented in the data, they were better classified in the plot.

Were there any interesting or surprising interactions between features?

It is surprising to see that residual sugar didn’t give a high correlation with quality while it was able to help in contributing to the classification of the qualities of white wines. But it is interesting to be able to swap a feature with a feature it correlates with and classification seems to improve.

Final Plots and Summary

Plot One

Description One

In the wines data set. Most of the wines fall under the quality values of 5, 6, and 7. There are only a few 9’s and all are white wines.

Plot Two

Description Two

Some of the input variables are correlated to each other such that using one or the other in the final analysis in classifying wine qualities can be done. For the plot above, since density is correlated to residual sugar, one can use either residual sugar or density in the final multivariate analysis of wine quality.

Plot Three

Description Three

Features that can separate qualities of wines are alcohol content, chlorides, and volatile acidity. Features that classify red wines from white wines are alcohol and total sulfur dioxide or volatile acidity. To classify red wines by quality, the features that contribute most are alcohol, volatile acidity and sulphates. To classify red wines by quality, the features that contribute most are alcohol, chlorides and density or residual sugar.

Reflection

At first, I actually chose just the red wine data set, but thought, why not analyze it with the white wine. So I obtained the white wine data set, but to my disappointment, I really couldn’t resort to analyzing them separately. I found that the two types of wines are influenced by different factors. However, in the end, I was still able to kind of obtain a multivariate plot that seem to classify all the wines.

It is really by trial and error that I found the best features that can separate the different qualities of wine in a plot. Since there are many features available, it was hard to choose, but by reading a little bit about the features and wine, my decisions were made a little easier, though I still did a lot of trial and error.

Suggestion:

Constructing 3-D plots might be better in seeing the separation among the observations in this project. I’m not sure how to come up with it, but machine learning techniques will probably better to use in analyzing this type of data.

References

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Archived: in R, how do I append two data files? https://kb.iu.edu/d/bcrr

Factor variables http://statistics.ats.ucla.edu/stat/r/modules/factor_variables.htm

Adding and removing columns from a data frame http://www.cookbook-r.com/Manipulating_data/Adding_and_removing_columns_from_a_data_frame/

Exploratory data analysis and data pre-processing: https://onlinecourses.science.psu.edu/stat857/print/book/export/html/224

Practical Winery & Vineyard Journal (Jan/Feb 2009): http://www.practicalwinery.com/janfeb09/page2.htm

Exploratory Data Analysis on Wine Quality by Bilal Mahmood https://rpubs.com/Bilal_Mahmood/EDA

Wine Quality Analysis: http://rstudio-pubs-static.s3.amazonaws.com/24803_abbae17a5e154b259f6f9225da6dade0.html

Correlation matrix http://www.cookbook-r.com/Graphs/Correlation_matrix/

An introduction to corrplot package https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

Diamonds exploration by Chris Saden: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/diamondsExample.html

Knitr with R Markdown http://kbroman.org/knitr_knutshell/pages/Rmarkdown.html