Wine Quality Exploration by Alona Varshal

Having worked previously on wines, I chose the red and white wines data set. I combined the two using the rbind() command after removing the variable “X”, adding a variable “type” having a value of 1 for red wines and 0 for white wines. The new data set, wines, have 13 features which include type and quality which were both converted from integer to factor variables. The goal of this exploratory analysis is to determine which features contribute to the most separation of the different values of quality of wines.

Univariate Plots Section

## [1] 6497   14

The actual variables are 13. The rest were created from the variable quality to make it an ordered factor variable. Two of the 13 variables (type and quality) are output variables.

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "type"                 "quality.f"
## 'data.frame':    6497 obs. of  14 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ type                : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ quality.f           : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## [1] "3" "4" "5" "6" "7" "8" "9"
## [1] "0" "1"

Most of the data come from white wines (1599 red wine and 4898 white wine observations). Features are physicochemical tests of wines. The following are summaries for each variable (except type):

Fixed acidity is measured as gram tartaric acid per liter of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.400   7.000   7.215   7.700  15.900

Volatile acidity is the amount in grams of acetic acid per liter of wine. High levels of this compound contributes to the unpleasant taste of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800

Citric acid, measured in g/L of the substance, adds “freshness” and flavor to wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2500  0.3100  0.3186  0.3900  1.6600

Residual sugar, measure in g/L, is the sugar left over after fermentation.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800

Chlorides are measured by the amount of sodium chloride per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100

Free sulfur dioxide, the undissolved portion of sulfur dioxide is in mg per liter.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   29.00   30.53   41.00  289.00

Total sulfur dioxide is the free and dissolved sulfur dioxide. Sulfur dioxide is used in wine to prevent microbial growth and oxidation of wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    77.0   118.0   115.7   156.0   440.0

In g/mL, the density is related to residual sugar and alcohol content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390

Most wines are between pH 3-4.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.110   3.210   3.219   3.320   4.010

Sulphates, measured by the amount of potassium sulphate per liter, are also used as antimicrobial agent in wine, contributing to the amount of sulfur dioxide in wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4300  0.5100  0.5313  0.6000  2.0000

Alcohol is measured by % volume.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

Quality is an output variable which is a score given by a human test panel and has possible value of 0 to 10 with 10 being the best.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.818   6.000   9.000

White wines and red wines have different distributions in some of the variables. Because of this, it was necessary to analyze the reds separately from the whites.

Some variables have normal distribution and some don’t. It is surprising to see that in some cases, the variable has normal distribution in red wines but not in white wines (for example, chlorides). Citric acid is where reds and whites are most different in terms of distribution.

For some variables, transformation created a more normal distribution but for other variables, transformation didn’t change the distribution.