|
@@ -0,0 +1,863 @@
|
|
|
+#' What Makes A Good Wine?
|
|
|
+#' ========================================================
|
|
|
+#'
|
|
|
+#' In this project, a data set of red wine quality is explored based on its
|
|
|
+#' physicochemical properties. The objective is to find physicochemical properties
|
|
|
+#' that distinguish good quality wine from lower quality ones. An attempt to build
|
|
|
+#' linear model on wine quality is also shown.
|
|
|
+#'
|
|
|
+#' ### Dataset Description
|
|
|
+#' This tidy dataset contains 1,599 red wines with 11 variables on the chemical
|
|
|
+#' properties of the wine. Another variable attributing to the quality of wine is
|
|
|
+#' added; at least 3 wine experts did this rating. The preparation of the dataset
|
|
|
+#' has been described in [this link](https://goo.gl/HVxAzY).
|
|
|
+#'
|
|
|
+## ----global_options, include=FALSE---------------------------------------
|
|
|
+knitr::opts_chunk$set(fig.path='Figs/',
|
|
|
+ echo=FALSE, warning=FALSE, message=FALSE)
|
|
|
+
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, packages------------------
|
|
|
+library(ggplot2)
|
|
|
+library(gridExtra)
|
|
|
+library(GGally)
|
|
|
+library(ggthemes)
|
|
|
+library(dplyr)
|
|
|
+library(memisc)
|
|
|
+
|
|
|
+#'
|
|
|
+#' First, the structure of the dataset is explored using ``summary`` and ``str``
|
|
|
+#' functions.
|
|
|
+## ----echo=FALSE, warning=FALSE, message=FALSE, Load_the_Data-------------
|
|
|
+wine <- read.csv("wineQualityReds.csv")
|
|
|
+str(wine)
|
|
|
+summary(wine)
|
|
|
+
|
|
|
+# Setting the theme for plotting.
|
|
|
+# theme_set(theme_minimal(10))
|
|
|
+
|
|
|
+# Converting 'quality' to ordered type.
|
|
|
+wine$quality <- ordered(wine$quality,
|
|
|
+ levels=c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
|
|
|
+# Adding 'total.acidity'.
|
|
|
+wine$total.acidity <- wine$fixed.acidity + wine$volatile.acidity
|
|
|
+
|
|
|
+#'
|
|
|
+#' **The following observations are made/confirmed:**
|
|
|
+#'
|
|
|
+#' 1. There are 1599 samples of Red Wine properties and quality values.
|
|
|
+#'
|
|
|
+#' 2. No wine achieves either a terrible (0) or perfect (10) quality score.
|
|
|
+#'
|
|
|
+#' 3. Citric Acid had a minimum of 0.0. No other property values were precisely 0.
|
|
|
+#'
|
|
|
+#' 4. Residual Sugar measurement has a maximum that is nearly 20 times farther
|
|
|
+#' away from the 3rd quartile than the 3rd quartile is from the 1st. There is
|
|
|
+#' a chance of a largely skewed data or that the data has some outliers.
|
|
|
+#'
|
|
|
+#' 5. The 'quality' attribute is originally considered an integer;
|
|
|
+#' I have converted this field into an ordered factor which is much more
|
|
|
+#' a representative of the variable itself.
|
|
|
+#'
|
|
|
+#' 6. There are two attributes related to 'acidity' of wine i.e. 'fixed.acidity'
|
|
|
+#' and 'volatile.acidity'. Hence, a combined acidity variable is added
|
|
|
+#' using ``data$total.acidity <- data$fixed.acidity + data$volatile.acidity``.
|
|
|
+#'
|
|
|
+#' ## Univariate Plots Section
|
|
|
+#' To lead the univariate analysis, I’ve chosen to build a grid of histograms.
|
|
|
+#' These histograms represent the distributions of each variable in the dataset.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, warning=FALSE, message=FALSE, Univariate_Grid_Plot------
|
|
|
+g_base <- ggplot(
|
|
|
+ data = wine,
|
|
|
+ aes(color=I('black'), fill=I('#990000'))
|
|
|
+)
|
|
|
+
|
|
|
+g1 <- g_base +
|
|
|
+ geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
|
|
|
+ scale_x_continuous(breaks = seq(4, 16, 2)) +
|
|
|
+ coord_cartesian(xlim = c(4, 16))
|
|
|
+
|
|
|
+g2 <- g_base +
|
|
|
+ geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 2, 0.5)) +
|
|
|
+ coord_cartesian(xlim = c(0, 2))
|
|
|
+
|
|
|
+g3 <- g_base +
|
|
|
+ geom_histogram(aes(x = total.acidity), binwidth = 0.25) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 18, 1)) +
|
|
|
+ coord_cartesian(xlim = c(4, 18))
|
|
|
+
|
|
|
+g4 <- g_base +
|
|
|
+ geom_histogram(aes(x = citric.acid), binwidth = 0.05) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 1, 0.2)) +
|
|
|
+ coord_cartesian(xlim = c(0, 1))
|
|
|
+
|
|
|
+g5 <- g_base +
|
|
|
+ geom_histogram(aes(x = residual.sugar), binwidth = 0.5) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 16, 2)) +
|
|
|
+ coord_cartesian(xlim = c(0, 16))
|
|
|
+
|
|
|
+g6 <- g_base +
|
|
|
+ geom_histogram(aes(x = chlorides), binwidth = 0.01) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 0.75, 0.25)) +
|
|
|
+ coord_cartesian(xlim = c(0, 0.75))
|
|
|
+
|
|
|
+g7 <- g_base +
|
|
|
+ geom_histogram(aes(x = free.sulfur.dioxide), binwidth = 2.5) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 75, 25)) +
|
|
|
+ coord_cartesian(xlim = c(0, 75))
|
|
|
+
|
|
|
+g8 <- g_base +
|
|
|
+ geom_histogram(aes(x = total.sulfur.dioxide), binwidth = 10) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 300, 100)) +
|
|
|
+ coord_cartesian(xlim = c(0, 295))
|
|
|
+
|
|
|
+g9 <- g_base +
|
|
|
+ geom_histogram(aes(x = density), binwidth = 0.0005) +
|
|
|
+ scale_x_continuous(breaks = seq(0.99, 1.005, 0.005)) +
|
|
|
+ coord_cartesian(xlim = c(0.99, 1.005))
|
|
|
+
|
|
|
+g10 <- g_base +
|
|
|
+ geom_histogram(aes(x = pH), binwidth = 0.05) +
|
|
|
+ scale_x_continuous(breaks = seq(2.5, 4.5, 0.5)) +
|
|
|
+ coord_cartesian(xlim = c(2.5, 4.5))
|
|
|
+
|
|
|
+g11 <- g_base +
|
|
|
+ geom_histogram(aes(x = sulphates), binwidth = 0.05) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 2, 0.5)) +
|
|
|
+ coord_cartesian(xlim = c(0, 2))
|
|
|
+
|
|
|
+g12 <- g_base +
|
|
|
+ geom_histogram(aes(x = alcohol), binwidth = 0.25) +
|
|
|
+ scale_x_continuous(breaks = seq(8, 15, 2)) +
|
|
|
+ coord_cartesian(xlim = c(8, 15))
|
|
|
+
|
|
|
+grid.arrange(g1, g2, g3, g4, g5, g6,
|
|
|
+ g7, g8, g9, g10, g11, g12, ncol=3)
|
|
|
+
|
|
|
+#'
|
|
|
+#' There are some really interesting variations in the distributions here. Looking
|
|
|
+#' closer at a few of the more interesting ones might prove quite valuable.
|
|
|
+#' Working from top-left to right, selected plots are analysed.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, warning=FALSE, message=FALSE, single_variable_hist------
|
|
|
+base_hist <- ggplot(
|
|
|
+ data = wine,
|
|
|
+ aes(color=I('black'), fill=I('#990000'))
|
|
|
+)
|
|
|
+
|
|
|
+#'
|
|
|
+#' ### Acidity
|
|
|
+## ----echo=FALSE, acidity_plot--------------------------------------------
|
|
|
+ac1 <- base_hist +
|
|
|
+ geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
|
|
|
+ scale_x_continuous(breaks = seq(4, 16, 2)) +
|
|
|
+ coord_cartesian(xlim = c(4, 16))
|
|
|
+
|
|
|
+ac2 <- base_hist +
|
|
|
+ geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 2, 0.5)) +
|
|
|
+ coord_cartesian(xlim = c(0, 2))
|
|
|
+
|
|
|
+grid.arrange(ac1, ac2, nrow=2)
|
|
|
+
|
|
|
+#'
|
|
|
+#' **Fixed acidity** is determined by aids that do not evaporate easily --
|
|
|
+#' tartaricacid. It contributes to many other attributes, including the taste, pH,
|
|
|
+#' color, and stability to oxidation, i.e., prevent the wine from tasting flat.
|
|
|
+#' On theother hand, **volatile acidity** is responsible for the sour taste in
|
|
|
+#' wine. A very high value can lead to sour tasting wine, a low value can make
|
|
|
+#' the wine seem heavy.
|
|
|
+#' (References: [1](http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity),
|
|
|
+#' [2](http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity).
|
|
|
+#'
|
|
|
+## ----echo=FALSE, warning=FALSE, message=FALSE, acidity_univariate--------
|
|
|
+ac1 <- base_hist +
|
|
|
+ geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
|
|
|
+ scale_x_continuous(breaks = seq(4, 16, 2)) +
|
|
|
+ coord_cartesian(xlim = c(4, 16))
|
|
|
+
|
|
|
+ac2 <- base_hist +
|
|
|
+ geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 2, 0.5)) +
|
|
|
+ coord_cartesian(xlim = c(0, 2))
|
|
|
+
|
|
|
+ac3 <- base_hist +
|
|
|
+ geom_histogram(aes(x = total.acidity), binwidth = 0.25) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 18, 2)) +
|
|
|
+ coord_cartesian(xlim = c(0, 18))
|
|
|
+
|
|
|
+grid.arrange(ac1, ac2, ac3, nrow=3)
|
|
|
+
|
|
|
+print("Summary statistics of Fixed Acidity")
|
|
|
+summary(wine$fixed.acidity)
|
|
|
+print("Summary statistics of Volatile Acidity")
|
|
|
+summary(wine$volatile.acidity)
|
|
|
+print("Summary statistics of Total Acidity")
|
|
|
+summary(wine$total.acidity)
|
|
|
+
|
|
|
+#'
|
|
|
+#' Of the wines we have in our dataset, we can see that most have a fixed acidity
|
|
|
+#' of 7.5. The median fixed acidity is 7.9, and the mean is 8.32. There is a
|
|
|
+#' slight skew in the data because a few wines possess a very high fixed acidity.
|
|
|
+#' The median volatile acidity is 0.52 g/dm^3, and the mean is 0.5278 g/dm^3. *It
|
|
|
+#' will be interesting to note which quality of wine is correlated to what level
|
|
|
+#' of acidity in the bivariate section.*
|
|
|
+#'
|
|
|
+#' ### Citric Acid
|
|
|
+#' Citric acid is part of the fixed acid content of most wines. A non-volatile
|
|
|
+#' acid, citric also adds much of the same characteristics as tartaric acid does.
|
|
|
+#' Again, here I would guess most good wines have a balanced amount of citric
|
|
|
+#' acid.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, warning=FALSE, message=FALSE, citric_acid_univariate----
|
|
|
+base_hist +
|
|
|
+ geom_histogram(aes(x = citric.acid), binwidth = 0.05) +
|
|
|
+ scale_x_continuous(breaks = seq(0, 1, 0.2)) +
|
|
|
+ coord_cartesian(xlim = c(0, 1))
|
|
|
+
|
|
|
+print("Summary statistics of Citric Acid")
|
|
|
+summary(wine$citric.acid)
|
|
|
+print('Number of Zero Values')
|
|
|
+table(wine$citric.acid == 0)
|
|
|
+
|
|
|
+#'
|
|
|
+#' There is a very high count of zero in citric acid. To check if this is
|
|
|
+#' genuinely zero or merely a ‘not available’ value. A quick check using table
|
|
|
+#' function shows that there are 132 observations of zero values and no NA value
|
|
|
+#' in reported citric acid concentration. The citric acid concentration could be
|
|
|
+#' too low and insignificant hence was reported as zero.
|
|
|
+#'
|
|
|
+#' As far as content wise the wines have a median citric acid level of
|
|
|
+#' 0.26 g/dm^3, and a mean level of 0.271 g/dm^3.
|
|
|
+#'
|
|
|
+#' ### Sulfur-Dioxide & Sulphates
|
|
|
+#' **Free sulfur dioxide** is the free form of SO2 exists in equilibrium between
|
|
|
+#' molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
|
|
|
+#' growth and the oxidation of wine. **Sulphates** is a wine additive which can
|
|
|
+#' contribute to sulfur dioxide gas (SO2) levels, which acts as an anti-microbial
|
|
|
+#' moreover, antioxidant -- *overall keeping the wine, fresh*.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, warning=FALSE, message=FALSE, sulfur_univariate---------
|
|
|
+sul1 <- base_hist + geom_histogram(aes(x = free.sulfur.dioxide))
|
|
|
+sul2 <- base_hist + geom_histogram(aes(x = log10(free.sulfur.dioxide)))
|
|
|
+
|
|
|
+sul3 <- base_hist + geom_histogram(aes(x = total.sulfur.dioxide))
|
|
|
+sul4 <- base_hist + geom_histogram(aes(x = log10(total.sulfur.dioxide)))
|
|
|
+
|
|
|
+sul5 <- base_hist + geom_histogram(aes(x = sulphates))
|
|
|
+sul6 <- base_hist + geom_histogram(aes(x = log10(sulphates)))
|
|
|
+
|
|
|
+grid.arrange(sul1, sul2, sul3, sul4, sul5, sul6, nrow=3)
|
|
|
+
|
|
|
+#'
|
|
|
+#' The distributions of all three values are positively skewed with a long tail.
|
|
|
+#' Thelog-transformation results in a normal-behaving distribution for 'total
|
|
|
+#' sulfur dioxide' and 'sulphates'.
|
|
|
+#'
|
|
|
+#' ### Alcohol
|
|
|
+#' Alcohol is what adds that special something that turns rotten grape juice
|
|
|
+#' into a drink many people love. Hence, by intuitive understanding, it should
|
|
|
+#' be crucial in determining the wine quality.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, warning=FALSE, message=FALSE, alcohol_univariate--------
|
|
|
+base_hist +
|
|
|
+ geom_histogram(aes(x = alcohol), binwidth = 0.25) +
|
|
|
+ scale_x_continuous(breaks = seq(8, 15, 2)) +
|
|
|
+ coord_cartesian(xlim = c(8, 15))
|
|
|
+
|
|
|
+print("Summary statistics for alcohol %age.")
|
|
|
+summary(wine$alcohol)
|
|
|
+
|
|
|
+#'
|
|
|
+#' The mean alcohol content for our wines is 10.42%, the median is 10.2%
|
|
|
+#'
|
|
|
+#' ### Quality
|
|
|
+## ----echo=FALSE,warning=FALSE, message=FALSE, quality_univariate---------
|
|
|
+qplot(x=quality, data=wine, geom='bar',
|
|
|
+ fill=I("#990000"),
|
|
|
+ col=I("black"))
|
|
|
+
|
|
|
+print("Summary statistics - Wine Quality.")
|
|
|
+summary(wine$quality)
|
|
|
+
|
|
|
+#'
|
|
|
+#' Overall wine quality, rated on a scale from 1 to 10, has a normal shape and
|
|
|
+#' very few exceptionally high or low-quality ratings.
|
|
|
+#'
|
|
|
+#' It can be seen that the minimum rating is 3 and 8 is the maximum for quality.
|
|
|
+#' Hence, a variable called ‘rating’ is created based on variable quality.
|
|
|
+#'
|
|
|
+#' * 8 to 7 are Rated A.
|
|
|
+#'
|
|
|
+#' * 6 to 5 are Rated B.
|
|
|
+#'
|
|
|
+#' * 3 to 4 are Rated C.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, quality_rating------------------------------------------
|
|
|
+# Dividing the quality into 3 rating levels
|
|
|
+wine$rating <- ifelse(wine$quality < 5, 'C',
|
|
|
+ ifelse(wine$quality < 7, 'B', 'A'))
|
|
|
+
|
|
|
+# Changing it into an ordered factor
|
|
|
+wine$rating <- ordered(wine$rating,
|
|
|
+ levels = c('C', 'B', 'A'))
|
|
|
+
|
|
|
+summary(wine$rating)
|
|
|
+
|
|
|
+qr1 <- ggplot(aes(as.numeric(quality), fill=rating), data=wine) +
|
|
|
+ geom_bar() +
|
|
|
+ ggtitle ("Barchart of Quality with Rating") +
|
|
|
+ scale_x_continuous(breaks=seq(3,8,1)) +
|
|
|
+ xlab("Quality") +
|
|
|
+ theme_pander() + scale_colour_few()
|
|
|
+
|
|
|
+qr2 <- qplot(x=rating, data=wine, geom='bar',
|
|
|
+ fill=I("#990000"),
|
|
|
+ col=I("black")) +
|
|
|
+ xlab("Rating") +
|
|
|
+ ggtitle("Barchart of Rating") +
|
|
|
+ theme_pander()
|
|
|
+
|
|
|
+grid.arrange(qr1, qr2, ncol=2)
|
|
|
+
|
|
|
+#'
|
|
|
+#' The distribution of 'rating' is much higher on the 'B' rating wine as
|
|
|
+#' seen in quality distribution. This is likely to cause overplotting. Therefore,
|
|
|
+#' a comparison of only the 'C' and 'A' wines is done to find distinctive
|
|
|
+#' properties that separate these two. The comparison is made using summary
|
|
|
+#' statistics.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, rating_comparison---------------------------------------
|
|
|
+print("Summary statistics of Wine with Rating 'A'")
|
|
|
+summary(subset(wine, rating=='A'))
|
|
|
+
|
|
|
+print("Summary statistics of Wine with Rating 'C'")
|
|
|
+summary(subset(wine, rating=='C'))
|
|
|
+
|
|
|
+#'
|
|
|
+#' On comparing the *mean statistic* of different attribute for 'A-rated' and
|
|
|
+#' 'C-rated' wines (A → C), the following %age change is noted.
|
|
|
+#'
|
|
|
+#' 1. `fixed.acidity`: mean reduced by 11%.
|
|
|
+#'
|
|
|
+#' 2. `volatile.acidity` - mean increased by 80%.
|
|
|
+#'
|
|
|
+#' 3. `citric.acidity` - mean increased by 117%.
|
|
|
+#'
|
|
|
+#' 4. `sulphates` - mean reduced by 20.3%
|
|
|
+#'
|
|
|
+#' 5. `alcohol` - mean reduced by 12.7%.
|
|
|
+#'
|
|
|
+#' 6. `residualsugar` and `chloride` showed a very low variation.
|
|
|
+#'
|
|
|
+#' These changes are, however, only suitable for estimation of important quality
|
|
|
+#' impacting variables and setting a way for further analysis. No conclusion
|
|
|
+#' can be drawn from it.
|
|
|
+#'
|
|
|
+#' ## Univariate Analysis - Summary
|
|
|
+#'
|
|
|
+#' ### Overview
|
|
|
+#' The red wine dataset features 1599 separate observations, each for a different
|
|
|
+#' red wine sample. As presented, each wine sample is provided as a single row in
|
|
|
+#' the dataset. Due to the nature of how some measurements are gathered, some
|
|
|
+#' values given represent *components* of a measurement total.
|
|
|
+#'
|
|
|
+#' For example, `data.fixed.acidity` and `data.volatile.acidity` are both obtained
|
|
|
+#' via separate measurement techniques, and must be summed to indicate the total
|
|
|
+#' acidity present in a wine sample. For these cases, I supplemented the data
|
|
|
+#' given by computing the total and storing in the data frame with a
|
|
|
+#' `data.total.*` variable.
|
|
|
+#'
|
|
|
+#' ### Features of Interest
|
|
|
+#' An interesting measurement here is the wine `quality`. It is the
|
|
|
+#' subjective measurement of how attractive the wine might be to a consumer. The
|
|
|
+#' goal here will is to try and correlate non-subjective wine properties with its
|
|
|
+#' quality.
|
|
|
+#'
|
|
|
+#' I am curious about a few trends in particular -- **Sulphates vs. Quality** as
|
|
|
+#' low sulphate wine has a reputation for not causing hangovers,
|
|
|
+#' **Acidity vs. Quality** - Given that it impacts many factors like pH,
|
|
|
+#' taste, color, it is compelling to see if it affects the quality.
|
|
|
+#' **Alcohol vs. Quality** - Just an interesting measurement.
|
|
|
+#'
|
|
|
+#' At first, the lack of an *age* metric was surprising since it is commonly
|
|
|
+#' a factor in quick assumptions of wine quality. However, since the actual effect
|
|
|
+#' of wine age is on the wine's measurable chemical properties, its exclusion here
|
|
|
+#' might not be necessary.
|
|
|
+#'
|
|
|
+#' ### Distributions
|
|
|
+#' Many measurements that were clustered close to zero had a positive skew
|
|
|
+#' (you cannot have negative percentages or amounts). Others such as `pH` and
|
|
|
+#' `total.acidity` and `quality` had normal looking distributions.
|
|
|
+#'
|
|
|
+#' The distributions studied in this section were primarily used to identify the
|
|
|
+#' trends in variables present in the dataset. This helps in setting up a track
|
|
|
+#' for moving towards bivariate and multivariate analysis.
|
|
|
+#'
|
|
|
+#' ## Bivariate Plots Section
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, correlation_plots---------
|
|
|
+ggcorr(wine,
|
|
|
+ size = 2.2, hjust = 0.8,
|
|
|
+ low = "#4682B4", mid = "white", high = "#E74C3C")
|
|
|
+
|
|
|
+#'
|
|
|
+#' **Observations from the correlation matrix.**
|
|
|
+#'
|
|
|
+#' * Total Acidity is highly correlatable with fixed acidity.
|
|
|
+#'
|
|
|
+#' * pH appears correlatable with acidity, citric acid, chlorides, and residual
|
|
|
+#' sugars.
|
|
|
+#'
|
|
|
+#' * No single property appears to correlate with quality.
|
|
|
+#'
|
|
|
+#' Further, in this section, metrics of interest are evaluated to check their
|
|
|
+#' significance on the wine quality. Moreover, bivariate relationships between
|
|
|
+#' other variables are also studied.
|
|
|
+#'
|
|
|
+#' ### Acidity vs. Rating & Quality
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, acidity_rating------------
|
|
|
+
|
|
|
+aq1 <- ggplot(aes(x=rating, y=total.acidity), data = wine) +
|
|
|
+ geom_boxplot(fill = '#ffeeee') +
|
|
|
+ coord_cartesian(ylim=c(0, quantile(wine$total.acidity, 0.99))) +
|
|
|
+ geom_point(stat='summary', fun.y=mean,color='red') +
|
|
|
+ xlab('Rating') + ylab('Total Acidity')
|
|
|
+
|
|
|
+aq2 <- ggplot(aes(x=quality, y=total.acidity), data = wine) +
|
|
|
+ geom_boxplot(fill = '#ffeeee') +
|
|
|
+ coord_cartesian(ylim=c(0, quantile(wine$total.acidity, 0.99))) +
|
|
|
+ geom_point(stat='summary', fun.y=mean, color='red') +
|
|
|
+ xlab('Quality') + ylab('Total Acidity') +
|
|
|
+ geom_jitter(alpha=1/10, color='#990000') +
|
|
|
+ ggtitle("\n")
|
|
|
+
|
|
|
+grid.arrange(aq1, aq2, ncol=1)
|
|
|
+
|
|
|
+#'
|
|
|
+#' The boxplots depicting quality also depicts the distribution
|
|
|
+#' of various wines, and we can again see 5 and 6 quality wines have the most
|
|
|
+#' share. The blue dot is the mean, and the middle line shows the median.
|
|
|
+#'
|
|
|
+#' The box plots show how the acidity decreases as the quality of wine improve.
|
|
|
+#' However, the difference is not very noticeable. Since most wines tend to
|
|
|
+#' maintain a similar acidity level & given the fact that *volatile acidity* is
|
|
|
+#' responsible for the sour taste in wine, hence a density plot of the said
|
|
|
+#' attribute is plotted to investigate the data.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, acidity_quality_rating----
|
|
|
+ggplot(aes(x = volatile.acidity, fill = quality, color = quality),
|
|
|
+ data = wine) +
|
|
|
+ geom_density(alpha=0.08)
|
|
|
+
|
|
|
+#'
|
|
|
+#' Red Wine of `quality` 7 and 8 have their peaks for `volatile.acidity` well
|
|
|
+#' below the 0.4 mark. Wine with `quality` 3 has the pick at the most right
|
|
|
+#' hand side (towards more volatile acidity). This shows that the better quality
|
|
|
+#' wines are lesser sour and in general have lesser acidity.
|
|
|
+#'
|
|
|
+#' ### Alcohol vs. Quality
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, alcohol_quality_sugar-----
|
|
|
+qas0 <- ggplot(aes(x=alcohol, y=as.numeric(quality)), data=wine) +
|
|
|
+ geom_jitter(alpha=1/12) +
|
|
|
+ geom_smooth() +
|
|
|
+ ggtitle("Alcohol Content vs. Quality") +
|
|
|
+ ylab("Quality") + xlab("Alcohol")
|
|
|
+
|
|
|
+qas1 <- ggplot(aes(x=alcohol), data=wine) +
|
|
|
+ geom_density(fill=I("#BB0000")) +
|
|
|
+ facet_wrap("quality") +
|
|
|
+ ggtitle("Alcohol Content for \nWine Quality Ratings") +
|
|
|
+ ylab("Density") + xlab("Alcohol")
|
|
|
+
|
|
|
+qas2 <- ggplot(aes(x=residual.sugar, y=alcohol), data=wine) +
|
|
|
+ geom_jitter(alpha=1/12) +
|
|
|
+ geom_smooth() +
|
|
|
+ ggtitle("Alcohol vs. Residual Sugar Content") +
|
|
|
+ ylab("Alcohol") + xlab("Residual Sugar")
|
|
|
+
|
|
|
+grid.arrange(qas1, arrangeGrob(qas0, qas2), ncol=2)
|
|
|
+
|
|
|
+#'
|
|
|
+#' The plot between residual sugar and alcohol content suggests that there is no
|
|
|
+#' erratic relation between sugar and alcohol content, which is surprising as
|
|
|
+#' alcohol is a byproduct of the yeast feeding off of sugar during the
|
|
|
+#' fermentation process. That inference could not be established here.
|
|
|
+#'
|
|
|
+#' Alcohol and quality appear to be somewhat correlatable. Lower quality wines
|
|
|
+#' tend to have lower alcohol content. This can be further studied using boxplots.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE----------------------------
|
|
|
+
|
|
|
+quality_groups <- group_by(wine, alcohol)
|
|
|
+
|
|
|
+wine.quality_groups <- summarize(quality_groups,
|
|
|
+ acidity_mean = mean(volatile.acidity),
|
|
|
+ pH_mean = mean(pH),
|
|
|
+ sulphates_mean = mean(sulphates),
|
|
|
+ qmean = mean(as.numeric(quality)),
|
|
|
+ n = n())
|
|
|
+
|
|
|
+wine.quality_groups <- arrange(wine.quality_groups, alcohol)
|
|
|
+
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, alcohol_quality-----------
|
|
|
+ggplot(aes(y=alcohol, x=factor(quality)), data = wine) +
|
|
|
+ geom_boxplot(fill = '#ffeeee')+
|
|
|
+ xlab('quality')
|
|
|
+
|
|
|
+#'
|
|
|
+#' The boxplots show an indication that higher quality wines have higher alcohol
|
|
|
+#' content. This trend is shown by all the quality grades from 3 to 8 except
|
|
|
+#' quality grade 5.
|
|
|
+#'
|
|
|
+#' **Does this mean that by adding more alcohol, we'd get better wine?**
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE----------------------------
|
|
|
+ggplot(aes(alcohol, qmean), data=wine.quality_groups) +
|
|
|
+ geom_smooth() +
|
|
|
+ ylab("Quality Mean") +
|
|
|
+ scale_x_continuous(breaks = seq(0, 15, 0.5)) +
|
|
|
+ xlab("Alcohol %")
|
|
|
+
|
|
|
+#'
|
|
|
+#' The above line plot indicates nearly a linear increase till 13% alcohol
|
|
|
+#' concetration, followed by a steep downwards trend. The graph has to be
|
|
|
+#' smoothened to remove variances and noise.
|
|
|
+#'
|
|
|
+#' ### Sulphates vs. Quality
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, sulphates_quality---------
|
|
|
+ggplot(aes(y=sulphates, x=quality), data=wine) +
|
|
|
+ geom_boxplot(fill="#ffeeee")
|
|
|
+
|
|
|
+#'
|
|
|
+#' Good wines have higher sulphates values than bad wines, though the difference
|
|
|
+#' is not that wide.
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, sulphates_qplots----------
|
|
|
+sq1 <- ggplot(aes(x=sulphates, y=as.numeric(quality)), data=wine) +
|
|
|
+ geom_jitter(alpha=1/10) +
|
|
|
+ geom_smooth() +
|
|
|
+ xlab("Sulphates") + ylab("Quality") +
|
|
|
+ ggtitle("Sulphates vs. Quality")
|
|
|
+
|
|
|
+sq2 <- ggplot(aes(x=sulphates, y=as.numeric(quality)),
|
|
|
+ data=subset(wine, wine$sulphates < 1)) +
|
|
|
+ geom_jitter(alpha=1/10) +
|
|
|
+ geom_smooth() +
|
|
|
+ xlab("Sulphates") + ylab("Quality") +
|
|
|
+ ggtitle("\nSulphates vs Quality without Outliers")
|
|
|
+
|
|
|
+grid.arrange(sq1, sq2, nrow = 2)
|
|
|
+
|
|
|
+#'
|
|
|
+#' There is a slight trend implying a relationship between sulphates and wine
|
|
|
+#' quality, mainly if extreme sulphate values are ignored, i.e., because
|
|
|
+#' disregarding measurements where sulphates > 1.0 is the same as disregarding
|
|
|
+#' the positive tail of the distribution, keeping just the normal-looking portion.
|
|
|
+#' However, the relationship is mathematically, still weak.
|
|
|
+#'
|
|
|
+#' ## Bivariate Analysis - Summary
|
|
|
+#'
|
|
|
+#' There is no apparent and mathematically strong correlation between any wine
|
|
|
+#' property and the given quality. Alcohol content is a strong contender, but even
|
|
|
+#' so, the correlation was not particularly strong.
|
|
|
+#'
|
|
|
+#' Most properties have roughly normal distributions, with some skew in one tail.
|
|
|
+#' Scatterplot relationships between these properties often showed a slight trend
|
|
|
+#' within the bulk of property values. However, as soon as we leave the
|
|
|
+#' expected range, the trends reverse. For example, Alcohol Content or
|
|
|
+#' Sulphate vs. Quality. The trend is not a definitive one, but it is seen in
|
|
|
+#' different variables.
|
|
|
+#'
|
|
|
+#' Possibly, obtaining an outlier property (say sulphate content) is particularly
|
|
|
+#' challenging to do in the wine making process. Alternatively, there is a change
|
|
|
+#' that the wines that exhibit outlier properties are deliberately of a
|
|
|
+#' non-standard variety. In that case, it could be that wine judges have a harder
|
|
|
+#' time agreeing on a quality rating.
|
|
|
+#'
|
|
|
+#' ## Multivariate Plots Section
|
|
|
+#'
|
|
|
+#' This section includes visualizations that take bivariate analysis a step
|
|
|
+#' further, i.e., understand the earlier patterns better or to strengthen the
|
|
|
+#' arguments that were presented in the previous section.
|
|
|
+#'
|
|
|
+#' ### Alcohol, Volatile Acid & Wine Rating
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, alcohol_acid_quality------
|
|
|
+ggplot(wine, aes(x=alcohol, y=volatile.acidity, color=quality)) +
|
|
|
+ geom_jitter(alpha=0.8, position = position_jitter()) +
|
|
|
+ geom_smooth(method="lm", se = FALSE, size=1) +
|
|
|
+ scale_color_brewer(type='seq',
|
|
|
+ guide=guide_legend(title='Quality')) +
|
|
|
+ theme_pander()
|
|
|
+
|
|
|
+#'
|
|
|
+#' Earlier inspections suggested that the volatile acidity and alcohol had high
|
|
|
+#' correlations values of negative and positive. Alcohol seems to vary more than
|
|
|
+#' volatile acidity when we talk about quality, nearly every Rating A wine has
|
|
|
+#' less than 0.6 volatile acidity.
|
|
|
+#'
|
|
|
+#' ### Understanding the Significance of Acidity
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, acid_quality--------------
|
|
|
+ggplot(subset(wine, rating=='A'|rating=='C'),
|
|
|
+ aes(x=volatile.acidity, y=citric.acid)) +
|
|
|
+ geom_point() +
|
|
|
+ geom_jitter(position=position_jitter(), aes(color=rating)) +
|
|
|
+ geom_vline(xintercept=c(0.6), linetype='dashed', size=1, color='black') +
|
|
|
+ geom_hline(yintercept=c(0.5), linetype='dashed', size=1, color='black') +
|
|
|
+ scale_x_continuous(breaks = seq(0, 1.6, .1)) +
|
|
|
+ theme_pander() + scale_colour_few()
|
|
|
+
|
|
|
+#'
|
|
|
+#' Nearly every wine has volatile acidity less than 0.8. As discussed earlier the
|
|
|
+#' A rating wines all have volatile.acidity of less than 0.6. For wines with
|
|
|
+#' rating B, the volatile acidity is between 0.4 and 0.8. Some C rating wine have
|
|
|
+#' a volatile acidity value of more than 0.8
|
|
|
+#'
|
|
|
+#' Most A rating wines have citric acid value of 0.25 to 0.75 while the B rating
|
|
|
+#' wines have citric acid value below 0.50.
|
|
|
+#'
|
|
|
+#' ### Understanding the Significance of Sulphates
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE----------------------------
|
|
|
+ggplot(subset(wine, rating=='A'|rating=='C'), aes(x = alcohol, y = sulphates)) +
|
|
|
+ geom_jitter(position = position_jitter(), aes(color=rating)) +
|
|
|
+ geom_hline(yintercept=c(0.65), linetype='dashed', size=1, color='black') +
|
|
|
+ theme_pander() + scale_colour_few() +
|
|
|
+ scale_y_continuous(breaks = seq(0, 2, .2))
|
|
|
+
|
|
|
+#'
|
|
|
+#' It is incredible to see that nearly all wines lie below 1.0 sulphates level.
|
|
|
+#' Due to overplotting, wines with rating B have been removed. It can be seen
|
|
|
+#' rating A wines mostly have sulphate values between 0.5 and 1 and the best rated
|
|
|
+#' wines have sulphate values between 0.6 and 1. Alcohol has the same values as
|
|
|
+#' seen before.
|
|
|
+#'
|
|
|
+#' ### Density & Sugar
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots2-------
|
|
|
+da1 <- ggplot(aes(x=density, y=total.acidity, color=as.numeric(quality)),
|
|
|
+ data=wine) +
|
|
|
+ geom_point(position='jitter') +
|
|
|
+ geom_smooth() +
|
|
|
+ labs(x="Total Acidity", y="Density", color="Quality") +
|
|
|
+ ggtitle("Density vs. Acidity Colored by Wine Quality Ratings")
|
|
|
+
|
|
|
+cs2 <- ggplot(aes(x=residual.sugar, y=density, color=as.numeric(quality)),
|
|
|
+ data=wine) +
|
|
|
+ geom_point(position='jitter') +
|
|
|
+ geom_smooth() +
|
|
|
+ labs(x="Residual Sugar", y="Density", color="Quality") +
|
|
|
+ ggtitle("\nSugar vs. Chlorides colored by Wine Quality Ratings")
|
|
|
+
|
|
|
+grid.arrange(da1, cs2)
|
|
|
+
|
|
|
+#'
|
|
|
+#' Higher quality wines appear to have a slight correlation with higher acidity
|
|
|
+#' across all densities. Moreover, there are abnormally high and low quality wines
|
|
|
+#' coincident with higher-than-usual sugar content.
|
|
|
+#'
|
|
|
+#' ## Multivariate Analysis - Summary
|
|
|
+#' Based on the investigation, it can be said that higher `citric.acid` and
|
|
|
+#' lower `volatile.acidity` contribute towards better wines. Also, better wines
|
|
|
+#' tend to have higher alcohol content.
|
|
|
+#'
|
|
|
+#' There were surprising results with `suplhates` and `alcohol` graphs.
|
|
|
+#' Sulphates had a better correlation with quality than citric acid, still the
|
|
|
+#' distribution was not that distinct between the different quality wines. Further
|
|
|
+#' nearly all wines had a sulphate content of less than 1, irrespective of the
|
|
|
+#' alcohol content; suplhate is a byproduct of fermantation just like
|
|
|
+#' alcohol.
|
|
|
+#'
|
|
|
+#' Based on the analysis presented, it can be noted because wine rating is a
|
|
|
+#' subjective measure, it is why statistical correlation values are not a very
|
|
|
+#' suitable metric to find important factors. This was realized half-way through
|
|
|
+#' the study. The graphs aptly depict that there is a suitable range and it is
|
|
|
+#' some combination of chemical factors that contribute to the flavour of wine.
|
|
|
+#'
|
|
|
+#' ## Final Plots and Summary
|
|
|
+#'
|
|
|
+#' ### Plot One
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, plot_2--------------------
|
|
|
+qr1 <- ggplot(aes(as.numeric(quality), fill=rating), data=wine) +
|
|
|
+ geom_bar() +
|
|
|
+ ggtitle ("Barchart of Quality with Rating") +
|
|
|
+ scale_x_continuous(breaks=seq(3,8,1)) +
|
|
|
+ xlab("Quality") +
|
|
|
+ theme_pander() + scale_colour_few()
|
|
|
+
|
|
|
+qr2 <- qplot(x=rating, data=wine, geom='bar',
|
|
|
+ fill=I("#990000"),
|
|
|
+ col=I("black")) +
|
|
|
+ xlab("Rating") +
|
|
|
+ ggtitle("Barchart of Rating") +
|
|
|
+ theme_pander()
|
|
|
+
|
|
|
+grid.arrange(qr1, qr2, ncol=2)
|
|
|
+
|
|
|
+#'
|
|
|
+#' #### Description One
|
|
|
+#' The plot is from the univariate section, which introduced the idea of
|
|
|
+#' this analysis. As in the analysis, there are plenty of visualizations which
|
|
|
+#' only plot data-points from A and C rated wines. A first comparison of only
|
|
|
+#' the 'C' and 'A' wines helped find distinctive properties that separate these
|
|
|
+#' two.
|
|
|
+#'
|
|
|
+#' It also suggests that it is likely that the critics can be highly subjective as
|
|
|
+#' they do not rate any wine with a measure of 1, 2 or 9, 10. With most wines
|
|
|
+#' being mediocre, the wines that had the less popular rating must've caught the
|
|
|
+#' attention of the wine experts, hence, the idea was derived to compare these two
|
|
|
+#' rating classes.
|
|
|
+#'
|
|
|
+#' ### Plot Two
|
|
|
+#'
|
|
|
+## ---- echo=FALSE, warning=FALSE, message=FALSE, plot_1a------------------
|
|
|
+ggplot(aes(x=alcohol), data=wine) +
|
|
|
+ geom_density(fill=I("#BB0000")) +
|
|
|
+ facet_wrap("quality") +
|
|
|
+ ggtitle("Alcohol Content for Wine Quality Ratings") +
|
|
|
+ labs(x="Alcohol [%age]", y="") +
|
|
|
+ theme(plot.title = element_text(face="plain"),
|
|
|
+ axis.title.x = element_text(size=10),
|
|
|
+ axis.title.y = element_text(size=10))
|
|
|
+
|
|
|
+#'
|
|
|
+## ----echo=FALSE, message=FALSE, warning=FALSE, plot_1b-------------------
|
|
|
+fp1 <- ggplot(aes(y=alcohol, x=quality), data = wine)+
|
|
|
+ geom_boxplot() +
|
|
|
+ xlab('Quality') +
|
|
|
+ ylab("Alcohol in % by Volume") +
|
|
|
+ labs(x="Quality", y="Alcohol [%age]") +
|
|
|
+ ggtitle("Boxplot of Alcohol and Quality") +
|
|
|
+ theme(plot.title = element_text(face="plain"),
|
|
|
+ axis.title.x = element_text(size=10),
|
|
|
+ axis.title.y = element_text(size=10))
|
|
|
+
|
|
|
+fp2 <-ggplot(aes(alcohol, qmean), data=wine.quality_groups) +
|
|
|
+ geom_smooth() +
|
|
|
+ scale_x_continuous(breaks = seq(0, 15, 0.5)) +
|
|
|
+ ggtitle("\nLine Plot of Quality Mean & Alcohol Percentage") +
|
|
|
+ labs(x="Alcohol [%age]", y="Quality (Mean)") +
|
|
|
+ theme(plot.title = element_text(face="plain"),
|
|
|
+ axis.title.x = element_text(size=10),
|
|
|
+ axis.title.y = element_text(size=10))
|
|
|
+
|
|
|
+grid.arrange(fp1, fp2)
|
|
|
+
|
|
|
+#'
|
|
|
+#' #### Description Two
|
|
|
+#'
|
|
|
+#' These are plots taken from bivariate analysis section discussing the effect of
|
|
|
+#' alcohol percentage on quality.
|
|
|
+#'
|
|
|
+#' The first visualization was especially appealing to me because of the way that
|
|
|
+#' you can almost see the distribution shift from left to right as wine ratings
|
|
|
+#' increase. Again, just showing a general tendency instead of a substantial
|
|
|
+#' significance in judging wine quality.
|
|
|
+#'
|
|
|
+#' The above boxplots show a steady rise in the level of alcohol. An interesting
|
|
|
+#' trend of a decrement of quality above 13%, alcohol gave way to further analysis
|
|
|
+#' which shows that a general correlation measure might not be suitable for the
|
|
|
+#' study.
|
|
|
+#'
|
|
|
+#' The plot that follows set the basis for which I carried out the complete
|
|
|
+#' analysis. Rather than emphasizing on mathematical correlation measures, the
|
|
|
+#' inferences drawn were based on investigating the visualizations. This felt
|
|
|
+#' suitable due to the subjectivity in the measure of wine quality.
|
|
|
+#'
|
|
|
+#' ### Plot Three
|
|
|
+#'
|
|
|
+## ----echo=FALSE, messages=FALSE, warning=FALSE, plot_3-------------------
|
|
|
+fp3 <- ggplot(subset(wine, rating=='A'|rating=='C'),
|
|
|
+ aes(x = volatile.acidity, y = citric.acid)) +
|
|
|
+ geom_point() +
|
|
|
+ geom_jitter(position=position_jitter(), aes(color=rating)) +
|
|
|
+ geom_vline(xintercept=c(0.6), linetype='dashed', size=1, color='black') +
|
|
|
+ geom_hline(yintercept=c(0.5), linetype='dashed', size=1, color='black') +
|
|
|
+ scale_x_continuous(breaks = seq(0, 1.6, .1)) +
|
|
|
+ theme_pander() + scale_colour_few() +
|
|
|
+ ggtitle("Wine Rating vs. Acids") +
|
|
|
+ labs(x="Volatile Acidity (g/dm^3)", y="Citric Acid (g/dm^3)") +
|
|
|
+ theme(plot.title = element_text(face="plain"),
|
|
|
+ axis.title.x = element_text(size=10),
|
|
|
+ axis.title.y = element_text(size=10),
|
|
|
+ legend.title = element_text(size=10))
|
|
|
+
|
|
|
+fp4 <- ggplot(subset(wine, rating=='A'|rating=='C'),
|
|
|
+ aes(x = alcohol, y = sulphates)) +
|
|
|
+ geom_jitter(position = position_jitter(), aes(color=rating)) +
|
|
|
+ geom_hline(yintercept=c(0.65), linetype='dashed', size=1, color='black') +
|
|
|
+ theme_pander() + scale_colour_few() +
|
|
|
+ scale_y_continuous(breaks = seq(0,2,.2)) +
|
|
|
+ ggtitle("\nSulphates, Alcohol & Wine-Rating") +
|
|
|
+ labs(x="Alcohol [%]", y="Sulphates (g/dm^3)") +
|
|
|
+ theme(plot.title = element_text(face="plain"),
|
|
|
+ axis.title.x = element_text(size=10),
|
|
|
+ axis.title.y = element_text(size=10),
|
|
|
+ legend.title = element_text(size=10))
|
|
|
+
|
|
|
+grid.arrange(fp3, fp4, nrow=2)
|
|
|
+
|
|
|
+#'
|
|
|
+#' #### Description Three
|
|
|
+#' These plots served as finding distinguishing boundaries for given attributes,
|
|
|
+#' i.e., `sulphates`, `citric.acid`, `alcohol`, `volatile.acidity`. The
|
|
|
+#' conclusions drawn from these plots are that sulphates should be high but less
|
|
|
+#' than 1 with an alcohol concentration around 12-13%, along with less (< 0.6)
|
|
|
+#' volatile acidity. It can be viewed nearlyas a depiction of a classification
|
|
|
+#' methodology without application of any machine learning algorithm. Moreover,
|
|
|
+#' these plots strengthened the arguments laid in the earlier analysis of the data.
|
|
|
+#'
|
|
|
+#' ------
|
|
|
+#'
|
|
|
+#' ## Reflection
|
|
|
+#' In this project, I was able to examine relationship between *physicochemical*
|
|
|
+#' properties and identify the key variables that determine red wine quality,
|
|
|
+#' which are alcohol content volatile acidity and sulphate levels.
|
|
|
+#'
|
|
|
+#' The dataset is quite interesting, though limited in large-scale implications.
|
|
|
+#' I believe if this dataset held only one additional variable it would be vastly
|
|
|
+#' more useful to the layman. If *price* were supplied along with this data
|
|
|
+#' one could target the best wines within price categories, and what aspects
|
|
|
+#' correlated to a high performing wine in any price bracket.
|
|
|
+#'
|
|
|
+#' Overall, I was initially surprised by the seemingly dispersed nature of the
|
|
|
+#' wine data. Nothing was immediately correlatable to being an inherent quality
|
|
|
+#' of good wines. However, upon reflection, this is a sensible finding. Wine
|
|
|
+#' making is still something of a science and an art, and if there was one
|
|
|
+#' single property or process that continually yielded high quality wines, the
|
|
|
+#' field wouldn't be what it is.
|
|
|
+#'
|
|
|
+#' According to the study, it can be concluded that the best kind of wines are the
|
|
|
+#' ones with an alcohol concentration of about 13%, with low volatile acidity &
|
|
|
+#' high sulphates level (with an upper cap of 1.0 g/dm^3).
|
|
|
+#'
|
|
|
+#' ### Future Work & Limitations
|
|
|
+#' With my amateurish knowledge of wine-tasting, I tried my best to relate it to
|
|
|
+#' how I would rate a bottle of wine at dining. However, in the future, I would
|
|
|
+#' like to do some research into the winemaking process. Some winemakers might
|
|
|
+#' actively try for some property values or combinations, and be finding those
|
|
|
+#' combinations (of 3 or more properties) might be the key to truly predicting
|
|
|
+#' wine quality. This investigation was not able to find a robust generalized
|
|
|
+#' model that would consistently be able to predict wine quality with any degree
|
|
|
+#' of certainty.
|
|
|
+#'
|
|
|
+#' If I were to continue further into this specific dataset, I would aim to
|
|
|
+#' train a classifier to correctly predict the wine category, in order to better
|
|
|
+#' grasp the minuteness of what makes a good wine.
|
|
|
+#'
|
|
|
+#' Additionally, having the wine type would be helpful for further analysis.
|
|
|
+#' Sommeliers might prefer certain types of wines to have different
|
|
|
+#' properties and behaviors. For example, a Port (as sweet desert wine)
|
|
|
+#' surely is rated differently from a dark and robust abernet Sauvignon,
|
|
|
+#' which is rated differently from a bright and fruity Syrah. Without knowing
|
|
|
+#' the type of wine, it is entirely possible that we are almost literally
|
|
|
+#' comparing apples to oranges and can't find a correlation.
|