Red_Wine_Quality.rmd 33 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862
  1. What Makes A Good Wine?
  2. ========================================================
  3. In this project, a data set of red wine quality is explored based on its
  4. physicochemical properties. The objective is to find physicochemical properties
  5. that distinguish good quality wine from lower quality ones. An attempt to build
  6. linear model on wine quality is also shown.
  7. ### Dataset Description
  8. This tidy dataset contains 1,599 red wines with 11 variables on the chemical
  9. properties of the wine. Another variable attributing to the quality of wine is
  10. added; at least 3 wine experts did this rating. The preparation of the dataset
  11. has been described in [this link](https://goo.gl/HVxAzY).
  12. ```{r global_options, include=FALSE}
  13. knitr::opts_chunk$set(fig.path='Figs/',
  14. echo=FALSE, warning=FALSE, message=FALSE)
  15. ```
  16. ```{r echo=FALSE, message=FALSE, warning=FALSE, packages}
  17. library(ggplot2)
  18. library(gridExtra)
  19. library(GGally)
  20. library(ggthemes)
  21. library(dplyr)
  22. library(memisc)
  23. ```
  24. First, the structure of the dataset is explored using ``summary`` and ``str``
  25. functions.
  26. ```{r echo=FALSE, warning=FALSE, message=FALSE, Load_the_Data}
  27. wine <- read.csv("wineQualityReds.csv")
  28. str(wine)
  29. summary(wine)
  30. # Setting the theme for plotting.
  31. # theme_set(theme_minimal(10))
  32. # Converting 'quality' to ordered type.
  33. wine$quality <- ordered(wine$quality,
  34. levels=c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
  35. # Adding 'total.acidity'.
  36. wine$total.acidity <- wine$fixed.acidity + wine$volatile.acidity
  37. ```
  38. **The following observations are made/confirmed:**
  39. 1. There are 1599 samples of Red Wine properties and quality values.
  40. 2. No wine achieves either a terrible (0) or perfect (10) quality score.
  41. 3. Citric Acid had a minimum of 0.0. No other property values were precisely 0.
  42. 4. Residual Sugar measurement has a maximum that is nearly 20 times farther
  43. away from the 3rd quartile than the 3rd quartile is from the 1st. There is
  44. a chance of a largely skewed data or that the data has some outliers.
  45. 5. The 'quality' attribute is originally considered an integer;
  46. I have converted this field into an ordered factor which is much more
  47. a representative of the variable itself.
  48. 6. There are two attributes related to 'acidity' of wine i.e. 'fixed.acidity'
  49. and 'volatile.acidity'. Hence, a combined acidity variable is added
  50. using ``data$total.acidity <- data$fixed.acidity + data$volatile.acidity``.
  51. ## Univariate Plots Section
  52. To lead the univariate analysis, I’ve chosen to build a grid of histograms.
  53. These histograms represent the distributions of each variable in the dataset.
  54. ```{r echo=FALSE, warning=FALSE, message=FALSE, Univariate_Grid_Plot}
  55. g_base <- ggplot(
  56. data = wine,
  57. aes(color=I('black'), fill=I('#990000'))
  58. )
  59. g1 <- g_base +
  60. geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
  61. scale_x_continuous(breaks = seq(4, 16, 2)) +
  62. coord_cartesian(xlim = c(4, 16))
  63. g2 <- g_base +
  64. geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
  65. scale_x_continuous(breaks = seq(0, 2, 0.5)) +
  66. coord_cartesian(xlim = c(0, 2))
  67. g3 <- g_base +
  68. geom_histogram(aes(x = total.acidity), binwidth = 0.25) +
  69. scale_x_continuous(breaks = seq(0, 18, 1)) +
  70. coord_cartesian(xlim = c(4, 18))
  71. g4 <- g_base +
  72. geom_histogram(aes(x = citric.acid), binwidth = 0.05) +
  73. scale_x_continuous(breaks = seq(0, 1, 0.2)) +
  74. coord_cartesian(xlim = c(0, 1))
  75. g5 <- g_base +
  76. geom_histogram(aes(x = residual.sugar), binwidth = 0.5) +
  77. scale_x_continuous(breaks = seq(0, 16, 2)) +
  78. coord_cartesian(xlim = c(0, 16))
  79. g6 <- g_base +
  80. geom_histogram(aes(x = chlorides), binwidth = 0.01) +
  81. scale_x_continuous(breaks = seq(0, 0.75, 0.25)) +
  82. coord_cartesian(xlim = c(0, 0.75))
  83. g7 <- g_base +
  84. geom_histogram(aes(x = free.sulfur.dioxide), binwidth = 2.5) +
  85. scale_x_continuous(breaks = seq(0, 75, 25)) +
  86. coord_cartesian(xlim = c(0, 75))
  87. g8 <- g_base +
  88. geom_histogram(aes(x = total.sulfur.dioxide), binwidth = 10) +
  89. scale_x_continuous(breaks = seq(0, 300, 100)) +
  90. coord_cartesian(xlim = c(0, 295))
  91. g9 <- g_base +
  92. geom_histogram(aes(x = density), binwidth = 0.0005) +
  93. scale_x_continuous(breaks = seq(0.99, 1.005, 0.005)) +
  94. coord_cartesian(xlim = c(0.99, 1.005))
  95. g10 <- g_base +
  96. geom_histogram(aes(x = pH), binwidth = 0.05) +
  97. scale_x_continuous(breaks = seq(2.5, 4.5, 0.5)) +
  98. coord_cartesian(xlim = c(2.5, 4.5))
  99. g11 <- g_base +
  100. geom_histogram(aes(x = sulphates), binwidth = 0.05) +
  101. scale_x_continuous(breaks = seq(0, 2, 0.5)) +
  102. coord_cartesian(xlim = c(0, 2))
  103. g12 <- g_base +
  104. geom_histogram(aes(x = alcohol), binwidth = 0.25) +
  105. scale_x_continuous(breaks = seq(8, 15, 2)) +
  106. coord_cartesian(xlim = c(8, 15))
  107. grid.arrange(g1, g2, g3, g4, g5, g6,
  108. g7, g8, g9, g10, g11, g12, ncol=3)
  109. ```
  110. There are some really interesting variations in the distributions here. Looking
  111. closer at a few of the more interesting ones might prove quite valuable.
  112. Working from top-left to right, selected plots are analysed.
  113. ```{r echo=FALSE, warning=FALSE, message=FALSE, single_variable_hist}
  114. base_hist <- ggplot(
  115. data = wine,
  116. aes(color=I('black'), fill=I('#990000'))
  117. )
  118. ```
  119. ### Acidity
  120. ```{r echo=FALSE, acidity_plot}
  121. ac1 <- base_hist +
  122. geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
  123. scale_x_continuous(breaks = seq(4, 16, 2)) +
  124. coord_cartesian(xlim = c(4, 16))
  125. ac2 <- base_hist +
  126. geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
  127. scale_x_continuous(breaks = seq(0, 2, 0.5)) +
  128. coord_cartesian(xlim = c(0, 2))
  129. grid.arrange(ac1, ac2, nrow=2)
  130. ```
  131. **Fixed acidity** is determined by aids that do not evaporate easily --
  132. tartaricacid. It contributes to many other attributes, including the taste, pH,
  133. color, and stability to oxidation, i.e., prevent the wine from tasting flat.
  134. On theother hand, **volatile acidity** is responsible for the sour taste in
  135. wine. A very high value can lead to sour tasting wine, a low value can make
  136. the wine seem heavy.
  137. (References: [1](http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity), [2](http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity).
  138. ```{r echo=FALSE, warning=FALSE, message=FALSE, acidity_univariate}
  139. ac1 <- base_hist +
  140. geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
  141. scale_x_continuous(breaks = seq(4, 16, 2)) +
  142. coord_cartesian(xlim = c(4, 16))
  143. ac2 <- base_hist +
  144. geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
  145. scale_x_continuous(breaks = seq(0, 2, 0.5)) +
  146. coord_cartesian(xlim = c(0, 2))
  147. ac3 <- base_hist +
  148. geom_histogram(aes(x = total.acidity), binwidth = 0.25) +
  149. scale_x_continuous(breaks = seq(0, 18, 2)) +
  150. coord_cartesian(xlim = c(0, 18))
  151. grid.arrange(ac1, ac2, ac3, nrow=3)
  152. print("Summary statistics of Fixed Acidity")
  153. summary(wine$fixed.acidity)
  154. print("Summary statistics of Volatile Acidity")
  155. summary(wine$volatile.acidity)
  156. print("Summary statistics of Total Acidity")
  157. summary(wine$total.acidity)
  158. ```
  159. Of the wines we have in our dataset, we can see that most have a fixed acidity
  160. of 7.5. The median fixed acidity is 7.9, and the mean is 8.32. There is a
  161. slight skew in the data because a few wines possess a very high fixed acidity.
  162. The median volatile acidity is 0.52 g/dm^3, and the mean is 0.5278 g/dm^3. *It
  163. will be interesting to note which quality of wine is correlated to what level
  164. of acidity in the bivariate section.*
  165. ### Citric Acid
  166. Citric acid is part of the fixed acid content of most wines. A non-volatile
  167. acid, citric also adds much of the same characteristics as tartaric acid does.
  168. Again, here I would guess most good wines have a balanced amount of citric
  169. acid.
  170. ```{r echo=FALSE, warning=FALSE, message=FALSE, citric_acid_univariate}
  171. base_hist +
  172. geom_histogram(aes(x = citric.acid), binwidth = 0.05) +
  173. scale_x_continuous(breaks = seq(0, 1, 0.2)) +
  174. coord_cartesian(xlim = c(0, 1))
  175. print("Summary statistics of Citric Acid")
  176. summary(wine$citric.acid)
  177. print('Number of Zero Values')
  178. table(wine$citric.acid == 0)
  179. ```
  180. There is a very high count of zero in citric acid. To check if this is
  181. genuinely zero or merely a ‘not available’ value. A quick check using table
  182. function shows that there are 132 observations of zero values and no NA value
  183. in reported citric acid concentration. The citric acid concentration could be
  184. too low and insignificant hence was reported as zero.
  185. As far as content wise the wines have a median citric acid level of
  186. 0.26 g/dm^3, and a mean level of 0.271 g/dm^3.
  187. ### Sulfur-Dioxide & Sulphates
  188. **Free sulfur dioxide** is the free form of SO2 exists in equilibrium between
  189. molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
  190. growth and the oxidation of wine. **Sulphates** is a wine additive which can
  191. contribute to sulfur dioxide gas (SO2) levels, which acts as an anti-microbial
  192. moreover, antioxidant -- *overall keeping the wine, fresh*.
  193. ```{r echo=FALSE, warning=FALSE, message=FALSE, sulfur_univariate}
  194. sul1 <- base_hist + geom_histogram(aes(x = free.sulfur.dioxide))
  195. sul2 <- base_hist + geom_histogram(aes(x = log10(free.sulfur.dioxide)))
  196. sul3 <- base_hist + geom_histogram(aes(x = total.sulfur.dioxide))
  197. sul4 <- base_hist + geom_histogram(aes(x = log10(total.sulfur.dioxide)))
  198. sul5 <- base_hist + geom_histogram(aes(x = sulphates))
  199. sul6 <- base_hist + geom_histogram(aes(x = log10(sulphates)))
  200. grid.arrange(sul1, sul2, sul3, sul4, sul5, sul6, nrow=3)
  201. ```
  202. The distributions of all three values are positively skewed with a long tail.
  203. Thelog-transformation results in a normal-behaving distribution for 'total
  204. sulfur dioxide' and 'sulphates'.
  205. ### Alcohol
  206. Alcohol is what adds that special something that turns rotten grape juice
  207. into a drink many people love. Hence, by intuitive understanding, it should
  208. be crucial in determining the wine quality.
  209. ```{r echo=FALSE, warning=FALSE, message=FALSE, alcohol_univariate}
  210. base_hist +
  211. geom_histogram(aes(x = alcohol), binwidth = 0.25) +
  212. scale_x_continuous(breaks = seq(8, 15, 2)) +
  213. coord_cartesian(xlim = c(8, 15))
  214. print("Summary statistics for alcohol %age.")
  215. summary(wine$alcohol)
  216. ```
  217. The mean alcohol content for our wines is 10.42%, the median is 10.2%
  218. ### Quality
  219. ```{r echo=FALSE,warning=FALSE, message=FALSE, quality_univariate}
  220. qplot(x=quality, data=wine, geom='bar',
  221. fill=I("#990000"),
  222. col=I("black"))
  223. print("Summary statistics - Wine Quality.")
  224. summary(wine$quality)
  225. ```
  226. Overall wine quality, rated on a scale from 1 to 10, has a normal shape and
  227. very few exceptionally high or low-quality ratings.
  228. It can be seen that the minimum rating is 3 and 8 is the maximum for quality.
  229. Hence, a variable called ‘rating’ is created based on variable quality.
  230. * 8 to 7 are Rated A.
  231. * 6 to 5 are Rated B.
  232. * 3 to 4 are Rated C.
  233. ```{r echo=FALSE, quality_rating}
  234. # Dividing the quality into 3 rating levels
  235. wine$rating <- ifelse(wine$quality < 5, 'C',
  236. ifelse(wine$quality < 7, 'B', 'A'))
  237. # Changing it into an ordered factor
  238. wine$rating <- ordered(wine$rating,
  239. levels = c('C', 'B', 'A'))
  240. summary(wine$rating)
  241. qr1 <- ggplot(aes(as.numeric(quality), fill=rating), data=wine) +
  242. geom_bar() +
  243. ggtitle ("Barchart of Quality with Rating") +
  244. scale_x_continuous(breaks=seq(3,8,1)) +
  245. xlab("Quality") +
  246. theme_pander() + scale_colour_few()
  247. qr2 <- qplot(x=rating, data=wine, geom='bar',
  248. fill=I("#990000"),
  249. col=I("black")) +
  250. xlab("Rating") +
  251. ggtitle("Barchart of Rating") +
  252. theme_pander()
  253. grid.arrange(qr1, qr2, ncol=2)
  254. ```
  255. The distribution of 'rating' is much higher on the 'B' rating wine as
  256. seen in quality distribution. This is likely to cause overplotting. Therefore,
  257. a comparison of only the 'C' and 'A' wines is done to find distinctive
  258. properties that separate these two. The comparison is made using summary
  259. statistics.
  260. ```{r echo=FALSE, rating_comparison}
  261. print("Summary statistics of Wine with Rating 'A'")
  262. summary(subset(wine, rating=='A'))
  263. print("Summary statistics of Wine with Rating 'C'")
  264. summary(subset(wine, rating=='C'))
  265. ```
  266. On comparing the *mean statistic* of different attribute for 'A-rated' and
  267. 'C-rated' wines (A → C), the following %age change is noted.
  268. 1. `fixed.acidity`: mean reduced by 11%.
  269. 2. `volatile.acidity` - mean increased by 80%.
  270. 3. `citric.acidity` - mean increased by 117%.
  271. 4. `sulphates` - mean reduced by 20.3%
  272. 5. `alcohol` - mean reduced by 12.7%.
  273. 6. `residualsugar` and `chloride` showed a very low variation.
  274. These changes are, however, only suitable for estimation of important quality
  275. impacting variables and setting a way for further analysis. No conclusion
  276. can be drawn from it.
  277. ## Univariate Analysis - Summary
  278. ### Overview
  279. The red wine dataset features 1599 separate observations, each for a different
  280. red wine sample. As presented, each wine sample is provided as a single row in
  281. the dataset. Due to the nature of how some measurements are gathered, some
  282. values given represent *components* of a measurement total.
  283. For example, `data.fixed.acidity` and `data.volatile.acidity` are both obtained
  284. via separate measurement techniques, and must be summed to indicate the total
  285. acidity present in a wine sample. For these cases, I supplemented the data
  286. given by computing the total and storing in the data frame with a
  287. `data.total.*` variable.
  288. ### Features of Interest
  289. An interesting measurement here is the wine `quality`. It is the
  290. subjective measurement of how attractive the wine might be to a consumer. The
  291. goal here will is to try and correlate non-subjective wine properties with its
  292. quality.
  293. I am curious about a few trends in particular -- **Sulphates vs. Quality** as
  294. low sulphate wine has a reputation for not causing hangovers,
  295. **Acidity vs. Quality** - Given that it impacts many factors like pH,
  296. taste, color, it is compelling to see if it affects the quality.
  297. **Alcohol vs. Quality** - Just an interesting measurement.
  298. At first, the lack of an *age* metric was surprising since it is commonly
  299. a factor in quick assumptions of wine quality. However, since the actual effect
  300. of wine age is on the wine's measurable chemical properties, its exclusion here
  301. might not be necessary.
  302. ### Distributions
  303. Many measurements that were clustered close to zero had a positive skew
  304. (you cannot have negative percentages or amounts). Others such as `pH` and
  305. `total.acidity` and `quality` had normal looking distributions.
  306. The distributions studied in this section were primarily used to identify the
  307. trends in variables present in the dataset. This helps in setting up a track
  308. for moving towards bivariate and multivariate analysis.
  309. ## Bivariate Plots Section
  310. ```{r echo=FALSE, message=FALSE, warning=FALSE, correlation_plots}
  311. ggcorr(wine,
  312. size = 2.2, hjust = 0.8,
  313. low = "#4682B4", mid = "white", high = "#E74C3C")
  314. ```
  315. **Observations from the correlation matrix.**
  316. * Total Acidity is highly correlatable with fixed acidity.
  317. * pH appears correlatable with acidity, citric acid, chlorides, and residual
  318. sugars.
  319. * No single property appears to correlate with quality.
  320. Further, in this section, metrics of interest are evaluated to check their
  321. significance on the wine quality. Moreover, bivariate relationships between
  322. other variables are also studied.
  323. ### Acidity vs. Rating & Quality
  324. ```{r echo=FALSE, message=FALSE, warning=FALSE, acidity_rating}
  325. aq1 <- ggplot(aes(x=rating, y=total.acidity), data = wine) +
  326. geom_boxplot(fill = '#ffeeee') +
  327. coord_cartesian(ylim=c(0, quantile(wine$total.acidity, 0.99))) +
  328. geom_point(stat='summary', fun.y=mean,color='red') +
  329. xlab('Rating') + ylab('Total Acidity')
  330. aq2 <- ggplot(aes(x=quality, y=total.acidity), data = wine) +
  331. geom_boxplot(fill = '#ffeeee') +
  332. coord_cartesian(ylim=c(0, quantile(wine$total.acidity, 0.99))) +
  333. geom_point(stat='summary', fun.y=mean, color='red') +
  334. xlab('Quality') + ylab('Total Acidity') +
  335. geom_jitter(alpha=1/10, color='#990000') +
  336. ggtitle("\n")
  337. grid.arrange(aq1, aq2, ncol=1)
  338. ```
  339. The boxplots depicting quality also depicts the distribution
  340. of various wines, and we can again see 5 and 6 quality wines have the most
  341. share. The blue dot is the mean, and the middle line shows the median.
  342. The box plots show how the acidity decreases as the quality of wine improve.
  343. However, the difference is not very noticeable. Since most wines tend to
  344. maintain a similar acidity level & given the fact that *volatile acidity* is
  345. responsible for the sour taste in wine, hence a density plot of the said
  346. attribute is plotted to investigate the data.
  347. ```{r echo=FALSE, message=FALSE, warning=FALSE, acidity_quality_rating}
  348. ggplot(aes(x = volatile.acidity, fill = quality, color = quality),
  349. data = wine) +
  350. geom_density(alpha=0.08)
  351. ```
  352. Red Wine of `quality` 7 and 8 have their peaks for `volatile.acidity` well
  353. below the 0.4 mark. Wine with `quality` 3 has the pick at the most right
  354. hand side (towards more volatile acidity). This shows that the better quality
  355. wines are lesser sour and in general have lesser acidity.
  356. ### Alcohol vs. Quality
  357. ```{r echo=FALSE, message=FALSE, warning=FALSE, alcohol_quality_sugar}
  358. qas0 <- ggplot(aes(x=alcohol, y=as.numeric(quality)), data=wine) +
  359. geom_jitter(alpha=1/12) +
  360. geom_smooth() +
  361. ggtitle("Alcohol Content vs. Quality") +
  362. ylab("Quality") + xlab("Alcohol")
  363. qas1 <- ggplot(aes(x=alcohol), data=wine) +
  364. geom_density(fill=I("#BB0000")) +
  365. facet_wrap("quality") +
  366. ggtitle("Alcohol Content for \nWine Quality Ratings") +
  367. ylab("Density") + xlab("Alcohol")
  368. qas2 <- ggplot(aes(x=residual.sugar, y=alcohol), data=wine) +
  369. geom_jitter(alpha=1/12) +
  370. geom_smooth() +
  371. ggtitle("Alcohol vs. Residual Sugar Content") +
  372. ylab("Alcohol") + xlab("Residual Sugar")
  373. grid.arrange(qas1, arrangeGrob(qas0, qas2), ncol=2)
  374. ```
  375. The plot between residual sugar and alcohol content suggests that there is no
  376. erratic relation between sugar and alcohol content, which is surprising as
  377. alcohol is a byproduct of the yeast feeding off of sugar during the
  378. fermentation process. That inference could not be established here.
  379. Alcohol and quality appear to be somewhat correlatable. Lower quality wines
  380. tend to have lower alcohol content. This can be further studied using boxplots.
  381. ```{r echo=FALSE, message=FALSE, warning=FALSE}
  382. quality_groups <- group_by(wine, alcohol)
  383. wine.quality_groups <- summarize(quality_groups,
  384. acidity_mean = mean(volatile.acidity),
  385. pH_mean = mean(pH),
  386. sulphates_mean = mean(sulphates),
  387. qmean = mean(as.numeric(quality)),
  388. n = n())
  389. wine.quality_groups <- arrange(wine.quality_groups, alcohol)
  390. ```
  391. ```{r echo=FALSE, message=FALSE, warning=FALSE, alcohol_quality}
  392. ggplot(aes(y=alcohol, x=factor(quality)), data = wine) +
  393. geom_boxplot(fill = '#ffeeee')+
  394. xlab('quality')
  395. ```
  396. The boxplots show an indication that higher quality wines have higher alcohol
  397. content. This trend is shown by all the quality grades from 3 to 8 except
  398. quality grade 5.
  399. **Does this mean that by adding more alcohol, we'd get better wine?**
  400. ```{r echo=FALSE, message=FALSE, warning=FALSE}
  401. ggplot(aes(alcohol, qmean), data=wine.quality_groups) +
  402. geom_smooth() +
  403. ylab("Quality Mean") +
  404. scale_x_continuous(breaks = seq(0, 15, 0.5)) +
  405. xlab("Alcohol %")
  406. ```
  407. The above line plot indicates nearly a linear increase till 13% alcohol
  408. concetration, followed by a steep downwards trend. The graph has to be
  409. smoothened to remove variances and noise.
  410. ### Sulphates vs. Quality
  411. ```{r echo=FALSE, message=FALSE, warning=FALSE, sulphates_quality}
  412. ggplot(aes(y=sulphates, x=quality), data=wine) +
  413. geom_boxplot(fill="#ffeeee")
  414. ```
  415. Good wines have higher sulphates values than bad wines, though the difference
  416. is not that wide.
  417. ```{r echo=FALSE, message=FALSE, warning=FALSE, sulphates_qplots}
  418. sq1 <- ggplot(aes(x=sulphates, y=as.numeric(quality)), data=wine) +
  419. geom_jitter(alpha=1/10) +
  420. geom_smooth() +
  421. xlab("Sulphates") + ylab("Quality") +
  422. ggtitle("Sulphates vs. Quality")
  423. sq2 <- ggplot(aes(x=sulphates, y=as.numeric(quality)),
  424. data=subset(wine, wine$sulphates < 1)) +
  425. geom_jitter(alpha=1/10) +
  426. geom_smooth() +
  427. xlab("Sulphates") + ylab("Quality") +
  428. ggtitle("\nSulphates vs Quality without Outliers")
  429. grid.arrange(sq1, sq2, nrow = 2)
  430. ```
  431. There is a slight trend implying a relationship between sulphates and wine
  432. quality, mainly if extreme sulphate values are ignored, i.e., because
  433. disregarding measurements where sulphates > 1.0 is the same as disregarding
  434. the positive tail of the distribution, keeping just the normal-looking portion.
  435. However, the relationship is mathematically, still weak.
  436. ## Bivariate Analysis - Summary
  437. There is no apparent and mathematically strong correlation between any wine
  438. property and the given quality. Alcohol content is a strong contender, but even
  439. so, the correlation was not particularly strong.
  440. Most properties have roughly normal distributions, with some skew in one tail.
  441. Scatterplot relationships between these properties often showed a slight trend
  442. within the bulk of property values. However, as soon as we leave the
  443. expected range, the trends reverse. For example, Alcohol Content or
  444. Sulphate vs. Quality. The trend is not a definitive one, but it is seen in
  445. different variables.
  446. Possibly, obtaining an outlier property (say sulphate content) is particularly
  447. challenging to do in the wine making process. Alternatively, there is a change
  448. that the wines that exhibit outlier properties are deliberately of a
  449. non-standard variety. In that case, it could be that wine judges have a harder
  450. time agreeing on a quality rating.
  451. ## Multivariate Plots Section
  452. This section includes visualizations that take bivariate analysis a step
  453. further, i.e., understand the earlier patterns better or to strengthen the
  454. arguments that were presented in the previous section.
  455. ### Alcohol, Volatile Acid & Wine Rating
  456. ```{r echo=FALSE, message=FALSE, warning=FALSE, alcohol_acid_quality}
  457. ggplot(wine, aes(x=alcohol, y=volatile.acidity, color=quality)) +
  458. geom_jitter(alpha=0.8, position = position_jitter()) +
  459. geom_smooth(method="lm", se = FALSE, size=1) +
  460. scale_color_brewer(type='seq',
  461. guide=guide_legend(title='Quality')) +
  462. theme_pander()
  463. ```
  464. Earlier inspections suggested that the volatile acidity and alcohol had high
  465. correlations values of negative and positive. Alcohol seems to vary more than
  466. volatile acidity when we talk about quality, nearly every Rating A wine has
  467. less than 0.6 volatile acidity.
  468. ### Understanding the Significance of Acidity
  469. ```{r echo=FALSE, message=FALSE, warning=FALSE, acid_quality}
  470. ggplot(subset(wine, rating=='A'|rating=='C'),
  471. aes(x=volatile.acidity, y=citric.acid)) +
  472. geom_point() +
  473. geom_jitter(position=position_jitter(), aes(color=rating)) +
  474. geom_vline(xintercept=c(0.6), linetype='dashed', size=1, color='black') +
  475. geom_hline(yintercept=c(0.5), linetype='dashed', size=1, color='black') +
  476. scale_x_continuous(breaks = seq(0, 1.6, .1)) +
  477. theme_pander() + scale_colour_few()
  478. ```
  479. Nearly every wine has volatile acidity less than 0.8. As discussed earlier the
  480. A rating wines all have volatile.acidity of less than 0.6. For wines with
  481. rating B, the volatile acidity is between 0.4 and 0.8. Some C rating wine have
  482. a volatile acidity value of more than 0.8
  483. Most A rating wines have citric acid value of 0.25 to 0.75 while the B rating
  484. wines have citric acid value below 0.50.
  485. ### Understanding the Significance of Sulphates
  486. ```{r echo=FALSE, message=FALSE, warning=FALSE}
  487. ggplot(subset(wine, rating=='A'|rating=='C'), aes(x = alcohol, y = sulphates)) +
  488. geom_jitter(position = position_jitter(), aes(color=rating)) +
  489. geom_hline(yintercept=c(0.65), linetype='dashed', size=1, color='black') +
  490. theme_pander() + scale_colour_few() +
  491. scale_y_continuous(breaks = seq(0, 2, .2))
  492. ```
  493. It is incredible to see that nearly all wines lie below 1.0 sulphates level.
  494. Due to overplotting, wines with rating B have been removed. It can be seen
  495. rating A wines mostly have sulphate values between 0.5 and 1 and the best rated
  496. wines have sulphate values between 0.6 and 1. Alcohol has the same values as
  497. seen before.
  498. ### Density & Sugar
  499. ```{r echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots2}
  500. da1 <- ggplot(aes(x=density, y=total.acidity, color=as.numeric(quality)),
  501. data=wine) +
  502. geom_point(position='jitter') +
  503. geom_smooth() +
  504. labs(x="Total Acidity", y="Density", color="Quality") +
  505. ggtitle("Density vs. Acidity Colored by Wine Quality Ratings")
  506. cs2 <- ggplot(aes(x=residual.sugar, y=density, color=as.numeric(quality)),
  507. data=wine) +
  508. geom_point(position='jitter') +
  509. geom_smooth() +
  510. labs(x="Residual Sugar", y="Density", color="Quality") +
  511. ggtitle("\nSugar vs. Chlorides colored by Wine Quality Ratings")
  512. grid.arrange(da1, cs2)
  513. ```
  514. Higher quality wines appear to have a slight correlation with higher acidity
  515. across all densities. Moreover, there are abnormally high and low quality wines
  516. coincident with higher-than-usual sugar content.
  517. ## Multivariate Analysis - Summary
  518. Based on the investigation, it can be said that higher `citric.acid` and
  519. lower `volatile.acidity` contribute towards better wines. Also, better wines
  520. tend to have higher alcohol content.
  521. There were surprising results with `suplhates` and `alcohol` graphs.
  522. Sulphates had a better correlation with quality than citric acid, still the
  523. distribution was not that distinct between the different quality wines. Further
  524. nearly all wines had a sulphate content of less than 1, irrespective of the
  525. alcohol content; suplhate is a byproduct of fermantation just like
  526. alcohol.
  527. Based on the analysis presented, it can be noted because wine rating is a
  528. subjective measure, it is why statistical correlation values are not a very
  529. suitable metric to find important factors. This was realized half-way through
  530. the study. The graphs aptly depict that there is a suitable range and it is
  531. some combination of chemical factors that contribute to the flavour of wine.
  532. ## Final Plots and Summary
  533. ### Plot One
  534. ```{r echo=FALSE, message=FALSE, warning=FALSE, plot_2}
  535. qr1 <- ggplot(aes(as.numeric(quality), fill=rating), data=wine) +
  536. geom_bar() +
  537. ggtitle ("Barchart of Quality with Rating") +
  538. scale_x_continuous(breaks=seq(3,8,1)) +
  539. xlab("Quality") +
  540. theme_pander() + scale_colour_few()
  541. qr2 <- qplot(x=rating, data=wine, geom='bar',
  542. fill=I("#990000"),
  543. col=I("black")) +
  544. xlab("Rating") +
  545. ggtitle("Barchart of Rating") +
  546. theme_pander()
  547. grid.arrange(qr1, qr2, ncol=2)
  548. ```
  549. #### Description One
  550. The plot is from the univariate section, which introduced the idea of
  551. this analysis. As in the analysis, there are plenty of visualizations which
  552. only plot data-points from A and C rated wines. A first comparison of only
  553. the 'C' and 'A' wines helped find distinctive properties that separate these
  554. two.
  555. It also suggests that it is likely that the critics can be highly subjective as
  556. they do not rate any wine with a measure of 1, 2 or 9, 10. With most wines
  557. being mediocre, the wines that had the less popular rating must've caught the
  558. attention of the wine experts, hence, the idea was derived to compare these two
  559. rating classes.
  560. ### Plot Two
  561. ```{r, echo=FALSE, warning=FALSE, message=FALSE, plot_1a}
  562. ggplot(aes(x=alcohol), data=wine) +
  563. geom_density(fill=I("#BB0000")) +
  564. facet_wrap("quality") +
  565. ggtitle("Alcohol Content for Wine Quality Ratings") +
  566. labs(x="Alcohol [%age]", y="") +
  567. theme(plot.title = element_text(face="plain"),
  568. axis.title.x = element_text(size=10),
  569. axis.title.y = element_text(size=10))
  570. ```
  571. ```{r echo=FALSE, message=FALSE, warning=FALSE, plot_1b}
  572. fp1 <- ggplot(aes(y=alcohol, x=quality), data = wine)+
  573. geom_boxplot() +
  574. xlab('Quality') +
  575. ylab("Alcohol in % by Volume") +
  576. labs(x="Quality", y="Alcohol [%age]") +
  577. ggtitle("Boxplot of Alcohol and Quality") +
  578. theme(plot.title = element_text(face="plain"),
  579. axis.title.x = element_text(size=10),
  580. axis.title.y = element_text(size=10))
  581. fp2 <-ggplot(aes(alcohol, qmean), data=wine.quality_groups) +
  582. geom_smooth() +
  583. scale_x_continuous(breaks = seq(0, 15, 0.5)) +
  584. ggtitle("\nLine Plot of Quality Mean & Alcohol Percentage") +
  585. labs(x="Alcohol [%age]", y="Quality (Mean)") +
  586. theme(plot.title = element_text(face="plain"),
  587. axis.title.x = element_text(size=10),
  588. axis.title.y = element_text(size=10))
  589. grid.arrange(fp1, fp2)
  590. ```
  591. #### Description Two
  592. These are plots taken from bivariate analysis section discussing the effect of
  593. alcohol percentage on quality.
  594. The first visualization was especially appealing to me because of the way that
  595. you can almost see the distribution shift from left to right as wine ratings
  596. increase. Again, just showing a general tendency instead of a substantial
  597. significance in judging wine quality.
  598. The above boxplots show a steady rise in the level of alcohol. An interesting
  599. trend of a decrement of quality above 13%, alcohol gave way to further analysis
  600. which shows that a general correlation measure might not be suitable for the
  601. study.
  602. The plot that follows set the basis for which I carried out the complete
  603. analysis. Rather than emphasizing on mathematical correlation measures, the
  604. inferences drawn were based on investigating the visualizations. This felt
  605. suitable due to the subjectivity in the measure of wine quality.
  606. ### Plot Three
  607. ```{r echo=FALSE, messages=FALSE, warning=FALSE, plot_3}
  608. fp3 <- ggplot(subset(wine, rating=='A'|rating=='C'),
  609. aes(x = volatile.acidity, y = citric.acid)) +
  610. geom_point() +
  611. geom_jitter(position=position_jitter(), aes(color=rating)) +
  612. geom_vline(xintercept=c(0.6), linetype='dashed', size=1, color='black') +
  613. geom_hline(yintercept=c(0.5), linetype='dashed', size=1, color='black') +
  614. scale_x_continuous(breaks = seq(0, 1.6, .1)) +
  615. theme_pander() + scale_colour_few() +
  616. ggtitle("Wine Rating vs. Acids") +
  617. labs(x="Volatile Acidity (g/dm^3)", y="Citric Acid (g/dm^3)") +
  618. theme(plot.title = element_text(face="plain"),
  619. axis.title.x = element_text(size=10),
  620. axis.title.y = element_text(size=10),
  621. legend.title = element_text(size=10))
  622. fp4 <- ggplot(subset(wine, rating=='A'|rating=='C'),
  623. aes(x = alcohol, y = sulphates)) +
  624. geom_jitter(position = position_jitter(), aes(color=rating)) +
  625. geom_hline(yintercept=c(0.65), linetype='dashed', size=1, color='black') +
  626. theme_pander() + scale_colour_few() +
  627. scale_y_continuous(breaks = seq(0,2,.2)) +
  628. ggtitle("\nSulphates, Alcohol & Wine-Rating") +
  629. labs(x="Alcohol [%]", y="Sulphates (g/dm^3)") +
  630. theme(plot.title = element_text(face="plain"),
  631. axis.title.x = element_text(size=10),
  632. axis.title.y = element_text(size=10),
  633. legend.title = element_text(size=10))
  634. grid.arrange(fp3, fp4, nrow=2)
  635. ```
  636. #### Description Three
  637. These plots served as finding distinguishing boundaries for given attributes,
  638. i.e., `sulphates`, `citric.acid`, `alcohol`, `volatile.acidity`. The
  639. conclusions drawn from these plots are that sulphates should be high but less
  640. than 1 with an alcohol concentration around 12-13%, along with less (< 0.6)
  641. volatile acidity. It can be viewed nearlyas a depiction of a classification
  642. methodology without application of any machine learning algorithm. Moreover,
  643. these plots strengthened the arguments laid in the earlier analysis of the data.
  644. ------
  645. ## Reflection
  646. In this project, I was able to examine relationship between *physicochemical*
  647. properties and identify the key variables that determine red wine quality,
  648. which are alcohol content volatile acidity and sulphate levels.
  649. The dataset is quite interesting, though limited in large-scale implications.
  650. I believe if this dataset held only one additional variable it would be vastly
  651. more useful to the layman. If *price* were supplied along with this data
  652. one could target the best wines within price categories, and what aspects
  653. correlated to a high performing wine in any price bracket.
  654. Overall, I was initially surprised by the seemingly dispersed nature of the
  655. wine data. Nothing was immediately correlatable to being an inherent quality
  656. of good wines. However, upon reflection, this is a sensible finding. Wine
  657. making is still something of a science and an art, and if there was one
  658. single property or process that continually yielded high quality wines, the
  659. field wouldn't be what it is.
  660. According to the study, it can be concluded that the best kind of wines are the
  661. ones with an alcohol concentration of about 13%, with low volatile acidity &
  662. high sulphates level (with an upper cap of 1.0 g/dm^3).
  663. ### Future Work & Limitations
  664. With my amateurish knowledge of wine-tasting, I tried my best to relate it to
  665. how I would rate a bottle of wine at dining. However, in the future, I would
  666. like to do some research into the winemaking process. Some winemakers might
  667. actively try for some property values or combinations, and be finding those
  668. combinations (of 3 or more properties) might be the key to truly predicting
  669. wine quality. This investigation was not able to find a robust generalized
  670. model that would consistently be able to predict wine quality with any degree
  671. of certainty.
  672. If I were to continue further into this specific dataset, I would aim to
  673. train a classifier to correctly predict the wine category, in order to better
  674. grasp the minuteness of what makes a good wine.
  675. Additionally, having the wine type would be helpful for further analysis.
  676. Sommeliers might prefer certain types of wines to have different
  677. properties and behaviors. For example, a Port (as sweet desert wine)
  678. surely is rated differently from a dark and robust abernet Sauvignon,
  679. which is rated differently from a bright and fruity Syrah. Without knowing
  680. the type of wine, it is entirely possible that we are almost literally
  681. comparing apples to oranges and can't find a correlation.