Red_Wine_Quality.R 35 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863
  1. #' What Makes A Good Wine?
  2. #' ========================================================
  3. #'
  4. #' In this project, a data set of red wine quality is explored based on its
  5. #' physicochemical properties. The objective is to find physicochemical properties
  6. #' that distinguish good quality wine from lower quality ones. An attempt to build
  7. #' linear model on wine quality is also shown.
  8. #'
  9. #' ### Dataset Description
  10. #' This tidy dataset contains 1,599 red wines with 11 variables on the chemical
  11. #' properties of the wine. Another variable attributing to the quality of wine is
  12. #' added; at least 3 wine experts did this rating. The preparation of the dataset
  13. #' has been described in [this link](https://goo.gl/HVxAzY).
  14. #'
  15. ## ----global_options, include=FALSE---------------------------------------
  16. knitr::opts_chunk$set(fig.path='Figs/',
  17. echo=FALSE, warning=FALSE, message=FALSE)
  18. #'
  19. ## ----echo=FALSE, message=FALSE, warning=FALSE, packages------------------
  20. library(ggplot2)
  21. library(gridExtra)
  22. library(GGally)
  23. library(ggthemes)
  24. library(dplyr)
  25. library(memisc)
  26. #'
  27. #' First, the structure of the dataset is explored using ``summary`` and ``str``
  28. #' functions.
  29. ## ----echo=FALSE, warning=FALSE, message=FALSE, Load_the_Data-------------
  30. wine <- read.csv("wineQualityReds.csv")
  31. str(wine)
  32. summary(wine)
  33. # Setting the theme for plotting.
  34. # theme_set(theme_minimal(10))
  35. # Converting 'quality' to ordered type.
  36. wine$quality <- ordered(wine$quality,
  37. levels=c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
  38. # Adding 'total.acidity'.
  39. wine$total.acidity <- wine$fixed.acidity + wine$volatile.acidity
  40. #'
  41. #' **The following observations are made/confirmed:**
  42. #'
  43. #' 1. There are 1599 samples of Red Wine properties and quality values.
  44. #'
  45. #' 2. No wine achieves either a terrible (0) or perfect (10) quality score.
  46. #'
  47. #' 3. Citric Acid had a minimum of 0.0. No other property values were precisely 0.
  48. #'
  49. #' 4. Residual Sugar measurement has a maximum that is nearly 20 times farther
  50. #' away from the 3rd quartile than the 3rd quartile is from the 1st. There is
  51. #' a chance of a largely skewed data or that the data has some outliers.
  52. #'
  53. #' 5. The 'quality' attribute is originally considered an integer;
  54. #' I have converted this field into an ordered factor which is much more
  55. #' a representative of the variable itself.
  56. #'
  57. #' 6. There are two attributes related to 'acidity' of wine i.e. 'fixed.acidity'
  58. #' and 'volatile.acidity'. Hence, a combined acidity variable is added
  59. #' using ``data$total.acidity <- data$fixed.acidity + data$volatile.acidity``.
  60. #'
  61. #' ## Univariate Plots Section
  62. #' To lead the univariate analysis, I’ve chosen to build a grid of histograms.
  63. #' These histograms represent the distributions of each variable in the dataset.
  64. #'
  65. ## ----echo=FALSE, warning=FALSE, message=FALSE, Univariate_Grid_Plot------
  66. g_base <- ggplot(
  67. data = wine,
  68. aes(color=I('black'), fill=I('#990000'))
  69. )
  70. g1 <- g_base +
  71. geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
  72. scale_x_continuous(breaks = seq(4, 16, 2)) +
  73. coord_cartesian(xlim = c(4, 16))
  74. g2 <- g_base +
  75. geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
  76. scale_x_continuous(breaks = seq(0, 2, 0.5)) +
  77. coord_cartesian(xlim = c(0, 2))
  78. g3 <- g_base +
  79. geom_histogram(aes(x = total.acidity), binwidth = 0.25) +
  80. scale_x_continuous(breaks = seq(0, 18, 1)) +
  81. coord_cartesian(xlim = c(4, 18))
  82. g4 <- g_base +
  83. geom_histogram(aes(x = citric.acid), binwidth = 0.05) +
  84. scale_x_continuous(breaks = seq(0, 1, 0.2)) +
  85. coord_cartesian(xlim = c(0, 1))
  86. g5 <- g_base +
  87. geom_histogram(aes(x = residual.sugar), binwidth = 0.5) +
  88. scale_x_continuous(breaks = seq(0, 16, 2)) +
  89. coord_cartesian(xlim = c(0, 16))
  90. g6 <- g_base +
  91. geom_histogram(aes(x = chlorides), binwidth = 0.01) +
  92. scale_x_continuous(breaks = seq(0, 0.75, 0.25)) +
  93. coord_cartesian(xlim = c(0, 0.75))
  94. g7 <- g_base +
  95. geom_histogram(aes(x = free.sulfur.dioxide), binwidth = 2.5) +
  96. scale_x_continuous(breaks = seq(0, 75, 25)) +
  97. coord_cartesian(xlim = c(0, 75))
  98. g8 <- g_base +
  99. geom_histogram(aes(x = total.sulfur.dioxide), binwidth = 10) +
  100. scale_x_continuous(breaks = seq(0, 300, 100)) +
  101. coord_cartesian(xlim = c(0, 295))
  102. g9 <- g_base +
  103. geom_histogram(aes(x = density), binwidth = 0.0005) +
  104. scale_x_continuous(breaks = seq(0.99, 1.005, 0.005)) +
  105. coord_cartesian(xlim = c(0.99, 1.005))
  106. g10 <- g_base +
  107. geom_histogram(aes(x = pH), binwidth = 0.05) +
  108. scale_x_continuous(breaks = seq(2.5, 4.5, 0.5)) +
  109. coord_cartesian(xlim = c(2.5, 4.5))
  110. g11 <- g_base +
  111. geom_histogram(aes(x = sulphates), binwidth = 0.05) +
  112. scale_x_continuous(breaks = seq(0, 2, 0.5)) +
  113. coord_cartesian(xlim = c(0, 2))
  114. g12 <- g_base +
  115. geom_histogram(aes(x = alcohol), binwidth = 0.25) +
  116. scale_x_continuous(breaks = seq(8, 15, 2)) +
  117. coord_cartesian(xlim = c(8, 15))
  118. grid.arrange(g1, g2, g3, g4, g5, g6,
  119. g7, g8, g9, g10, g11, g12, ncol=3)
  120. #'
  121. #' There are some really interesting variations in the distributions here. Looking
  122. #' closer at a few of the more interesting ones might prove quite valuable.
  123. #' Working from top-left to right, selected plots are analysed.
  124. #'
  125. ## ----echo=FALSE, warning=FALSE, message=FALSE, single_variable_hist------
  126. base_hist <- ggplot(
  127. data = wine,
  128. aes(color=I('black'), fill=I('#990000'))
  129. )
  130. #'
  131. #' ### Acidity
  132. ## ----echo=FALSE, acidity_plot--------------------------------------------
  133. ac1 <- base_hist +
  134. geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
  135. scale_x_continuous(breaks = seq(4, 16, 2)) +
  136. coord_cartesian(xlim = c(4, 16))
  137. ac2 <- base_hist +
  138. geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
  139. scale_x_continuous(breaks = seq(0, 2, 0.5)) +
  140. coord_cartesian(xlim = c(0, 2))
  141. grid.arrange(ac1, ac2, nrow=2)
  142. #'
  143. #' **Fixed acidity** is determined by aids that do not evaporate easily --
  144. #' tartaricacid. It contributes to many other attributes, including the taste, pH,
  145. #' color, and stability to oxidation, i.e., prevent the wine from tasting flat.
  146. #' On theother hand, **volatile acidity** is responsible for the sour taste in
  147. #' wine. A very high value can lead to sour tasting wine, a low value can make
  148. #' the wine seem heavy.
  149. #' (References: [1](http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity),
  150. #' [2](http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity).
  151. #'
  152. ## ----echo=FALSE, warning=FALSE, message=FALSE, acidity_univariate--------
  153. ac1 <- base_hist +
  154. geom_histogram(aes(x = fixed.acidity), binwidth = 0.25) +
  155. scale_x_continuous(breaks = seq(4, 16, 2)) +
  156. coord_cartesian(xlim = c(4, 16))
  157. ac2 <- base_hist +
  158. geom_histogram(aes(x = volatile.acidity), binwidth = 0.05) +
  159. scale_x_continuous(breaks = seq(0, 2, 0.5)) +
  160. coord_cartesian(xlim = c(0, 2))
  161. ac3 <- base_hist +
  162. geom_histogram(aes(x = total.acidity), binwidth = 0.25) +
  163. scale_x_continuous(breaks = seq(0, 18, 2)) +
  164. coord_cartesian(xlim = c(0, 18))
  165. grid.arrange(ac1, ac2, ac3, nrow=3)
  166. print("Summary statistics of Fixed Acidity")
  167. summary(wine$fixed.acidity)
  168. print("Summary statistics of Volatile Acidity")
  169. summary(wine$volatile.acidity)
  170. print("Summary statistics of Total Acidity")
  171. summary(wine$total.acidity)
  172. #'
  173. #' Of the wines we have in our dataset, we can see that most have a fixed acidity
  174. #' of 7.5. The median fixed acidity is 7.9, and the mean is 8.32. There is a
  175. #' slight skew in the data because a few wines possess a very high fixed acidity.
  176. #' The median volatile acidity is 0.52 g/dm^3, and the mean is 0.5278 g/dm^3. *It
  177. #' will be interesting to note which quality of wine is correlated to what level
  178. #' of acidity in the bivariate section.*
  179. #'
  180. #' ### Citric Acid
  181. #' Citric acid is part of the fixed acid content of most wines. A non-volatile
  182. #' acid, citric also adds much of the same characteristics as tartaric acid does.
  183. #' Again, here I would guess most good wines have a balanced amount of citric
  184. #' acid.
  185. #'
  186. ## ----echo=FALSE, warning=FALSE, message=FALSE, citric_acid_univariate----
  187. base_hist +
  188. geom_histogram(aes(x = citric.acid), binwidth = 0.05) +
  189. scale_x_continuous(breaks = seq(0, 1, 0.2)) +
  190. coord_cartesian(xlim = c(0, 1))
  191. print("Summary statistics of Citric Acid")
  192. summary(wine$citric.acid)
  193. print('Number of Zero Values')
  194. table(wine$citric.acid == 0)
  195. #'
  196. #' There is a very high count of zero in citric acid. To check if this is
  197. #' genuinely zero or merely a ‘not available’ value. A quick check using table
  198. #' function shows that there are 132 observations of zero values and no NA value
  199. #' in reported citric acid concentration. The citric acid concentration could be
  200. #' too low and insignificant hence was reported as zero.
  201. #'
  202. #' As far as content wise the wines have a median citric acid level of
  203. #' 0.26 g/dm^3, and a mean level of 0.271 g/dm^3.
  204. #'
  205. #' ### Sulfur-Dioxide & Sulphates
  206. #' **Free sulfur dioxide** is the free form of SO2 exists in equilibrium between
  207. #' molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial
  208. #' growth and the oxidation of wine. **Sulphates** is a wine additive which can
  209. #' contribute to sulfur dioxide gas (SO2) levels, which acts as an anti-microbial
  210. #' moreover, antioxidant -- *overall keeping the wine, fresh*.
  211. #'
  212. ## ----echo=FALSE, warning=FALSE, message=FALSE, sulfur_univariate---------
  213. sul1 <- base_hist + geom_histogram(aes(x = free.sulfur.dioxide))
  214. sul2 <- base_hist + geom_histogram(aes(x = log10(free.sulfur.dioxide)))
  215. sul3 <- base_hist + geom_histogram(aes(x = total.sulfur.dioxide))
  216. sul4 <- base_hist + geom_histogram(aes(x = log10(total.sulfur.dioxide)))
  217. sul5 <- base_hist + geom_histogram(aes(x = sulphates))
  218. sul6 <- base_hist + geom_histogram(aes(x = log10(sulphates)))
  219. grid.arrange(sul1, sul2, sul3, sul4, sul5, sul6, nrow=3)
  220. #'
  221. #' The distributions of all three values are positively skewed with a long tail.
  222. #' Thelog-transformation results in a normal-behaving distribution for 'total
  223. #' sulfur dioxide' and 'sulphates'.
  224. #'
  225. #' ### Alcohol
  226. #' Alcohol is what adds that special something that turns rotten grape juice
  227. #' into a drink many people love. Hence, by intuitive understanding, it should
  228. #' be crucial in determining the wine quality.
  229. #'
  230. ## ----echo=FALSE, warning=FALSE, message=FALSE, alcohol_univariate--------
  231. base_hist +
  232. geom_histogram(aes(x = alcohol), binwidth = 0.25) +
  233. scale_x_continuous(breaks = seq(8, 15, 2)) +
  234. coord_cartesian(xlim = c(8, 15))
  235. print("Summary statistics for alcohol %age.")
  236. summary(wine$alcohol)
  237. #'
  238. #' The mean alcohol content for our wines is 10.42%, the median is 10.2%
  239. #'
  240. #' ### Quality
  241. ## ----echo=FALSE,warning=FALSE, message=FALSE, quality_univariate---------
  242. qplot(x=quality, data=wine, geom='bar',
  243. fill=I("#990000"),
  244. col=I("black"))
  245. print("Summary statistics - Wine Quality.")
  246. summary(wine$quality)
  247. #'
  248. #' Overall wine quality, rated on a scale from 1 to 10, has a normal shape and
  249. #' very few exceptionally high or low-quality ratings.
  250. #'
  251. #' It can be seen that the minimum rating is 3 and 8 is the maximum for quality.
  252. #' Hence, a variable called ‘rating’ is created based on variable quality.
  253. #'
  254. #' * 8 to 7 are Rated A.
  255. #'
  256. #' * 6 to 5 are Rated B.
  257. #'
  258. #' * 3 to 4 are Rated C.
  259. #'
  260. ## ----echo=FALSE, quality_rating------------------------------------------
  261. # Dividing the quality into 3 rating levels
  262. wine$rating <- ifelse(wine$quality < 5, 'C',
  263. ifelse(wine$quality < 7, 'B', 'A'))
  264. # Changing it into an ordered factor
  265. wine$rating <- ordered(wine$rating,
  266. levels = c('C', 'B', 'A'))
  267. summary(wine$rating)
  268. qr1 <- ggplot(aes(as.numeric(quality), fill=rating), data=wine) +
  269. geom_bar() +
  270. ggtitle ("Barchart of Quality with Rating") +
  271. scale_x_continuous(breaks=seq(3,8,1)) +
  272. xlab("Quality") +
  273. theme_pander() + scale_colour_few()
  274. qr2 <- qplot(x=rating, data=wine, geom='bar',
  275. fill=I("#990000"),
  276. col=I("black")) +
  277. xlab("Rating") +
  278. ggtitle("Barchart of Rating") +
  279. theme_pander()
  280. grid.arrange(qr1, qr2, ncol=2)
  281. #'
  282. #' The distribution of 'rating' is much higher on the 'B' rating wine as
  283. #' seen in quality distribution. This is likely to cause overplotting. Therefore,
  284. #' a comparison of only the 'C' and 'A' wines is done to find distinctive
  285. #' properties that separate these two. The comparison is made using summary
  286. #' statistics.
  287. #'
  288. ## ----echo=FALSE, rating_comparison---------------------------------------
  289. print("Summary statistics of Wine with Rating 'A'")
  290. summary(subset(wine, rating=='A'))
  291. print("Summary statistics of Wine with Rating 'C'")
  292. summary(subset(wine, rating=='C'))
  293. #'
  294. #' On comparing the *mean statistic* of different attribute for 'A-rated' and
  295. #' 'C-rated' wines (A → C), the following %age change is noted.
  296. #'
  297. #' 1. `fixed.acidity`: mean reduced by 11%.
  298. #'
  299. #' 2. `volatile.acidity` - mean increased by 80%.
  300. #'
  301. #' 3. `citric.acidity` - mean increased by 117%.
  302. #'
  303. #' 4. `sulphates` - mean reduced by 20.3%
  304. #'
  305. #' 5. `alcohol` - mean reduced by 12.7%.
  306. #'
  307. #' 6. `residualsugar` and `chloride` showed a very low variation.
  308. #'
  309. #' These changes are, however, only suitable for estimation of important quality
  310. #' impacting variables and setting a way for further analysis. No conclusion
  311. #' can be drawn from it.
  312. #'
  313. #' ## Univariate Analysis - Summary
  314. #'
  315. #' ### Overview
  316. #' The red wine dataset features 1599 separate observations, each for a different
  317. #' red wine sample. As presented, each wine sample is provided as a single row in
  318. #' the dataset. Due to the nature of how some measurements are gathered, some
  319. #' values given represent *components* of a measurement total.
  320. #'
  321. #' For example, `data.fixed.acidity` and `data.volatile.acidity` are both obtained
  322. #' via separate measurement techniques, and must be summed to indicate the total
  323. #' acidity present in a wine sample. For these cases, I supplemented the data
  324. #' given by computing the total and storing in the data frame with a
  325. #' `data.total.*` variable.
  326. #'
  327. #' ### Features of Interest
  328. #' An interesting measurement here is the wine `quality`. It is the
  329. #' subjective measurement of how attractive the wine might be to a consumer. The
  330. #' goal here will is to try and correlate non-subjective wine properties with its
  331. #' quality.
  332. #'
  333. #' I am curious about a few trends in particular -- **Sulphates vs. Quality** as
  334. #' low sulphate wine has a reputation for not causing hangovers,
  335. #' **Acidity vs. Quality** - Given that it impacts many factors like pH,
  336. #' taste, color, it is compelling to see if it affects the quality.
  337. #' **Alcohol vs. Quality** - Just an interesting measurement.
  338. #'
  339. #' At first, the lack of an *age* metric was surprising since it is commonly
  340. #' a factor in quick assumptions of wine quality. However, since the actual effect
  341. #' of wine age is on the wine's measurable chemical properties, its exclusion here
  342. #' might not be necessary.
  343. #'
  344. #' ### Distributions
  345. #' Many measurements that were clustered close to zero had a positive skew
  346. #' (you cannot have negative percentages or amounts). Others such as `pH` and
  347. #' `total.acidity` and `quality` had normal looking distributions.
  348. #'
  349. #' The distributions studied in this section were primarily used to identify the
  350. #' trends in variables present in the dataset. This helps in setting up a track
  351. #' for moving towards bivariate and multivariate analysis.
  352. #'
  353. #' ## Bivariate Plots Section
  354. #'
  355. ## ----echo=FALSE, message=FALSE, warning=FALSE, correlation_plots---------
  356. ggcorr(wine,
  357. size = 2.2, hjust = 0.8,
  358. low = "#4682B4", mid = "white", high = "#E74C3C")
  359. #'
  360. #' **Observations from the correlation matrix.**
  361. #'
  362. #' * Total Acidity is highly correlatable with fixed acidity.
  363. #'
  364. #' * pH appears correlatable with acidity, citric acid, chlorides, and residual
  365. #' sugars.
  366. #'
  367. #' * No single property appears to correlate with quality.
  368. #'
  369. #' Further, in this section, metrics of interest are evaluated to check their
  370. #' significance on the wine quality. Moreover, bivariate relationships between
  371. #' other variables are also studied.
  372. #'
  373. #' ### Acidity vs. Rating & Quality
  374. #'
  375. ## ----echo=FALSE, message=FALSE, warning=FALSE, acidity_rating------------
  376. aq1 <- ggplot(aes(x=rating, y=total.acidity), data = wine) +
  377. geom_boxplot(fill = '#ffeeee') +
  378. coord_cartesian(ylim=c(0, quantile(wine$total.acidity, 0.99))) +
  379. geom_point(stat='summary', fun.y=mean,color='red') +
  380. xlab('Rating') + ylab('Total Acidity')
  381. aq2 <- ggplot(aes(x=quality, y=total.acidity), data = wine) +
  382. geom_boxplot(fill = '#ffeeee') +
  383. coord_cartesian(ylim=c(0, quantile(wine$total.acidity, 0.99))) +
  384. geom_point(stat='summary', fun.y=mean, color='red') +
  385. xlab('Quality') + ylab('Total Acidity') +
  386. geom_jitter(alpha=1/10, color='#990000') +
  387. ggtitle("\n")
  388. grid.arrange(aq1, aq2, ncol=1)
  389. #'
  390. #' The boxplots depicting quality also depicts the distribution
  391. #' of various wines, and we can again see 5 and 6 quality wines have the most
  392. #' share. The blue dot is the mean, and the middle line shows the median.
  393. #'
  394. #' The box plots show how the acidity decreases as the quality of wine improve.
  395. #' However, the difference is not very noticeable. Since most wines tend to
  396. #' maintain a similar acidity level & given the fact that *volatile acidity* is
  397. #' responsible for the sour taste in wine, hence a density plot of the said
  398. #' attribute is plotted to investigate the data.
  399. #'
  400. ## ----echo=FALSE, message=FALSE, warning=FALSE, acidity_quality_rating----
  401. ggplot(aes(x = volatile.acidity, fill = quality, color = quality),
  402. data = wine) +
  403. geom_density(alpha=0.08)
  404. #'
  405. #' Red Wine of `quality` 7 and 8 have their peaks for `volatile.acidity` well
  406. #' below the 0.4 mark. Wine with `quality` 3 has the pick at the most right
  407. #' hand side (towards more volatile acidity). This shows that the better quality
  408. #' wines are lesser sour and in general have lesser acidity.
  409. #'
  410. #' ### Alcohol vs. Quality
  411. #'
  412. ## ----echo=FALSE, message=FALSE, warning=FALSE, alcohol_quality_sugar-----
  413. qas0 <- ggplot(aes(x=alcohol, y=as.numeric(quality)), data=wine) +
  414. geom_jitter(alpha=1/12) +
  415. geom_smooth() +
  416. ggtitle("Alcohol Content vs. Quality") +
  417. ylab("Quality") + xlab("Alcohol")
  418. qas1 <- ggplot(aes(x=alcohol), data=wine) +
  419. geom_density(fill=I("#BB0000")) +
  420. facet_wrap("quality") +
  421. ggtitle("Alcohol Content for \nWine Quality Ratings") +
  422. ylab("Density") + xlab("Alcohol")
  423. qas2 <- ggplot(aes(x=residual.sugar, y=alcohol), data=wine) +
  424. geom_jitter(alpha=1/12) +
  425. geom_smooth() +
  426. ggtitle("Alcohol vs. Residual Sugar Content") +
  427. ylab("Alcohol") + xlab("Residual Sugar")
  428. grid.arrange(qas1, arrangeGrob(qas0, qas2), ncol=2)
  429. #'
  430. #' The plot between residual sugar and alcohol content suggests that there is no
  431. #' erratic relation between sugar and alcohol content, which is surprising as
  432. #' alcohol is a byproduct of the yeast feeding off of sugar during the
  433. #' fermentation process. That inference could not be established here.
  434. #'
  435. #' Alcohol and quality appear to be somewhat correlatable. Lower quality wines
  436. #' tend to have lower alcohol content. This can be further studied using boxplots.
  437. #'
  438. ## ----echo=FALSE, message=FALSE, warning=FALSE----------------------------
  439. quality_groups <- group_by(wine, alcohol)
  440. wine.quality_groups <- summarize(quality_groups,
  441. acidity_mean = mean(volatile.acidity),
  442. pH_mean = mean(pH),
  443. sulphates_mean = mean(sulphates),
  444. qmean = mean(as.numeric(quality)),
  445. n = n())
  446. wine.quality_groups <- arrange(wine.quality_groups, alcohol)
  447. #'
  448. ## ----echo=FALSE, message=FALSE, warning=FALSE, alcohol_quality-----------
  449. ggplot(aes(y=alcohol, x=factor(quality)), data = wine) +
  450. geom_boxplot(fill = '#ffeeee')+
  451. xlab('quality')
  452. #'
  453. #' The boxplots show an indication that higher quality wines have higher alcohol
  454. #' content. This trend is shown by all the quality grades from 3 to 8 except
  455. #' quality grade 5.
  456. #'
  457. #' **Does this mean that by adding more alcohol, we'd get better wine?**
  458. #'
  459. ## ----echo=FALSE, message=FALSE, warning=FALSE----------------------------
  460. ggplot(aes(alcohol, qmean), data=wine.quality_groups) +
  461. geom_smooth() +
  462. ylab("Quality Mean") +
  463. scale_x_continuous(breaks = seq(0, 15, 0.5)) +
  464. xlab("Alcohol %")
  465. #'
  466. #' The above line plot indicates nearly a linear increase till 13% alcohol
  467. #' concetration, followed by a steep downwards trend. The graph has to be
  468. #' smoothened to remove variances and noise.
  469. #'
  470. #' ### Sulphates vs. Quality
  471. #'
  472. ## ----echo=FALSE, message=FALSE, warning=FALSE, sulphates_quality---------
  473. ggplot(aes(y=sulphates, x=quality), data=wine) +
  474. geom_boxplot(fill="#ffeeee")
  475. #'
  476. #' Good wines have higher sulphates values than bad wines, though the difference
  477. #' is not that wide.
  478. #'
  479. ## ----echo=FALSE, message=FALSE, warning=FALSE, sulphates_qplots----------
  480. sq1 <- ggplot(aes(x=sulphates, y=as.numeric(quality)), data=wine) +
  481. geom_jitter(alpha=1/10) +
  482. geom_smooth() +
  483. xlab("Sulphates") + ylab("Quality") +
  484. ggtitle("Sulphates vs. Quality")
  485. sq2 <- ggplot(aes(x=sulphates, y=as.numeric(quality)),
  486. data=subset(wine, wine$sulphates < 1)) +
  487. geom_jitter(alpha=1/10) +
  488. geom_smooth() +
  489. xlab("Sulphates") + ylab("Quality") +
  490. ggtitle("\nSulphates vs Quality without Outliers")
  491. grid.arrange(sq1, sq2, nrow = 2)
  492. #'
  493. #' There is a slight trend implying a relationship between sulphates and wine
  494. #' quality, mainly if extreme sulphate values are ignored, i.e., because
  495. #' disregarding measurements where sulphates > 1.0 is the same as disregarding
  496. #' the positive tail of the distribution, keeping just the normal-looking portion.
  497. #' However, the relationship is mathematically, still weak.
  498. #'
  499. #' ## Bivariate Analysis - Summary
  500. #'
  501. #' There is no apparent and mathematically strong correlation between any wine
  502. #' property and the given quality. Alcohol content is a strong contender, but even
  503. #' so, the correlation was not particularly strong.
  504. #'
  505. #' Most properties have roughly normal distributions, with some skew in one tail.
  506. #' Scatterplot relationships between these properties often showed a slight trend
  507. #' within the bulk of property values. However, as soon as we leave the
  508. #' expected range, the trends reverse. For example, Alcohol Content or
  509. #' Sulphate vs. Quality. The trend is not a definitive one, but it is seen in
  510. #' different variables.
  511. #'
  512. #' Possibly, obtaining an outlier property (say sulphate content) is particularly
  513. #' challenging to do in the wine making process. Alternatively, there is a change
  514. #' that the wines that exhibit outlier properties are deliberately of a
  515. #' non-standard variety. In that case, it could be that wine judges have a harder
  516. #' time agreeing on a quality rating.
  517. #'
  518. #' ## Multivariate Plots Section
  519. #'
  520. #' This section includes visualizations that take bivariate analysis a step
  521. #' further, i.e., understand the earlier patterns better or to strengthen the
  522. #' arguments that were presented in the previous section.
  523. #'
  524. #' ### Alcohol, Volatile Acid & Wine Rating
  525. #'
  526. ## ----echo=FALSE, message=FALSE, warning=FALSE, alcohol_acid_quality------
  527. ggplot(wine, aes(x=alcohol, y=volatile.acidity, color=quality)) +
  528. geom_jitter(alpha=0.8, position = position_jitter()) +
  529. geom_smooth(method="lm", se = FALSE, size=1) +
  530. scale_color_brewer(type='seq',
  531. guide=guide_legend(title='Quality')) +
  532. theme_pander()
  533. #'
  534. #' Earlier inspections suggested that the volatile acidity and alcohol had high
  535. #' correlations values of negative and positive. Alcohol seems to vary more than
  536. #' volatile acidity when we talk about quality, nearly every Rating A wine has
  537. #' less than 0.6 volatile acidity.
  538. #'
  539. #' ### Understanding the Significance of Acidity
  540. #'
  541. ## ----echo=FALSE, message=FALSE, warning=FALSE, acid_quality--------------
  542. ggplot(subset(wine, rating=='A'|rating=='C'),
  543. aes(x=volatile.acidity, y=citric.acid)) +
  544. geom_point() +
  545. geom_jitter(position=position_jitter(), aes(color=rating)) +
  546. geom_vline(xintercept=c(0.6), linetype='dashed', size=1, color='black') +
  547. geom_hline(yintercept=c(0.5), linetype='dashed', size=1, color='black') +
  548. scale_x_continuous(breaks = seq(0, 1.6, .1)) +
  549. theme_pander() + scale_colour_few()
  550. #'
  551. #' Nearly every wine has volatile acidity less than 0.8. As discussed earlier the
  552. #' A rating wines all have volatile.acidity of less than 0.6. For wines with
  553. #' rating B, the volatile acidity is between 0.4 and 0.8. Some C rating wine have
  554. #' a volatile acidity value of more than 0.8
  555. #'
  556. #' Most A rating wines have citric acid value of 0.25 to 0.75 while the B rating
  557. #' wines have citric acid value below 0.50.
  558. #'
  559. #' ### Understanding the Significance of Sulphates
  560. #'
  561. ## ----echo=FALSE, message=FALSE, warning=FALSE----------------------------
  562. ggplot(subset(wine, rating=='A'|rating=='C'), aes(x = alcohol, y = sulphates)) +
  563. geom_jitter(position = position_jitter(), aes(color=rating)) +
  564. geom_hline(yintercept=c(0.65), linetype='dashed', size=1, color='black') +
  565. theme_pander() + scale_colour_few() +
  566. scale_y_continuous(breaks = seq(0, 2, .2))
  567. #'
  568. #' It is incredible to see that nearly all wines lie below 1.0 sulphates level.
  569. #' Due to overplotting, wines with rating B have been removed. It can be seen
  570. #' rating A wines mostly have sulphate values between 0.5 and 1 and the best rated
  571. #' wines have sulphate values between 0.6 and 1. Alcohol has the same values as
  572. #' seen before.
  573. #'
  574. #' ### Density & Sugar
  575. #'
  576. ## ----echo=FALSE, message=FALSE, warning=FALSE, Multivariate_Plots2-------
  577. da1 <- ggplot(aes(x=density, y=total.acidity, color=as.numeric(quality)),
  578. data=wine) +
  579. geom_point(position='jitter') +
  580. geom_smooth() +
  581. labs(x="Total Acidity", y="Density", color="Quality") +
  582. ggtitle("Density vs. Acidity Colored by Wine Quality Ratings")
  583. cs2 <- ggplot(aes(x=residual.sugar, y=density, color=as.numeric(quality)),
  584. data=wine) +
  585. geom_point(position='jitter') +
  586. geom_smooth() +
  587. labs(x="Residual Sugar", y="Density", color="Quality") +
  588. ggtitle("\nSugar vs. Chlorides colored by Wine Quality Ratings")
  589. grid.arrange(da1, cs2)
  590. #'
  591. #' Higher quality wines appear to have a slight correlation with higher acidity
  592. #' across all densities. Moreover, there are abnormally high and low quality wines
  593. #' coincident with higher-than-usual sugar content.
  594. #'
  595. #' ## Multivariate Analysis - Summary
  596. #' Based on the investigation, it can be said that higher `citric.acid` and
  597. #' lower `volatile.acidity` contribute towards better wines. Also, better wines
  598. #' tend to have higher alcohol content.
  599. #'
  600. #' There were surprising results with `suplhates` and `alcohol` graphs.
  601. #' Sulphates had a better correlation with quality than citric acid, still the
  602. #' distribution was not that distinct between the different quality wines. Further
  603. #' nearly all wines had a sulphate content of less than 1, irrespective of the
  604. #' alcohol content; suplhate is a byproduct of fermantation just like
  605. #' alcohol.
  606. #'
  607. #' Based on the analysis presented, it can be noted because wine rating is a
  608. #' subjective measure, it is why statistical correlation values are not a very
  609. #' suitable metric to find important factors. This was realized half-way through
  610. #' the study. The graphs aptly depict that there is a suitable range and it is
  611. #' some combination of chemical factors that contribute to the flavour of wine.
  612. #'
  613. #' ## Final Plots and Summary
  614. #'
  615. #' ### Plot One
  616. #'
  617. ## ----echo=FALSE, message=FALSE, warning=FALSE, plot_2--------------------
  618. qr1 <- ggplot(aes(as.numeric(quality), fill=rating), data=wine) +
  619. geom_bar() +
  620. ggtitle ("Barchart of Quality with Rating") +
  621. scale_x_continuous(breaks=seq(3,8,1)) +
  622. xlab("Quality") +
  623. theme_pander() + scale_colour_few()
  624. qr2 <- qplot(x=rating, data=wine, geom='bar',
  625. fill=I("#990000"),
  626. col=I("black")) +
  627. xlab("Rating") +
  628. ggtitle("Barchart of Rating") +
  629. theme_pander()
  630. grid.arrange(qr1, qr2, ncol=2)
  631. #'
  632. #' #### Description One
  633. #' The plot is from the univariate section, which introduced the idea of
  634. #' this analysis. As in the analysis, there are plenty of visualizations which
  635. #' only plot data-points from A and C rated wines. A first comparison of only
  636. #' the 'C' and 'A' wines helped find distinctive properties that separate these
  637. #' two.
  638. #'
  639. #' It also suggests that it is likely that the critics can be highly subjective as
  640. #' they do not rate any wine with a measure of 1, 2 or 9, 10. With most wines
  641. #' being mediocre, the wines that had the less popular rating must've caught the
  642. #' attention of the wine experts, hence, the idea was derived to compare these two
  643. #' rating classes.
  644. #'
  645. #' ### Plot Two
  646. #'
  647. ## ---- echo=FALSE, warning=FALSE, message=FALSE, plot_1a------------------
  648. ggplot(aes(x=alcohol), data=wine) +
  649. geom_density(fill=I("#BB0000")) +
  650. facet_wrap("quality") +
  651. ggtitle("Alcohol Content for Wine Quality Ratings") +
  652. labs(x="Alcohol [%age]", y="") +
  653. theme(plot.title = element_text(face="plain"),
  654. axis.title.x = element_text(size=10),
  655. axis.title.y = element_text(size=10))
  656. #'
  657. ## ----echo=FALSE, message=FALSE, warning=FALSE, plot_1b-------------------
  658. fp1 <- ggplot(aes(y=alcohol, x=quality), data = wine)+
  659. geom_boxplot() +
  660. xlab('Quality') +
  661. ylab("Alcohol in % by Volume") +
  662. labs(x="Quality", y="Alcohol [%age]") +
  663. ggtitle("Boxplot of Alcohol and Quality") +
  664. theme(plot.title = element_text(face="plain"),
  665. axis.title.x = element_text(size=10),
  666. axis.title.y = element_text(size=10))
  667. fp2 <-ggplot(aes(alcohol, qmean), data=wine.quality_groups) +
  668. geom_smooth() +
  669. scale_x_continuous(breaks = seq(0, 15, 0.5)) +
  670. ggtitle("\nLine Plot of Quality Mean & Alcohol Percentage") +
  671. labs(x="Alcohol [%age]", y="Quality (Mean)") +
  672. theme(plot.title = element_text(face="plain"),
  673. axis.title.x = element_text(size=10),
  674. axis.title.y = element_text(size=10))
  675. grid.arrange(fp1, fp2)
  676. #'
  677. #' #### Description Two
  678. #'
  679. #' These are plots taken from bivariate analysis section discussing the effect of
  680. #' alcohol percentage on quality.
  681. #'
  682. #' The first visualization was especially appealing to me because of the way that
  683. #' you can almost see the distribution shift from left to right as wine ratings
  684. #' increase. Again, just showing a general tendency instead of a substantial
  685. #' significance in judging wine quality.
  686. #'
  687. #' The above boxplots show a steady rise in the level of alcohol. An interesting
  688. #' trend of a decrement of quality above 13%, alcohol gave way to further analysis
  689. #' which shows that a general correlation measure might not be suitable for the
  690. #' study.
  691. #'
  692. #' The plot that follows set the basis for which I carried out the complete
  693. #' analysis. Rather than emphasizing on mathematical correlation measures, the
  694. #' inferences drawn were based on investigating the visualizations. This felt
  695. #' suitable due to the subjectivity in the measure of wine quality.
  696. #'
  697. #' ### Plot Three
  698. #'
  699. ## ----echo=FALSE, messages=FALSE, warning=FALSE, plot_3-------------------
  700. fp3 <- ggplot(subset(wine, rating=='A'|rating=='C'),
  701. aes(x = volatile.acidity, y = citric.acid)) +
  702. geom_point() +
  703. geom_jitter(position=position_jitter(), aes(color=rating)) +
  704. geom_vline(xintercept=c(0.6), linetype='dashed', size=1, color='black') +
  705. geom_hline(yintercept=c(0.5), linetype='dashed', size=1, color='black') +
  706. scale_x_continuous(breaks = seq(0, 1.6, .1)) +
  707. theme_pander() + scale_colour_few() +
  708. ggtitle("Wine Rating vs. Acids") +
  709. labs(x="Volatile Acidity (g/dm^3)", y="Citric Acid (g/dm^3)") +
  710. theme(plot.title = element_text(face="plain"),
  711. axis.title.x = element_text(size=10),
  712. axis.title.y = element_text(size=10),
  713. legend.title = element_text(size=10))
  714. fp4 <- ggplot(subset(wine, rating=='A'|rating=='C'),
  715. aes(x = alcohol, y = sulphates)) +
  716. geom_jitter(position = position_jitter(), aes(color=rating)) +
  717. geom_hline(yintercept=c(0.65), linetype='dashed', size=1, color='black') +
  718. theme_pander() + scale_colour_few() +
  719. scale_y_continuous(breaks = seq(0,2,.2)) +
  720. ggtitle("\nSulphates, Alcohol & Wine-Rating") +
  721. labs(x="Alcohol [%]", y="Sulphates (g/dm^3)") +
  722. theme(plot.title = element_text(face="plain"),
  723. axis.title.x = element_text(size=10),
  724. axis.title.y = element_text(size=10),
  725. legend.title = element_text(size=10))
  726. grid.arrange(fp3, fp4, nrow=2)
  727. #'
  728. #' #### Description Three
  729. #' These plots served as finding distinguishing boundaries for given attributes,
  730. #' i.e., `sulphates`, `citric.acid`, `alcohol`, `volatile.acidity`. The
  731. #' conclusions drawn from these plots are that sulphates should be high but less
  732. #' than 1 with an alcohol concentration around 12-13%, along with less (< 0.6)
  733. #' volatile acidity. It can be viewed nearlyas a depiction of a classification
  734. #' methodology without application of any machine learning algorithm. Moreover,
  735. #' these plots strengthened the arguments laid in the earlier analysis of the data.
  736. #'
  737. #' ------
  738. #'
  739. #' ## Reflection
  740. #' In this project, I was able to examine relationship between *physicochemical*
  741. #' properties and identify the key variables that determine red wine quality,
  742. #' which are alcohol content volatile acidity and sulphate levels.
  743. #'
  744. #' The dataset is quite interesting, though limited in large-scale implications.
  745. #' I believe if this dataset held only one additional variable it would be vastly
  746. #' more useful to the layman. If *price* were supplied along with this data
  747. #' one could target the best wines within price categories, and what aspects
  748. #' correlated to a high performing wine in any price bracket.
  749. #'
  750. #' Overall, I was initially surprised by the seemingly dispersed nature of the
  751. #' wine data. Nothing was immediately correlatable to being an inherent quality
  752. #' of good wines. However, upon reflection, this is a sensible finding. Wine
  753. #' making is still something of a science and an art, and if there was one
  754. #' single property or process that continually yielded high quality wines, the
  755. #' field wouldn't be what it is.
  756. #'
  757. #' According to the study, it can be concluded that the best kind of wines are the
  758. #' ones with an alcohol concentration of about 13%, with low volatile acidity &
  759. #' high sulphates level (with an upper cap of 1.0 g/dm^3).
  760. #'
  761. #' ### Future Work & Limitations
  762. #' With my amateurish knowledge of wine-tasting, I tried my best to relate it to
  763. #' how I would rate a bottle of wine at dining. However, in the future, I would
  764. #' like to do some research into the winemaking process. Some winemakers might
  765. #' actively try for some property values or combinations, and be finding those
  766. #' combinations (of 3 or more properties) might be the key to truly predicting
  767. #' wine quality. This investigation was not able to find a robust generalized
  768. #' model that would consistently be able to predict wine quality with any degree
  769. #' of certainty.
  770. #'
  771. #' If I were to continue further into this specific dataset, I would aim to
  772. #' train a classifier to correctly predict the wine category, in order to better
  773. #' grasp the minuteness of what makes a good wine.
  774. #'
  775. #' Additionally, having the wine type would be helpful for further analysis.
  776. #' Sommeliers might prefer certain types of wines to have different
  777. #' properties and behaviors. For example, a Port (as sweet desert wine)
  778. #' surely is rated differently from a dark and robust abernet Sauvignon,
  779. #' which is rated differently from a bright and fruity Syrah. Without knowing
  780. #' the type of wine, it is entirely possible that we are almost literally
  781. #' comparing apples to oranges and can't find a correlation.