In this blog assignment, we are using UScereal data set in the MASS package.
Information about the data can be found here.
I choose three variables among (calories, protein, fat, sodium, fibre, carbo, sugars, potassium, vitamins) and construct a scatterplot matrix, a coplot, and a spinning 3-dimensional scatterplot for these variables. Then, based on my graphs, I will describe the general relationships between the three variables. In addition, I will find two “special” cereals that seem to deviate from the general relationship patterns.
First, I choose Calories, Protein, and fibers and portray their scatter plot.
There is a positive association between calories and protein, Calories and fiber, and protein and fiber. It is not hard to notice there exists some outliers in the data. If we removed them, we might see a better linear relationship but right now, the relationship is not linear. At this point, I have a few guesses of some outliers. I remove them to see the effect of removing them on my scatter plots.
We notice an increasing pattern between protein and calories. This means more protein, more calories on average. Also, an increasing pattern between fiber and protein.
Well, now that I removed a few data points, we can better see the associations. But still the non linearity of the relationship exists.
Conditional plot:
“A conditional plot, also known as a coplot or subset plot, is a plot of two variables conditional on the value of a third variable (called the conditioning variable). The conditioning variable may be either a variable that takes on only a few discrete values or a continuous variable that is divided into a limited number of subsets.
One limitation of the scatter plot matrix is that it cannot show interaction effects with another variable. This is the strength of the conditioning plot. It is also useful for displaying scatter plots for groups in the data. Although these groups can also be plotted on a single plot with different plot symbols, it can often be visually easier to distinguish the groups using the conditional plot.”
There is a nonlinear association between the number of calories in one portion vs. the amount of protein in almost all the bins of fiber. In the first bin, where the amount of fiber is less than 2, panel (1,1) the amount of of calories increases as the amount of protein increases. Then it decreases and then again it increases.
As the amount of fiber increases, the association between Calories and protein gets more positive and maybe moving toward linearity.
We see some outliers in the last panel. In some cereals with high amount of fiber, we have high amount of calories, as the amount of protein increases. What those outliers would be?
I believe the ones that don’t follow the general pattern are Grape-Nuts and Great Grains Pecan, because they are the ones that have too high of calories counts and high protein counts, in comparison with the rest of the cereals.
The following code was helpful in finding the outlying cereals.
UScereal[which(UScereal$calories==max(UScereal$calories)),] UScereal[which(UScereal$calories==max(UScereal$calories[which(row.names(UScereal)!="Grape-Nuts")])),]
Next, 3-dimensional scatterplot:
So, based on this three dimensional graph, we again see how for high level of fiber, we have some outlying cereals that have high protein and high calories counts, in comparison with other ones. We can also see the positive associations among the three variables.