In this blog post we will be making use of two different methods of visualizing multivariate data. This first method we will be using is a matrix of scatterplots. The data set we will be working with is contained in the MASS package within R and is titled UScereal. Of the variables contained therein we will only focus on three: calories, fat, and sugars. We will make use of a scatter plot matrix to get an idea of the relationship between these three variables.
When we consider the caloric content compared to the fat content we see a relatively linear relationship in the positive direction. That is, the more fat in the cereal, the more calories the cereal contains. We see the same relationship when we compare the caloric content with the amount of sugars present. However, we do notice two relatively severe deviations from that pattern in the calorie v sugars scatterplot. These two observations belong to Grape Nuts and Great Grains Pecan. Both cereals have relatively average amounts of sugar when compared to the other cereals in our data set, but have the two singularly large values of calories between them.
Another plot we can use to visualize the relationships between these three variables is a coplot.
In this coplot we consider 6 different intervals of fat content and for each interval we create scatterplots that plot calories on the vertical and sugars on the horizontal to get an idea of how the relationship between calories and sugars is affected by changes in the fat content. By reading the plots from lower left to upper right we move across the different intervals of fat content and we don’t really see too much of a change in the pattern describing the relationship between calories and sugars mentioned previously. This leads us to conclude that the fat content must not have too much impact on the relationship between the other two variables. However, we do notice that the two observations noted as outliers from this pattern between calories and sugars are no longer paired together, so we can determine that these two observations are from different fat content intervals.