Category Archives: Uncategorized

Reading Assignment

For the reading assignment I chose to read “Scaling the Heights (and Widths)*”, and “Graphs in the Presidential Campaign: Why weren’t they used by more than one candidate?”  Both of these articles are by Howard Wainer, and published in Chance magazine.

My presentation on the two articles is available through the link below:

https://docs.google.com/presentation/d/1ZakD1SgvzkqnIzcgMFILb45hp0dJ_JqyI1dBdQ8hLqg/edit?usp=sharing

 

Pop Charts

This week we are examining better ways to present the data in what Cleveland calls “pop charts” (pie charts, divided bar charts, and area charts) that will facilitate better pattern perception and table lookup.

Pie Chart

The first pop chart we will look at is a pie chart.  The chart below is from BGSU’s College of Business information page about student demographics and breakdown.

The pie chart is trying to communicate the breakdown of graduate students in the College of Business by program of study.  When viewing this chart, the reader has to engage in inefficient table look-up to match the color of each pie chart section to the correct program.  In addition, while it is simple to see that the Masters of Applied Statistics program accounts for the greatest share of graduate students in the College of Business, it is more difficult to discriminate between the smaller programs.  To enhance table look-up I graphed the same data with a dot plot below.

This dotplot greatly facilitates table lookup.  First, the programs are presented in descending order for percent of total gradate students in each program.  Second, it is much easier to identify which program has which share of students.  In the original chart the reader has to move their eyes down to the color key and then back to the pie chart.  Finally, this dot plot allows the reader to determine what the percentage of total graduate students in each program is.  The original plot didn’t allow for this.

Divided Bar Chart

The next chart we will look at is a divided bar chart from the website “awefulannouncing.com”.  This chart displays the weekly viewership of the NFL in six geographical regions of the country through week 8 of the 2013-2016 seasons.

While this chart does allow for pattern perception of the fact that NFL viewership generally declined from 2013-2016, the table look-up is not very efficient in this chart.  The attendance numbers are awkwardly perched above or in each of the bars, and some of them are almost impossible to read.  The reader’s eyes have to travel to the bottom of the chart to identify which region each of the bars represents, and then to the right of the chart to identify which season the bar represents.  Finally, it is difficult to compare the attendance in each region by season.  Below is the same data displayed with a multiway dot plot.

 

(Note: It appears that the makers of the original chart used the same attendance numbers for the Southeast and Pacific Regions, which I didn’t realize until I examined multiway dotplot).

The multiway dotplot facilitates both better pattern perception and table lookup than the original chart.  It is easy to see the declining attendance trend for 2013-2016 is present among all regions.  Furthermore, this chart makes it much easier to compare attendance for a certain year between regions.  The biggest improvements are found in table lookup.  It is much easier to find the attendance for a given region and season than in the original chart.  It is also quicker to identify the regions and their corresponding attendance levels.  Finally, at an aesthetic level the original chart seemed a bit too “busy” with its use of solid color and bars.  This chart allows the data to stand out better, which aids the communication of the data to the reader.

This week’s blog assignment showed how relatively easy it is to make an alternative to “pop charts” that are all over the internet.  These alternative charts do a better job displaying the data and are easier to use, and provide more insight than the original charts.

Multivariate Data

This week we are looking at different methods of graphing multivariate data.  In order to be able to graph all three variables at the same time we are limiting the number of variables to three.  We will be looking at pairs plots, coplots, and a 3d scatterplot.  We are working with the dataset “USCereal” in the R MASS package.  The variables I selected are Calories, Carbohydrates, and Sugars.

Pairs Plots

The first graph we will be examining is the pairs plot.

From the pairs plots we can see there is a positive and linear relationship between the carbohydrates in a cereal and the calories in the cereal.  The relationship between sugars and carbohydrates and sugars and calories is a little less clear.  There seems to be a moderate positive linear relationship between sugars and calories, but almost none at all between sugars and carbohydrates.

We can also see several outliers in the pairs plots.  In the pairs plot for calories and carbohydrates, the two outliers are the cereals “Grape Nuts” and “Great Grains Pecan”.

Coplot

Next we will be looking at a coplot.  I experimented around with different variables as the “given”, and decided that using the amount of sugar as the “given” provided the most insightful coplots.

From the top of the coplot we can see that the data is divided into 6 ranges of sugar content, the lowest being from 0 to 6 grams and the highest being from 13 to 22 grams.  Interestingly enough, the “given” condition of sugar doesn’t really change the scatterplot of carbohydrates and calories.  each scatterplot has almost the same shape, which is a positive linear relationship with roughly the same slope.  We can see the two outliers “Grape Nuts” and “Great Grains Pecan”, in the 4th and 5th coplots, but besides those two outliers the scatterplots are almost the same.  Just out of curiosity, I plotted the same data set in a coplot without the two outliers.

Removing those two outliers, we can see the scatterplots have very similar ranges for the data, as well as correlations.  The conclusion is that the amount of sugar in a cereal isn’t very strongly related to the amount of carbs and calories.  This is probably because most sugar in cereal is in the form of added sugar.

3D Plot

The third plot we will be looking at is a 3D plot of the data, which unfortunately cannot be displayed on the blog.  The 3D scatterplot reveals the trends discussed above; the positive linear relationship between carbohydrates and cereal, and the fact that sugar content isn’t strongly related to either factor.

These two screen grabs show the outliers “Grape Nuts” and “Great Grains Pecans”, circled in red.

 

These next two screen grabs show that sugar content isn’t strongly related to calories or carbohydrates.  The first graph is looking at the data in a roughly parallel perspective to the sugar axis.  We can see the relationship between calories and carbohydrates previously described.  The second view is from the graph rotated roughly 90 degrees counterclockwise, around the “north south” axis.  We can see the values of sugar content are fairly evenly distributed along the Calorie-Carbohydrate plane.

Overall, all three graphs were effective in displaying this data.  The 3D plot was by far the most fun to use, but in higher dimensions the pairs plot would probably be the most useful for quickly assessing relationships between variables.

Blog 10: Color

This week we are examining how to effectively use color to enhance the communication of the data.  First, we will examine some time series data from the Broadway attendance data set.  I chose four of the most popular shows, “Wicked”, “Lion King”, “Mamma Mia!”, and “The Phantom Of The Opera”, and plotted their weekly attendance as a percentage of the total capacity for the years 2012-2015.

 

We can see that in each of the years “Wicked” had busier periods in the summer and over the Christmas holidays.  The color palette used is the default one in ggplot2.

For “Phantom of the Opera” we see a similar trend, with the summer drawing the most consistent crowds, and an even more precipitous spike in attendance over the Christmas holidays.  The color palette used is “YlOrRd”.  It is somewhat more difficult to differentiate between years, so I would not recommend using this one.

“Mamma Mia!” shows the same Christmas attendance jump, but less of the summer rise, with the exception of the last year it ran, 2015.  The color pallet used is “RdYlBl”, which does a pretty good job of setting the years apart.

Finally, “The Lion King” uses the “Dark2” color palette, which is not as effective as the default palette and the “RdYlBl” palette, as it is difficult to tell the colors representing 2013 and 2015 apart.  In addition, I realized when making this graph that it appears the attendance in  the spring and fall have very dramatic drop-offs, but in reality the graph uses a different y-axis scale.

When a similar scale is used, we can see that the “Lion King” actually had by far the highest average capacity throughout the year, with the majority of shows sold out or very close to selling out.

Overall, I appreciate the default color palette in ggplot due to the ease of telling apart different categories by  their color.  In general, it seems choosing color spectrums such as the “YlOrRd” isn’t as effective in allowing the reader to differentiate categories.

Now we are going to examine color in the use of a contour plot.

Below is similar data in a contour plot, graphed with different color palettes.

The palettes I tried that did not have a gradient to them I found to be ineffective.  In this data the color is representing the density of the distribution, so it should enhance the reader’s ability to interpret the graph.  Trying palettes such as “Dark2” or “Spectral” did not yield effective graphs, as the different colors were not assigned to densities in any logical manner.

I found the “Grey” option to do a good job representing the levels of density.  I tried inverting the gradient, but found the edges to be hard to distinguish from the grid.  The benefit of this palette is that it is easy to reproduce on black and white printing.

My favorite palette I found was the “RdBu” palette.  There is something natural about having red represent the “hottest”, or in this case, densest, areas of the distribution.  In some ways, this choice of color functions like a heat map.  The colors help communicate the rising levels of density to the reader of the graph.

Finally, I tried the “RdYlGn” palette.  I visually find this choice effective in highlighting the different and rising levels of density as we approach the center of the graph, but aesthetically I don’t appreciate this palette as much.

If I were using a palette in a paper or assignment where I was graphing similar contour plots I would choose either the “Grey” palette or the “RdBu” palette, as these have both aesthetically pleasing qualities and helpful visual logic.

Loess

This week we are looking at different Loess smoothing curves, and examining the effect changing the value of alpha has on the curve and the residuals of the Loess.  When we use a Loess we want the residuals between the original values and fitted Loess values to be independent from the values of x.  First, we will examine the residuals from the Loess at the default setting of alpha.

We can see there is some periodicity in the residuals using the default value of span in the ggplot smoothing function.  We can also examine the residuals versus fits.

We should lower the value for span and see if that makes the residuals of the Loess more independent.  We will try to use Alpha = 0.6.

By changing the value of span to 0.6 we can see by the residuals plot that there is less dependence on x for the residuals, but there is still some.  Below are the residuals and fits plots

From the residuals plot it is clear that lowering the value for “span” has improved the situation.

Now we shall try an Alpha = 0.5.

The Loess through the residuals is almost completely horizontal, which is what we desire.

From the residuals plot there is no discernible dependence on the x values for the residuals.  We would get even better results if we lowered Alpha further, but at the cost of the smoothness of our fitted values curve on the original data.  It seems that an Alpha = 0.5 is a good choice for this data.

 

Dot Plots

This week we are looking at different ways we can use dot plots to display numerical data from with multiple categorical classifications.  The data set I am working with is the number of hurricanes that formed from the years 2000 – 2015 in four ocean regions of the world.  The data is in the form of a 4 by 16 matrix, where each row is a ocean region and each column is a year.

Means Dotplot

This first dotplot will display the mean number of hurricanes in each of the four ocean regions and order them in descending order.

This display clearly shows that the West Pacific ocean range has the highest average number, almost double the next highest region.  It also shows that the North Atlantic and Indian Ocean have similar 16-year averages of around 7, and the East Pacific region has the fewest with around 4.  This means dotplot doesn’t show any patterns within each region, and only serves to summarize the 16-year period.

Dotplot, Grouped by Region

The next dotplot will display the number of hurricanes over this time period, grouped by the ocean region (rows).

By grouping by ocean region we can get a better idea of the trends within each region over this 16-year period.  For example, the Indian Ocean and North Atlantic had similar 16-year averages, but the Indian ocean had zero hurricanes from 2009 – 2015, and relatively more from 2000-2008 than then North Atlantic.  This display also shows that the number of hurricanes in the North Atlantic per year has quite a bit of variability compared to the number in the West Pacific.

Dotplot, Grouped by Year

Now we will examine the same data, this time grouped by year.

This dotplot allows us to examine global trends by year for the number of hurricanes.  For example, in 2009 there were very few hurricanes across the globe.  It is difficult to say more than that, as comparing 16 plots is somewhat difficult.

Conclusion

Overall, grouping by region seems to be the most effective way of presenting the data.  The means dotplot is very concise, but the information it communicates is extremely limited.  Examining by year over this 16-year time frame is too much information at once.  Perhaps it would have been better to plot a smaller number of 5-year averages instead.  Grouping by year allows for straightforward comparison between regions, and gives a clear idea of the distributions for each region.  In addition, if there were a significant time effect occurring to influence the number of hurricanes per year it would stand out the most.

Distributions

This week we are looking at ways to compare two distributions graphically.  The data we will be working with is from the student data set in the “LearnBayes” package in R.  First, 100 students were randomly selected from the data set and divided by gender.  24 of these students were male, and 76 were female.  The variable of interest is the price they pay for haircuts.

Parallel Dot Plot

The first graph we will look at is a parallel one-dimensional dot plot comparing haircut cost by gender.

This parallel dot plot shows that the range of prices that females from this sample is much wider than the prices that males pay.  The black dot represents the median price for each gender.

Parallel Quantile Plot

The second graph we shall use to examine the data is a parallel quantile plot.

The parallel quantile plot shows that the first quartiles for haircut price are very similar; many students paid nothing for their haircuts.  The second quartiles are also similar, but females in the third quartile paid close to double what males paid and females in the fourth quartile paid significantly more than double.  This parallel quantile plot shows that the differences between what males and females paid is not consistent throughout the quantiles and increases as the quantiles increase.

Quantile-Quantile Plot

The third graph we shall look at is a QQ plot for male and female haircut prices, where the quantile values for males are plotted against the quantile values for females.  Overlaid over the Q-Q plot is a line with a slope of 1 and an intercept of y = 0.  if the distributions of haircut prices were similar we would expect most of the pairs of quantiles to fall on this line.

From the Q-Q plot we can see that the majority of the female quantile values are larger than the male quantile values, and the difference between them increases as the quantiles increase.  This would indicate that these haircut prices are coming from different distributions.

Tukey Mean-Difference Plot

The final graph we will look at is the Tukey Mean-Difference plot, where we compare the means of the quantile values to the differences  between the quantiles.  Since from the Q-Q plot it appears the difference between the quantiles increases we would expect this to show up in the Mean-Difference plot.

Overlaid on the Mean-Difference plot is a line with a slope of 1 and y intercept y=0.  This line is to help the reader generalize the relationship between male and female values.  In general, when the mean increases by 1 the difference between female and male haircut price quantile values increases by 1 as well.  In other words, the prices of female haircuts are higher than the prices of male haircuts, and this difference increases as the haircuts get more expensive.

Comparison

These four graphs are all effective at comparing the distribution of female haircuts to the distribution of male haircut prices.  For an at-a-glance comparison the parallel dot plot and the parallel quantile plot provide the most simple comparison of the distributions.  From both, it is easy to see that females paid more for haircuts than males across the board.   The Q-Q plot seems to be the least useful for a measurable graphical comparison.  From it we can tell that the distributions are different, but it is more difficult to quantify this difference.  The Tukey Mean-Difference plot gives the most detailed information about the differences in the distributions, but for readers less familiar to m-d plots it takes longer to absorb the message the  graph is communicating.  If I were to choose only one graph to display this difference I would choose the parallel dot plot, as it is very accessible and summarizes the differences in distribution efficiently.

Pythagorean Relationship

The 1995-96 NBA season saw the Chicago Bulls set a then-record of 72 wins, finishing with a final record of 72-10.  The league’s 29 franchises finished with win totals ranging from the Vancouver Grizzlies’ 15 wins to the Bulls’ 72 wins.  It would be interesting to see how the Pythagorean relationship, looking at the ratio between points scored and points against, does in predicting these win totals.  The relationship is:

where W = total wins in the season, L = total losses in the season, P = points for, PA = Points against, and k is some constant.  Below is the data for each team.

Below are the plots of log(P/PA) against log(W/L), along with the residuals plot for the fitted regression equation.

From the original scatter plot we can see that the regression line does a good job fitting the data.  The residuals plot confirms this; there appears to be no pattern and no significant outliers.  The fact that there are no significant outliers speaks to the predictive accuracy of this relationship.  The Chicago Bulls, while having 72 wins, also had an average point differential of +12.3.  On the other hand, the league-worst Vancouver Grizzlies had an average point differential of -10.0.  Teams with a log(w/l) of 0, in this season the Phoenix Suns and Charlotte Hornets, had point differentials of 0.3 and 0.6 respectively.  These two plots demonstrate that no team in the NBA this season was very “unlucky” or “lucky” in terms of the number of wins they got compared with their ratio of (P/PA).

The regression estimate for the constant k is 14.286, giving us a final model of

log(W/L) = 0.0062 + 14.286*log(P/PA) + error.

This estimate of the coefficient for k is very close to the ones Daryl Morey and Dean Oliver came up with in their initial application of the Pythagorean Expectation to basketball.

 

Visualizing Amounts

This week I am examining different ways to compare amounts between different classifications.  The data set I am working with consists of data about Broadway shows that have run over the past several decades.  I am interested in comparing Broadway shows from two time periods; 2000-2008 and 2009-2016.  Of interest is which shows were the most popular over these time periods.  I am defining “most popular” by the total number of people who saw the show over each eight-year period.  In addition, I am only interested in shows that ran during both time periods, so that we can examine the trends of popularity among shows.  I selected the top seven shows in attendance that ran in both periods.  These shows are The Lion King, Wicked, Mamma Mia!, The Phantom of the Opera, Chicago, Jersey Boys, and Mary Poppins.  I will be using two types of graphs to compare attendance; grouped bar plots and stacked bar plots.

First, we will look at a grouped bar graph that could use some improvement.

This grouped bar graph shows each production, a plots the total attendance for that production on the y-axis.  Some of the show titles are difficult to read because the names are long.  This causes the names to run into each other, creating a mess.  In addition, the order of shows on the x-axis is chosen arbitrarily.  In this case, ggplot defaults to alphabetical order.  It would be more interesting if the order of the shows was in ascending or descending order based on total attendance.

Below is the same data graphed in a stacked bar graph.

Once again, the x-axis label issues and x-axis order issues are present.  In addition, the total attendance is displayed as a power of 10, making reading the y-axis somewhat tricky.  We  can improve upon these three issues in the next two graphs.

This grouped bar chart and stacked bar chart solve the x-axis label problem by rotating the graphs 90 degrees and displaying the y values horizontally.  In addition, the attendance values are now ordered from greatest total attendance to least total attendance.  This allows the reader to see how the attendance changed between time periods, but also see which shows have remained strong draws over time.  For example, it is easy to see that longer running shows like The Lion King, The Phantom of the Opera, and Mamma Mia! declined in popularity between the two time periods, but newer shows like Wicked (2003) and Jersey Boys (2005) have increased in popularity.  Finally, the attendance values are displayed in millions, which makes interpreting the raw numbers from the graphs much more straight forward.

Two Scales and Comparison

It can sometimes be useful to graph data with two different axis scales.  In this blog post I will explore how this can be used to effectively communicate a feature of the data.  Below is a graph of the population growth in India between the years 1960-1969.

The left y-axis shows the log2 of the population, in millions, of India.  The line that would connect the 10 data points is almost perfectly straight.  This shows the reader that the growth rate in India during this time period was exponential and consistent.  We can see that the population increased by a factor of log2(0.3) during this time period, or about 23%.  The raw population figures on the right y-axis allows the reader to get a sense for how great of an increase this was; the population of India increased by roughly 100 million people over these ten years.

Using two axes can also be helpful in comparing the growth rates of two separate countries, like in the graph below.

This graph shows that India and China had somewhat similar and exponential growth rates from 1960-1969.  India had a more consistent growth rate throughout the decade.  China experienced a negative growth rate during 1960 and 1961, but then had a consistent growth rate slightly higher than India’s from 1962 to 1969.  The “Great Leap Forward” (1958-1961), an attempt by the Chinese Communist government to implement rapid industrialization and centralization, led to about 20 million deaths from starvation.  This effect, and the subsequent growth rate increase once the policy was repealed, can be clearly seen on the graph.  The second y-axis of population on this graph is also helpful to capture the population differences between India and China during this time period.

Unclear Vision

The data set titled “USCereal” in  the R MASS library has nutritional information concerning a variety of US cereals which are correlated to a certain degree.  The purpose of this blog post is to examine some of these correlations, as well as good and bad ways to display these graphs.  First, consider a scatter plot of two variables, “Protein” and “Fiber”.

This graph shows that there is fairly strong correlation between the amount of protein and the amount of fiber a cereal contains.  Cereals that have higher amounts of protein tend to have higher amounts of fiber as well.  This is probably due to a main source of both protein and fiber being whole grains.

The following graph is of the same data, but this time violating attributes of Clear Vision as laid out by Cleveland.

This graph is a modified version of the original graph, with two deleterious changes.  First, the inclusion of labels onto the data points is an interesting idea, but due to the density of cereals at the lower portion of axes the labels become impossible to read and distract the reader.  If labels were required in the graph a smaller data set should be used, or a different way of labeling the data should be used.  Second, the use of a square plotting shape makes it hard to differentiate between data points in the same crowded area of the scatter plot.  In some areas where data points almost overlap it is difficult to see how many cereals have that combination of protein and fiber.  In the original graph the use of smaller circles makes it easier to differentiate data points, even when they are close together.

Finally, the following graph displays the same data, but this time uses a color gradient to show the caloric content of each of the cereals.

The correlation between calories and fiber, and calories and protein is moderate, but it is still visible in the graph due to the color gradient.  We see that as the grams of protein and grams of fiber increase, the caloric content of the cereal tends to increase as well.

I experimented with dividing the caloric data into 3 or 4 calorie ranges and displaying the data with different colors that way, but the correlation between the caloric content and the other two variables wasn’t strong enough to create a meaningful pattern on the graph.  It is also interesting to experiment with different color gradients.  Choices like “blue” to “red” didn’t do a very good job of highlighting differences in caloric content, since it is hard to distinguish the transitional purples from the blues and reds.  Using cyan to red makes it visually much easier, but aesthetically I am not thrilled with this choice.  If I were doing this graph over again it would be interesting to see if the size of each observation could be used to represent the caloric count.

Tuition Growth at BGSU

Below is a scatterplot that shows the cost of  of instructional fees per semester at BGSU between 1960-2018 for seven individual terms.  The y-axis displays the instructional fees in base 10 log.  Between 1960 and 2010 the cost for instructional fees has roughly doubled each decade.  In recent years the rate of growth has slowed, possibly due to Governor Kasich’s tuition freeze.

The basic message this graph is intending to convey is that instructional fees at BGSU have steadily risen since the 1960s.  The data points that are displayed in this scatterplot are spaced roughly every ten years, giving the viewer of the graph enough of a sense of this trend without crowding the graph.  The lines connecting the data points further emphasize this consistent increase in instructional fees.  Displaying the instructional fees as a log10 allows the reader to get a clearer idea about the year-to-year increase relative to the previous decade.  If the instructional fees were plotted as a dollar amount it would appear that the fees were barely increasing between 1960 and 1970, and then rapidly rising after those decades.  However, displaying the data in this manner shows that the percent increase in instructional fees has been fairly steady over the past 58 years.

I have previously not had much experience using ggplot in R, and so making this graph was quite the learning experience.  I found out how to adjust many of the aesthetic and design features of this scatterplot such as point size and color, as well as title and axis color and font.  I would have liked to make the numbers on the x and y axis larger next time, and was also unsure of where to place the annotation “Year I started college” on the vertical line at x=2008.

Is Horsepower of a Car Related to Its Mileage?

Motor Trend magazine collected the horsepower and mileage for 32 cars in the 1973-74 model year. To see if there is any relationship between horsepower and mileage, I constructed a scatterplot of the two variables.

There appears to be a clear relationship between the horsepower and fuel efficiency for cars in the 1973-74 model year.  As the horsepower of a car increases, the fuel efficiency tends to decrease.