Reading Assignment

Howard Wainer has written a number of interesting “statistical stories” that extensively use statistical graphics in learning about patterns in data. Many of his articles appear in the “Visual Revelations” column in the magazine Chance. For this assignment, I read three  interesting articles ;

  1. Looking at Blood Sugar
  2. Giving the Finger to Dating Services
  3. Trilinear Plots

I summarized “Looking at Blood Sugar” and “Giving the Finger to Dating Service” and the presentation about the two articles can be found here.

Pop Charts

In this blog, I looked at pop charts in the media and ways to redraw these charts using a dot plot or a multiway dot plot. Examples of pop charts include pie charts, divided bar charts and area charts. These charts are commonly used in media because they are easy to construct and interpret. But sometimes, they are not very helpful in fulfilling their purpose. We are unable to see patterns and draw conclusions from these charts. Instead of pop charts, I’ll redraw the charts to give a more efficient look at the trends and patterns in the data.

I looked at some graphs from YouGov.com and the post I considered talked about distracted driving, whether it was texting while walking or texting while driving. A poll was conducted to ask residents of various cities across the US whether text while driving was a problem in their cities in the US. They also conducted another poll to ask residents of whether text while walking was a problem in their cities or not. In the post, divided bar charts were used to display the results of the poll separately. The graphs can be found here;

https://today.yougov.com/topics/automotive/articles-reports/2018/11/19/texting-while-driving-viewed-problem-cities-across

Although the responses for those who think it’s a problem in both cases are sorted in descending order, it is difficult to compare the responses of both polls across the cities since we have separate graphs for both responses. I re-constructed a multiway dot plot to help compare the results obtained in the polls. The plots are created in different ways to visualize the patterns on the results. The graph below shows a multiway dot plot and we see the % of US residents who felt either texting while driving or texting while walking was a problem.

While more than 60% of the US residents believe texting while driving is a problem, less that 60% believe that texting while walking is a problem. But as to whether texting while driving or texting while walking was a problem or not, less than 20% didn’t know in both cases across the cities.  While more than 30% of the residents across the US cities believe texting while walking is not a problem, less than 30% believe texting while driving is not a problem. Hence, just a small percentage of US residents in the selected cities believe texting while driving is not a problem. From the graph, we can perceive a better pattern and trend in response compared to the plot in the post.

The next graph helped to compare the percentages within the cities more clearly. For example in Minneapolis and Cleveland, a higher percentage (more than 50%) of residents believe it is not a problem to text while walking. But we see a clear difference in response when residents were asked whether or not texting while driving was a problem across the cities. Most cities especially Denver and Houston see distracted driving as a major problem. However, Minneapolis had one of the lowest percentage of residents saying texting while driving was a problem with the  highest percentage of residents saying texting while driving is not a problem, compared to other cities.

I also looked at another interesting post on YouGov.com and the post was about online courses and how likely students were to cheat in online courses. This post was pretty interesting since it talked about the which age group was likely to take online courses and the reason behind their decision. Among the results presented, there was a pie chart showing whether or not students thought online courses were effective compared to the traditional in-class courses. The results/post can be found here;

https://today.yougov.com/topics/education/articles-reports/2018/08/02/online-education-distance-learning-classes

Instead of pie chart, I redrew the chart using a dot plot and this is shown below. We don’t need the key/legend to show which color represents which response. All the information is captured in the dot plot and we see a clear pattern. We observe that 35% of students believed the online courses were as effective as the traditional in-class courses whereas 30% believed it was much or somewhat less effective compared to the traditional in-class courses.  Only 14% believed the online courses were much or somewhat effective.

I believe the dot plot and multiway dot plot provide a better visualization of the data compared to the pop charts. And as Cleveland stated, “Any data that can be encoded by one of these pop charts (such as a pie chart, divided bar chart or an area chart) can also be decoded by either a dot plot or multiway dot plot that typically provides far more pattern perception and table look-up than the pop-chart encoding.”

 

Multivariate Data

In this blog, I will re-visit the US Cereal data set. The data set consists of 65 observations with 11 input variables including the name of the cereal, manufacturer, calories per serving, grams of protein, grams of fat, milligrams of sodium, grams of fiber, milligrams of potassium, grams of sugars, etc. Among these variables, I will focus on three variables – the amount of calories, grams of fat and the amount of grams of complex carbohydrates in one portion.

I constructed a scatter plot matrix for these selected variables to look into the relationship between the variables. This is shown below;

From the plot, we see that as the amount of fat increases, the number of calories tend to increase as well. Most cereals tend to have 6 or less grams of fat and less than 300 calories. But Great Grains Pecan has more than 9 grams of fat with about 363.64 calories which is quite interesting. Another interesting cereal is the Grape-Nuts which has no gram of fat but has 440 calories. Considering the relationship between calories and the amount of grams of complex carbohydrates, we observe that as the amounts of carbohydrates increase, the number of calories tend to increase as well. From the graph, most cereals have less than 32 grams of complex carbohydrates but Grape-Nuts and Great Grains Pecan have high grams of complex carbohydrates with Grape-Nuts having the most grams of about 68 grams. Again, comparing the grams of fat to the grams of carbohydrates, it’s had to tell if there is an association between those variables but there is a slightly positive (upward) trend. Most cereals have less than 6 grams of fat with less than 32 grams of complex carbohydrates but both Great Grains Pecan and Grape-Nuts appear to fall away from the cluster of points.

Next I constructed a conditioning plot, also known as coplot. Here, I used amount of calories as the response variable and the amount of fat as the explanatory variable conditioning on the amount of complex carbohydrates.

The coplot panels are ordered from lower left to upper right conditioned on the levels of the carbohydrates. Based on the coplot, we see a similar trend in all the panels: that is the amount of calories tends to increase as fat increases at different levels of carbohydrates. At different levels of carbohydrates, most cereals tend to have 6 or less grams of fat with at most 300 calories. But in the upper right panel, we see both Great Grains Pecan and Grape-Nuts falling into this panel having higher amounts of carbohydrates and higher amounts of calories.

Finally, I constructed the spinning 3d scatter plot. We can see a positive association between the variables and a clearly, Great Grains Pecan and Grape-Nuts fall away from the cluster of points since they both have high amounts of calories and carbohydrates.

 

 

Color

I will re-visit the Broadway library for this blog and will be comparing different time series effectively using a single graph, distinguishing the series using colors. The data set was obtained from the Broadway Library https://think.cs.vt.edu/corgis/csv/broadway/broadway.htmlIt contains more than 30,000 observations with 12 input variables. The input variables include the year of performances, attendance, capacity, name of the production, name of the theatre, among others.

I will plot the mean capacity (percentage of the theatre that was filled during that week) as a function of week number to investigate the mean percentage of the theatre that was filled during that week, between 2013 to 2015. The plot is shown below;

From the graph, we see that for the 3-year period, the trend is quite similar but intertwined. The average percentage of the theatre that was filled during the first few weeks of the Broadway shows in 2013 was relatively low compared to the other years. But as the weeks went by, the average percentage of the theatre that was filled picked up and after week 40, we see a relatively higher percentage filled in 2013 compared to 2014 and 2015. Again, in 2013, the 6th week saw the lowest mean percentage of the theatre that was filled for the Broadway shows but saw the highest mean capacity by the end of the year. The last few weeks in 2015 also saw significant decrease in the mean percentage of the theatre that was filled compared to the other years which saw a slight decrease in the last weeks as well.

Using the geom_smooth function, I overlaid a smooth curve on the graph to see the overall trend and we see that there were ups and downs in the average percentage of the theatre that was filled over the weeks of the Broadway shows for the different years.

Next, I simulated a sample of size 200 from a bivariate normal distribution with correlation rho=-0.9 and used a bivariate density estimation algorithm to construct a contour graph of the density estimate. First, the Spectral palette is used to color the regions of the contours and this is shown below;

From the graph, there are several colors and shades of the colors used and we have to constantly refer to the key to understand the order of the values. But what we want is an effortless perception of the order of the data values and to clearly see the boundaries between adjacent levels. I changed the color palette to YlGn to help improve the graph.

From the graph, I used fewer colors compared to the first contour graph and we see the boundaries clearly and how the color becomes deeper as we move to the outer contours. We don’t have to constantly check the key as well as the effortless perception of the order of the encoded quantities, thus a better set of colors compared to the first contour plot.

Loess

In today’s blog, we will be investigating a method for fitting a smooth curve to a scatterplot. The curve is produced by a smoothing procedure called locally weighted scatterplot smoother (or locally weighted regression), also known as loess. I will use the geom_smooth() function in R and I will control the amount of smoothing by the span argument. To achieve this, I will simulate a data set bivariate data set  using the following R codes;

simdata <- function(sigma=0.4){
x <- seq(-pi, pi, length = 200)
curve <- sample(1:4, size = 1)
f <- (sin(x) + cos(x)) * (curve == 1)+
(sin(x) – cos(x)) * (curve == 2) +
(sin(x) * cos(x)) * (curve == 3)+
(.28 – .88 * x – 0.03 * x^2 + .14 * x^3) * (curve == 4)
y <- f + rnorm(length(f), 0, sigma)
data.frame(x=x, y=y)
}

d <- simdata()

With the data, I will construct a scatterplot and overlay a loess smooth using the default value of span in the loess function.

From the scatterplot, we observe that the relationship between x and y is nonlinear and we are exploring this relationship using loess. Note that I used the default value of span. I want to check how well the curve fits the data or whether the loess curve is not distorting the data so I’ll construct a residual plot for the data to see if there is any effect. If the smooth curve fits the data, then when we add the loess smoothing to the graph of residuals, the curve should be nearly horizontal line.

From the graph, we see that the smoothing curve is not horizontal and suggests that there is some dependence of the residuals on x.  This may be because the value of the span is too large in the smoothing of the data. To fix this, I will reduce the value of the span to 0.42 and construct a scatterplot with an overlaid smooth curve with the new span. The plot is shown below;

From the scatterplot, the smoother curve is more defined and fits the data better and the residual graph below shows that the curve is nearly a horizontal line which suggests no dependence of the residuals on x.

Clearly, the smoothing curve with span value 0.42 is not distorting the underlying pattern and the fit curves the data well.

Flights Arriving at Atlanta International Airport by Major Airlines

The data set I used is the number of flights arriving at Hartsfield–Jackson Atlanta International Airport from January to June, 2018. Atlanta International is one of the busiest US airports and I focused on 5 of the biggest airlines which includes Delta Air Lines, Spirit Air Lines, American Airlines, United Air Lines and Southwest Airlines. Below is a summary of the data;

First, I constructed a dotplot to compared the average number of flights arriving at Atlanta International Airport for the 6-month period. To see the trend, the average number of flights arriving were ordered from highest to lowest and the dotplot is shown below;

Clearly, we see that Delta Air Lines has the highest average number of flights arriving at Atlanta International Airport over the 6-month period. United Air  Lines and Spirit Airlines have the fewest average number of flights arriving at the Atlanta International Airport. Next, I constructed a dotplot, grouping by the different months.

Consistently, Delta Airlines has the largest number of flights arriving at Atlanta International Airport across the 6-month period, followed by Southwest and American Airlines. Both United Air and Spirit continue to have the least number of flights arriving. With this, it made more sense ordering the number of flights arriving within each month and this is shown below.

Now we consider grouping the  number of flights arriving by the airlines for the 6 month period. This helps to see if the number of flights arriving  are consistently the same or have declined.

From the graph, the number of flights arriving in the month of February by different Airlines declined slightly and this may be due to weather conditions. And United Airlines consistently had fewer number of flights arriving at the Atlanta International Airport.

Delta Air Lines operates one of the largest hub at the Atlanta International Airport and has more than 20,000, on average, flights arriving at the airport each month, which is consistent with the dotplots I have.

Comparing distributions

The data set, studentdata was obtained from the LearnBayes package in R. The data contains 657 observations and 11 input variables including height, gender, number of shoes owned by student, among others. For this blog, I will randomly sample 100 observations from the data set and the goal is to compare the number of shoes owned by males and females. Out of the 100 observations, 38 are males and 62 are females.

First, I constructed a parallel one-dimensional scatterplot of the number of shoes owned by males and females. I created this scatterplot by gender to compare the number of shoes owned by females to the number of shows owned by males. The one-dimensional scatterplot helps to show the individual distributions of the male and female values.

From the graph, females tend to have more shoes compared to males (which is quite obvious!!!). While the least number of shoes owned by females is 5, males tend to have 10
or less shoes. The females have a larger spread compared to males, with one female having about 40 shoes. But in general, the parallel one-dimensional scatterplot does not  provide enough information to compare the distributions.

Next, I constructed the parallel quantile plot which is often more effective for comparing data distributions. The data set is split into 2 groups, male shoes and female shoes and the f-values are computed separately for each group. The plot of the number of shoes against the f-values are shown below;

From the graph, we see that the quantile values for females are higher than those of males. For example the median number of shoes owned by females is 15 shoes compared to males which is about 5 shoes. Typically females have more shoes compared to males. Also comparing the upper quartile (Q3), 75% of the males have about 8 shoes or less compared to the females with about 75% of them having at most 25 shoes.The females have a larger spread or variability in their data compared to males.

I also constructed a quantile-quantile plot of the male and female values since it is an effective way to compare the quantiles. Here, we graph the quantiles of the females against the corresponding quantiles of males. It is a simple but powerful tool for comparing two distributions. The graph is shown below;

The equation y=x is overlaid (It is the black line on the graph). All the points are above the line and I don’t think it is a good idea to summarize the point with this equation. Throughout most of the range of the distribution, the female quantiles are higher than the males quantiles. The corresponding quantiles do not differ by a constant, thus it is hard to tell by how many more shoes the females own compared to the males. The medians of the two groups differ and higher number of shoes for females are larger than the high number of shoes for males. Also, higher quantiles differ by more compared to lower quantiles.

Lastly, I constructed a Tukey mean-difference plot from the quantiles of the two distributions. Graphing the difference in quantiles of the shoes owned by males and females against the mean quantiles, we see an upward trend. That is as the mean quantiles increase, the difference between the quantiles of males and females tend to increase as well. From the median to the top quantiles, most of the quantiles for the females are about 15 to 25 shoes higher than those for the males, but the difference decreases to less than 10 shoes for the lower quantiles. The means there is a larger gender differences in number of shoes for larger mean value of shoes. The number of shoes owned by males and females differ and is complicated, not like the q-q plots with a simple linear pattern.

From the 4 graphs, we can conclude that females tend to own more shoes compared to males. I believe the parallel quantile plot and the Tukey mean-difference plot graphical comparison of the shoes owned by males and females in the statistics class. The parallel quantile plot helps to compare the medians and quartiles of the two distributions and the tukey mean-difference plot gives further information about the mean and difference between the quantiles to compare the two distributions.

 

Exploring the Pythagorean Formula

The data is about the NFL standings of National Football Conference (NFC) in 2017 – 2018. There are 16 teams in the NFC including Minnesota Vikings, LA Rams, Dallas Cowboys, Green Bay Packers and the defending champions, Philadelphia Eagles. The data set was obtained from http://www.espn.com/nfl/standings/_/season/2017/group/conference. I did not include data for the American Football Conference (AFC) because the Cleveland Browns lost all their games during that season. (But they’ve won one game this season so lets see how the season goes.)

For each team, the number of games won (W), the number of games lost (L), the total number of points scored during the season (PF) and  the total number of points allowed during the season (PA) were collected. The data is shown below;

The Pythagorean formula (described first by Bill James in the context of baseball) relates the number of points scored by a team and the points allowed to the winning percentage. Thus W/L  = (P/PA)^k where k is a constant that is dependent on the particular sport. Taking log, we can express the formula as log(W/L) = k log (P/PA). I created a scatterplot of log(W/L) against log(P/PA) and overlayed the best fitting line to estimate the value of k. I also constructed a plot of the residuals against log(P/PA).

The top panel is the graph of log(W/L) against log(P/PA). The line is the least-squares fit with slope of 3.005. Though the points deviate from the best fitted line, most of the points lie in
a narrow band around the line.

The bottom panel is the graph of residuals. Most of the residuals lie between -0.4 to 0.4. This means, the percent deviations of the actual ratio of wins to losses (or winning percentage) from
the ideal ones range between -40% and 40%. The largest residual is about 0.6068 means the ratio of wins to losses (or winning percentage) for the team, Arizona Cardinals is about 61% larger than the ideal ratio of wins to losses (or winning percentage).  I don’t think there were any teams that were significantly unusual but the Giants scored less touch downs and allowed quite a number of touchdowns against their team.

 

Broadway Shows

The data set was obtained from the Broadway Library https://think.cs.vt.edu/corgis/csv/broadway/broadway.html. It contains more than 30,000 observations with 12 input variables. The input variables include the year of performances, attendance, capacity, name of the production, name of the theatre, among others.

I considered two time intervals: 2000 – 2008 and 2009 – 2016. I wanted to compare the best Broadway show for the two time intervals based on the total attendance total over the entire week. First, I constructed a grouped bar graph for the entire Broadway data and these are shown below;

The vertical axes of the graph is the total attendance and the horizontal axes represents the name of the production. It is extremely difficult to read the names of the production on the horizontal axes because there are too many shows. Also, there is no title for the graph. This makes it hard to compare the Broadway shows during the two time periods.

I then constructed a stacked graph with the same data but this time, I set a minimum attendance of about 2000000 people. Thus I subset the data to include the shows that had at least 2000000 people attending and the graph is shown below;

Looking at the horizontal axes, we cannot clearly read the titles of the show because the names are overlapping. To fix this, we can flip the coordinates so that the total attendance is on the horizontal axes and the name of the show on the vertical axes. The horizontal and vertical scale labels can be improved as well. This is shown below;

Lastly, I created a grouped bar chart with a minimum attendance set but the horizontal axes labeled total attendance in million. This avoids the use of 0e+00, 2e+06, 4e+06 and 6e+06 on the horizontal scale.  I added a title as well.

Looking at the graph, we observed that on average more people attended the shows between the periods of 2000 – 2008 compared to 2009 – 2016. Examples are The Lion King and Mamma mia!. But the show, Wicked had a higher number of people attending between 2009 – 2016 compared to the periods of 2000 – 2008. Some shows such as Beauty and the Beast, The Producers, Hairspray, etc did not take place between the periods of 2009 – 2016 whereas Jersey Boys, The Book of Mormons, Mary Poppins, etc started their performances after 2008.

 

Population Growth

The population data for Bahrain were obtained on the website http://data.worldbank.org/indicator/SP.POP.TOTL?cid=GPD_1.  I collected the population for a ten-year period from 1999 – 2008 where the growth was exponential. To understand the growth rate of Bahrain, a graph of the log (base 2) of the population against the different years was constructed and this is shown below.

From the graph, the vertical scale is the log base 2 of the population in million and the population growth is somewhat stable through time. The population growth of Bahrain has an exponential pattern and graphing on a log scale gives a roughly linear trend. Changes in the years tends to produce increase in percentage changes in the population. From the graph, we see an overall increase in the growth rate of Bahrain as the years go by.

 

Next, I changed the vertical scale to have two scales where the left scale showed the log (base 2)  of population in thousands and the right vertical scale showed the population in thousands for the ten-year period. This helps to interpret the growth of Bahrain on both the log scale and the original scale in thousands.

 

I also collected the population of Equatorial Guinea  for a ten-year period from 1999 – 2008. I displayed both curves on the same panel plotting the log (base 2) of the populations in thousands against the year. The left and right vertical scales were labeled. The graph is shown below.

Graphing on the log scale, we observe that the growth rate of Equatorial Guinea is pretty linear. Also, Bahrain tends to have a bigger population compared to Equatorial Guinea. Overall, the growth rate tends to increase from year to year. The population of Bahrain from 1999 – 2008 increased at a higher rate compared to Equatorial Guinea. In Bahrain, the population increased over the ten-year period by a factor of 1.751, which is about  75%  increase compared to Equatorial Guinea which increased by a factor of 1.472, about 47% increase.

 

Analysis of Breakfast Cereal

The US cereal dataset comes from the 1993 ASA Statistical Graphical Exposition. It is taken from the F&DA food label. The data consists of 65 observations with 11 input variables including the name of the cereal, manufacturer, calories per serving, grams of protein, grams of fat, milligrams of sodium, grams of fibre, milligrams of potassium, grams of sugars, etc.

I construct a scatterplot to investigate whether there is a linear association between the grams of fibre and grams of potassium in one portion of cereal.

From the plot above, there is a strong positive linear association between the grams of fibre and grams of potassium. Thus cereals with high grams of fibre tend to have high grams of potassium as well. These three cereals, All-Bran, 100% Bran and All-Bran with Extra Fiber have higher grams of fibre as well as potassium and they fall away from the rest of the data.

 

Next, I constructed a  scatterplot of the grams of fibre and grams of potassium in one portion of cereal. But in this graph, I labeled all the data points and decreased the size of the plotting  symbols. This violates two of the attributes of  Clear Vision. First, the data do not stand out because the plotting symbols are not visually prominent. We cannot tell how many data values are plotted in the lower left corner. Also, I allowed the data labels in the interior of the scale-line rectangle to interfere with the quantitative data or to clutter the graph. Hence we cannot visually distinguish the labels from the plotting symbols.

 

Adding a third variable

I chose a third variable in the USCereal dataset was associated with the amount of fibre and potassium. I chose protein as the third variable. Using color and size, I constructed a new graph that incorporates this third variable in the display.

 

From the plot, we notice that quite a good number of cereals have lower grams of protein. Cereals with lower grams of fibre tend to have milligrams of potassium and lower grams of protein as well. The All-Bran cereal tends to have higher amounts of potassium and protein as the grams of fibre increases.

 

 

 

 

Instructional fees at BGSU

The data is about  instructional fees (per term) for BGSU for selected years. We want to investigate whether the instructional fee has changed (increased) over the years. I construct a scatterplot of the two variables

 

The graph is a plot of the log10 of instructional fees against some selected years at BGSU. From the plot, there is an increase in the log10 instructional  fees as the years so by. Thus, as the years increase the log10 of instructional fees tends to increase. The vertical reference line shows the year I started  college, 2007 and my estimated instructional fee would have  been between $3200 – $3500.

Is Horsepower of a Car Related to Its Mileage?

The horsepower and mileage for 32 cars is collected by Motor Trend magazine. The goal is to see if there is any relationship between the horsepower and mileage of the car. To do this, I construct a scatterplot of the horsepower and mileage.

From the scatterplot, there is a negative association between the horsepower and mileage of the car. As the horsepower of the car increases, the mileage tends to decrease as well. Clearly, the association is not linear.