Comparing Distributions Graphs

Data

The dataset “studentdata” is from the LearnBayes package, which describes the results from a survey from an introductory statistics class.

I took a random sample of 100 from the dataset and aimed in displaying the graphical comparison of haircut prices of the men and women in the class. (The relevant variables in the dataset are “Haircut” and “Gender”).

Note: Generally, people would not get a zero price for the haircut. So, I removed all NA values or zeroes of these two variables because they would mislead the quantiles by increasing the sample size. 

Parallel Dotplots

From the parallel dotplots, the distribution of the female group is obviously right-skewed. And, most of the haircut prices in the female group are higher than them in the male group. But it is not clear to tell the pattern of the differences.

Parallel Quantile Plots

Notice that, in this case, the number of black circles is larger than it of red circles since we have more observations of the female than them of the male.

To construct the Parallel Quantile plots, I sort the data from the smallest to the largest, denoted as X(i), i=1,…,n; n is the sample size of each group. Then I create the equidistant fractions from 0 to 1. Actually, they are f-values in each group by computing (i-0.5)/n; The last step is to plot the fractions against the sorted values. It is clear for us to compare quartiles, medians, etc. For instance, assume we want to compare the maximums. The maximum in the female group is nearly 150 bucks when the maximum in the male group is around 25 bucks. Also, the differences under the same fraction between the two groups are increasing significantly.

Quantile-Quantile Plot

From the Q-Q plot, more specific than the previous two graphs, it implies that, before the upper 50% quantiles, the differences with the same quantiles are under 10 bucks. After around 80% quantiles, the differences increase dramatically.

Tukey Mean-Difference Plot

I’d like to choose the Tukey Mean-Difference plot as the best graphical comparison of the two sets of measurements among the above 4 graphs.

Actually, Tukey Mean-Difference plot is an alternative expression of Q-Q plot. It converts interpretation of the differences around a 45-degree diagonal line to the interpretation of differences around a horizontal zero line. Now, it shows that all the differences among the same quantiles are positive. And we get the same conclusion from the Q-Q plot but easier and faster, because y-axis is the straightforward difference.

In summary, women have higher haircut prices than men. But the lower level haircut price has no big difference between men and women. But in advanced haircut service, it costs women much more than men.

Leave a Reply