In this blog I am going to compare the distribution of number of shoes owned by female and male students in an introduction to Statistics class. In fact, at BGSU all instructors give a survey to their introduction to Statistics’ students and ask several questions about their life style and preferences. Later, during the semester, the instructor is suppose to use the data to teach different plots, such as scatter plot. I find it very useful, as it can make the students more excited about the analysis of data.
I use tidyverse package in R to take a sample of 100 students. It is always nice to do some very basic exploratory analysis on the data before any analysis. A simple exploration could be just to get a summary of the number of shoes or to see how many male and female are in my sample. It turns out, I have 66 female students and 34 male students. Male students own at most 10 shoes in my sample, while female’s maximum number of shoes is 164! Could this be an outlier? Or the data has been put by mistake in the data set? In either case, we can analyse our data without being in so much trouble.
Next, I plot the number of shoes of those female and male students in order to compare their distributions. There are several different type of plots that can help me to tackle this task. In the following several parts, I will display how each plot can help compare the distributions.
Before getting down to displaying, I should mention that I sometime try to plot with both base plotting codes in R and ggplot. We know that ggplot has all the embellishments we need to make our graph prettier and more interpret-able. I leave the judgement to the audience to some extent.
Part 1: Construct parallel one-dimensional scatter plot


We see that female students obviously have more shoes. The number of shoes owned by females is stretched all the way till 164! Males own at most 10.
Part 2: Parallel Quantile Plot
In this part, I have two different styles of displaying the parallel quantile plots. I prefer the second one, since I can compare the quantiles easier.
From this kind of plot, we can compare the quantiles of male and females easily. Quantiles of Fraction are values in which approximately (100*Fraction)% of the observations from the sample is below that value. For example, Fraction=0.5 corresponds to the median.


The median number of shoes owned by males is 5, meaning, 50 % of the male students have less 5 shoes.
This number for the female students is 20. This means, less than 50% of the female
students have less than 20 shoes!
All of the quantiles of females are larger than the corresponding quantiles
of males. Also, near the maximum, we see that there must have been some
female with crazy number of shoes. I mentioned this peculiarity at the beginning.
Part 3: Quantile-Quantile plot
For this part, I display the QQplot, with both basic and ggplot tools.


“In statistics, a Q–Q (quantile-quantile) plot is a probability plot, which is a graphical method for comparing two probability distributions by plotting their quantiles against each other”
This Quantile Quantile plot is vividly showing that men have less shoes than women as all the
dots are below the line y=x.
Although, since it is hard to see the individual quantiles since we have combined the quantiles of male and female, we can tell for all quantiles, female students have larger quantiles. Here also we see the spread of the quantiles is increasing. Also, we see when a female student owns 50 shoes, the count for a male student is 10. We can explore the difference of their quantiles in Tukey plot which I expalin next.
Part4: Tukey Mean-Difference Plot
In this last part, I have used both basic R commands and ggplot to display the Tukey’s plot.


As the quantiles increase, the differences of the quantiles of males and females also increase. In this plot we can see the trend of increasing spread maybe easier. Also we see the difference between the quantiles readily. This is because we are plotting the difference of the quantiles on the y axis and the mean of the quatiles on the x axis.
Only in the first few quantiles the differences of the number of shoes between male students and female students is small. We had seen this in other plots too. I mentioned in QQ plot that when a female student owns 50, a male student own 10 shoes. Here in Tukey we see the difference of the quantiles for those counts is 25.
I prefer the parallel quantile plots because I can compare the corresponding quantiles easier as the example above explained how. Tukey is also effective because we can see the differences of the quantiles, also the average of them.
R Code:
library(LearnBayes)
library(tidyverse)
library(ggplot2)
library(devtools)
#install_github("easyGgplot2", "kassambara")
library(easyGgplot2)
library(gridExtra)
set.seed(13555)
d.sample <- sample_n(studentdata, size = 100)
names(d.sample)
#1. Construct parallel one-dimensional scatterplots of the variable by gender.
levels(d.sample$Gender)
summary(d.sample$Shoes)
SP1=d.sample[which(d.sample$Gender== "female"),]$Shoes
summary(SP1)
SP2=d.sample[which(d.sample$Gender== "male"),]$Shoes
summary(SP2)
#SP1<- subset(d.sample, Gender == "female")$Shoes
#SP2 <- subset(d.sample, Gender == "male")$Shoes
#stripchart produces one dimensional scatter plots (or dot plots) of the given data.
#These plots are a good alternative to boxplots when sample sizes are small.
stripchart(list("Female" = SP1, "male" = SP2),
main = "Parallel Scatterplot \n comparing the number of shoes", xlim = c(1,164), xlab = "Number of Shoes", ylab = "Gender")
##########
ggplot2.stripchart(data=d.sample, xName="Shoes",yName="Gender",
groupName="Gender",
xTickLabelFont=c(7,"bold", "#993333"),
yTickLabelFont=c(14,"bold", "#993333"),
mainTitle="Parallel Scatter plot via ggplot2",
position = position_dodge())+
scale_x_discrete(breaks= pretty(d.sample$Shoes, n=60))+
labs(title = "Parallel Scatter plot via ggplot2",
subtitle = "Comparing the distribution of Number of shoes owned by female vs. male",
caption = "", x="Number of Shoes")
#Based on the graph, we can see that female students tend to have more shoes than males.
#2. Construct parallel quantile plots of the male values and
#of the female values. Write several sentences that help the
#reader interpret the quantile plots.
n1=length(SP1)
n1
n2=length(SP2)
n2
f1 <- (1:n1 - 0.5) / n1
plot(f1, sort(SP1, na.last = TRUE),
xlab = "Fraction", ylab = "Shoes",
main = "Parallel Quantile Plot", pch=21, col="blue")
f2 <- (1:n2 - 0.5) / n2
points(f2, sort(SP2, na.last = TRUE), col="red")
legend(0, 150, legend=c("Female", "Male"),
col=c("blue", "red"), cex=0.8, pch=c(21,21))
# We need a data.frame to get the ggplot.
SP1
SORT.F=sort(SP1, na.last = TRUE)
df=cbind.data.frame(SP=SP1, FR=f1, Sort.Shoes=SORT.F,GENDER=rep("Female", length(SP1)))
names(df)
#View(df)
SORT.M=sort(SP2, na.last = TRUE)
dm=cbind.data.frame(SP=SP2, FR=f2, Sort.Shoes=SORT.M, GENDER=rep("Male", length(SP2)))
names(dm)
# To get the side by side parallel quantile plot, like the one in the book.
# Make two plots and then use the grid to put them side by side.
Pf=ggplot(df, aes(Sort.Shoes,FR))+
geom_point(shape= 1, fill="orange",size=3, col = "orange")+
theme(aspect.ratio = 1)+
labs(title = "Parallel Quantile Plots",
subtitle = "Female",
caption = "",
x = "Number of shoes", y = "Fraction")
Pm=ggplot(dm, aes(Sort.Shoes,FR))+
geom_point(shape= 1, fill="orange",size=3, col = "orange")+
theme(aspect.ratio = 1)+
theme(axis.title.y=element_blank())+
labs(title = "",
subtitle = "Male",
caption = "",
x = "Number of shoes", y = "Fraction")
grid.arrange(Pf, Pm, nrow=1, ncol=2)
# To get a plot like what dr Albert did in his post:
BOTH=rbind.data.frame(df,dm)
nrow(BOTH)
names(BOTH)
#View(BOTH)
ggplot(BOTH, aes(FR, Sort.Shoes, colour=GENDER))+
geom_point()+
labs(title = "Parallel Quantile Plot",
subtitle = "Comparing the distribution of Number of shoes owned by female vs. male",
caption = "Source: studentdata from the LearnBayes",
x = "Fraction", y = "Number of shoes")
# Female students tend to have more shoes. The difference between number of shoes
#owned by females vs. males increases as the number of shoes owned increases.
#It seems in my sample, if you are a male student and own less than 10 shoes, you can
# find a female with less than 10 shoes. But, if you are a female and have more than 30
# shoes, you won't find a male student that has as many shoes as you have.
# The median number of shoes owned by males is 5, meaning, 50 % of the male students have less 5 shoes.
# This number for the female students is 20. This means, less than 50% of the female
#students have less than 20 shoes!
# For the most part the quantiles of females are larger than the corresponding quantiles
#of males. But, near the maximum, we see that there must have been some
# females with crazy amount of shoes, because the quantiles near 100% for women have more shoes than those for
#men in this sample.
# 3. Construct a quantile-quantile plot of the male and female
#values.
#Quantile-Quantile Plot
#A different plot is a scatterplot of the corresponding quantiles.
#Below I set up a grid of fraction values, find the quantiles of each dataset
#corresponding to these fractions, and construct the scatterplot.
qqplot(SP1, SP2, main="QQ Plot", xlab = "Number of Shoes owned by Females",
ylab = "Number of Shoes owned by Males", col="blue")
abline(a=0,b=1,col="red",lwd=2)
text(labels="Y=X",x=4,y=8,col="blue")
f <- (1:15 - 0.5) / 15
q1 <- quantile(SP1, f, na.rm = TRUE)
q2 <- quantile(SP2, f, na.rm = TRUE)
# plot(q1, q2, main = "Quantile-Quantile Plot")
# abline(0, 1,col="red",lwd=2)
##############
# QQplot:
d <- as.data.frame(qqplot(SP1, SP2, plot.it=FALSE))
ggplot(d) + geom_point(aes(x=x, y=y), col="blue")+
geom_abline(intercept = 0, slope = 1, col="red")+
labs(title = "Quantile Quantile Plot",
subtitle = "Comparing the distribution of Number of shoes owned by female vs. male",
caption = "",
x = "Number of Shoes owned by Females", y = "Number of shoes owned by Males")+
geom_text(aes(x = 0, y = 5, label = "Y=X"),col="red")+
annotate(geom="text", x=162, y=9, label="164 Shoes",
color="red")
# This Quantile Quantile plot is vividly showing that men have less shoes than women as all the
# dots are below the line y=x.
####################
##4. Construct a Tukey m-d plot from the quantiles of the two
#samples. Interpret the plot. Is there a simple relationship
#between the male values and the female values?
#plot((q1 + q2) / 2, (q2-q1),main = "Tukey Mean-Difference Plot")
quant_male=quantile(SP2,probs=seq(0,1,by=.1))
quant_female=quantile(SP1,probs=seq(0,1,by=.1), na.rm = TRUE)
Difference=quant_female-quant_male
Mean=(quant_female-quant_male)/2
plot(Mean,Difference, main="Tukey Mean-Difference Plot")
#abline(h = 0, col = "blue")
## qqplot version.
Q.data=cbind.data.frame(quant_female,quant_male, Difference, Mean)
names(Q.data)
ggplot(Q.data) + geom_point(aes(x=Mean, y=Difference), col="blue")+
labs(title = "Tukey Mean-Difference Plot",
subtitle = "Comparing the distribution of Number of shoes owned by female vs. male",
caption = "",
x = "Mean of Quantiles", y = "Difference of Quantiles")+
#geom_hline(yintercept=0)