In this blog we are going to explore the Pythagorean Formula (described first by Bill James in the context of baseball).
The Pythagorean formula is (W/L)=(P/PA)^k where k is some constant. We consider the transformations log(W/L) and the log(P/PA) so we can perform a linear regression model centered at the origin to approximate k.
Part 1, Data:
I collected my data from NFL teams for the 2017 regular season. (football)
I collected my data from here.
The original data looks like this:
I chose 10 teams randomly and focus my attention on “W”: the number of games won, “L”: the number of games lost, “P”: the number of points (or runs, goals, etc) scored by the team, and “PA” : the number of points allowed by the team.
Part 2: Data Analysis
After fitting the model , regression through origin:
my.model=lm(log_WtoL~log_PtoPA+0, data = footbal), we get k which is: 2.4406.
For every one unit increase in log(P/PA), the log(W/L) increases by approximately 2.4406. In other words, in NFL, the win to lose Ratio is equivalent to the points scored to points given Ratio raised to 2.4406 power.
Part 3: Plots
Part 4: Residual Analysis
The least lucky team is Los Angeles Chargers with residual of -0.39721208 which has the lowest residual.
And, the luckiest team is Pittsburgh Steelers with the highest residual of 0.79238146 .
Part 5: R Code:
# Blog assignment.
# I chose footbal.
library(dplyr)
library(ggplot2)
library(ggfortify)
install.packages(“gridExtra”)
library(gridExtra)
footbal <- read.csv(“C:/Users/Sima/Desktop/math 6820 Graphics/footbal.txt”)
names(footbal)
#”Tm” “W” “L” “W.L.” “PF” “PA” “PD” “MoV” “SoS” “SRS” “OSRS” “DSRS”
# Get the followings:
#W – the number of games won
#L – the number of games lost
#P – the number of points (or runs, goals, etc) scored by the team
#PA – the number of points allowed by the team
footbal=as.data.frame(footbal)
colnames(footbal)[colnames(footbal)==”PF”] = “P”
colnames(footbal)[colnames(footbal)==”Tm”] = “Team Name”
head(footbal,10)
footbal=select(footbal, c(“Team Name”,”W”, “L”, “P”, “PA”))
head(footbal,10)
#Choose 10 teams:
footbal=footbal[c(1:7, 14:16),]
View(footbal)
##### Do some math here:
footbal$WtoL<-round(footbal$W/footbal$L,3)
footbal$log_WtoL<-round(log(footbal$WtoL),3)
footbal$PtoPA<-round(footbal$P/footbal$PA,3)
footbal$log_PtoPA<-round(log(footbal$PtoPA),3)
# Fit the regression line on the data:
my.model=lm(log_WtoL~log_PtoPA+0, data = footbal)
summary(my.model)
summary(my.model$residuals)
# The solpe which is the K we are looking for is 2.440.
# Construct the scatter plot with the regresion line imposed on it.
# Also, construct the residual plot.
# ggplot(footbal, aes(x=log_PtoPA, y=log_WtoL)) +
# geom_point()+
# geom_smooth(method=lm, se=FALSE)
#############################
P1=ggplot(footbal, aes(x=log_PtoPA, y=log_WtoL))+
geom_point(shape= 20, fill=”orange”, col =”red”,size=3, col = “steelblue”)+
geom_smooth(method=lm, se=FALSE)+
ylab(“LOGe W/L”)+
#xlab(“”)+
ggtitle(“Ratio of Wins to Loses Versus \n Ratio of Pts. Earned to Pts. Given in NFL”)+
theme(plot.title = element_text(hjust =0.5, lineheight=.9, face=”bold”))+
theme(aspect.ratio = .75)+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
P1
#e <- sum(1/factorial(0:100))
##### Now residuals plot:
#my.model$residuals
modf <- fortify(my.model)
modf$.resid
modf$log_PtoPA
P2=ggplot(modf, aes(x=log_PtoPA, y=.resid))+
geom_point(shape= 20, fill=”orange”, col =”red”,size=3, col = “steelblue”)+
ylab(“Residual”)+
xlab(“LOGe (P/PA)”)+
#scale_y_continuous(sec.axis = sec_axis(trans = ~., name = “Residual”))+
geom_hline(yintercept=0,0,
color = “blue”, size=1)+
theme(aspect.ratio = .75)
P2
grid.arrange(P1, P2)