Exploring the Pythagorean Formula

In this blog we are going to explore the Pythagorean Formula (described first by Bill James in the context of baseball). 

The Pythagorean formula is (W/L)=(P/PA)^k where k is some constant. We consider the transformations log(W/L) and the log(P/PA) so we can perform a linear regression model centered at the origin to approximate k.  

Part 1, Data:

 I collected my data from  NFL teams for the 2017 regular season. (football)

I collected my data from here.

The original data looks like this:

I chose 10 teams randomly and focus my attention on “W”: the number of games won, “L”: the number of games lost, “P”: the number of points (or runs, goals, etc) scored by the team,  and “PA” : the number of points allowed by the team. 

 

Part 2: Data Analysis 

After fitting the model , regression through origin: 

my.model=lm(log_WtoL~log_PtoPA+0, data = footbal), we get k which is:  2.4406. 

For every one unit increase in log(P/PA), the log(W/L) increases by approximately 2.4406.  In other words, in NFL, the win to lose Ratio is equivalent to the points scored to points given Ratio raised to  2.4406 power.

 

Part 3: Plots

 

Part 4:  Residual Analysis

The least lucky team is Los Angeles Chargers with residual of -0.39721208 which has the lowest residual.

And, the luckiest team is Pittsburgh Steelers with the highest residual of 0.79238146 . 

 

Part 5: R Code:

# Blog assignment.
# I chose footbal.

library(dplyr)
library(ggplot2)
library(ggfortify)
install.packages(“gridExtra”)
library(gridExtra)

footbal <- read.csv(“C:/Users/Sima/Desktop/math 6820 Graphics/footbal.txt”)
names(footbal)

#”Tm” “W” “L” “W.L.” “PF” “PA” “PD” “MoV” “SoS” “SRS” “OSRS” “DSRS”

# Get the followings:

#W – the number of games won
#L – the number of games lost
#P – the number of points (or runs, goals, etc) scored by the team
#PA – the number of points allowed by the team

footbal=as.data.frame(footbal)
colnames(footbal)[colnames(footbal)==”PF”] = “P”
colnames(footbal)[colnames(footbal)==”Tm”] = “Team Name”
head(footbal,10)

footbal=select(footbal, c(“Team Name”,”W”, “L”, “P”, “PA”))
head(footbal,10)

#Choose 10 teams:
footbal=footbal[c(1:7, 14:16),]
View(footbal)

##### Do some math here:

footbal$WtoL<-round(footbal$W/footbal$L,3)
footbal$log_WtoL<-round(log(footbal$WtoL),3)

footbal$PtoPA<-round(footbal$P/footbal$PA,3)
footbal$log_PtoPA<-round(log(footbal$PtoPA),3)

# Fit the regression line on the data:

my.model=lm(log_WtoL~log_PtoPA+0, data = footbal)
summary(my.model)
summary(my.model$residuals)

# The solpe which is the K we are looking for is 2.440.
# Construct the scatter plot with the regresion line imposed on it.
# Also, construct the residual plot.

# ggplot(footbal, aes(x=log_PtoPA, y=log_WtoL)) +
# geom_point()+
# geom_smooth(method=lm, se=FALSE)

#############################

P1=ggplot(footbal, aes(x=log_PtoPA, y=log_WtoL))+
geom_point(shape= 20, fill=”orange”, col =”red”,size=3, col = “steelblue”)+
geom_smooth(method=lm, se=FALSE)+
ylab(“LOGe W/L”)+
#xlab(“”)+
ggtitle(“Ratio of Wins to Loses Versus \n Ratio of Pts. Earned to Pts. Given in NFL”)+
theme(plot.title = element_text(hjust =0.5, lineheight=.9, face=”bold”))+
theme(aspect.ratio = .75)+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

P1

#e <- sum(1/factorial(0:100))
##### Now residuals plot:

#my.model$residuals
modf <- fortify(my.model)
modf$.resid
modf$log_PtoPA

P2=ggplot(modf, aes(x=log_PtoPA, y=.resid))+
geom_point(shape= 20, fill=”orange”, col =”red”,size=3, col = “steelblue”)+
ylab(“Residual”)+
xlab(“LOGe (P/PA)”)+
#scale_y_continuous(sec.axis = sec_axis(trans = ~., name = “Residual”))+
geom_hline(yintercept=0,0,
color = “blue”, size=1)+
theme(aspect.ratio = .75)

P2

grid.arrange(P1, P2)

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *