Daily Archives: October 5, 2018

Exploring the Pythagorean Formula

In this blog we are going to explore the Pythagorean Formula (described first by Bill James in the context of baseball). 

The Pythagorean formula is (W/L)=(P/PA)^k where k is some constant. We consider the transformations log(W/L) and the log(P/PA) so we can perform a linear regression model centered at the origin to approximate k.  

Part 1, Data:

 I collected my data from  NFL teams for the 2017 regular season. (football)

I collected my data from here.

The original data looks like this:

I chose 10 teams randomly and focus my attention on “W”: the number of games won, “L”: the number of games lost, “P”: the number of points (or runs, goals, etc) scored by the team,  and “PA” : the number of points allowed by the team. 

 

Part 2: Data Analysis 

After fitting the model , regression through origin: 

my.model=lm(log_WtoL~log_PtoPA+0, data = footbal), we get k which is:  2.4406. 

For every one unit increase in log(P/PA), the log(W/L) increases by approximately 2.4406.  In other words, in NFL, the win to lose Ratio is equivalent to the points scored to points given Ratio raised to  2.4406 power.

 

Part 3: Plots

 

Part 4:  Residual Analysis

The least lucky team is Los Angeles Chargers with residual of -0.39721208 which has the lowest residual.

And, the luckiest team is Pittsburgh Steelers with the highest residual of 0.79238146 . 

 

Part 5: R Code:

# Blog assignment.
# I chose footbal.

library(dplyr)
library(ggplot2)
library(ggfortify)
install.packages(“gridExtra”)
library(gridExtra)

footbal <- read.csv(“C:/Users/Sima/Desktop/math 6820 Graphics/footbal.txt”)
names(footbal)

#”Tm” “W” “L” “W.L.” “PF” “PA” “PD” “MoV” “SoS” “SRS” “OSRS” “DSRS”

# Get the followings:

#W – the number of games won
#L – the number of games lost
#P – the number of points (or runs, goals, etc) scored by the team
#PA – the number of points allowed by the team

footbal=as.data.frame(footbal)
colnames(footbal)[colnames(footbal)==”PF”] = “P”
colnames(footbal)[colnames(footbal)==”Tm”] = “Team Name”
head(footbal,10)

footbal=select(footbal, c(“Team Name”,”W”, “L”, “P”, “PA”))
head(footbal,10)

#Choose 10 teams:
footbal=footbal[c(1:7, 14:16),]
View(footbal)

##### Do some math here:

footbal$WtoL<-round(footbal$W/footbal$L,3)
footbal$log_WtoL<-round(log(footbal$WtoL),3)

footbal$PtoPA<-round(footbal$P/footbal$PA,3)
footbal$log_PtoPA<-round(log(footbal$PtoPA),3)

# Fit the regression line on the data:

my.model=lm(log_WtoL~log_PtoPA+0, data = footbal)
summary(my.model)
summary(my.model$residuals)

# The solpe which is the K we are looking for is 2.440.
# Construct the scatter plot with the regresion line imposed on it.
# Also, construct the residual plot.

# ggplot(footbal, aes(x=log_PtoPA, y=log_WtoL)) +
# geom_point()+
# geom_smooth(method=lm, se=FALSE)

#############################

P1=ggplot(footbal, aes(x=log_PtoPA, y=log_WtoL))+
geom_point(shape= 20, fill=”orange”, col =”red”,size=3, col = “steelblue”)+
geom_smooth(method=lm, se=FALSE)+
ylab(“LOGe W/L”)+
#xlab(“”)+
ggtitle(“Ratio of Wins to Loses Versus \n Ratio of Pts. Earned to Pts. Given in NFL”)+
theme(plot.title = element_text(hjust =0.5, lineheight=.9, face=”bold”))+
theme(aspect.ratio = .75)+
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())

P1

#e <- sum(1/factorial(0:100))
##### Now residuals plot:

#my.model$residuals
modf <- fortify(my.model)
modf$.resid
modf$log_PtoPA

P2=ggplot(modf, aes(x=log_PtoPA, y=.resid))+
geom_point(shape= 20, fill=”orange”, col =”red”,size=3, col = “steelblue”)+
ylab(“Residual”)+
xlab(“LOGe (P/PA)”)+
#scale_y_continuous(sec.axis = sec_axis(trans = ~., name = “Residual”))+
geom_hline(yintercept=0,0,
color = “blue”, size=1)+
theme(aspect.ratio = .75)

P2

grid.arrange(P1, P2)