How to rate the performance of a soccer team? An application of Principal Components Analysis

How many time you have seen a team that is playing great is going through a rough streak of loosing matches and the head coach might be fired due  bad results? When results are the only measure you have regarding a team performance is quite difficult to know if the team is improving or not and this type of situations tend to happend quite often.  Most of the time rating the performance of a team relys exclusively on qualitative analysis, and usually nobody agrees with it (If you don’t beleive me watch any sports show and see how the ‘experts’ can never agree on something).

With this post I want to show how to rate the performance of a team using exclusively simple stats of a match. For that, I will use a database that contains information of matches form La Liga and the Premier League, from 2011 to 2016.

As the title might give you a clue I will use principal components analysis (PCA) for this task, unlike other posts, this time I will highlight some technical points around the analysis and code lines as well.
First let’s start loading the libraries and the data as well as some options to make our life easier with the data wrangling.


#Load libraries
libraries <- c('plyr','zoo', 'magrittr', 'stringr', 'tidyverse', 'ggplot2',
 'forecast', 'lubridate')

lapply(libraries,require, character.only = TRUE )

#Define plotting and data importing specifications
options(stringsAsFactors = F)

cbPalette <- c( "#D55E00", "#0072B2")

#Load data and prepare it for the analysis
STATS <- read.csv('Index_performance/DATABASE.csv')
#STATS <- read.csv('

# Change class to several columns
STATS[,-c(1:4)] %<>% sapply(as.numeric)
STATS$Date %<>% as.Date

#Subset the data
STATS %<>% filter(variable == 'Home')


The data contains stats associated to 4142 matches from the Premier League and La Liga.


Once the data is in a nice format we can start running the Principal component analysis (PCA). The PCA can use either covariance matrix or the correlation matrix to be performed, in this case I will use the correlation matrix because there are some variables like Fouls or Possession have a wider range than others like Goals or Red Cards.  This technical decision has a huge impact on the analysis, mainly beause they give interpretability to the Components, as you will see below the Components have a fairly intuitive explanation for this dataset.


#Principal components
Princomp <- princomp(STATS[,-c(1:4)], cor = T )

#Find number of components that explain most of the data

biplot(Princomp, col =c('white', 'red'), cex = 1)

(Loadings <- Princomp$loadings[,1:2] %>% round(2) %>% data.frame %>%
 mutate(Attribute = rownames(.)) %>%
 select(Attribute, everything()) %>%


PCA is a dimensional reduction technique, this allow us to plot the first 2 components, making it easier to present them. The biplot is a great tool to show how variables are related and how much they impact on each component. As you can see below you will be able to see 3 main groups, the offensive variables such as goals and shots, the defensive variables such as goals from the opponent and the disciplinary variables, such as fouls and yellow cards.


On the table below we can see the loadings of each component, from this table is simple to understand that the first component rate the team performance, while the second one explains how often the referee was involved during the match.


Attribute Comp.1 Comp.2
Possession -0.4 0.04
Shots -0.39 0.21
Corner.kicks -0.31 0.15
Saves_Opp -0.3 0.15
Fouls_Opp -0.15 -0.44
Goals -0.14 0.06
Yellow.Cards_Opp -0.13 -0.43
Red.Cards_Opp -0.12 -0.13
Offsides -0.08 -0.09
Yellow.Cards 0.03 -0.47
Fouls 0.06 -0.45
Offsides_Opp 0.08 -0.09
Red.Cards 0.11 -0.16
Goals_Opp 0.16 0.1
Saves 0.31 0.1
Corner.kicks_Opp 0.33 0.04
Shots_Opp 0.4 0.14


Loadings on the components can switch the signs indistinctly only  if you change the sign on all the component values. On the first component if a team performed great the value will be negative, but that’s not intuitive so, to make things more natural let’s change the direction of the first component.
Now let’s take 4 teams to show their first component time series.


Teams.example <- c("Granada", 'Barcelona', 'Manchester City', "Stoke City")
Sample <- STATS %<% filter(Team %in% Teams.example)

ggplot(data= Sample, aes(x= Date, y = PC1, color = League)) +
 geom_hline(yintercept= 0, colour ='red') +
 geom_line(show.legend = F) +
 facet_wrap(~Team) +


Time series


Form here is possible to see that teams as Barcelona or Manchester City have consistently positive rates provided by the first component on their matches performance, while smaller teams such as Granada or Stoke City have mixed values (Sorry Granada and Stoke City fans, but that’s the truth).

As you can see the first component rate the performance of a team, this would lead naturally to create a table that compare all the teams rate. The following table presents the average performance rate of the top teams for the last six years (up to January 2017).


 group_by(Team) %>% filter(n()>30) %>% summarise(
 Index.1 = mean(PC1),
 Index.2 = mean(PC2)
 ) %>% arrange(desc(Index.1)) %>%
 select(Team, Index.1) %>%
 data.frame %>% write.csv(row.names =F)


Team Performance Rate
Barcelona 2.57
Real Madrid 2.08
Manchester City 1.62
Liverpool 1.33
Tottenham Hotspur 1.20
Chelsea 1.09
Arsenal 1.00
Atletico Madrid 0.88
Manchester United 0.84


If you are a Barca fan you might agree with the table, although if you are a Real Madrid fan you will disagree, just remember that variables such as players’ handsomeness are not included in the analysis.

So far I have been focusing only on the first component, since is the one that explains most of the data dispersion, but the second component could also provide intersting information. To analyse both components, let’s take the average of the last 15 games of each team to obtain an ‘updated’ rate for both components.


STATS %<>% group_by(Team) %>% filter(n()>30) %>%
 arrange(Date) %>%
 M.A.PC1 = rollmeanr(PC1, 15, fill = NA ),
 M.A.PC2 = rollmeanr(PC2, 15, fill = NA )

Index.2016 % group_by(Team) %>%
 filter(Date == max(Date)) %>%
 select(Team, M.A.PC1, M.A.PC2, Date, League) %>%
 filter(Date>= as.Date('2016-11-01')) %>% arrange(M.A.PC1)

ggplot(Index.2016, aes(x = M.A.PC1, y = M.A.PC2, color = League)) +
 geom_point(show.legend = F) +
 geom_text(aes(label = Team), check_overlap = TRUE, nudge_y = 0.08,
 show.legend = F, size = 2.9) +
 scale_color_manual(values=cbPalette) +
 labs(x = 'PC1', y = 'PC2')




Is incredibly evident how the Premier league teams are located on top of the chart, while the Spaniard teams are on the bottom, this means that variables that impact the second component are not equally distributed between leagues,  explaining it in terms of the data context, Premier League matches have consistently less referee interruptions due fouls and misconduct than La Liga matches.


To confirm evidence of we can run the following line to find out the mean of both leagues for the disciplinary columns.


(Fouls.summary <- STATS %>% ddply(.(League), function(x){
 sapply(x[,c(Fouls.cols, 'PC2')], mean)


As expected  you can see on the table below that La Liga has bigger values, this is why the previous plot had this behavior.


La Liga Premier League
Fouls 14 11.4
Yellow Cards 2.7 1.8
Red Cards 0.2 0.1
Fouls Opponent 14.2 10.8
Yellow Cards Opponent 2.5 1.5
Red.Cards Opponent 0.13 0.06
PC2 -0.73 0.74


Since this variables variate among leagues, let’s normalize the variables according to each league statistics  running the following lines:


STATS <- read.csv('')

STATS[,-c(1:4)] %<>% sapply(as.numeric)
STATS$Date %<>% as.Date

STATS %<>% filter(variable == 'Home')

#Tweak scale function <- function(x){ scale(x) %>% c

#Standarize all variables related with fouls
STATS %<>% group_by(League) %>%
mutate_each(funs(, contains("Fouls"), contains('Cards'))


Once the variables are transformed I can perform again the PCA and plot the same chart previoulsy shown.



Using this plot we can see that the first component (General performance) remains the same, but the second one (Fouls and Misconduct score) changed as expected.

Even though clustering is not the main goal of PCA it is possible to create groups according to the first components. As you might have seen on the previous plot the dotted lines show where the mean of each component lies, these lines create 4 groups.

  • The Artists. – Teams with good performance and fluent matches, typically teams that have entertaining matches. (Real Madrid, Barcelona, Liverpool)
  • The Street Fighters. – Teams with poor performance and matches that include tons of fouls and referee interruptions (Espanyol, Sunderland, Celta de Vigo)
  • The Rugby Teams. – Teams with good performance and a lot of fouls and interruptions.  They might like better rugby rules. (Manchester United, Chelsea, Real Sociedad)
  • No Fouls, No Goal Teams. – Teams that have bad performances but fluent matches. (Hull City, Villarreal, Everton)

Lastly I would like to point out one curios behavior shown on the last image, most of the Premier League teams with good performance tend to have a lot of misconduct interruptions, while the opposite happens to La Liga teams. Is this something that has to do with the way referees apply the rules different in each league? Or is it just a way how team styles variate between this two leagues? Perhaps that could be a topic for a new post.

You can find the github code here.


One thought on “How to rate the performance of a soccer team? An application of Principal Components Analysis

Add yours

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

Blog at

Up ↑

%d bloggers like this: