Since the English conceived football (probably even before) different ideas and styles have been created around the game with mainly one purpose, being the best. Looking at the recent history of soccer
there are two types of styles almost opposite between each other that became quite popular and successful, one is the Catenaccio, an Italian idea that basically consists on defending your own net (with 3 goalkeepers and 13 defenders in cases rules allow you) without having the ball with one guy alone at the other side of the pitch trying to score the few chances that destiny and luck gives to him and the other one is the one called tiki-taka, which was created in the early 2000’s by Pep Guardiola during his coaching period at Barcelona, which consisted basically on monopolising the ball possession.
These 2 styles have been tested against each other many times, one I remember the most is a Champions league semifinals on 2010 between Jose Mourinho’s Inter and Pep Guardiola’s Barcelona, during those years everyone wanted to play like the Barcelona, coaches were obsessed with possession, thinking that the perfect strategy had been created. That time the Catenaccio won, Mourinho found a way to beat that tiki-taka.
Despite this and other similar matchup where the strategy that neglects ball possession wins, nobody denies that possession is extremely important in soccer. The question is how important? What is the real advantage of possession?
To find an answer I decided to do a data analysis using data from 2630 matches of the Premier League and La Liga.
Possession is recorded as a percentage, one for the home team and one for the away team, for this analysis when I refer to possession I will be referring to the possession percentage of the home team, since the possession of the opposite team is 100 – home team ball possession.
As expected home team advantage is also shown in the possession (mean and median almost the same around 52%). Looking at the histogram on the histogram below is possible to observe this, although is not normally distributed (fails normally tests as almost every big data set) the data has almost a ‘textbook behaviour’, that’s why I show it in the analysis.
The figure below shows the correlation between the variables (H means Home team and A Away team), a darker color represents a strong correlation between variables. Ignoring obvious correlations (Goals and Results), possessions and shots for the home team and away team (Shots on goal).
The scatterplot between Shots on goal for the home team and possession are show on the figure below, since there are several observations with the same values, the color of the point, shows how often this value appears on the data.
The missing part on the previous plot are the shots on goal of the visitor team. To include the 3 variables, the following figure shows on they axis the difference between the shots made by the home team and the away team. The scatterplot is similar, but with less dispersion.
Doing the linear regression to make their coefficients more interpretable is possible to say that home team advantage gives your team 2.2 more shots than your opponent and every percentage possession higher than your opponent will give you .22 more shots on goal.
For example, if you are a home team with a 60% of ball possession, you have 20% more ball possession than you opponent), then you will have around 7 more shots on goal than your opponent.
High ball possession percentage means more opportunities to score than your rival, the quality of the opportunities seem to be independent from the possession, but that I would leave for another post.
If you want to check the code, here is the git hub link https://github.com/fconick/fcoStartistician