An Exploration of Expected Goals
Part 1: Introduction and Data Exploration
Here, I will introduce the concept of expected goals (xG) and conduct an exploration of event data. Part II will be centered around constructing a machine-learning model from this event data, while Part III will explore the applications, strengths and deficiencies of this model. For a more thorough walkthrough of the entire process, including reproducible code with step-by-step instructions, please visit my GitHub page.
What is xG?
Results in football, more so than any other sport, can be greatly influenced by random moments and “luck.” Near misses, deflected shots, goalkeeping errors, and controversial refereeing decisions alone can dictate the final result. Football is a game of inches.
These effects are amplified by the fact that goals are rare events; a match produces 2.5 goals on average. Furthermore, a large majority of matches end in a draw or are decided by just a couple of goals, meaning a single goal can be largely significant to the result of a match.
Luck and randomness can therefore have a notable effect when so many matches are defined by fine margins. This also makes performances difficult to evaluate; is a dogged 1-0 win a product of a deserving performance or a series of fortunate events? Sometimes, this is difficult to evaluate with the naked eye. It is our hope to quantify and qualify performances by eliminating as much randomness as possible when examining a match.
In order to score a goal, you must first attempt a shot at goal. Assessing a performance ten or so years ago would simply entail taking a look at the total shots and shots on target. While these are useful tools for assessing chance creation, they do not tell the whole story as not all shots are created equal. There are many factors that influence the the likelihood of a shot resulting in a goal.
This is where xG comes into play. xG measures the probability that a shot will result in a goal based on a number of factors. Such factors include the distance from where the shot was taken, angle with respect to the goal line, the game state (what is the score), if it was a header, if the shot came during a counter attack and other factors. For the purpose of simplicity, our exploration will focus on just three of these factors. We can use this metric to sum over all the chances in a match to determine how many goals a team should have scored based on the factors we aggregated in our model. We can go even further to apply this to a stretch of games, a season or even a manager’s tenure.
xG therefore can serve as a gauge of how potent a team is in attack and how solid they are at the back. It can also be used to analyze a players ability to create shooting opportunities in dangerous areas and how well he/she takes their chances. In summary, the xG model helps us eliminate a portion of the randomness associated with scoring opportunities when we attempt to quantify a team's ability to score goals, which in the end is the ultimate goal of football.
We will see later that we can use xG to predict future results, guide decisions on player recruitment and evaluate coaching instructions, but first let’s try to explore some data.
Data Exploration
Before we get into building our xG model, we need to consider what sort of data we are interested in. Obviously, we need a large collection of shot data but more importantly we need the data to describe the type of shots that result in goals. We can deduce that the most important factors we need would be the distance from goal when the shot was taken, the angle with respect to the goal and what part of the body the shot was taken with.
Fig 1: Please note that the angle refers to the angle made from the location of the shot and the goal line. Notice how it varies based on position.
Football data is normally split into two forms: event data and tracking data. Event data records all on-ball events and where on the pitch they happened (such as shots, passes, tackles, dribbles), whereas tracking data records the positions of players and the ball through-out the game at regular intervals.
The event data that I will use today comes from Wyscout. It covers all events from all matches across the top 5 domestics leagues in Europe (English Premier League, Ligue 1, Bundesliga, La Liga, Seria a) from the 2017/2018 season.
While some of the findings in this section may seem elementary to those who have an extensive understanding of football, I always believe it is important to test our assumptions, as they can be misleading at times. I think the best place to start is to ask, where do most shots happen on the pitch?
Right away, there are some conclusions we can draw. The distributions suggests that:
A majority of shots happen between 10 and 20 meters.
Shots taken within about 6 meters are quite rare in comparison to shots taken outside 10 meters
Oddly enough, there is a trough in the local region taken between 18m to about 25m
The angle distribution agrees with the distance distribution in that shots taken from closer (larger angles) are much more difficult to produce.
Just with a simple distribution chart, we can conclude that it is quite difficult to produce shots that are close and central to goal. While we now know how shots are distributed by distance and angle, we have yet to address how shots that result in goals differ from those that do not.
The violin plot above plays a similar role to a box and whiskers plot but also provides use with the kernel distribution estimate of the data (essentially a smoothing of the distribution). In splitting up the data by the result of the shot, we can see that on average, shots that result in goals are taken from much closer to goal than shots that do not result in goals. The mean of shots resulting in goals is about 12 meters compared to about 18 meters for those that don't bulge the net. Similarly, goals are typically scored from angles of 20 degrees to about 50 degrees.
So while it is difficult to produce shooting chances close to goal, the violin plot suggests that those that are close and central tend to result in goals. There is also substantial overlap of the data which implies that it is difficult to clearly distinguish between goals and misses based on distance and angle alone.
Let's see how headers impact the mean and the distributions.
Headers, as we might expect, are normally taken within the 18 yard box (16.5 m). Interestingly the means and distributions of the results do not differ by much, so that is something we should consider down the line.
We have gained some fabulous insight through some basic distribution plots but we can take this a step further. We can better visualize how these variables impact the result by plotting the density of shots on a pitch. That is, we would like to split the pitch up into bins, calculate the number of shots taken within each bin and then use a color gradient to visualize how the density differs from bin to bin.
These density plots serve a similar function to the violin plots above but give us a much better visual understanding of which areas of the pitch normally produce shots and goals. This is because we can see how both distance and angle impact the distribution of shots on the same plot.
As we learned with the violin plots:
Shots are seldom taken from either side the box due to the poor angle
A majority of the shots are taken around the penalty spot (11m)
Goals are normally scored within 11 meters and within a very narrow passage
Before we delve deeper, we need to address why there is a sharp decrease in the number of shots on the edge of the box. This could be due to a number of reasons.
We have generalize that all football pitches are 105m x 68m when in reality every pitch has its own unique dimensions. The trough may be a product of our generalization.
Another factor may be due to how Wyscout records their data. There may be an inclination to record a shot near the penalty box line as either inside the box or outside. Therefore it is possible that shots that happen on the line are being mischaracterized.
Lastly, there may be a psychological effect happening to the players. Players may resist the temptation to shoot when just outside the box in order to dribble into the box in the hope of winning a penalty. Defenders tend to defend much more tentatively when the opposition are in the box due to the possibility of a costly foul and it is possible that attackers want to take advantage of this.
There are other possibilities and it is difficult to make any sort of conclusion regarding the trough but for now we will have to live with it.
Now to return to some more data visualization. As you might guess, we can plot a probability density to assess which areas of the pitch have a high probability of a shot resulting in the goal and which do not.
As expected, the closer you shoot from the goal, the more likely the shot is to result in a goal. Notice that there are certain outliers in which the probability density in those bin are very high. This is because of the few shots taken from those areas, they resulted in goals. If we had say 10 seasons of data, we would see a much more homogenous probability density.
One of the things that is most surprising to those who have never used xG is that the probability of scoring from beyond 11 meters out is in fact lower than maybe we appreciate when we see a game live. It is for this reason that this sort of analysis is important. We tend to over-estimate the quality of chance, such as shots from outside the box, when in fact these shots are quite difficult and inefficient. So, the next time your favorite player misses from 11 meters out, remember that he only really had a 3 in 10 chance of scoring anyway.
Now what about headers?
While headers exhibit a similar trend to regular shots, they have lower probability values overall. This seems to suggest that while headers happen closer to goal on average, they also represent a much more difficult chance to put away. As we will see down the line, this is an important discovery and one that impacts our interpretation of chance evaluation.
Before we progress into some machine learning, I want to close with us trying to gain some insight into these trends. We have hypothesized that both the distance and the angle affect the probability of a shot resulting in a goal and we have seen that with the graphs above. But what is the nature of this relationship? As we move away from the goal, how does the probability of scoring change? We can address these question as we have done before, with some well designed graphs.
The first thing that pops out and is quite intriguing is that as we move further away from goal, the probability of scoring becomes exponentially more difficult. Now that is profound because it vastly diminishes the value of shots from distance. So why would this be? Up until now, we have ignored the fact that the angle with the goal decreases as we move away from the goal. So we have this sort of 'doubling factor' for the distance. We can hypothesize that this is because as we increase the distance a shot is taken from, it not only has a longer distance to travel but the target also becomes smaller.
Fantastic! Now we have seen the power of data visualization and how even the simplest graphs can help us discern the information locked behind large datasets. It is for this reason that spending time and effort on data exploration is so important. This will serve as a good foundation when we move onto some machine learning in the next part. See you soon!