Which teams are leading the set piece leaderboards?

I don’t think I have to tell you that I love set pieces. The whole reason I started looking at data metrics and models more, was the fact that I felt that there are so few metrics concerning set pieces. I think that’s still the case, but we are getting that there, one day at the time.

READ MORE

All’s Well That Ends Well… With a Long Throw-in

I was waiting for an excuse for me to look deeper into the long throw-ins for a while now, but I couldn’t find the right hook. Now, it’s all over the place, and I know my research is far from innovative, but I do still think there are so many interesting things we could talk about. Long throw-ins seem to be emerging in the Premier League, but is this a league specific trend or can we see some similar development in the other top leagues in Europe? Let me take you on a wonderful trip through long throw-ins.

READ MORE

Scraping Soccerdonna

The transferwindow is coming to a close in women’s football and with that I wanted to express the importance of data. This time I don’t want to talk as much about performance data, because I don’t want to measure how well/bad a player performs. I want to gather data that’s important for building a portfolio for recruitment . In short, I like data that give data on biography.

READ MORE

Expected Danger: modelling a hybrid shot model with emphasis on blocked shots

It’s been about two years since I first started writing about data away from player or team insight. I started with the Corner Delivery-model and two years later I have designed 20+ metrics, scores, indexes and models. They all took a variety of work, effort and hours – but in the end, they all worked and took some time to develop. Some need time to brew and cook up, effectively shelving them for another time.

There is one model that has come off the shelve and has been put up there repeatedly in my brain. But I think now is the time for me to properly launch into the world. Not only because I think it’s a decent model after time of deliberation and changes, but also because I truly believe in the process. The process is where we learn; it’s where our creativity comes to the surface and converts into meaningful data engineering.

The model I’m talking about is the expected shot danger model.

Introduction

There are always two reasons for me to publish models. First of all, and this is rather selfish, but I’m really proud of myself for finishing a long project. Due to my mental disorder, limitations are often put on what I can produce, but with footbal,l this seems to be different. And, the second one – perhaps more relevant for you who are reading this – is that I think this provides meaningful content for you to read and use the model in your own endeavours.

It’s a question you might have and also one I wanted to answer for myself: why do we need this model when there already exists a quite comprehensive and complex model that values shots in expected goals?

Excellent question, I might add. Expected goals look at the quality of a chance and calculate a probability or likelihood that a certain shot will be converted into a goal. This is based on historical data. This is something I really like, but it feels very focused on goalscoring and the probability of that happening. I want to measure threat and/or danger based on location, and I found that expected goals came short in how I wanted to approach it.

Threat? Is there not another model that deals with threat or expected threat? Yes, there is. Karun Singh developed the expected threat model (xT) a few years back, and it has been instrumental in how we approach bin count values with progression towards the goal. He can explain it much better than I can, so go read his full article here:

https://karun.in/blog/expected-threat.html

These are excellent models that will help going forward, but there are some fundamental elements missing in combining threat with shots, so that’s why I decided to create my own model.

Aim

So why do I do this, and what’s my aim? My aim is to create a shot-based value model that is a hybrid between expected goals and expected threat that seeks to measure outcome danger whilst including blocked shots.

The Expected Danger model is designed to overcome a key limitation of existing football analytics models. It combines the strengths of two popular metrics:

  • Expected Goals (xG): This model calculates the probability of a shot resulting in a goal based on factors like shot location, body part used, and play type. Its primary focus is on the final outcome: a goal. However, it assigns a value of zero to shots that are blocked, even if the shot was taken from a high-danger location.
  • Expected Threat (xT): This model measures the change in a team’s probability of scoring when the ball moves from one location to another. It values passes and carries that move the ball into more dangerous areas, but it doesn’t directly value the shot itself.

In conclusion, this new model gives us a better idea of actual shot danger and the inclusion of blocked shots in that model.

Theoretical framework

There are a few theoretical concepts that are included in this particular model that we need to have a look at. Two variables of the models come directly from Ice Hockey analytics and are Corsi and Fenwick. Inside the Rink (article here) describes it as follows:

“What are Corsi and Fenwick? The answer may feel underwhelming. Despite the tendency of the hockey community to refer to them as “advanced stats” Corsi and Fenwick are very simple concepts. Corsi is nothing more than another name for shot attempts. NHL.com defines shot attempts (Corsi) as: any time a player tries to shoot the puck. So, any shot on goal, blocked shot or missed shot is classified as a shot attempt. The name Corsi was given to the stat by a hockey blogger because he liked Jim Corsi’s moustache (Jim was the goalie coach of the Buffalo Sabres at the time).”

“Fenwick is Corsi without the blocked shots. So only missed shots or shots on goal are counted as a Fenwick shot. Compared with Corsi, Fenwick is barely used in practice. We will spend at least one future article on the uses of Fenwick, but for the rest of this piece, we will focus solely on Corsi. However, all the concepts we will discuss with respect to Corsi can also be applied to Fenwick.”

It’s not a wildly innovative set of metrics, not at all, but I like the distinction between blocked shots and unblocked shots in the metrics. Because of how I value blocked shots in this model, I will refer to Corsi and Fenwick danger in the expected models as well.

Mathematical approach

There are three different mathematical approaches towards this model that I employ. These form the basis of how my model is calculated.

  1. Rule-based heuristics: Rule-based heuristics are simple decision-making methods that rely on predefined rules to solve problems or guide actions. Instead of complex models or deep calculations, they use “if–then” logic or straightforward guidelines derived from experience or domain knowledge. These heuristics are fast, interpretable, and easy to apply, but they can oversimplify complex situations and may not adapt well when conditions change.
  2. Shot-type weights in football are empirical values assigned to different kinds of shots to reflect their likelihood of resulting in a goal. Instead of treating all attempts equally, analysts use historical data to assign probabilities. These weights are built from large samples of past matches, capturing how effective each shot type usually is. They form the backbone of metrics like Expected Goals (xG), though they simplify reality by not fully accounting for defender pressure or game context.
  3. In football analytics, supervised learning with logistic regression calibrated to real goal outcomes is a method for estimating the probability that a shot results in a goal. Historical data on shots is collected with features such as distance, angle, body part, shot type, and defensive pressure. A logistic regression model is trained on this data, using actual goal outcomes (goal = 1, miss = 0) as labels. The model learns how each feature influences scoring likelihood and outputs a probability between 0 and 1. Calibrating the model against real match outcomes provides a more accurate estimate than simple shot-type weights.

These form the basis for our data approach as we look to calculate the new model.


How are we going to transform the raw event data into meaningful danger scores? First, we are going to load the data from a series of JSON-files. I want all games played in the Dutch Eredivisie 2024-2025 and I want the JSON-files flattened, so we can create a workable dataframe we can use in Python. I am an avid Python user, but of course, you can also use other programming languages that work with similar calculations. From the dataframe, I will only look at the features/variables I need to work with:

  • Player ID, team ID, coordinates (xy), time
  • Event type (goalon targetblocked, etc.)
  • Qualifiers (body part, assist, play type)

The next step is that we are going to process the data. This means that we are going to engineer. We are going to move typeId into shot categories. This means that typeId 13,14,15,16 will move into shot categories: wide shot, shot on post, shot on target and goal. We extract two different features:

  1. positional features: Distance to goal (Euclidean), Shot angle relative to goalposts
  2. Derived contextual features: Shot body part (foot/head/other), Play type (open play, set piece, counter, penalty), Assist information

We move to the heuristic model. We implemented two possession-based proxies and one geometric danger model, inspired by hockey analytics:

  • Corsi: all shot attempts per team, ratio over match total.
  • Fenwick: unblocked shot attempts per team, ratio over match total.
  • Heuristic danger score (Fenwick-style): combines location geometry into a simple rating: Scorefenwick​=(0.5⋅(1−100d​)+0.5⋅(1−90θ​))×100, where d = shot distance, θ = shot angle.

Additionally, we applied shot-type weighting (goals > shots on target > misses > blocked).

After this follows the statistical model

  • Input features: categorical (shot type, body part, play type), numerical (distance, angle, assist flag, etc.).
  • Model choice: Logistic regression.
  • Calibration: Applied Platt scaling (sigmoid) CalibratedClassifierCV to ensure preot danger per shot.dicted probabilities are well-calibrated with observed frequencies.

This produces a probabilistic estimate of shot danger per shot.

Analysis I: Expected Danger Fenwick + Corsi

In the image above, you can see two pitches with shots plotted. On the left side, you see the shots with the Fenwick Danger Score, which are all shots that are unblocked. The average of the danger is 0,58.

On the second pitch, you see all shots with the Corsi Danger Score. The average per shot is 0,66 – a significant difference of 0,08 per shot. We can conclude from this that unblocked shots do add to the danger of shooting, but are not always considered in traditional shot models.

The main thing we can get from this is that we incorporate blocked shots in our calculations, the danger of a shot is higher because of the potential it stands for. Every shot can have both danger values, but it’s essential for the danger being generated on the pitch that we don’t disregard or ignore blocked shots.

Analysis 2: Expected Danger Blocked Shots

In the shotmap above, we see all shots that have been blocked. We use the corsi-model as it deal with blocked shots as well. The average danger score is 0,6, and what I think is quite interesting is that the shots that are blocked with the highest value are the shots coming from the central areas.

Analysis 3: Expected Danger prevented blocked shots

We have shot locations of blocked shots, but we also have the xy-coordinates of where the shots have been blocked by the opposition. We can use that to calculate the danger prevented by blocking shots using our expected danger model.

As we can see on the pitch, the average danger prevented per blocked shot is 0,35. We can use this kind of information to evaluate opposition defending, but also to evaluate our own players and how much danger they prevent apart from disrupting passes.

Example 1: Expected Danger Build-up + Chains (passmaps + passnetworks)

Now we have seen how the model can work and what the danger can represent, how can we use this in analysis? In other words, how can we make this a bit more actionable?

We are quite familiar with passing networks in football, but we can also create build-up and chains from expected danger, and that’s what we will do.

In the image above, you can see the season passing network of PSV in Eredivisie 2025-2025. The size of the nodes corresponds with the value of the expected danger build-up score. The same goes for the colour. what we can see is that in the expected danger build-up, Dest, Veerman and Saibari score very high.

In the bargraph above, you can see the players in Eredivisie 2025-2026 who have the highest xDanger build-up after three matches (clubs playing European qualifiers have a game less). Karouani, Smits and Sano score really high.

Example 2: Shotmaps with values Expected Danger Shots

What we can do with all this information is create a shotmap just like we do with the expected goals map. You can see all shots by Berghuis (Ajax) and their value. He has shot 14 times and scored 1 goal. Five shots were blocked. His danger with unblocked shots was an average of 0,61 and 0,68 when including blocked shots.

We also look at the average Location Quality Scor,e which is the score of how optimal his location is in relation to his expected danger.

Challenges

There are numerous challenges when creating this model, and what makes this model not entirely flawless:

  • Model Limitations: I train a logistic regression to estimate xDangers, but realise it sometimes overestimates danger for simple shots and underestimates complex plays. It ignores the defensive pressure or positioning nuances. Scaling the scores across matches adds another layer of inconsistency.
  • Interpretability: Even after calculating the xDanger, I might struggle to explain why a particular action has a certain danger score. Combining distance, angle, and shot type into a single number obscures the reasoning. Plus, a 0.6 xDanger doesn’t directly translate to a probability of scoring unless carefully calibrated.

Final Thoughts

I wanted to create a hybrid between expected goals and expected threat that focuses on shots with a split between Fenwick and Corsi models. This has definitely happened, and I succeeded in it.

Important to know that this is one of the first itterations of the model and that I need to keep evolving this with defensive pressures, off-ball data and tracking data. Will it be an alternative to expected goals? Oh hell no. To me, it’s an extra source of information in which we can measure the danger of shots while also incorporating blocked shots.

Introducing the Goalkeeper Value Model (GVM)

I am a very chaotic person by nature, and I need structure to make sure everything works right. This makes me thrive. But when the structure isn’t good enough, it happens that projects stay on the shelf for a long time. One of these projects was the Goalkeeper Value Model (GVM), but not anymore! I finished it and I’m very happy with it.

READ MORE
```