2-0 Advantage in European soccer leagues (Part 1)
A few weeks ago a friend of mine wrote me an interesting message on X that went more or less like this
“I feel like 2-0 is the most unstable result. If a team goes up and starts to miss chances they often get punished. Would be interesting to do some investigation on the frequency the team ends up losing or drawing after being 2-0 up?.”
He didn’t need to ask twice, I started looking at it over the weekend and finalized what I think is an interesting piece of analysis, for soccer bettors and fans alike.
The analysis is pretty long, so I split it into two parts. In this part we will gather the data and check what is the frequency of a team losing a 2-0 advantage. In the second part we will see how to exploit it in a betting strategy.
Let’s dive into it, feel free to skip the technical details but if you want to get the most out of it I suggest you try to replicate my findings.
Getting the data
The idea of the analysis is to select all matches where a team loses or draws after being in a 2-0 lead and understand if there are match signals and statistics that can predict the team losing their lead eventually. For this reason I looked into match data using Understat API. First step, let’s load the packages.
import asyncio
import json
import aiohttp
from understat import Understat
import pandas as pd
This is the usual stuff, plus asyncio
and aiohttp
that we will use to make async requests to the Understat API (as recommended by them).
The first thing we want to do is to get as many matches as possible to analyze. Together with it, we want to get the results, so that later we will be able to only select the interesting ones. For example, a match that ends up in a 0-0 is not interesting for our analysis.
# get all results of all leagues
# get the results in the last 5 seasons
seasons = [2019,2020,2021,2022,2023]
leagues = ['epl', 'la_liga', 'serie_a',
'bundesliga', 'ligue_1']
all_matches = []
async def main():
async with aiohttp.ClientSession() as session:
understat = Understat(session)
for league in leagues:
for season in seasons:
fixtures = await understat.get_league_results(
league,
season)
all_matches.extend(fixtures)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
The code above loops on the last 5 seasons (from 2019/2020 to 2023/24) and the 5 most competitive European leagues (Premier League, La Liga, Serie A, Bundesliga, and Ligue 1). In the loop we just call the get_league_results
method to get the results of all matches that have been played in that season and league. We then save them all in a list called all_matches
.
Once the code above has finished running, we can explore the result.
# check how many matches we have collected
print(f"Matches: {len(all_matches)}")
# check the 1st match
print(f"First match:\n{all_matches[0]}")
We can see that we have collected 8955 matches and the all_matches
elements look something like this
{'id': '11643',
'isResult': True,
'h': {'id': '87', 'title': 'Liverpool', 'short_title': 'LIV'},
'a': {'id': '79', 'title': 'Norwich', 'short_title': 'NOR'},
'goals': {'h': '4', 'a': '1'},
'xG': {'h': '2.23456', 'a': '0.842407'},
'datetime': '2019-08-09 20:00:00',
'forecast': {'w': '0.7377', 'd': '0.1732', 'l': '0.0891'}}
This is great. We have the match, the result and the match_id
which we will use to get the full statistics about that match.
Filter the data
Before diving into the match statistics, let’s filter the interesting results first. Since it takes time to get the statistics for all those matches, and we are only interested in the ones where one of the teams was leading 2-0 at some point, we can build a filter like this.
# filter the fixtures where at least one team has >= 2 goals
match_ids_filter = []
for match in all_matches:
if int(match['goals']['h']) >= 2 or int(match['goals']['a']) >=2:
match_ids_filter.append((match['id'], match['datetime']))
len(match_ids_filter)
Here we save in a new list match_ids_filter
only matches were either the home team or the away team have scored at least 2 goals. It’s not a perfect filter, but it takes out a lot of matches that ended with low scoring goals. Also, it’s the best that we can do at this stage since we only know the final result.
We should have gone from 8955 to 5846 matches.
At this point it’s a good idea to save those match ids in a file, in case we want to retrieve them back separately later on.
# save the filtered matches in a csv file
df = pd.DataFrame(match_ids_filter, columns=['match_id', 'date'])
df.to_csv('matches_20_ids.csv')
# read back the match ids
match_ids_filter = pd.read_csv('matches_20_ids.csv')['match_id'].values
Extract the match data
Now, we can get our hands on the shots. Let’s test the API first with a single call (not straight up 6000) to understand what is inside the match statistics.
# check the match shots API
test_shots = []
match_id = match_ids_filter[5]
async def main():
async with aiohttp.ClientSession() as session:
understat = Understat(session)
print(match_id)
shots = await understat.get_match_shots(match_id)
test_shots.append(shots)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
test_shots
Here we only select a single match is (the 6th of the list), call the get_match_shots
giving the match_id
as input, and saving the match in the test_shots
list. Once the cell has finished running, we can verify that test_shots
looks like this
[{'h': [{'id': '310295',
'minute': '6',
'result': 'SavedShot',
'X': '0.8280000305175781',
'Y': '0.639000015258789',
'xG': '0.04247729852795601',
'player': 'Anthony Martial',
'h_a': 'h',
'player_id': '553',
'situation': 'OpenPlay',
'season': '2019',
'shotType': 'RightFoot',
'match_id': '11652',
'h_team': 'Manchester United',
'a_team': 'Chelsea',
'h_goals': '4',
'a_goals': '0',
'date': '2019-08-11 16:30:00',
'player_assisted': None,
'lastAction': 'None'},
...]
}]
We have a dictionary with 2 keys, h
and a
, which represent the home and away shots. Each dictionary key is a list of all the shots taken in the match by the home (or away) team. In the case above we can see that Martial took a shot in the 6th minute, with an xG of 0.04, and the shots was saved.
Let’s now run this for all matches by adding a for loop to the code above.
# loop on all filtered matches and check the stats
match_shots = []
async def main():
async with aiohttp.ClientSession() as session:
understat = Understat(session)
for match_id in match_ids_filter:
print(f"Analyzing {match_id}...")
shots = await understat.get_match_shots(match_id)
match_shots.append(shots)
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
This will take a few minutes to run.
Calculate the match statistics
Once it’s finished running, we are ready for the real analysis. We are going to loop on each of the matches and review all the events up to one of the teams going up 2-0. We will save, for both the leading and the trailing team:
- minute of the 2-0 goal
- total shots
- total missed shots
- total blocked shots
- total saved shots
- total shots on the post
- total xG
- total goals at the end of the match
- the team that was leading 2-0
This can be done, by writing a function that loops on all shots and saves the information in a big list.
def get_match_stats(all_shots):
# check the stats of the teams before the 2-0 was scored
# idea is to predict if the result will hold or not
# based on statistics before and including the 2-0 scored
home_shots = all_shots['h']
away_shots = all_shots['a']
ah_shots = home_shots + away_shots
# turn the minute into an int
for idx, shot in enumerate(ah_shots):
shot.update({'minute': int(shot['minute'])})
ah_shots[idx] = shot
ah_shots = sorted(ah_shots,
key=lambda x: x['minute'])
# calculate if the result gets to 2-0 or 0-2 at some point
goals_home = 0
goals_away = 0
shots_home = 0
shots_away = 0
xg_home = 0
xg_away = 0
miss_home = 0
miss_away = 0
block_home = 0
block_away = 0
saved_home = 0
saved_away = 0
post_home = 0
post_away = 0
stop_analysis = False
for shot in ah_shots:
if not stop_analysis:
# print('anal')
# collect stats of teams
if shot['h_a'] == 'a':
shots_away += 1
xg_away += float(shot['xG'])
if shot['result'] == 'BlockedShot':
block_away += 1
elif shot['result'] == 'MissedShots':
miss_away += 1
elif shot['result'] == 'SavedShot':
saved_away += 1
elif shot['result'] == 'ShotOnPost':
post_away += 1
else:
shots_home += 1
xg_home += float(shot['xG'])
if shot['result'] == 'BlockedShot':
block_home += 1
elif shot['result'] == 'MissedShots':
miss_home += 1
elif shot['result'] == 'SavedShot':
saved_home += 1
elif shot['result'] == 'ShotOnPost':
post_home += 1
if shot['result'] == 'Goal':
if shot['h_a'] == 'a':
goals_away += 1
else:
goals_home += 1
# stop analysis only if the result is 2-0
if (goals_away == 0 and goals_home == 2) or (goals_away == 2 and goals_home == 0):
stop_analysis = True
# decide winning team
if goals_away == 2:
winning_team = 'away'
else:
winning_team = 'home'
# record the minute when the goal was scored
if shot['result'] == 'Goal':
minute_second_goal = shot['minute']
# we never saw a 2-0
if not stop_analysis:
return False
if winning_team == 'away':
leading_shots = shots_away
leading_miss = miss_away
leading_blocked = block_away
leading_saved = saved_away
leading_post = post_away
leading_xg = xg_away
leading_team = shot['a_team']
leading_goals = goals_away
trailing_shots = shots_home
trailing_miss = miss_home
trailing_blocked = block_home
trailing_saved = saved_home
trailing_post = post_home
trailing_xg = xg_home
trailing_team = shot['h_team']
trailing_goals = goals_home
else:
leading_shots = shots_home
leading_miss = miss_home
leading_blocked = block_home
leading_saved = saved_home
leading_post = post_home
leading_xg = xg_home
leading_team = shot['h_team']
leading_goals = goals_home
trailing_shots = shots_away
trailing_miss = miss_away
trailing_blocked = block_away
trailing_saved = saved_away
trailing_post = post_away
trailing_xg = xg_away
trailing_team = shot['a_team']
trailing_goals = goals_away
understat_link = f''
season = shot['season']
all_stats = [leading_shots, leading_miss, leading_blocked, leading_saved, leading_post, leading_xg, leading_goals,
leading_team, trailing_shots, trailing_miss, trailing_blocked, trailing_saved, trailing_post,
trailing_xg, trailing_goals, trailing_team, minute_second_goal, season]
return all_stats
This is quite a long function, but it’s not really complex (and maybe I could have made it shorter with some optimization). Notice how we stop the analysis after the 2-0 has been scored, and how we do not return anything if the match has never been in a 2-0 result.
To apply the function to all matches we simply loop over the match_shots
and call it iteratively.
# aggregate the stats
full_data = []
for shots in match_shots:
all_stats = get_match_stats(shots)
if all_stats:
full_data.append(all_stats)
Now all_stats
contains all the information about the matches where a team went up 2-0. To make it easier to manipulate, s usual, we can turn this into a Dataframe, and save it into a separate file.
df = pd.DataFrame(full_data, columns=['leading_shots', 'leading_miss', 'leading_blocked', 'leading_saved',
'leading_post', 'leading_xg', 'leading_goals', 'leading_team',
'trailing_shots', 'trailing_miss', 'trailing_blocked', 'trailing_saved',
'trailing_post', 'trailing_xg', 'trailing_goals', 'trailing_team',
'2-0_minute', 'season'])
df.to_csv("matches_20_fullstats.csv")
We can now reply to a few questions. For example “How many matches where a team was leading 2-0 ended up with a draw or a loss for the leading team?”. To get the answer, we just filter the Dataframe for the rows where the final goals for the leading team where lower or equal to the goals for the trailing team.
# select the matches where the results was overturned from a 2-0 (either draw or loss)
df_turned = df[df.trailing_goals>=df.leading_goals][['season', 'leading_team', 'trailing_team',
'leading_goals', 'trailing_goals',
'leading_xg', 'trailing_xg', '2-0_minute']]
df_turned
And similarly, we can select the matches where the leading team ended up winning.
# select the matches where the leading team ended up winning from a 2-0
df_staied = df[df.trailing_goals<df.leading_goals][['season', 'leading_team', 'trailing_team',
'leading_goals', 'trailing_goals',
'leading_xg', 'trailing_xg', '2-0_minute']]
df_staied
We can check the percentage of overturned result, by simply counting the number of rows in the two Dataframes. And we can also calculate the xGs for the leading/trailing team in both situations.
# check the % of overturned results
print(f"Overturned results: {100*len(df_turned)/len(df):.2f}%")
# check the leading/trailing xg average
print("Leading xGs")
print(f"Avg 2-0 staied: {df_staied['leading_xg'].mean():.2f}")
print(f"Avg 2-0 turned: {df_turned['leading_xg'].mean():.2f}")
# check the leading/trailing xg average
print("Trailing xGs")
print(f"Avg 2-0 staied: {df_staied['trailing_xg'].mean():.2f}")
print(f"Avg 2-0 turned: {df_turned['trailing_xg'].mean():.2f}")
It turns out that in 7.95% of the cases where a team goes up 2-0 it will either lose or draw the game. It’s also quite interesting to notice the average xG of the leading team (up to the 2-0) is significantly higher in matches where the team ends up winning, with 1.37 xG vs 1.03 xG in the cases when the team ends up losing the match.
In the next part we will investigate the winning ratio as a function of a few important stats like the xGs of the leading and trailing team, and the minute when the 2-0 was scored. As these values change the winning ratio also changes significantly. This can make it possible to find what are the right situations to bet on a team losing the lead and quantify how often we expect to be right (according to historical data).
If you are interested in this type of analysis, I have written a few books where I go into the details of how to get the data, visualize and train a model to predict football results for the Premier League, La Liga, Serie A, Bundesliga and the other major European national tournaments, complete with code examples.