2-0 Advantage in European soccer leagues (Part 1)
In the first part of the analysis we arrived at the conclusion that in 7.95% of the cases where a team goes up 2-0 it will either lose or draw the game. And we also noticed that the average xG of the leading team (up to the 2-0) is significantly higher in matches where the team ends up winning.
In this second part we are going to investigate the winning ratio as a function of a few important stats like the xGs of the leading and trailing team, and the minute when the 2-0 was scored. As these values change the winning ratio also changes significantly. This can make it possible to find what are the right situations to bet on a team losing the lead and quantify how often we expect to be right (according to historical data).
Building a matrix
The most impactful metrics towards the prediction of the final result, seem to be the xG of the teams and the minute when the 2-0 was scored. It’s then natural to ask ourselves questions like this: “What is the chance that a 2-0 gets overturned when the xG of the leading team is below 1 and the xG of the losing team is above 1.5?”
To answer this question, the simplest approach is to build a 3-dimensional matrix where each cell is a particular configuration of these 3 variables. And then just count the number of wins, losses and draws of the leading team in that particular configuration. We will create a build_win_matrix
function that does exactly that.
def build_win_matrix(df, df_turned):
# search in slices across 3 dimensions (xG lead, xG trail, minute)
# loop and count the number of points and the ratio
# create grids
range_xg = [0.01*i for i in range(0,450,50)]
range_minutes = [i for i in range(0,110,10)]
result = []
for min_1, min_2 in zip(range_minutes, range_minutes[1:]):
for xg_lead_1, xg_lead_2 in zip(range_xg, range_xg[1:]):
for xg_trail_1, xg_trail_2 in zip(range_xg, range_xg[1:]):
# count the no of wins
no_wins = len(df_turned[(df_turned['leading_xg'].between(xg_lead_1, xg_lead_2)) &
(df_turned['trailing_xg'].between(xg_trail_1, xg_trail_2)) &
(df_turned['2-0_minute'].between(min_1, min_2))
])
tot = len(df[(df['leading_xg'].between(xg_lead_1, xg_lead_2)) &
(df['trailing_xg'].between(xg_trail_1, xg_trail_2)) &
(df['2-0_minute'].between(min_1, min_2))])
# save result
if tot == 0:
res = {"xg_lead": f"{xg_lead_1}-{xg_lead_2}",
"xg_trail": f"{xg_trail_1}-{xg_trail_2}",
"minute": f"{min_1}-{min_2}",
"no_wins": no_wins,
"total": tot,
"ratio": 0
}
else:
res = {"xg_lead": f"{xg_lead_1}-{xg_lead_2}",
"xg_trail": f"{xg_trail_1}-{xg_trail_2}",
"minute": f"{min_1}-{min_2}",
"no_wins": no_wins,
"total": tot,
"ratio": no_wins/tot
}
result.append(res)
df_results = pd.DataFrame(result)
return df_results
Let’s have a look at what the function does, step by step.
- Create grids for xGs and minute. The xGs are grouped in steps of 0.5 in a range between 0 and 5. The minutes are grouped in 10 minutes steps between 0 and 100.
- Loop on each of those xG and minute cells and count the number of wins, losses and draws.
- Save
n_wins
as the trailing team loss/draw ratio together with the total number of matches. - Return a DataFrame with all the possible combinations.
The resulting DataFrame will have all the possible combination of xGs and minutes. We can call the function like this to build the matrix, and sort the result by the total number of matches, for example.
df_ratios = build_win_matrix(df, df_turned)
df_ratios.sort_values(by='total', ascending=False)
And we will get a matrix that looks like this.
xg_lead | xg_trail | minute | n_wins | total | ratio |
---|---|---|---|---|---|
0.5-1.0 | 0.0-0.5 | 20-30 | 20 | 158 | 0.126 |
1.0-1.5 | 0.0-0.5 | 30-40 | 12 | 147 | 0.081 |
… | … | … | … | … | … |
So we can immediately see that, although the average win ratio of trailing teams is below 8%, in the particular case where the 2-0 is scored between the 20th and 30th minute, the leading team xG is between 0.5 and 1 and the trailing team xG is below 0.5, the ratio goes up to 12.6% which is significantly higher.
Betting strategy
A way to exploit this feature is to devise a betting strategy that takes advantage of the fact that the probability of a 2-0 being overturned is highly dependent on the xGs of the team and the minute when the 2-0 was scored, as we have seen.
To profit from this, we will have to compare the odds implied probabilities, with the probabilities calculated with matrix. If the matrix probability is higher than the implied probability, we have a value bet.
We can simulate this betting strategy on available data from the 2023/24 season.
- Build the matrix on seasons previous to the 2023/24
- Select potential value bets in 2023/24 season
- Evaluate the winning rate
This translates into the following code.
df_train = df[df.season!=2023]
df_test = df[df.season==2023]
df_turned_train = df_turned[df_turned.season!=2023]
df_turned_test = df_turned[df_turned.season==2023]
# create a matrix of results on the trained dataset
df_ratios_train = build_win_matrix(df=df_train, df_turned=df_turned_train)
# select only the cells that have a min number of matches and a ratio > threshold
min_matches = 10
min_ratio = 0.15
df_ratios_train_sel = df_ratios_train[(df_ratios_train['total']>min_matches) &
(df_ratios_train['ratio']>min_ratio)]
count_sel_ok = 0
count_sel_tot = 0
# loop on both the df_test and select the matches
for idx, row in df_turned_test.iterrows():
# extract the xgs
xg_lead_sel = row['leading_xg']
xg_trail_sel = row['trailing_xg']
# extract the minute
minute_sel = row['2-0_minute']
# check if it is in the selected ratios
for idx_2, row_2 in df_ratios_train_sel.iterrows():
# extract xg lead limits
xg_lead = row_2['xg_lead'].split('-')
xg_lead_min = float(xg_lead[0])
xg_lead_max = float(xg_lead[1])
# extract xg trail limits
xg_trail = row_2['xg_trail'].split('-')
xg_trail_min = float(xg_trail[0])
xg_trail_max = float(xg_trail[1])
# get the minute
minute = row_2['minute'].split('-')
minute_min = float(minute[0])
minute_max = float(minute[1])
# check if the match is in the matrix
if (xg_lead_min <= xg_lead_sel <= xg_lead_max) and \
(xg_trail_min <= xg_trail_sel <= xg_trail_max) and (minute_min <= minute_sel <= minute_max):
count_sel_ok += 1
for idx, row in df_test.iterrows():
# extract the xgs
xg_lead_sel = row['leading_xg']
xg_trail_sel = row['trailing_xg']
# extract the minute
minute_sel = row['2-0_minute']
# print(xg_lead_sel, xg_trail_sel, minute_sel)
# check if it is in the selected ratios
for idx_2, row_2 in df_ratios_train_sel.iterrows():
# extract xg lead limits
xg_lead = row_2['xg_lead'].split('-')
xg_lead_min = float(xg_lead[0])
xg_lead_max = float(xg_lead[1])
# extract xg trail limits
xg_trail = row_2['xg_trail'].split('-')
xg_trail_min = float(xg_trail[0])
xg_trail_max = float(xg_trail[1])
# get the minute
minute = row_2['minute'].split('-')
minute_min = float(minute[0])
minute_max = float(minute[1])
# check if the match is in the matrix
if (xg_lead_min <= xg_lead_sel <= xg_lead_max) and \
(xg_trail_min <= xg_trail_sel <= xg_trail_max) and (minute_min <= minute_sel <= minute_max):
count_sel_tot += 1
We have split the dataset into the matches before 2023/24 season and called it df_train
. While we called df_test
only the matches in 2023/24 season. We have then built the matrix and selected only those cells where we have enough matches (at least 10) and a high enough winning ratio of 15%. Finally, we have selected only the matches in the 2023/24 season that matched these criteria.
We can then count how many of those selected matches actually ended with a correct prediction (a loss or a win by the trailing team).
count_sel_ok/(count_sel_tot)
This will give us 17.1%. In this case we were expecting a number around 15%, which is exactly what we get. This gives us confidence that our model is stable enough across seasons to be used in combination with the implied odds probabilities.
If you are interested in this type of analysis, I have written a few books where I go into the details of how to get the data, visualize and train a model to predict football results for the Premier League, La Liga, Serie A, Bundesliga and the other major European national tournaments, complete with code examples.