Predicting goals Euro 2024 (matchday one)
In the previous article I explained how to predict the results for the matches played in matchday 1 of the Euro tournament.
In this one, I will show a similar approach, but to predict the total goals scored during each match. Many of the steps are pretty similar to the ones used in the previous article, but there are a few differences, which I will focus on. I will finally produce a slip predicting the amount of goals in each match and compare it with the popular Over/Under 2.5 market odds.
Getting the data
The idea is to build a model training it on the most recent Euro 2020 data, and back-testing it on total goals scored in the Euro 2020 Group stage. In this way we can also get an estimation of the model accuracy and its reliability when applying it to the Euro 2024 matches.
In the article where I predicted the match results, I have explained how get the data from the UEFA website to get all the statistics about the teams’ performance in the Qualifiers.
I have downloaded and saved all data about the Euro 2020 Qualifiers, the Euro 2020 final stage and the Euro 2024 Qualifiers in my github repository. Feel free to have a look there. Those data show the aggregated statistics for all teams that have participated in the Qualifiers to the Euro 2020 tournament, both the ones that qualified and the ones that didn’t qualify.
The goals scored in the Euro 2020 group stage matches are also needed, in order to train and back-test the model. We can find it in this dataset, which collects all the International matches results since 1872, from official tournaments to friendly matches.
The last step, is to simply calculate how many goals were scored in each match of the Euro 2020 Group stage. The dataset will then look like this
Date | Home Team | Away Team | Result | Total Goals |
---|---|---|---|---|
2021-06-11 | Italy | Turkey | 3-0 | 3 |
2021-06-12 | Wales | Switzerland | 1-1 | 2 |
2021-06-12 | Denmark | Finland | 0-1 | 1 |
2021-06-12 | Russia | Belgium | 0-3 | 3 |
2021-06-13 | Austria | North Macedonia | 3-1 | 4 |
… | … | … | … | … |
Transform the data
We need now to normalize the aggregated statistical data and filter them. This is exactly the same approach used in the previous article, so I won’t go into details here. The steps to follow are
- Divide the cumulative statistics by the number of matches played by the teams in the Qualifiers.
- Leave the average statistics (like ball possession) as they are since they already show the average.
- Filter only the data of the teams that qualified to the Tournament.
- Join the match data with the statistical data.
In the final dataset, we will have one row per match, with the statistics of both teams and the number of goals, which will be the target of our model, the variable we want to predict.
Date | Home Team | Away Team | Attempts Home | Attempts Away | … | Total Goals |
---|---|---|---|---|---|---|
2021-06-11 | Italy | Turkey | 20.5 | 13.5 | … | 3 |
2021-06-12 | Wales | Switzerland | 12.87 | 22.0 | … | 2 |
2021-06-12 | Denmark | Finland | 17.37 | 11.5 | … | 1 |
2021-06-12 | Russia | Belgium | 20.7 | 20.7 | … | 3 |
… | … | … | … | … | … | … |
The model that we will train, will try to find correlations between the Total Goals
column and all those statistics that represents the teams’ strength.
Training the model
Training the model is relatively straightforward once we have done the heavy lifting of creating a dataset that has all the features needed. The standard steps are the following.
- We split the data in 60% train and 40% test.
- We train and evaluate the model. This is done with the cross validation method, where the training set is split in N parts, and N-1 parts are used for the model training and one part is used to evaluate the model.
- We use R2 metric to evaluate the model.
To be sure to choose the best model, we train a few different ones and then compare their R2, or the number of times the predicted result is the correct one. he higher the R2 the better the model is at explaining correlation in data. We obtain a table like the one below.
Model | R2 |
---|---|
Decision Tree | 100% |
Linear Regression | 100% |
kNN | 48% |
Random Forest | 78% |
Due to the small amount of data points in our training set (with respect to the number of features of the model) the Decision Tree and the Linear Regression clearly overfit. This means that their results are unreliable and adapt too well to the data. There are a few techniques to correct this, and those are applied in the Random Forest algorithm out of the box. This is the model that have R2=78%, and it’s the one that gives the best result. Our next step is to train this model on the full dataset, and apply its results to the Euro 2024 Group stage matches.
Prediction of Euro 2024 Group stage
Once the model has been trained on the full 2020 data, it can be used to predict the Euro 2024 matches, by simply giving it as input the Euro 2024 Qualifiers statistics instead of the Euro 2020 ones. The model will output the expected number of goals that will be scored in total in each match.
When applied to the first match day of the Euro 2024 group stage this is the result of the model. Notice that the Germany match is missing because we do not have any Qualifiers’ data for the host country, so the model cannot give us any predictions for that.
Date | Match | Goal Prediction | Over/Under 2.5 | Odds |
---|---|---|---|---|
15-06-24 | Hungary-Switzerland | 4.4 | Over | 2.49 |
15-06-24 | Spain-Croatia | 2.9 | Over | 2.13 |
15-06-24 | Italy-Albania | 1.6 | Under | 1.85 |
16-06-24 | Slovenia-Denmark | 3.2 | Over | 2.22 |
16-06-24 | Serbia-England | 4.2 | Over | 1.85 |
16-06-24 | Poland-Netherlands | 2.9 | Over | 1.88 |
17-06-24 | Austria-France | 3.0 | Over | 1.73 |
17-06-24 | Romania-Ukraine | 2.5 | Under | 1.69 |
17-06-24 | Belgium-Slovakia | 2.9 | Over | 1.81 |
18-06-24 | Türkiye-Georgia | 3.2 | Over | 2.20 |
18-06-24 | Portugal-Czechia | 4.5 | Over | 1.81 |
Above, we have compared the model predictions with the average odds.
Most of the results predicted by the model are Overs. With the only exception of the match Italy-Albania and Romania-Ukraine. In the latter case, the number of predicted goals is 2.5, so that could really go either way. You could take a more conservative approach though, betting on the Over only when the predictions show more than 3 goals, betting on under where the prediction is below 2 and not betting where the prediction is between 2 and 3, where the model is less certain of the outcome.
Conclusions
Above we have applied a pretty standard method to predict the number of goals in the first matchday of the Euro 2024 Group stage. We have shown how to build a predictive model of the goals scored in a match, using historical football results and statistics, taken from various online sources. Using the same approach, but only changing the target variable, it is also possible to build similar model that predicts the number of corners, the yellow and red cards and so on.
If you are interested in learning more about how to build a betting model for Euro 2024 and more, you can check out my books where I go into the details of how to get the data, visualize and train a model, complete with code examples.