Predicting group matches Euro 2024

In the previous article I explained how I created a model to rank all teams participating in the final stage of the Euro 2024 tournament, from getting the data to creating a ranking model, to comparing that with the bookmakers’ odds.

In this one, I have a look at the Group stage matches and build a model that can give use predictions for the winners of each match. I will go through the data collection process, the creation of the model and the back-testing on the Euro 2020 tournament dataset and results.

Getting the data

The idea is to build a model training it on the most recent Euro 2020 data, and back-testing it on the Euro 2020 Group stage result. In this way we can also get an estimation of the model accuracy and its reliability when applying it to the Euro 2024 matches.

On the UEFA website we can access many statistics about the teams’ performance in the Qualifiers. For example, we can see all the data about goals scored, possession, passing accuracy and more. Unfortunately, there is no button or URL that we can use to download the data. A bit of research however shows that the UEFA website is using a public API to access those data in a machine-readable format. We can use this API and with a single HTTP request we get all the data presented in the page.

We can use the endpoint https://compstats.uefa.com/v1/team-ranking?competitionId=3 to get the information about all Euros competitions, and adding a few parameters like seasonYear and stats we can narrow down our search to the Euro 2020 only and select only the metrics that we are interested in.

I have downloaded and saved all data about the Euro 2020 Qualifiers, the Euro 2020 final stage and the Euro 2024 Qualifiers in my github repository. Feel free to have a look there. There are aggregated data that show the aggregated statistics for all teams that have participated in the Qualifiers to the Euro 2020 tournament, both the ones that qualified and the ones that didn’t qualify.

We also need all the results of the Euro 2020 group stage matches, in order to train and back-test the model. We can find it in this dataset, which collects all the International matches results since 1872, from official tournaments to friendly matches. This set of data has only the match results information, and it looks like this

Date	Home Team	Away Team	Result
2021-06-11	Italy	Turkey	3-0
2021-06-12	Wales	Switzerland	1-1
2021-06-12	Denmark	Finland	0-1
2021-06-12	Russia	Belgium	0-3
2021-06-13	Austria	North Macedonia	3-1
…	…	…	…

Transform the data

The first step is to normalize the aggregated statistical data for the Euro 2020 Qualifying campaign. Since not all teams played the same number of matches in the Qualifying phase, it’s necessary to divide all statistics for the number of matches played. So we can have an average of those metrics per match, that is unaffected by the number of matches played.

Once that is done, we will have a set of normalized metrics, like average attempts, average attempts on target and so on. We have that for all teams that went through the Qualifiers, so the next step is to filter only the 24 teams that qualified to the final stage of Euro 2020. This set of normalized data looks like this.

Team	Goals	Attempts	Attempts on target	…
Belgium	4	20.7	9.2	…
Italy	3.7	20.5	8.1	…
England	4.6	13.75	8.0	…
…	…	…	…	…

The final step to build the dataset that we will use for training, is to join the results’ data with the statistics data. For each of the rows of the results’ table we will create two sets of additional columns. The first set will contain the statistics for the home team, the second set will contain the statistics for the away team. We will also add a column that will serve as target variable, which is what we want to predict. We want to predict the match result, and this falls into three categories, home win (labeled as 1), away win (labeled as 2) or draw (labeled as X).

The final dataset will then look like this.

Date	Home Team	Away Team	Attempts Home	Attempts Away	…	Result
2021-06-11	Italy	Turkey	20.5	13.5	…	1
2021-06-12	Wales	Switzerland	12.87	22.0	…	X
2021-06-12	Denmark	Finland	17.37	11.5	…	2
2021-06-12	Russia	Belgium	20.7	20.7	…	2

The model that we will train, will try to find correlations between the Result column and all those statistics that represents the teams’ strength in some way or another.

Training the model

Training the model is relatively straightforward once we have done the heavy lifting of creating a dataset that has all the features needed. The standard steps are the following.

We split the data in 60% train and 40% test.
Since the target variable is a string, we need to apply a transformation that makes it an integer number. We assign the value 0 to the draw, the value 1 to the home win and the value 2 to the away win
We train and evaluate the model. This is done with the cross validation method, where the training set is split in N parts, and N-1 parts are used for the model training and one part is used to evaluate the model.

To be sure to choose the best model, we train a few different ones and then compare their accuracy, or the number of times the predicted result is the correct one. We obtain a table like the one below.

Model	Accuracy
Decision Tree	49%
Logistic Regression	62%
kNN	53%
Random Forest	58%

In our case, it looks like the best model is the Logistic Regression one. Our next step is to train the model on the full dataset, and apply its results to the Euro 2024 Group stage matches.

Prediction of Euro 2024 Group stage

Once the model has been trained on the full 2020 data, it can be used to predict the Euro 2024 matches, by simply giving it as input the Euro 2024 Qualifiers statistics instead of the Euro 2020 ones. The model will output not only the predicted result but also the probability that each of the results will happen.

When applied to the first match day of the Euro 2024 group stage this is the result of the model. Notice that Germany match is missing because we do not have any Qualifiers’ data for the host country, so the model cannot give us any predictions for that.

Match	Prediction	Probability
Hungary-Switzerland	2	45%
Spain-Croatia	1	58%
Italy-Albania	1	42%
Slovenia-Denmark	2	44%
Serbia-England	1	49%
Poland-Netherlands	1	48%
Austria-France	1	48%
Romania-Ukraine	2	58%
Belgium-Slovakia	X	37%
Türki̇ye-Georgia	2	45%
Portugal-Czechia	1	58%

The model looks very confident only in 3 out of 11 matches. Spain winning against Croatia, Ukraine against Romania, and Portugal winning against Czechia. It also predicts a few potential surprising results, like the win of Serbia vs England, the win of Poland vs Netherlands, of Austria vs France and the draw between Belgium and Slovakia. The probabilities associated to these events are below 50%, so I would not be surprised if they don’t happen.

Conclusions

The above exercise shows the basics of how to build a predictive model using historical football results and statistics, applied to International teams. Using the same approach, but only changing the target variable, it is also possible to build similar model that predicts the number of goals scored in a match, or the number of corners, the yellow and red cards and so on.

If you are interested in learning more about how to build a betting model for Euro 2024 and more, you can check out my books where I go into the details of how to get the data, visualize and train a model, complete with code examples.

Check out the books on

12 Jun 2024