We created an R script that automatically collects data from a website, focusing on
specific soccer statistics we believe are good indicators of a team's league standing.
We aim to predict which teams will end up at the top of the league table and which
might not perform as well. This method saves time and allows us to gather and
analyze large amounts of data that we think are most relevant to team
performance.
For the purpose of accounting for all aspects of the game of soccer we decided to
collect vital stats of every team in 3 important aspects : Goalkeeping, Defending
and Attack. Out of over 100 available features we narrowed it down to 27 removing
correlated or redundant stats. We only used 2019 , 2022 & 2023 data. Using 2019
and 2022 for training and 2023 for prediction.
Prediction by Linear Regression
We applied a Linear Regression model to start with. Using the formula , Model <-
lm(Pts <- . –Squad, data = train_data RMSE : 9.05 We then got the predictions for
points and their predicted intervals. From among the 20 possible predictions, the
model predicted 15 teams accurately between its lower and upper bounds.
Results:
Our final goal is actually to predict the standings and they can be reliably done by
predicted the points and ordering them.
Highlights:
Positions Guessed Correctly : 6 (highest)
Mean Position Difference : 1.6 (lowest)
Mean Points Difference : 7.4
Points predicted between the upper and lower bound: 15
Prediction by Random Forest
We then opted to using Random Forest as our 2 nd model. The parameters being,
500 trees and 8 variables split at each node.
For points prediction, it did really well as the points for 19 of the 20 squads were
predicted to be in upper and lower bounds.
Results:
After taking the predicted values and ordering it, we compared them against the
actual results.
Highlights:
Positions Guessed Correctly : 5
Mean Position Difference : 2.3
Mean Points Difference : 7.3
Teams that scored points within the Prediction Interval : 19 (Highest)
Prediction by xG Boosting method
After using Extreme Gradient Boosting method for the prediction of points scored in
a league season, it also performed really well in terms of points prediction.
And after giving it prediction intervals, we could see it also gave 19 out of the 20
values in its upper and lower bounds
Results:
We then chose xgbm for our final model,
Highlights:
Positions Guessed Correctly : 5
Mean Position Difference : 2.1
Mean Points Difference : 6.9 (lowest)
Teams that scored points within the Prediction Interval : 5
Expected Goals Prediction
We used a model to account for expected goals, it did not involve any machine
learning but accounted for it using the Law of Large numbers.
For this model, we did not use all the stats we took from data scraping this model
only has expected goals and shots taken at home and away. It is a model which
shows how teams would perform if the expected goals (xG) metric was the only
metric available for analysis.
Results
This is the resulting table for the 2023 season using Expected Goals, the points
difference (+/-) and the Position changes are in relation to the actual 2023 points
table as it happened.
Highlights:
Positions Guessed Correctly : 2
Mean Position Difference : 2.5
Mean Points Difference : 12.05
Conclusion
After looking at the results of the 4 models, we can safely make the following
conclusions.
● Predicting points can never be too accurate from stats as soccer is itself a
game where each team influences the other’s stats. We would have more
success measuring the points in relation to each other (i.e, Standings)
● Expected Goals is not a reliable metric for concluding games and standings. It
only helps and is not the be all end all stat of football as it is claimed to be.
● Linear model is our best model for predicting the standings. That is with
variable selection other than that Random forest was very consistent with the
points prediction.
Executive Summary
This project aims to forecast the standings of soccer teams in a league using various
predictive models and key match statistics. The project explores the application of
different statistical methods and machine learning models to predict the league
position of soccer teams based on performance indicators. The methods compared
include a basic Educated-Guessing approach based on expected goals (xG) and
three machine learning approaches: Linear Regression, Random Forest, and
Extreme Gradient Boosting (XGBoost). The effectiveness of these models is
evaluated to determine which is most reliable for predicting soccer league
standings.
Introduction
Predictive modeling in sports analytics focuses on using statistical data to forecast
outcomes in sports events. This project leverages data from the Premier League to
predict team standings at the end of the season using different statistical metrics
related to team performance, including goalkeeping, defense, and attacking
statistics.
Project Objectives
To utilize key soccer match statistics to predict league standings.
To compare different predictive models in terms of accuracy and reliability.
To enhance understanding and insights into soccer analytics through effective data
utilization.
Data was automatically gathered via an R script, pulling information from a
dedicated sports statistics website. The dataset consists of team performance
metrics from the 2019, 2022, and 2023 seasons, with the first two seasons used for
training the models and the latest season for prediction. The collected data focus on
three main aspects of the game: goalkeeping, defending, and attacking, totaling 27
specific features.
Methodology
Data Preparation
Data from over 100 potential statistics were narrowed down to 27 relevant
indicators, ensuring the removal of correlated or redundant metrics. The selected
variables cover a broad spectrum of team performance, including specific metrics
like Save Percentage, Clean Sheet Percentage, and Progressive Passes.
Predictive Modeling
Four primary predictive approaches were employed:
Educated-Guessing Using Expected Goals (xG): This non-machine learning approach
uses xG to simulate matches and predict outcomes based solely on goal
probabilities.
Linear Regression: This model uses team statistics to predict the number of points a
team will earn, focusing on minimizing the root mean squared error (RMSE).
Random Forest: This ensemble model uses 500 decision trees to predict points,
considering 8 variables at each split to capture complex patterns in the data.
XGBoost: An implementation of gradient boosted trees designed for speed and
performance, this model also aims to predict team points accurately.
Evaluation Metrics
Models were evaluated based on several metrics:
Positions Guessed Correctly
Mean Position Difference
Mean Points Difference
Points Predicted Within the Upper and Lower Bounds
Results
Model Comparisons
Expected Goals Prediction: This model performed the weakest, with the highest
mean points difference and only 2 positions guessed correctly.
Linear Regression: Showed strong performance in predicting the standings with the
lowest mean position difference and a good balance between accuracy and
reliability in point predictions.
Random Forest: Excelled in the accuracy of point predictions, with 19 out of 20
teams' points falling within the predicted intervals.
XGBoost: Similar to Random Forest in point prediction accuracy but slightly better in
predicting correct positions.
Key Findings
The effectiveness of predictive models varies, with machine learning approaches
generally outperforming simpler statistical methods.
No single model perfectly predicts outcomes due to the unpredictable nature of
soccer.
Expected goals, while a popular metric, may not be sufficient alone to predict
overall team performance accurately.
Conclusion
The study concludes that while predictive modeling can provide valuable insights
into soccer analytics, the inherent unpredictability of sports means that predictions
will always have limitations. The Linear Regression model, despite its simplicity,
proved effective for this specific application, balancing between the granularity of
data and prediction accuracy. However, for more consistent point prediction,
ensemble methods like Random Forest and XGBoost are recommended.
Final Project Report: Predictive Analysis of Soccer League Standings Based on Key
Match Statistics