A novel way to Soccer Match Prediction
Jongho Shin Robert Gasparyan
Email:
[email protected] Email:
[email protected] Abstract—Our hypothesis is that the video game industry, in over the season. It is interesting to point out that this data is
the attempt to simulate a realistic experience, has inadvertently already being used to solve other problems - Scouts in English
collected very accurate data which can be used to solve problems Premiere League use video games to look for young, promising
in the real world. In this paper we describe a novel approach
to soccer match prediction that makes use of only virtual data players with specific skills.1 . And not too long ago a video game
collected from a video game(FIFA 2015). player was appointed as a general manager for a real soccer
Our results were comparable and in some places better than team2 .
results achieved by predictors that used real data. We also use The remainder of this paper is structured as follows. In
the data provided for each player and the players present in the
squad, to analyze the team strategy. Based on our analysis, we section 2 & 3 we describe the data sources for this project
were able to suggest better strategies for weak teams and preprocessing. Section 4 describes our model selection for
match prediction and results. In section 5 we used unsupervised
I. I NTRODUCTION learning techniques to identify the strategy used by the teams
Sports betting has motivated many machine learning and rule based on their squads.
based attempts to solve this problem over the years but one of
the most common approaches was using some combination of
previous match results as a feature set. In this paper we attempt A. Previous Work
a novel approach to Soccer Match Prediction that doesn’t use Sports prediction is obviously a very hot topic and has always
match statistics. This, as far as we know, is the first attempt to packed the interest of sports fans. Also, in countries where sports
predict soccer match outcomes using only “Virtual Data”(Video betting is legal, sports prediction is as critical as predicting
game data). Our goal is to build a better or at least comparable stock prices. But despite this big interest in sports forecasting,
predictor than the ones based on “Real Data” which will verify it was hard to find serious published work on the subject. This
our hypothesis about the validity of video game data. may be due to the fact that “Sports Prediction” is a bit too
Video games have long become a multi-billion dollar industry far from academic interest. We were able to find a few papers
for a reason none other than the fact that they are really “good”. from published in Economics journal. One paper [3] tried the
And the reason they have become so good is because a lot multi-layer perceptron for sports prediction, and explains how
of resources have been spent on making them as realistic as hard it is predicting sports games. [4] investigated three other
possible which requires collecting and curating a lot of real prediction methods and compared their accuracy. Their result
world data. Of course this data has been collected for the sole shows that market prediction and betting odds provide much
purpose of making the video simulation more realistic, thus better forecasts than tipsters. [2] anaylized a sports prediction
more engaging for the user. A representative of the a soccer market of the FIFA World Cup 2006, and compared their
video game company said , “The information gathered by our prediction accuracy based on history to a random draw. The
network of more than 1,300 scouts around the world, combined paper shows that history based prediction is better than a random
with Prozone”s amazing performance data, makes this an in- prediction.
valuable tool for any football club that takes player recruitment Interestingly, it is more easy to find many sprouts prediction
seriously”. If this statement is true, and the video games are papers from other machine learning classes. Students tried to
that accurate the question becomes: Can data collected by video predict their favorite sports game using machine learning. Many
games be used for solving other problems in the same medium? different prediction algorithms have been used but they all had
In this paper, we try to answer exactly this question. one thing in common: they tried to predict match outcomes
The medium we looked at is European soccer and the video based on previous match results. We didn’t find any attempts at
game considered is “FIFA 2015”. Makers of this video game are strategy analysis or any instance of unsupervised learning. This
famous for creating the most realistic soccer player avatars that is mainly due to the limitation of their feature sets (previous
look and act very much like real ones. In order to provide this game history), they can only have small number of features.
level of authenticity, experts have been hired to collect informa- And small number of features hide many factors of games. Thus
tion about the playing style of each player and curate this data so they could use only a small part of Machine Learning.
that each player can be accurately represented as a combination
of 33+ features. This features include both physical(speed,
1 http://www.theguardian.com/football/2014/aug/11/football-manager-
power, acceleration, etc) and technical skills(dribbling, heading,
computer-game-premier-league-clubs-buy-players
accuracy, etc) represented as an integer from 1 to 100. These 2 http://www.dailymail.co.uk/sport/football/article-2340324/Football-
features are updated constantly as the form of players change Manager-Vugar-Huseynzade-got-FC-Baku-job.html
TABLE I: Game data feature list
Attacking Skill Movement Power Mentality Defending Goalkeeping
Crossing Dribbling Acceleration Shot Power Aggression Marking GK Diving
Finishing Curve Sprint Speed Jumping Interceptions Standing Tackle GK Handling
Heading Accuracy Free Kick Accuracy Agility Stamina Positioning Sliding Tackle GK Kicking
Short Passing Long Passing Reactions Strength Vision GK Positioning
Volleys Ball Control Balance Long Shots Penalties GK Reflexes
II. DATASET
In this project, we looked at games played in Spanish ‘La
Liga’(Primera Liga). There are 20 teams in the league who
play each other in a round-robin fashion twice(A plays B twice,
once at home and once away) which adds up to 380 fixtures a
year. For this project, we aquired collection of data from three
different sources:
1) Fixture results
(a) PCA in 3D (b) PCA in 2D
This represents the results of the played matches of ‘La
Liga’. It corresponds to the output variable in the pre- Fig. 1: Each point represents a soccer player. The small sepa-
diction algorithm. The scores where reduced to a binary rated chuck of players are the goalkeepers.
value of three categories: home win/draw/away win. Also
we collected line ups for each match. We focused on line
ups not teams, because teams tend to use different line start creating models and training we decided to do a basic
ups depending on strategy. “sanity check”. Being familiar with the league and some of the
They were gathered from http://www.goal.com by script- players we wanted to see if the scores given to the players are
ing web-crawlers. consistent and are able to distinguish different type of players
2) Real Data from each other. In order to make the manual comparison of
This data represents statistics about teams performance 280 players in 33 dimensional feature space more feasible we
from match history: goals, shots on target, yellow cards, applied PCA. This was a successful and a rather interesting use
etc. We have looked at many examples of sports prediction of PCA, because by reducing the dimensions of the data to 3D(a
techniques and they all use data similar to this. This data visually observable space) we could actually look at our data
was used in this project to create a create a baseline and do a basic sanity check. First thing we noticed after plotting
prediction algorithm that our new technique(using virtual the results of PCA(Figure 1) was a small set of points separated
data) can be compared against. from the rest. We checked and these points correspond to all
3) Virtual Data the goalkeepers in the league which is very intuitive because
This data was collected from http://sofifa.com, and repre- soccer goalkeepers require a very different from set of skills
sents 33 features for each players set by experts. Each from field players. Examining the results in Figure 1 further we
feature corresponds to a physical or technical skill of noticed that the top two players of the league ended up next to
player and is an integer value ranging (1, 100). These each other on the 3D plot. In fact, in the reduced dimensional
are the features used by the video game to simulate the space, it was easy to see that the players that are close to each
actions of each player and in this paper we try to verify other are similar and the ones that are far away are not. We then
how legitimate this data is, by training the predictor with it ran k-means clustering on the data and got three distinct groups
and comparing the results to the predictor that uses “Real (Defensive Players, Offensive Players, Goalkeepers).
Data”. The full set of “Virtual Features” can be found in
Table I. B. Real Data
Notice that Real Data and Virtual Data represent features for
“Real Data”collected from http://www.football-data.co.uk
the same output variables in Fixture results.
represetns a set of 380 matches played between 20 teams.
III. F EATURES AND P REPROCESSING Each match entry has 24 different statistics(home red cards,
away red cards, home team shots on target, etc). This data is
A. PCA used to create a baseline predictor that our “virtual” predictor
Sanity Check with PCA - “Virtual Data” was collected from can be compared against. The aggregation function basically
sofifa.com using web-crawlers, and represents a 33 feature averages teams performance statistics over all the other matches.
vecotor for each player in the league(total of 280 players).
Our hypothesis was that this data was sufficient to build a Team Stat = (Home Team Shots, Away Team Shots,
legitimate predictor for real match results but before we would ..., Away Team Red Cards)
2
m
1 team mentality, home team defence, home team goalkeeping.
P
Team Real Features(Team)i = m Team Stats(Team)i
j=1 And from the latter, top 5 features were home team attacking
Team Stat represents 12 statistics for a team for a given , home team skill, home team movement, hometeam mentality,
match. Team Real Features is computed by averaging this stats home team goalkeeping. With SVM, the accuracy of prediction
for all the other matches it has played in. from them was almost similar to ones that used the entire feature
set.
C. Virtual Data
IV. M ODELS AND R ESULTS (S UPERVISED L EARNING )
“Virtual Data”represents individual soccer players but we
want to predict an outcome of a match played between We treated each match played between a home and an away
two squads(a squad consists of 11 players). We define an team as a sample which is labeled as (home win, draw, away
aggregation function that represents a squad as a combination win). Notice that in soccer there are three possible outcomes
of the features of the players it consists of. First, we just for each game but most of our classifiers have binary outputs.
calculated average values of each skill for each team. This We overcome this issue by slightly redefining the problem we
naive approach smoothes out each skill point; for example, are trying: we will have three binary classification problems
there is only one goalie but other players” goal keeping skill where in each one we try to distinguish between on of the
make the difference small. Thus we need to sum up the stats labels and the other two.
according to the positions. To prevent smoothing out, we
selected a few number of top scores from each squad. We tried Example (for home team win prediction)
several combinations of 2 ∼ 5 top scores for each feature. And
0 if home team won
we ended up selecting [4, 5, 5, 5, 5, 4, 1] top scores respectively. Y =
1 otherwise
Thus the one set of data looks as follows.
This classifier will basically tell us if the home team won or
Team Virtual Features(team) = not.
P
A. Real Predictor
P top 4 Attacking, ∈ [0, 2000]
top 5 Skill, Movement,
Each sample is a match between TeamA and TeamB
P Power, Mentality, ∈ [0, 2500] which is labeled with Y. As feature set we used the
top 4 Defending, ∈ [0, 1200]
P Team Real Features(TeamA), Team Real Features(TeamB)
top 1 Goalkeeping, ∈ [0, 500] which is a 24 dimensional vector. We applied Logistic
Regression and Linear SVM to predict labeling of each match.
We experimented with the feature set, by only looking at the
An on-field player is represented as a set of 33 features teams performance in the last 3 matches in order to capture the
which are split into 6 main categories Table I. A goalkeeper “current form” of the team but the hit rate was not affected. We
is represented as a set of 5 features. A squad is given as a achieved hit rate of around .75(Table I) which is comparable to
set of 11 players, one of which is a goalkeeper the other 10 results in related works and is a good baseline for our “Virtual
are usually split into a defensive group and a offensive group. Predictor”.
The aggregation function for team is computed by taking all the
goalkeeper features and averaging the field players features ac- B. Virtual Predictor
cording by averaging top 4 Attacking and Defending, top 5 Skill, Similar to the “Real Predictor” we combined the
Movement, Power, Mentality features fromTable I The rationale features for each line up, but we used the virtual
behind this is that 1) goalkeepers features must be present in features instead - {Team Features Virtual(TeamA),
the aggregation because there is only one goalkeeper per team, Team Features Virtual(TeamB)} which is a 66 dimensional
2) There are usually 4 defenders so it makes sense to only vector. We applied Linear SVM, RBF SVM, Logistic
consider top 4 purely defensive skills. The other coefficients Regression, SGD and Multivariate NB(we were required to
were approximated by trying a set of different combinations discretize our values for this model). The “Virtual Predictor”
and choosing the one that produces best hit rates. produced results comparable to the “Real Predictor” and with
1) Match result: For each match, there are three outcomes: a little bit of tuning Linear SVM performed better(Table 1).
home team win, draw, and away team win. From the raw data,
home team score and away team score, we converted them into V. I DENTIFYING S TRATEGIES (U NSUPERVISED L EARNING )
binary values of three categories. For example, if home team Soccer managers are responsible for developing a team strat-
score > away team score, home team win=1 and rest of them egy before each match in order to surprise the opponent. Strat-
are 0. egy includes things like player positioning, tactics, set pieces,
2) Feature selection: To find out which features are more etc. But team strategy is very often predicted by the sports-
relevant, we also conducted a feature selection over the data. reporters by just examining the squad before the game. The
We tried two feature selection algorithms: sequential forward reason this can be done is because managers trying to play an
selection, and best first selection. From the former, top 5 offensive strategy will have to include a lot of offensive players
features were home team skill, home team movement, home in the squad and visa versa. We take advantage of this property
3
TABLE II: Prediction results comparison
Real Data
Model Home Win Draw Away Win
Linear SVM 73% 75% 71%
Logistic Reg 73% 72% 74%
Virtual Data
Model Home Win Draw Away Win
Linear SVM 78% 80% 78%
RBF SVM 69% 81% 80%
Logistic Reg 70% 75% 76%
SGD 64% 70% 67%
Multinomial NB 78% 70% 75%
Fig. 2: K-mean RMSE
and attempt to use unsupervised machine learning techniques to
identify different types of strategies based solely on the players
skills present in the each squad. “Fixture results” collected from C. Clustering analysis
http://www.goal.com also includs the squads that were played
in each match. We define an aggregation function similar to Given 5-mean clustering result, we analyzed winning odds
Team Virtual Rating, but notice that this function identifies between clusters. Hence we can see which combination, i.e.
strength of the team not the strategy. In order to identify strategy strategy, is more plausible in certain situations.
we need to normalize the features in Team Virtual Rating. 1) Winning odds between clusters: General winning odds
between clusters are given in Table III. Winning odds are
A. Preprocessing very different depending on the opponent and stadium. But
in general, cluster4 does well. Since the table shows winning
Normalization is required for k-mean clustering, because probability of home team, positive numbers are good for rows,
Team Virtual Features for top teams are likely to be higher and negative numbers are good for columns.
than the ones for weaker teams in every aspect - they will have
better defense, offense, etc. But we normalize these features TABLE III: Winning probability of home team by clusters
by making sure that all the scores in Team Virtual Features
Home\Away Cluster0 Cluster1 Cluster2 Cluster3 Cluster4
adds up to 1. Thus the value for each feature in Squad Strategy Cluster0 0.00000 0.25000 0.40000 -0.75000 -0.75000
will be represented in proportion to all the other features Cluster1 0.00000 0.33333 -0.50000 0.00000 0.20000
in the composition. This will allow us to observe if one of Cluster2 1.00000 0.00000 -0.33333 0.00000 -0.20000
Cluster3 0.50000 -0.20000 -0.66667 0.00000 0.00000
the features is overemphasized which will define the strat- Cluster4 0.28571 0.50000 0.80000 0.14286 0.20000
egy. For example, a weak team and a very strong team may
have different Team Virtual Ratings but they may have similar
Squad Strategy if they are playing a similar strategy. However, Table III shows winning odds regardless of actual
point differences. That means cluster4 may have more strong
Squad Strategy= Normalized(Team Virtual Features) teams. Thus we also conducted analysis for weak teams; weak
teams mean teams with less stat points than the opponent.
Thus we can see that which skill is emphasized and which is TableIV shows winning cluster. This table shows cluster3 and
not in a certain combination. We assume that this emphasis of cluster4 are good strategy for weak team in general: cluster3 can
skills reflects the strategy of the squad. win against 1,2,and 3, and cluster4 can win against 0,1,and 3.
Thus if a certain team is weaker than the opponent, it is better
B. Clustering results to focus on goalkeeping and defense to increase the winning
We computed Squad Strategy for each squad played in the odds like cluster3 and cluster4.
380 matches of the season. We then applied k-means clustering TABLE IV: Weak team’s winning strategy
to the data to identify distinct types of strategies with various
Cluster0 Cluster1 Cluster2 Cluster3 Cluster4
number of k. Figure 1 shows the root mean square error of k-
Against 0 1 0,1 1,2,3 0,1,3
mean clustering for different number of k. Even though larger
k shows better fit, if k is too large, it would be meaningless.
From the RMSE plot, 5 or 6 will be appropriate number for k.
VI. D ISCUSSION
Thus rest of analysis based on 5-mean clustering.
Figure2 shows the 5-mean clustering. Each cluster shows A. Supervised Learning
distinct combinations: cluster0 is well balanced except the In this section we described two approaches to soccer match
attacking, cluster1 is more focus on attacking and individual prediction: “Real Predictor” and “Virtual Predictor”. The “Real
skills, cluster2 is more biased on defending and movement, Predictor” represents the traditional approach which applies
cluster3 is focusing on goalkeeping and individual skills, and machine learning to match statistics collected throughout the
cluster4 concentrates on goalkeeping and defending. season(we refer to this as “Real Data”). This approach achieved
4
(a) cluster0 (b) cluster1 (c) cluster2
(d) cluster3 (e) cluster4
Fig. 3: Clusters. Each cluster represents a distinct strategy. Radial graph shows how different skills are emphasized in the each
strategy.
0.75 hit rate for our data but we looked at related work and the of soccer players, but its likely that they have used more than 33
highest hit rate achieved with this approach was 0.83[1]. “Virtual features, which are presented in the game, to accomplish this.
Predictor”, which uses only data collected from video games, We believe that if the entire feature set was available, “Virtual
achieved 0.80 hit rate. This demonstrates that data collected by Predictor” would achieve even higher hit rates.
the video-game industry can be used to solve real problems in
R EFERENCES
this medium, soccer, with comparable or even better results.
[1] M. Beckler, H. Wang, and M. Papamichael. Nba oracle. Zuletzt besucht
B. Unsupervised Learning am, 17:2008–2009, 2013.
[2] S. Luckner, J. Schröder, and C. Slamka. On the forecast accuracy of sports
In this section we were able to use k-means clustering to prediction markets. In Negotiation, Auctions, and Market Engineering,
identify 5 different types of playing strategies based on the pages 227–234. Springer, 2008.
[3] A. McCabe and J. Trevathan. Artificial intelligence in sports prediction.
players skills present in the squad(Figure 3). We were able In Information Technology: New Generations, 2008. ITNG 2008. Fifth
to measure how these 5 different strategies perform against International Conference on, pages 1194–1197. IEEE, 2008.
each other and observed that most top teams in Soccer have [4] M. Spann and B. Skiera. Sports forecasting: a comparison of the forecast
accuracy of prediction markets, betting odds and tipsters. Journal of
very offensive strategies. This is in contrast to NBA, where Forecasting, 28(1):55–72, 2009.
best teams where the ones with good defensive stats[1]. Also
we observed that weaker teams perform better against stronger
teams if they use defensive strategies.
VII. C ONCLUSION AND F UTURE WORK
One of the main challenges in machine learning projects
is data collection which can be very time consuming and
expensive. In this paper we demonstrate an alternative sources
for curated data: video games. Video games are often overlooked
due to its origin. However, video games have come a long
way since Pac-Man and Frogger and have created phenomenally
accurate simulations of the real world, which can only be done
through very intensive data collection. This data can be used in
machine learning projects to make predictions in the real world
with very accurate results. Of course this would be made easier
if the video game industry shared this information in public
domain. FIFA 2015 has done a great job simulating the action