Soccer Player Evaluation via AI
Soccer Player Evaluation via AI
Abstract
Scientifically evaluating soccer players represents
a challenging Machine Learning problem. Unfor-
arXiv:2101.05388v1 [[Link]] 13 Jan 2021
Figure 2. (a) The ET model computes the position of each entity. It then passes the coordinates to the online tracker to: (1): Compute the
embedding of each player, (2): Warp each coordinate with the homography from HE. (b) The HE starts by doing a direct estimation of the
homography. Available keypoints are then predicted and used to compute another estimation of the homography. Both predictions are
used to remove potential outliers. (c) The REID model computes an embedding for each detected player. It also compares the IoU’s of
each pair of players and applies a Kalman Filter to each trajectory.
the idea that our EDG model is trained purely on simulation, worked on a deeper model: an open-source multiperson pose
and never on real soccer data. This makes it much easier estimator named AlphaPose, based on the COCO dataset.
to train, and utterly independent from large soccer datasets Another approach is to consider players as objects and use
that are either too complicated to produce, or too expensive Object Detection Models. Ramanathan et al. (2015) used
to acquire. such a model, a CNN-based multibox detector for the player
detection.
Our EDG model generalizes well to many playstyles and
many teams. It also produces similar or even more accurate Field registration. One key element of any sports tracking
results than existing models. model is how players coordinates can be represented in two
dimensions. One way to do so is Sport Field Registration,
Our tracking model predicts the player’s coordinates and
the act of finding the homography placing a camera view
computes precise results even on difficult scenes such as
into a two-dimensional view, fixed across the entire game.
sunny or covered with shadow ones. This is also the first end-
More about Field Registration can be found in the Supple-
to-end open-source framework for soccer player tracking,
mentary Materials-A. Sharma et al. (2017) formulated the
and could easily be extended to other team sports.
registration problem as a nearest neighbor search over a gen-
erated dictionary of edge maps and homography pairs that
2. Related Work they made synthetically. To extract the edge data from their
images, they compared 3 different approaches: Histogram
2.1. Tracking players on a sport field
of oriented gradients (HOG) features, chamfer matching,
Player detection. Detecting soccer players can be a compli- and convolution neural networks (CNN). Finally, they en-
cated task, especially in a multi-actors situation. There are hanced their findings using Markov Random Field (MRF).
multiple ways of doing this detection, Kaarthick (2019) Homayounfar et al. (2017) figured out the homography’s
explored the usage of a HOG color based detector that attributes by parametrizing the problem as one of inference
gives each image a detection of the players. Johnson (2020) in an MRV which features they determine using structured
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
SVMs. The use of synthetic data is also explored by Chen scoring or conceding probability. Therefore, this model
& Little (2018), which takes Homayounfar et al. (2017) and enables analysts to estimate the effect of each action from a
Sharma et al. (2017)’s work one step further by building a player and then evaluate his global performance.
generative adversarial network (GAN) model. Finally, Jiang
We start by going into the Theoretical framework of both
et al. (2019) used a ResNet to directly predict the homogra-
our tracking model and our EDG model. In Experimen-
phy parameters iteratively, while Citraro et al. (2020) used
tal methods, we review the implementations and learning
particular keypoints detected with a Unet from the field to
procedures of our models. Finally, in Results, we quickly
achieve the same goal.
present the tracking results to focus more on the insights
Entities tracking in Video. given by our EDG model.
Ramanathan et al. (2015) used a Kanade–Lucas–Tomasi fea-
ture tracker for player tracking across a game. Recent ReI- 3. Theoretical framework
dentification models go a step further: Zheng et al. (2019)
3.1. Tracking model
used an embedding model to extract player features, and
a mixture of IoU and embedding similarities to track per- Our EDG model needs to know the state of the game at any
sons. Another approach was used by Liang (2020), with time t: we chose to do so by representing a Soccer game
a k-Shortest Path algorithm and an embedding model to with a list of entities coordinates. Given a list of images, our
extract players’ numbers and colors. tracking model computes the 2-dimensional trajectories of
both the ball and each player. The model adopts a 3-steps
2.2. Valuing player actions method: E NTITY T RACKING, H OMOGRAPHY E STIMA -
TION , R E I DENTIFICATION (see Figure 2) (ET,HE,REID).
Before the revolution brought by data analysis in football, Each of these steps is represented by one or several mod-
valuing players was mostly based on traditional statistics els, and more theoretical background on the homography
such as the number of goals, the number of assists, passes estimation can be found in the Supplementary Materials-A.
accuracy, or the number of steals. These post-game statistics The first 2 steps do not consider any temporal information
can be relevant as an overview of a player’s quality, but it as they approximate results for each image separately. How-
does not reflect how a player can impact a game thanks to ever, the last step compares each new image to the list of
his vision or his moves. Furthermore, focusing on general images that have already been processed. Doing so allows
statistics also means ignoring the entire universe of actions us to take into consideration each entity’s movement over
in which a player took part. Thanks to the growing availabil- time and counters the mistakes the previous models might
ity of event and tracking data, researchers started to create do. We quickly review here each of the 3 steps: ET, HE,
models using all game events and tracking data to evaluate REID.
players’ actions on and off-ball. In 2018, Spearman (2018)
suggested an indicator to evaluate off-ball positioning called E NTITY T RACKING definition. The ET : Rn×n×3 →
m m
Off-Ball Scoring Opportunities (OBSO). Thanks to his pre- Rm×4 × [0, 1] × [0, 1] , takes an image as input, and
vious fundamental works on Pitch Control (PC), he was able predicts a list of bounding boxes associated with a class
to measure the positioning quality of attacking players by prediction (Player or Ball) and a confidence value1 .
evaluating the danger of their position if they would receive H OMOGRAPHY E STIMATION definition. The HE is
the ball. made of 2 separate models. The first one : Rn×n×3 →
The same year, using tracking data, Fernández & Bornn R3×3 takes an image as input, and predicts the homogra-
(2018) presented two main indicators to measure the use phy directly. The second one : Rn×n×3 → Rp×n×n also
of space and its creation during a game: Space Occupation takes an image as input, but predicts p masks, each mask
Gain (SOG), and Space Generation Gain (SGG). Later on, representing a particular keypoint on the field (see Figure 3).
in 2019, Fernández (2019) adapted Expected Possession The homography is computed knowing the coordinates of
Value (EPV), a deep learning based method, from basketball available keypoints on the image, by mapping them to the
to football enabling analysts to reach a goal probability keypoints coordinates on a 2-dimensional field (see Supple-
estimator at each instant of a possession. By discounting mentary Materials-A for more details). Using both models
EPV, researchers were able to access the impact of a single allows us to have better result stability, and to use one model
event like a key pass which would increase significantly EPV. or the other when outliers are detected.
Similarly, Decroos et al. (2019) presented their VAEP model R E I DENTIFICATION definition. The REID model gathers
for which they developed an entire language around event all information from the ET model and builds the embedding
data called Soccer Player Action Description Language
1
(SPADL). Thanks to supervised learning, their model was m is defined as the number of predictions we make and is
trained to evaluate the impact an event could have on the usually fixed at 100
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
Figure 3. Positions of the 28 keypoints the HE model predicts. It represents the discounted expected number of goals the
During training and inference, each keypoint is associated with team will score (or concede) from a particular state. To build
a different mask. Such a mask is a binary matrix, with 1s at the such a good policy, one can define an objective function
position of the keypoint, and 0s elsewhere. We have 28 masks, one based on the cumulative discounted reward:
per keypoint.
J(θ) = E R(τ )
τ ∼πθ
model as well. The embedding model : Rn×n×3 → R751
takes the image of a detected player, and gives a meaningful and seek the optimal parametrization θ∗ that maximize
embedding2 . The model initializes several tracklets based on J(θ):
the boxes from ET in the first image. In the following ones,
the model links the boxes to the existing tracklets according
θ∗ = arg max E R(τ )
to: (1): their distance measured by the embedding model, θ
(2): their distance measured by boxes IoU’s. A Kalman
Filter (Kalman, 1960) is also applied to predict the location To that end, we can compute the gradient of such cost
of the tracklets in the current image and removes the ones function3 ∇θ J(θ) to update our parameters with θ ←
that are too far from the linked detection. The model also θ + λ∇θ J(θ). In our case, the evaluation of V π and πθ is
updates the embedding of each player over time with another done using Neural Networks, and θ represents the weights
Kalman Filter. of such networks (more details on Neural Networks can
be found in Goodfellow et al. (2017)). At inference, our
3.2. EDG model model will take the state of the game as input, and will
We assume st ∈ S is the state of the game at time t. It may output the estimation of the EDG. A more advanced view
be the positions of each player and the ball for example. of Reinforcement Learning can be found in Sutton & Barto
Given an action a ∈ A (e.g.a pass, a shot, more details in (1998)
the Supplementary Materials-B), and a state s0 ∈ S, we note
P : S × A × S → [0, 1] the probability P(s0 |s, a) of getting 4. Experimental methods
to state s0 from s following action a. Applying actions
over K time steps yields a trajectory of states and actions, All of the models were implemented using standard
τ t0:K = st0 , at0 , ..., stK , atK . We denote rt the reward deep learning libraries and the usual image processing
given going from st to st+1 (e.g.+1 if the team scores a library. Models, code, and datasets will be released at
goal). More importantly, the cumulative discounted reward [Link]
along τ t0:K is defined as:
4.1. Tracking implementation details
E NTITY T RACKING details. The ET model is based on
K
X a Single Shot MultiBox Detector (SSD) (Liu et al., 2016).
R(τ t0:K ) = γ n rtn The model takes images of shape (512, 512), and makes
n=0 prediction for 2 classes: players (including referees) and the
2
751 was the initial number of classes in the Market-1501 3
Using a log probability trick, we can show
Dataset. We kept the same number for the size of our embedding, that hwe have the following equality: ∇θ J(θ) =
although a smaller embedding size might lead to better results PT i
according to recent papers (Zheng et al., 2019) E
τ ∼πθ t=0 ∇θ log (πθ (at |st )) R(τ )
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
Figure 5. We can track almost every player and the ball, even in very different settings. We present here some examples of tracking
failures and success. In sequence (a), the REID model fails to identify player 19 between frame 1. and 2., letting the same player having a
new id (15). Between frames 3., 4. and 5., the boxes of players 4 and 5 merge into id 20. When the merge is undone, the model can
retrieve the right id for each player
In sequence (b), the same phenomenon appears frame 2., 3. and 4. but the model fails to recognize player 8 and gives him a new id (16).
Another issue with our model is when a player runs out of the camera’s range and returns later. This almost always leads to a change of id,
as seen in frame 5. with player 17. However, this can easily be managed by merging the two ids trajectories.
for the human eye. Because of its relatively small size and action (one trajectory to remove every 60.57 frame, or 3
its frequent occlusion by players, the ball Average Precision seconds, on average). This number includes the referees
does not get higher than 60%. that we also detect in our dataset and model. We merge 5
trajectories per action on average ([Link] a player is not
H OMOGRAPHY E STIMATION results. The results and
recognized and gets a new id). Finally, we manually add
comparison against existing methods are available Table
3.75 missing coordinates on average, mostly when we do
1. We find similar results to (Citraro et al., 2020) for our
not find the ball at a critical moment (e.g. the beginning of
Keypoints based method and slightly better results than the
a pass or a shot).
direct estimation implementation from (Jiang et al., 2019).
This can be explained by the higher number of dense layers We present the results of our entire model Figure 5 for
added at the end of our Resnet18 architecture. Our com- 2 different actions, and give a more in-depth analysis of
bination of 2 very different strategies allows the complete when our model successfully tracks each player even when
model to react better to outliers and bypass the condition of the tracking model fails to detect them, but also highlight
4 non-collinear points needed by the keypoints based model. particular cases of Re-Identification failure. We discuss
Producing high-quality estimations of the homography is a in the Supplementary Materials-C the rest of architectural
hard task with many consequences. If an estimation is incor- choices. We also display more results for each model in
rect, players’ coordinates will be as bad and produce noisy Figure C.3.
or false trajectories. For this reason, it is useful to manu-
ally check some estimations by hand during the tracking 5.2. Valuing expected discounted goals
process, and remove them if they are too far from the truth.
The more detailed results of the keypoints mask prediction Our Agent imparts a satisfactory understanding of soccer
are available in Table 3. and soccer dynamics, even while training on simulations.
First, Table 4 shows the Average Goal Difference between
R E I DENTIFICATION results. Since we have no dataset of our Agent and its opponent at the end of each training step
soccer player re-Identification, it is impossible to evaluate (50M updates): not only does the Agent learn to beat easy
our embedding model. However, we can denote the number and medium bots, but it also learns more complex dynam-
of operations we had to do manually after a tracking is ics while training against itself. The major impact of this
produced. On average, we had to delete 3.5 trajectories per selfplay component can be seen Figure 7 (right), and in
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
Figure 6. EDG mapping based on the position of the ball. In (1), passing the ball to the midfield would release pressure and increase
the EDG, but lose momentum. A forward pass inside the box or to the right-wing would also increase the EDG and keep the current
momentum. In (2), a forward pass to the right-wing would have less impact as one inside the box, given the current players’ velocities.
We present in (3) an evolution of the EDG overtime. In the first frame, a pass to the middle would help release the high pressure. The
team prefers to stay on the right side and to create a momentum towards the goal. The EDG evolves then to increase on the wing. The
EDG does not take into account the different physical capabilities of each player, thus underestimate the potential the team attack’s speed.
Figure 7. (left) We present here the EDG over time alongside the tracking of players. The green arrows represent the action taken in the
next frame by the player. Between frame (1) and (2), the EDG increases with the long forward pass to the right-wing. It keeps increasing
frame (2) to (3) with a decisive pass inside the box, before decreasing due to the very high pressure. It increases again with a shot.
(right) Comparison between our agent EDG map and an agent trained for 50M steps against an easy bot. The easy agent attribute on
average the same potential to every location on the field. It shows how training against a more difficult agent and self-play helped our
agent learn real soccer players’ behavior.
the Supplementary Materials Figure C.4. We displayed on into consideration the four last frames, it can consider each
these maps the EDG estimated by our Agent if the ball was entity’s velocity.
located at this location. After the first 2 steps of training, the
Our model can also estimate the EDG of an action over
Agent attributes on average the same EDG to each location
time. We present 7 (left) the evolution of the EDG during
of the field. After selfplay, a more realistic and insightful
an action, with the player tracking as well. While our Agent
potential is discovered.
can capture the potential of an action, we believe that 2
This potential can be seen Figure 6 for 4 different games. components make the EDG extremely valuable for soccer
While the front of the center circle and penalty sport both get analysis:
a high EDG (for releasing the pressure and a high probability
The discount factor, γ. We chose a discount factor of
of goal respectively), the Agent also accurately captures
0.993, meaning that if a goal takes place 100 frames later,
the most dangerous zones. For example, in both Figure 6
it brings a value of 0.993100 ≈ 0.5. It reduces the bias
(1) and (2), a forward pass to the right-wing would bring
from which a single goal would bring immense value to
a lot of value to the action. Since the Agent inputs take
each previous action, even ones that are far away. This
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
Average
Training Step Opponent
Goal Difference
Step 1 Easy Bot 8.14
Step 2 Medium Bot 4.7
Step 3 end of Step 2 5.3
Step 4 end of Step 3 4.8 Figure 8. We compare here on the same scenario, VAEP (Decroos
et al., 2019) (a) and our EDG (b). Our EDG is negative on pass
Table 4. Average Goal Differences at the end of the training be- (1), as we take into account the 2 close surrounding opponents to
tween our Agent and the opponent. the receiver. The next 2 passes are highly important, and get more
values in our framework as they allow to: (2) release pressure, (3)
add momentum and danger close to the goal. The next two actions
5.3. Comparison with existing framework get overall the same value as the VAEP. However, the shot gets a
much smaller value within our framework, as our model discounts
One way to ensure our Agent produces good analysis is to the goal and attributes more uniformly the potential of an action.
compare it to other algorithms whose purpose is to evaluate
soccer players. We focus on 2 different models: the Ex-
pected Possession Value (EPV) from Fernández (2019), and
a good value from the VAEP algorithm, while our Agent
Valuing Actions by Estimating Probabilities (VAEP) from
could give him a weak value because opponents are close, or
Decroos et al. (2019). We compare the three approaches
if a pass is risky because of the field limits. Overall, we find
to the scenario they used on their paper, as we do not have
that the value of a goal is much more uniformly distributed
access to their training procedures or their data.
in the previous actions with our framework. For example,
Overall, we find that our approach, while never trained on an assist and its related goal would get values distribution
data extracted from a real game, find on average the same of 10% and 90% in the other frameworks, while our Agent
value as the other 2 approaches. The EPV and our Agent would give a distribution of 35% and 65%.
always agree on the positive or negative impact of an action,
and when the surrounding is not too impactful, the VAEP 6. Conclusion
and our Agent also agree on the value of an action5 .
We presented a generic framework for soccer player track-
It does, however, find more precise and meaningful values
ing and evaluation based on machine learning and Deep
for more subtle actions. This is a powerful result and proof
Reinforcement Learning. Our experiments show that we
that our Agent learnt realistic soccer behaviors in a simula-
can produce trajectories of players only from a single cam-
tion, and can understand and process information at various
era and that our EDG framework captures more insightful
levels: [Link] a player can perform, pressing on a player,
potential than previous models, while being based on simu-
players’ velocities. For example, a forward pass would get
lations. We find that both the tracking model and EDG agent
4 are the first open-sourced one while being more accurate
Such as completely splitting the relations between passes and
shots, for example, or attributing the same reward to each action than previous approaches.
that led to a goal, or only to the last 10.
5
Since the VAEP doesn’t take into consideration if opponents Regarding the tracking models, our model could easily be
are close to a player for example extended to other sports, with appropriate datasets. Some
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
Figure 9. We compare here in the same scenario the EPV (Fernández, 2019) and our EDG. We pick up the same increase in value for
action A and B, while our model doesn’t value as much action C. On the other hand, our model values a lot action a, which is a long
forward pass to the left-wing, and a strong situation that could lead crossing the ball to a teammate. We also kept track of the end of the
action (situation b) with a great pass followed by a caught header.
improvements may be obtained by building a soccer ReI- Citraro, L., Márquez-Neila, P., Savarè, S., Jayaram, V.,
dentification datasets or building Unsupervised Approach. Dubout, C., Renaut, F., Hasfura, A., Ben Shitrit, H.,
and Fua, P. Real-time camera pose estimation for
Our DRL Agent would benefit a lot from being replaced
sports fields. Machine Vision and Applications, 31
by a multi-agent algorithm, where more than one player
(3), Mar 2020. ISSN 1432-1769. doi: 10.1007/
are being controlled at once. Improvements over the soccer
s00138-020-01064-7. URL [Link]
environment would also help a lot: bringing more physical
10.1007/s00138-020-01064-7.
statistics for players or particular teams, for example. While
we focus on pure simulation-based training, finetuning the Decroos, T., Bransen, L., Van Haaren, J., and Davis, J.
Agent on Expert Data from real games could help to bridge Actions speak louder than goals. Proceedings of the 25th
the gap with a more realistic playstyle. ACM SIGKDD International Conference on Knowledge
Finally, this work aims to give a broader audience access to Discovery and Data Mining - KDD, 2019. doi: 10.1145/
soccer tracking data and analysis tools. We also hope that 3292500.3330758. URL [Link]
our datasets will help to build more performant models over 1145/3292500.3330758.
time.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih,
V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning,
Acknowledgements I., Legg, S., and Kavukcuoglu, K. Impala: Scalable dis-
We thank Arthur Verrez and Gauthier Guinet for valu- tributed deep-rl with importance weighted actor-learner
able feedback on the work and manuscript, and we thank architectures, 2018.
Matthias Cremieux for advice on object detection models.
Fernández, J., B. L. . C. D. Decomposing the immeasur-
We also thank Last Row for the first tracking data we used
able sport: A deep learning expected possession value
to test our agent.
framework for soccer. 2019.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual Lowe, D. Distinctive image features from scale-invariant
learning for image recognition, 2015. keypoints. International Journal of Computer Vision, 60:
91–, 11 2004. doi: 10.1023/B:VISI.0000029664.99615.
Homayounfar, N., Fidler, S., and Urtasun, R. Sports field 94.
localization via deep structured models. In 2017 IEEE
Conference on Computer Vision and Pattern Recognition Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
(CVPR), pp. 4012–4020, 2017. DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.
Jiang, W., Higuera, J. C. G., Angles, B., Sun, W., Javan,
M., and Yi, K. M. Optimizing through learned errors for Ramanathan, V., Huang, J., Abu-El-Haija, S., Gorban, A. N.,
accurate sports field registration, 2019. Murphy, K., and Fei-Fei, L. Detecting events and key ac-
tors in multi-person videos. CoRR, abs/1511.02917, 2015.
Johnson, N. Extracting player tracking data from video URL [Link]
using non-stationary cameras and a combination of com- Schoenfeld, B. How data (and some breath-
puter vision technique. 2020. taking soccer) brought liverpool to the cusp
of glory. New York Times, 05 2019. URL
Kaarthick, P. An automated player detection and tracking
[Link]
in basketball game. Computers, Materials & Continua,
magazine/[Link].
58:625–639, 01 2019. doi: 10.32604/cmc.2019.05161.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and
Kalman, R. E. A new approach to linear filtering and predic- Klimov, O. Proximal policy optimization algorithms,
tion problems” transaction of the asme journal of basic. 2017.
1960.
Sharma, R. A., Bhat, B., Gandhi, V., and Jawahar, C. V.
Kendall, A., Grimes, M., and Cipolla, R. Posenet: A convo- Automated top view registration of broadcast football
lutional network for real-time 6-dof camera relocalization, videos. CoRR, abs/1703.01437, 2017. URL http://
2015. [Link]/abs/1703.01437.
Kingma, D. P. and Ba, J. Adam: A method for stochastic Spearman, W. Beyond expected goals. 03 2018.
optimization, 2014.
Sutton, R. S. and Barto, A. G. Introduction to Reinforcement
Kurach, K., Raichuk, A., Stanczyk, P., Zajac, M., Bachem, Learning. MIT Press, Cambridge, MA, USA, 1st edition,
O., Espeholt, L., Riquelme, C., Vincent, D., Michalski, 1998. ISBN 0262193981.
M., Bousquet, O., and Gelly, S. Google research football: Tan, M. and Le, Q. V. Efficientnet: Rethinking model
A novel reinforcement learning environment, 2019. scaling for convolutional neural networks, 2019.
Le, H., Liu, F., Zhang, S., and Agarwala, A. Deep homogra- Yakubovskiy, P. Segmentation models. https://
phy estimation for dynamic scenes, 2020. [Link]/qubvel/segmentation_models,
2019.
Lewis, M. Moneyball: The Art of Winning an Unfair Game.
2003. Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., and Tian,
Q. Scalable person re-identification: A benchmark. In
Liang, Q.; Wu, W. Y. Y. Z. R. P. Y. X. M. Multi-player Proceedings of the IEEE International Conference on
tracking for multi-view sports videos with improved k- Computer Vision, 2015.
shortest path algorithm. Appl. Sci., 864, 2020.
Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., and Kautz,
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., J. Joint discriminative and generative learning for person
and Belongie, S. Feature pyramid networks for object re-identification. IEEE Conference on Computer Vision
detection, 2016. and Pattern Recognition (CVPR), 2019.
h11 h12 h13
X2 = HX1 = h21 h22 h23 X1
h31 h32 h33
where we assume h33 = 1 to normalize H and since H only has 8 degrees of freedom as it estimates only up to a scale
factor. An example of such homography is available Figure A.1 (left). The equation above yields the following 2 equations:
or more concisely:
x1 y1 1 0 0 0 −x1 x2 −y1 x2
h=0
0 0 0 x1 y1 1 −x1 y2 −y1 y2
where h = [h11 , h12 , h13 , h21 , h22 , h23 , h31 , h32 ]> . We can stack such constraints for n pair of points, leading to a system
of equations of the form Ah = 0 where A is a 2n × 8 matrix. Given the 8 degrees of freedom and the system above, we
need at least 8 points (4 in each plan) to compute an estimation of our homography. This is the method we used to build the
H OMOGRAPHY DATASET, and is also the method we use to compute the homography after our keypoints predictions. An
example of the points correspondences in Soccer is available Figure A.1 (right).
In reality, our Direct Estimation model doesn’t predict directly the 8 homography parameters, but the coordinates of 4
control points instead (Jiang et al., 2019). The predicted coordinates of these 4 control points are then used to estimate the
homography using the system of equations above.
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
Figure A.1. (left) Example of a planar surface (our soccer field) viewed by two camera positions: the 2D field camera and the Live camera.
Finding the homography between them allows to produce 2D coordinates for each player. (right) Example of points correspondences
between the 2D Soccer Field Camera and the Live Camera.
Table B.1. List of actions the agent can chose at each timestep.
Figure B.1. Example of the Tracking Dataset (a), with the bounding boxes for Players, Referees and the Ball. In (b) are shown examples
of the Keypoints Dataset, with the position of the keypoints highlighted in blue.
C. Supplementary results
C.1. Architectural choices for tracking
In the following paragraph, we discuss some failed model experiments and architectural choices. Except for the entire
model we presented above, we tested 4 other different configurations:
For (1), we tried different scenarios based on a first direct homography estimation model. Giving the first model’s prediction
to a second one, also based on Resnet18, does not improve the performance significantly (0.8%) (confirming previous results
(Jiang et al., 2019)). We also tried to stack the field warped with our homography estimation to our original image, but
without any improvements at all. Finally, using both the warped field and the homography estimation doesn’t improve the
performance either.
For (2), we implemented and trained a 3-steps Multi-scale model from (Le et al., 2020). This model progressively estimates
and refines the homography from a third of the image to the entire one, in addition to the warped field from the previous
step estimation. While promising, this model did not lead to better results. We believe that this is a result of the high
degree of transformations there is between the field and the camera input. However, it can be a great model to estimate the
homography modifications happening between 2 consecutive frames.
For (3), we realized that producing the image in higher quality and then splitting it in separate panels allowed the tracking
model to detect more players. An example is available Figure C.1. We compared the results of the tracking model on 3
experiments: detection on the entire image (with shape (1024, 1024), 4 detections on 4 panels (each of shape (512, 512) and
8 detections on panels of shape (256, 256). Overall, we found that the second experiment was leading to the best results.
For (4), we tried to apply a Sagvol filter to the sequence of estimated homographies over the entire action. However, the
results were mitigated, as it would help the stability over time, but would completely ruin the tracking as soon as one outlier
was produced. To avoid such behaviour, we ended up using no filter to the sequence of homographies.
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
Figure C.1. Example of the difference in results between a direct prediction, and a splitted prediction. It gives better on results on average,
while letting some players (the cropped ones) being missed. This issue can be addressed by using both the direct prediction and the
splitted one.
Parameter Value
Clipping Range 0.08
Discount Factor 0.993
Entropy Coefficient 0.003
GAE 0.95
Gradient Norm Clipping 0.64
Learning Rate .000343
Number of Actors 16
Training Epochs per Update 2
Value Function Coefficient 0.5
Table C.1. Hyperparameters used for each global step of the Agent training. We reused these for selfplay too.
Evaluating Soccer Player: from Live Camera to Deep Reinforcement Learning
Figure C.2. Ablation study on the keypoints dataset to generate masks. Each model is plotted based on its number of parameters and its
metric. We only tested one backbone for the LinkNet et Unet architecture because of the poor results in comparison to the FPN.
Figure C.3. Additional results from each model of our tracking model. Each image was selected randomly and not cherry picked.
Figure C.4. Another example following the idea from Figure 7 (right). Again, we show here how the easy Agent doesn’t grasp any
meaningful information from the game.