Predicting Baseball Wins Using Machine Learning

This project developed a predictive model to forecast baseball team wins using historical performance data prior to 2002, employing machine learning algorithms such as Random Forest, SVR, and AdaBoost. The Random Forest model outperformed others in terms of accuracy, highlighting the importance of features like runs scored and allowed. Ethical considerations regarding the use of personal data and the implications of model predictions in sports analytics were also discussed.

Uploaded by

Ibrahim Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views3 pages

Predicting Baseball Wins Using Machine Learning

Uploaded by

Ibrahim Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Predicting Baseball Wins Using Machine Learning

Mian Ahmad Nasrat Ullah

Teesside University
Student ID: [Your Student ID Here]
Email: [email protected]

Abstract—This project aimed to build a predictive model for Next, only numerical features were selected for modeling,
forecasting the number of games a baseball team can win in a and this included:
season. Using a historical dataset that included team performance
• RS, RA, OBP, SLG, BA
stats before the year 2002, several machine learning algorithms
like Linear Regression, Random Forest Regressor, SVR, and • OOBP, OSLG, G
AdaBoost Regressor were tested. After data cleaning and feature • RankSeason, RankPlayoffs, Playoffs
imputation using KNN, models were trained and evaluated based Categorical features were either irrelevant or already encoded
on RMSE. Random Forest gave the best performance. The report
also touches on the ethical and legal considerations involved in numerically.
using machine learning for sports analytics and decision-making. Data was normalized using StandardScaler, which brings all
features to a similar scale. This step is especially important for
models like SVR, which are sensitive to feature magnitude.
I. I NTRODUCTION
Baseball has always been deeply associated with numbers III. E XPLORATORY DATA A NALYSIS
and statistics. Every play, every hit, every score is measured, Before jumping to model building, it was important to
recorded, and analyzed. It makes the sport an ideal domain for understand the relationships between features. A correlation
applying machine learning techniques, especially regression- heatmap revealed that the number of wins (W) had strong
based models. The main goal of this project was to build a correlations with RS and RA, which makes sense — scoring
system that could take in various team-level statistics and then more and allowing fewer runs generally results in more wins.
predict how many games a team would win in that particular Surprisingly, features like Playoffs or RankSeason did not
season. show strong linear relationships with wins, possibly because
This project focuses on data before the year 2002, filtered they are more outcomes than causes. That said, their indirect
intentionally to avoid modern rule changes or player behavior effects might still be captured by complex models like ensem-
shifts that might affect the outcomes. The data comes in ble regressors.
CSV format (baseball.csv) and includes performance metrics Histograms showed that the target variable W was roughly
like the number of runs a team scored or allowed, their normally distributed with a few outliers on both ends — teams
offensive and defensive on-base and slugging percentages, that performed extremely well or poorly.
batting averages, and several rankings. The model aims to
IV. M ODEL S ELECTION AND I MPLEMENTATION
learn patterns from this data and be able to forecast wins (W),
a key indicator of overall performance in any baseball league. The following machine learning regression algorithms were
The idea is not just to find a model that works well, but to selected and implemented using Python’sscikit-learn:
understand why it works and what the most impactful features 1) Linear Regression
are. As a baseline, this simple model was trained first.
A number of regression algorithms were chosen for this ex- It helps to get a quick look at the linearity of the
periment. These included basic models like Linear Regression, problem. The implementation was straightforward using
more advanced ensemble methods like Random Forest and LinearRegression() from sklearn. However, it did
AdaBoost, and a support vector-based approach using SVR. not handle the slight non-linearity and interaction effects
well.
II. DATASET OVERVIEW AND P REPARATION 2) Support Vector Regression (SVR)
The dataset initially had 614 rows, but after filtering to SVR is useful for capturing non-linear relationships. A
exclude seasons from 2002 onwards, we ended up with fewer grid search over different kernels and hyperparameters (C
rows. This was acceptable for the purpose of this task. and epsilon) was done to find the optimal configuration.
The dataset had several missing values in key features However, SVR struggled with performance, possibly due
like OBP and SLG. We used KNN Imputer with 1 nearest to the small dataset and many irrelevant features. It also
neighbor to fill these missing values. This approach works by took significantly longer to train.
looking at similar rows to estimate missing entries, making 3) Random Forest Regressor
the assumption that nearby data points (in feature space) are This model combines multiple decision trees and averages
likely to share similar values. the predictions. It handles both linear and non-linear
relationships well and is less likely to overfit due to VIII. E THICAL , L EGAL , AND P ROFESSIONAL I SSUES
bootstrapping. The Random Forest Regressor produced Even though this dataset contains only public sports statis-
the best results in terms of RMSE. tics, any use of personal data (like player salary, injury status,
4) AdaBoost Regressor ethnicity) would trigger ethical and legal concerns.
This boosting technique tries to correct the mistakes From an ethical perspective, models must not be used to
of weak learners in a sequential way. It gave a decent unfairly judge players or teams. For instance, if a model
performance, slightly worse than Random Forest but recommends benching a player based purely on historic stats
better than SVR. It was sensitive to hyperparameters like without considering context, it could lead to unfair decisions.
learning rate and number of estimators. Moreover, if such models are used in gambling or sports
All models were evaluated using Root Mean Square Error betting, it could lead to manipulations or even match fixing,
(RMSE). This metric penalizes larger errors more than smaller which is illegal and harmful to the sport’s integrity.
ones, making it suitable for this regression task. Transparency and explainability also matter. Teams and
coaches using ML tools should be able to understand why a
V. M ODEL E VALUATION prediction was made, not just what the outcome is. Black-box
models might offer good accuracy but fail on interpretability.
TABLE I: Model RMSE Comparison
IX. F UTURE W ORK
Model RMSE
There are several ways this project could be improved:
Linear Regression 7.18
SVR 10.88 1) Use player-level statistics: More granular data could help
AdaBoost Regressor 6.90 refine predictions.
Random Forest 5.51
2) Model time-based trends: Stats change over years; time
series models like LSTM could be considered.
From the table, Random Forest clearly had the best per- 3) Test on modern datasets: Applying the model to data
formance. Its ensemble structure helped generalize well even from 2002–2023 could validate generalizability.
with noisy or missing data. It also provided a built-in feature 4) Use AutoML tools: Automated pipelines could help find
importance metric, which helped understand what drove the optimal models faster.
predictions. 5) Cross-validation and ensembling: More robust evalua-
tion using k-fold cross-validation and model stacking.
VI. F EATURE I MPORTANCE
X. C ONCLUSION
Using the Random Forest model, we extracted feature
importance. The most significant predictors were: In conclusion, this project has successfully built a predictive
system for forecasting baseball wins using historical perfor-
• RS (most important)
mance data. It went through data preprocessing, exploratory
• RA, OBP, SLG (moderate)
analysis, modeling using different algorithms, evaluation, and
• RankSeason, OOBP had low importance
interpretation. The Random Forest model stood out as the most
Interestingly, some features like RankSeason, OOBP, and accurate, while also being interpretable.
Playoffs had very low importances. This suggests that sim- The project not only demonstrates the application of ma-
ple performance stats like runs and on-base stats are better chine learning in sports analytics but also opens the door for
indicators of season wins than final rankings or playoff status. more advanced future work in this area. Careful attention must
also be given to ethical and legal aspects when deploying such
VII. D ISCUSSION models in real scenarios.
This project highlights that machine learning can be very
useful in sports analytics. By training models on historical XI. R EFERENCES
data, it’s possible to forecast a team’s performance to a good R EFERENCES
degree of accuracy. [1] B. Alamar, Sports analytics: A guide for coaches, managers, and other
However, there are limitations. The dataset used was rel- decision makers, Columbia University Press, 2013.
atively small and from before 2002, which might not reflect [2] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-
mization,” Journal of Machine Learning Research, vol. 13, pp. 281–305,
current trends. Also, all features are at the team level — more 2012.
granular player-level stats could further improve accuracy. [3] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
Ensemble models like Random Forest and AdaBoost were 5–32, 2001. Available at: https://doi.org/10.1023/A:1010933404324.
[4] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,”
clearly better suited for this kind of task. They handled outliers in Proc. of the 22nd ACM SIGKDD International Conference on
and missing data more gracefully than linear models or SVR. Knowledge Discovery and Data Mining, 2016, pp. 785–794. Available
It was also found that no single model is perfect. Depending at: https://doi.org/10.1145/2939672.2939785.
[5] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to
on the use case (e.g., speed vs accuracy), different models may Statistical Learning: with Applications in R, 2nd ed., Springer, 2021.
be chosen. Available at: https://www.statlearning.com/.
[6] Kaggle, “Baseball prediction dataset,” Available at: https://www.kaggle.
com/datasets, Accessed: 25 Apr. 2025.
[7] M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer, 2013.
Available at: https://doi.org/10.1007/978-1-4614-6849-3.
[8] C. Molnar, Interpretable Machine Learning, 2nd ed., 2022. Available at:
https://christophm.github.io/interpretable-ml-book/, Accessed: 25 Apr.
2025.
[9] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal
of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[10] F. Provost and T. Fawcett, Data Science for Business: What you need
to know about data mining and data-analytic thinking, O’Reilly Media,
2013.
[11] M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why Should I Trust You?”
Explaining the predictions of any classifier,” in Proc. of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2016, pp. 1135–1144.
[12] N. Silver, The Signal and the Noise: Why So Many Predictions Fail—but
Some Don’t, Penguin Press, 2012.
[13] Turing Institute, “Understanding Artificial Intelligence Ethics and
Safety,” Available at: https://www.turing.ac.uk/research/publications/
understanding-artificial-intelligence-ethics-and-safety, Accessed: 25
Apr. 2025.
[14] Z. Zhang, “Missing data imputation: focusing on single imputation,”
Annals of Translational Medicine, vol. 4, no. 1, p. 9, 2016. Available
at: https://doi.org/10.3978/j.issn.2305-5839.2015.12.38.

NBA Game Prediction with ML Techniques
No ratings yet
NBA Game Prediction with ML Techniques
6 pages
Results of Sports Matches For 2025
No ratings yet
Results of Sports Matches For 2025
8 pages
The Paper About The Method of Cricket Match Outcome
No ratings yet
The Paper About The Method of Cricket Match Outcome
67 pages
Interim Layout
No ratings yet
Interim Layout
9 pages
IPL Score Prediction Using ML Models
No ratings yet
IPL Score Prediction Using ML Models
10 pages
NBA Game Winner Prediction Model
No ratings yet
NBA Game Winner Prediction Model
5 pages
NBA Game Prediction with ML
0% (1)
NBA Game Prediction with ML
6 pages
Entropy 23 00090 v3
No ratings yet
Entropy 23 00090 v3
12 pages
Statistics-Free Sports Prediction Model
No ratings yet
Statistics-Free Sports Prediction Model
12 pages
Omid Aryan, Ali Reza Sharafat, A Novel Approach To Predicting The Results of NBA Matches
No ratings yet
Omid Aryan, Ali Reza Sharafat, A Novel Approach To Predicting The Results of NBA Matches
5 pages
NBA Game Result Prediction Using Feature Analysis and Machine Learning
No ratings yet
NBA Game Result Prediction Using Feature Analysis and Machine Learning
14 pages
YMER210577
No ratings yet
YMER210577
3 pages
T20 Cricket Match Outcome Prediction
No ratings yet
T20 Cricket Match Outcome Prediction
57 pages
Thesis Andrew Cui
No ratings yet
Thesis Andrew Cui
62 pages
Use of Machine Learning and Deep Learning To Predi
No ratings yet
Use of Machine Learning and Deep Learning To Predi
22 pages
Prediction of English Premier League Soccer Matches
No ratings yet
Prediction of English Premier League Soccer Matches
60 pages
Weissbock Joshua 2014 Thesis
No ratings yet
Weissbock Joshua 2014 Thesis
106 pages
5sem - MP - Synopsis Miniproject
No ratings yet
5sem - MP - Synopsis Miniproject
4 pages
The Application of Machine Learning For Sport Result Prediction A Review
No ratings yet
The Application of Machine Learning For Sport Result Prediction A Review
49 pages
Project Report
No ratings yet
Project Report
16 pages
Ipl Report
No ratings yet
Ipl Report
12 pages
Predicting Outcome of Soccer Matches Using Machine Learning
No ratings yet
Predicting Outcome of Soccer Matches Using Machine Learning
12 pages
Verhoosel 33241900 2024-2
No ratings yet
Verhoosel 33241900 2024-2
82 pages
Entropy-25-00765, Introduction
No ratings yet
Entropy-25-00765, Introduction
16 pages
Electronics 14 02177
No ratings yet
Electronics 14 02177
15 pages
Sports Match Prediction Using AI
No ratings yet
Sports Match Prediction Using AI
2 pages
Pattern PDF
No ratings yet
Pattern PDF
5 pages
IPL Data Analysis and Prediction Using M
No ratings yet
IPL Data Analysis and Prediction Using M
4 pages
Comparison of Football Results Using Machine Learning Algorithms
No ratings yet
Comparison of Football Results Using Machine Learning Algorithms
7 pages
Sminton,+13509 Article+ (PDF) 30287 1 11 20220414
No ratings yet
Sminton,+13509 Article+ (PDF) 30287 1 11 20220414
38 pages
Combining Textual Pre-Game Reports and Statistical Data For Predicting Success in The National Hockey League
No ratings yet
Combining Textual Pre-Game Reports and Statistical Data For Predicting Success in The National Hockey League
12 pages
Report 730
No ratings yet
Report 730
5 pages
Blue Futuristic Technology Presentation
No ratings yet
Blue Futuristic Technology Presentation
19 pages
IPL Match Prediction Using ML
No ratings yet
IPL Match Prediction Using ML
28 pages
IPL Match Winner Prediction Using ML
No ratings yet
IPL Match Winner Prediction Using ML
7 pages
Application of Machine Learning in Cricket and Predictive Analytics of IPL 2020
No ratings yet
Application of Machine Learning in Cricket and Predictive Analytics of IPL 2020
26 pages
Football Prediction with ML
No ratings yet
Football Prediction with ML
73 pages
ICS5200 Matthew Zammit Soft
No ratings yet
ICS5200 Matthew Zammit Soft
128 pages
Football Match Prediction System
No ratings yet
Football Match Prediction System
7 pages
Ipl Winner Prediction Using Machine Learning
100% (1)
Ipl Winner Prediction Using Machine Learning
58 pages
Shau PredictingOutcomesOfNFLGames
No ratings yet
Shau PredictingOutcomesOfNFLGames
5 pages
Beating The Odds: Learning To Bet On Soccer Matches Using Historical Data
No ratings yet
Beating The Odds: Learning To Bet On Soccer Matches Using Historical Data
7 pages
Thesis PDF
No ratings yet
Thesis PDF
137 pages
Ipl Prediction Documentation
No ratings yet
Ipl Prediction Documentation
18 pages
Wilkinson Draft 2
No ratings yet
Wilkinson Draft 2
3 pages
Ruck Those Stats! Machine Learning As The New Coach
No ratings yet
Ruck Those Stats! Machine Learning As The New Coach
5 pages
Applied Computing and Informatics: Rory P. Bunker, Fadi Thabtah
No ratings yet
Applied Computing and Informatics: Rory P. Bunker, Fadi Thabtah
7 pages
IPL Match Predictions via ML
No ratings yet
IPL Match Predictions via ML
6 pages
AIML Mini Project Report Format
No ratings yet
AIML Mini Project Report Format
26 pages
A Machine Learning Framework For Sport Result Prediction
No ratings yet
A Machine Learning Framework For Sport Result Prediction
7 pages
NBA2023 2024 Data Guidelines
No ratings yet
NBA2023 2024 Data Guidelines
3 pages
Dynamic Cricket Match Outcome Prediction
No ratings yet
Dynamic Cricket Match Outcome Prediction
12 pages
Hybrid Basketball Game Outcome Prediction Method
No ratings yet
Hybrid Basketball Game Outcome Prediction Method
14 pages
NBA Game Outcome Prediction Using API Data
No ratings yet
NBA Game Outcome Prediction Using API Data
6 pages
DS Ca4
No ratings yet
DS Ca4
13 pages
Access-Template (MOST RECENT)
No ratings yet
Access-Template (MOST RECENT)
10 pages
A Comparative Study of The Different Classification Algorithms On Football Analytics
No ratings yet
A Comparative Study of The Different Classification Algorithms On Football Analytics
16 pages
Sports Fantasy
No ratings yet
Sports Fantasy
3 pages
Cricket Match Winner Prediction)
No ratings yet
Cricket Match Winner Prediction)
5 pages
Radial Basis Functions
No ratings yet
Radial Basis Functions
10 pages
Master Data Structures & Algorithms
No ratings yet
Master Data Structures & Algorithms
13 pages
Control Systems MATLAB File
No ratings yet
Control Systems MATLAB File
25 pages
Chapter 07 Deadlocks
No ratings yet
Chapter 07 Deadlocks
11 pages
An Adaptive Particle Swarm Optimization Algorithm Based On Cat Map
No ratings yet
An Adaptive Particle Swarm Optimization Algorithm Based On Cat Map
8 pages
Parallelism Techniques for Efficiency
No ratings yet
Parallelism Techniques for Efficiency
17 pages
Algorithm Design Lab Manual
No ratings yet
Algorithm Design Lab Manual
91 pages
CSE 473/573 Fall 2011 Homework 1
No ratings yet
CSE 473/573 Fall 2011 Homework 1
2 pages
Minimax
No ratings yet
Minimax
7 pages
Quadratic Equation Root Algorithm
No ratings yet
Quadratic Equation Root Algorithm
20 pages
PST Important Questions
No ratings yet
PST Important Questions
4 pages
Matlab Practice DSP Lab
No ratings yet
Matlab Practice DSP Lab
10 pages
Lab 1
No ratings yet
Lab 1
4 pages
ASSIGNMENT 3 - Probabilistic Models, GBDT, SVM
No ratings yet
ASSIGNMENT 3 - Probabilistic Models, GBDT, SVM
3 pages
Polynomial Printsheet
No ratings yet
Polynomial Printsheet
2 pages
Bebras Solutions Guide 2023 R2 Secondary
No ratings yet
Bebras Solutions Guide 2023 R2 Secondary
34 pages
CBSE Sample Paper
No ratings yet
CBSE Sample Paper
6 pages
Maths Test Class - X Chapter - 2 Polynomials
No ratings yet
Maths Test Class - X Chapter - 2 Polynomials
9 pages
Northeast Corner Transportation Method
No ratings yet
Northeast Corner Transportation Method
6 pages
Polynomial Approximation by Least Squares: Distance in A Vector Space
No ratings yet
Polynomial Approximation by Least Squares: Distance in A Vector Space
6 pages
Cse (Aiml) 6th Sem
No ratings yet
Cse (Aiml) 6th Sem
7 pages
FAANG Preparation Problems
No ratings yet
FAANG Preparation Problems
10 pages
Report
No ratings yet
Report
21 pages
Prefix Sum - Notes-2
No ratings yet
Prefix Sum - Notes-2
15 pages
Transfer Function in Control Systems
100% (1)
Transfer Function in Control Systems
10 pages
YOLO (You Only Look Once)
No ratings yet
YOLO (You Only Look Once)
4 pages
Global Sequence Alignment Guide
No ratings yet
Global Sequence Alignment Guide
24 pages
Edge Detection in Image Segmentation
No ratings yet
Edge Detection in Image Segmentation
24 pages
Unsupervised Learning Notes
No ratings yet
Unsupervised Learning Notes
21 pages
Lecture 17 ECE265A - Rx4 Image - Rej
No ratings yet
Lecture 17 ECE265A - Rx4 Image - Rej
18 pages

Predicting Baseball Wins Using Machine Learning

Uploaded by

Predicting Baseball Wins Using Machine Learning

Uploaded by

Predicting Baseball Wins Using Machine Learning

Mian Ahmad Nasrat Ullah

You might also like