0% found this document useful (0 votes)
40 views3 pages

Predicting Baseball Wins Using Machine Learning

This project developed a predictive model to forecast baseball team wins using historical performance data prior to 2002, employing machine learning algorithms such as Random Forest, SVR, and AdaBoost. The Random Forest model outperformed others in terms of accuracy, highlighting the importance of features like runs scored and allowed. Ethical considerations regarding the use of personal data and the implications of model predictions in sports analytics were also discussed.

Uploaded by

Ibrahim Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views3 pages

Predicting Baseball Wins Using Machine Learning

This project developed a predictive model to forecast baseball team wins using historical performance data prior to 2002, employing machine learning algorithms such as Random Forest, SVR, and AdaBoost. The Random Forest model outperformed others in terms of accuracy, highlighting the importance of features like runs scored and allowed. Ethical considerations regarding the use of personal data and the implications of model predictions in sports analytics were also discussed.

Uploaded by

Ibrahim Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Predicting Baseball Wins Using Machine Learning

Mian Ahmad Nasrat Ullah


Teesside University
Student ID: [Your Student ID Here]
Email: [email protected]

Abstract—This project aimed to build a predictive model for Next, only numerical features were selected for modeling,
forecasting the number of games a baseball team can win in a and this included:
season. Using a historical dataset that included team performance
• RS, RA, OBP, SLG, BA
stats before the year 2002, several machine learning algorithms
like Linear Regression, Random Forest Regressor, SVR, and • OOBP, OSLG, G
AdaBoost Regressor were tested. After data cleaning and feature • RankSeason, RankPlayoffs, Playoffs
imputation using KNN, models were trained and evaluated based Categorical features were either irrelevant or already encoded
on RMSE. Random Forest gave the best performance. The report
also touches on the ethical and legal considerations involved in numerically.
using machine learning for sports analytics and decision-making. Data was normalized using StandardScaler, which brings all
features to a similar scale. This step is especially important for
models like SVR, which are sensitive to feature magnitude.
I. I NTRODUCTION
Baseball has always been deeply associated with numbers III. E XPLORATORY DATA A NALYSIS
and statistics. Every play, every hit, every score is measured, Before jumping to model building, it was important to
recorded, and analyzed. It makes the sport an ideal domain for understand the relationships between features. A correlation
applying machine learning techniques, especially regression- heatmap revealed that the number of wins (W) had strong
based models. The main goal of this project was to build a correlations with RS and RA, which makes sense — scoring
system that could take in various team-level statistics and then more and allowing fewer runs generally results in more wins.
predict how many games a team would win in that particular Surprisingly, features like Playoffs or RankSeason did not
season. show strong linear relationships with wins, possibly because
This project focuses on data before the year 2002, filtered they are more outcomes than causes. That said, their indirect
intentionally to avoid modern rule changes or player behavior effects might still be captured by complex models like ensem-
shifts that might affect the outcomes. The data comes in ble regressors.
CSV format (baseball.csv) and includes performance metrics Histograms showed that the target variable W was roughly
like the number of runs a team scored or allowed, their normally distributed with a few outliers on both ends — teams
offensive and defensive on-base and slugging percentages, that performed extremely well or poorly.
batting averages, and several rankings. The model aims to
IV. M ODEL S ELECTION AND I MPLEMENTATION
learn patterns from this data and be able to forecast wins (W),
a key indicator of overall performance in any baseball league. The following machine learning regression algorithms were
The idea is not just to find a model that works well, but to selected and implemented using Python’sscikit-learn:
understand why it works and what the most impactful features 1) Linear Regression
are. As a baseline, this simple model was trained first.
A number of regression algorithms were chosen for this ex- It helps to get a quick look at the linearity of the
periment. These included basic models like Linear Regression, problem. The implementation was straightforward using
more advanced ensemble methods like Random Forest and LinearRegression() from sklearn. However, it did
AdaBoost, and a support vector-based approach using SVR. not handle the slight non-linearity and interaction effects
well.
II. DATASET OVERVIEW AND P REPARATION 2) Support Vector Regression (SVR)
The dataset initially had 614 rows, but after filtering to SVR is useful for capturing non-linear relationships. A
exclude seasons from 2002 onwards, we ended up with fewer grid search over different kernels and hyperparameters (C
rows. This was acceptable for the purpose of this task. and epsilon) was done to find the optimal configuration.
The dataset had several missing values in key features However, SVR struggled with performance, possibly due
like OBP and SLG. We used KNN Imputer with 1 nearest to the small dataset and many irrelevant features. It also
neighbor to fill these missing values. This approach works by took significantly longer to train.
looking at similar rows to estimate missing entries, making 3) Random Forest Regressor
the assumption that nearby data points (in feature space) are This model combines multiple decision trees and averages
likely to share similar values. the predictions. It handles both linear and non-linear
relationships well and is less likely to overfit due to VIII. E THICAL , L EGAL , AND P ROFESSIONAL I SSUES
bootstrapping. The Random Forest Regressor produced Even though this dataset contains only public sports statis-
the best results in terms of RMSE. tics, any use of personal data (like player salary, injury status,
4) AdaBoost Regressor ethnicity) would trigger ethical and legal concerns.
This boosting technique tries to correct the mistakes From an ethical perspective, models must not be used to
of weak learners in a sequential way. It gave a decent unfairly judge players or teams. For instance, if a model
performance, slightly worse than Random Forest but recommends benching a player based purely on historic stats
better than SVR. It was sensitive to hyperparameters like without considering context, it could lead to unfair decisions.
learning rate and number of estimators. Moreover, if such models are used in gambling or sports
All models were evaluated using Root Mean Square Error betting, it could lead to manipulations or even match fixing,
(RMSE). This metric penalizes larger errors more than smaller which is illegal and harmful to the sport’s integrity.
ones, making it suitable for this regression task. Transparency and explainability also matter. Teams and
coaches using ML tools should be able to understand why a
V. M ODEL E VALUATION prediction was made, not just what the outcome is. Black-box
models might offer good accuracy but fail on interpretability.
TABLE I: Model RMSE Comparison
IX. F UTURE W ORK
Model RMSE
There are several ways this project could be improved:
Linear Regression 7.18
SVR 10.88 1) Use player-level statistics: More granular data could help
AdaBoost Regressor 6.90 refine predictions.
Random Forest 5.51
2) Model time-based trends: Stats change over years; time
series models like LSTM could be considered.
From the table, Random Forest clearly had the best per- 3) Test on modern datasets: Applying the model to data
formance. Its ensemble structure helped generalize well even from 2002–2023 could validate generalizability.
with noisy or missing data. It also provided a built-in feature 4) Use AutoML tools: Automated pipelines could help find
importance metric, which helped understand what drove the optimal models faster.
predictions. 5) Cross-validation and ensembling: More robust evalua-
tion using k-fold cross-validation and model stacking.
VI. F EATURE I MPORTANCE
X. C ONCLUSION
Using the Random Forest model, we extracted feature
importance. The most significant predictors were: In conclusion, this project has successfully built a predictive
system for forecasting baseball wins using historical perfor-
• RS (most important)
mance data. It went through data preprocessing, exploratory
• RA, OBP, SLG (moderate)
analysis, modeling using different algorithms, evaluation, and
• RankSeason, OOBP had low importance
interpretation. The Random Forest model stood out as the most
Interestingly, some features like RankSeason, OOBP, and accurate, while also being interpretable.
Playoffs had very low importances. This suggests that sim- The project not only demonstrates the application of ma-
ple performance stats like runs and on-base stats are better chine learning in sports analytics but also opens the door for
indicators of season wins than final rankings or playoff status. more advanced future work in this area. Careful attention must
also be given to ethical and legal aspects when deploying such
VII. D ISCUSSION models in real scenarios.
This project highlights that machine learning can be very
useful in sports analytics. By training models on historical XI. R EFERENCES
data, it’s possible to forecast a team’s performance to a good R EFERENCES
degree of accuracy. [1] B. Alamar, Sports analytics: A guide for coaches, managers, and other
However, there are limitations. The dataset used was rel- decision makers, Columbia University Press, 2013.
atively small and from before 2002, which might not reflect [2] J. Bergstra and Y. Bengio, “Random search for hyper-parameter opti-
mization,” Journal of Machine Learning Research, vol. 13, pp. 281–305,
current trends. Also, all features are at the team level — more 2012.
granular player-level stats could further improve accuracy. [3] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp.
Ensemble models like Random Forest and AdaBoost were 5–32, 2001. Available at: https://doi.org/10.1023/A:1010933404324.
[4] T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,”
clearly better suited for this kind of task. They handled outliers in Proc. of the 22nd ACM SIGKDD International Conference on
and missing data more gracefully than linear models or SVR. Knowledge Discovery and Data Mining, 2016, pp. 785–794. Available
It was also found that no single model is perfect. Depending at: https://doi.org/10.1145/2939672.2939785.
[5] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to
on the use case (e.g., speed vs accuracy), different models may Statistical Learning: with Applications in R, 2nd ed., Springer, 2021.
be chosen. Available at: https://www.statlearning.com/.
[6] Kaggle, “Baseball prediction dataset,” Available at: https://www.kaggle.
com/datasets, Accessed: 25 Apr. 2025.
[7] M. Kuhn and K. Johnson, Applied Predictive Modeling, Springer, 2013.
Available at: https://doi.org/10.1007/978-1-4614-6849-3.
[8] C. Molnar, Interpretable Machine Learning, 2nd ed., 2022. Available at:
https://christophm.github.io/interpretable-ml-book/, Accessed: 25 Apr.
2025.
[9] F. Pedregosa et al., “Scikit-learn: Machine Learning in Python,” Journal
of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[10] F. Provost and T. Fawcett, Data Science for Business: What you need
to know about data mining and data-analytic thinking, O’Reilly Media,
2013.
[11] M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why Should I Trust You?”
Explaining the predictions of any classifier,” in Proc. of the 22nd ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2016, pp. 1135–1144.
[12] N. Silver, The Signal and the Noise: Why So Many Predictions Fail—but
Some Don’t, Penguin Press, 2012.
[13] Turing Institute, “Understanding Artificial Intelligence Ethics and
Safety,” Available at: https://www.turing.ac.uk/research/publications/
understanding-artificial-intelligence-ethics-and-safety, Accessed: 25
Apr. 2025.
[14] Z. Zhang, “Missing data imputation: focusing on single imputation,”
Annals of Translational Medicine, vol. 4, no. 1, p. 9, 2016. Available
at: https://doi.org/10.3978/j.issn.2305-5839.2015.12.38.

You might also like