A comprehensive machine learning system utilizing Gradient Boosting Regression Trees (GBRT) to analyze historical Olympic data and predict medal counts for the 2028 Los Angeles Summer Olympics.
Based on the research paper: Prediction and Analysis Based on the GBRT Model.
💡 Note: These plots are generated by running the visualization scripts in
src/visualization/.
Forecasted total medal counts for top performing nations, including confidence intervals.

Analysis of factors contributing to Gold vs. Total medal counts.

Derived from the core GBRT analysis in our Research Paper.
A systematic approach to multi-feature regression for Olympic medal forecasting.

Global heatmap showing predicted gold medal distribution for the 2028 Olympics.

Heatmap showing the winning rates of top 10 countries across different events.

Residual plots validating the GBRT model's prediction accuracy for gold and total medals.

- Advanced Data Cleaning: Robust pipelines to handle multi-source data (Athletes, Medals, Hosts, Programs).
- Feature Engineering: Extraction of critical signals including Host Country Effect and Historical Momentum.
- GBRT Modeling: Optimized Gradient Boosting Regressor with Grid Search hyperparameter tuning.
Predicting-Medals/
├── data/
│ ├── raw/ # Original datasets (from Kaggle/Olympic.org)
│ └── processed/ # Cleaned and merged CSVs
├── docs/ # Research papers and documentation
├── outputs/ # Generated plots and prediction results
├── result/ # Static visualization assets from research paper
├── src/
│ ├── analysis/ # Statistical analysis scripts (Coach effect, etc.)
│ ├── data_cleaning/ # Preprocessing pipelines
│ ├── feature_engineering/ # Feature construction
│ ├── models/ # GBRT and Linear Regression models
│ └── visualization/ # Plotting scripts
├── requirements.txt # Project dependencies
├── .gitignore # Git exclusion rules
├── LICENSE # MIT License
└── README.md-
Clone the repository
git clone https://github.com/1EchA/Predicting-medals.git cd Predicting-medals -
Install dependencies
pip install -r requirements.txt
-
Prepare Data Unzip the raw data into the
data/rawdirectory:unzip data/Data.zip -d data/raw/
To reproduce the analysis and predictions, follow this pipeline:
Standardize names, handle missing values, and merge datasets.
python src/data_cleaning/clean_athletes.py
python src/data_cleaning/clean_medals.py
python src/data_cleaning/clean_hosts.py
python src/data_cleaning/clean_programs.pyConstruct the training dataset with historical features.
python src/feature_engineering/build_dataset.py
python src/feature_engineering/merge_events.pyTrain the GBRT model and generate predictions for 2028.
python src/models/gbrt_model.pyGenerate the plots shown above.
python src/visualization/gbrt_visualization.py
python src/visualization/feature_visualization.pyThe GBRT model was tuned using Grid Search with 5-fold cross-validation.
| Metric | Value |
|---|---|
| Model | Gradient Boosting Regressor |
| CV MSE | ~Optimized |
| Key Hyperparams | n_estimators: [100, 200], learning_rate: [0.05, 0.1] |
Key Findings:
- Host Effect: Significant boost in medal counts for host nations.
- Historical Momentum: Previous games' performance is the strongest predictor.
- Gender Parity: Balanced gender ratios correlate with higher overall medal counts in modern games.
- 1EchA - Lead Developer & Researcher
This project is licensed under the MIT License - see the LICENSE file for details.