Sales forecasting

This project focuses on sales prediction and data enrichment using the Catboost algorithm and Upgini. The goal is to forecast future sales for the next 3 months and determine whether enriching the data leads to an increase in model accuracy.

Roadmap

Research Catboost and Upgini
Obtain dataset
Perform basic exploratory data analysis (EDA)
Implement and train Catboost model
Assess model's performance using SMAPE metric
Enhance dataset using Upgini
Retrain the Catboost model
Compare SMAPE values before and after data enrichment
Summarise findings and highlight the effectiveness of Upgini in improving the accuracy of the Catboost model

Stack

EDA
Catboost
Upgini

Data Overview

Tabular data
5 years' worth of product sales (19k samples)
4 features:
- date, store_id, item_id, and sales
Sales data before 2017 will be used as training data (15213 samples), while everything older than 2017 will be our test data (3787 samples)
Limited information (i.e., only date and sales are useful) for our model to understand how to successfully predict future sales

Methodology

The methodology consists of three key steps: basic exploratory data analysis, application of the Catboost algorithm, and data enrichment using Upgini.

Exploratory Data Analysis (EDA):

The initial step involves conducting some really basic exploratory data analysis to gain insights into the sales data.

Model creation using the Catboost algorithm:

Once the data has been briefly analysed, the Catboost algorithm will be applied to build a time series forecasting model. Catboost is a gradient-boosting algorithm that handles categorical features effectively and has shown promising performance in various domains. The model will then be trained on our sales data, considering relevant features identified during the exploratory data analysis.

Data Enrichment with Upgini:

To further enhance the forecasting model, the data will be enriched using Upgini. The latter is a data enrichment method that enhances the importance of uncertain and informative data points during the training process. By assigning higher weights to such data instances, the model can effectively capture the nuances and dependencies within the data, leading to improved predictions. The enriched dataset will be used to retrain the Catboost model, leveraging the enhanced representation of the data to generate more accurate sales forecasts.

Symmetric Mean Absolute Percentage Error (SMAPE)

The evaluation of the developed model will primarily be based on the SMAPE metric, which measures the accuracy of the predictions compared to the actual sales values. The performance of the model will be assessed by comparing the SMAPE values before and after data enrichment using Upgini.

Evaluation and Conclusion

To assess whether our enrichment process led to significant results, we employed the Symmetric Mean Absolute Percentage Error (SMAPE) metric, which measures the accuracy of our predictions compared to the actual values. Initially, our baseline model yielded a SMAPE value of 37%.

However, by employing Upgini, we were able to prioritise uncertain and informative data points during training. This technique played a crucial role in identifying and assigning higher weights to data instances that were previously underrepresented, leading to a more accurate prediction model and a lower SMAPE value of 14%.

Overall, by incorporating Upgini to enrich our data and retraining the CatBoostRegressor model, we successfully achieved a significant reduction in the SMAPE value. This outcome underscores the effectiveness of Upgini in enhancing the model's accuracy and reaffirms the value of CatBoostRegressor as a robust algorithm for regression tasks.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
data		data
imgs		imgs
License.txt		License.txt
README.md		README.md
sales-forecasting.ipynb		sales-forecasting.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sales forecasting

Roadmap

Stack

Data Overview

Methodology

Exploratory Data Analysis (EDA):

Model creation using the Catboost algorithm:

Data Enrichment with Upgini:

Symmetric Mean Absolute Percentage Error (SMAPE)

Evaluation and Conclusion

License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sales forecasting

Roadmap

Stack

Data Overview

Methodology

Exploratory Data Analysis (EDA):

Model creation using the Catboost algorithm:

Data Enrichment with Upgini:

Symmetric Mean Absolute Percentage Error (SMAPE)

Evaluation and Conclusion

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages