Table of Content
PREFACE........................................................................................................................................................i
RESEARCH................................................................................................................................................... 1
Introduction.................................................................................................................................................... 1
Project Overview....................................................................................................................... 1
Dataset:...................................................................................................................................... 1
Approach:...................................................................................................................................1
Problem Definition.................................................................................................................................. 1
Algorithms Used........................................................................................................................ 2
a. Linear Regression.............................................................................................................2
b. Random Forest Regressor............................................................................................................. 2
c. Gradient Boosting Regressor........................................................................................................ 3
Exploratory Data Analysis (EDA)................................................................................................... 4
Dataset Overview.......................................................................................................................4
Target Variable Distribution.......................................................................................................5
Target Variable Distribution.......................................................................................................6
Preprocessing & Feature Engineering............................................................................................................7
Steps...........................................................................................................................................7
Train-Test Split.......................................................................................................................... 8
Model Training & Evaluation.......................................................................................................... 9
Models Used.............................................................................................................................. 9
Training Code............................................................................................................................ 9
Model initialization and fitting............................................................................................ 9
Evaluation Metrics................................................................................................................... 10
Making a prediction on new data.............................................................................................11
Saving the Model for Future Use.............................................................................................11
Feature Importance.................................................................................................................. 12
Conclusion............................................................................................................................... 13
What We Can Do to Improve the Model Further.................................................................................. 13
Final Thoughts....................................................................................................................................... 13
PREFACE
This project investigates how different machine learning algorithms can be applied to predict
daily mobile data usage based on user behavior and smartphone characteristics. With the
growing need for efficient data plan management and usage forecasting, this project seeks to
demonstrate the practicality of predictive models like Linear Regression, Random Forest, and
Gradient Boosting in estimating mobile data consumption. The goal is to provide insight into
how such models can assist telecom companies, device manufacturers, and end-users in
understanding and managing mobile data consumption patterns.
RESEARCH
We utilized the Smartphone Usage and Behavioral Dataset sourced from Kaggle. This dataset
contains 700 records and 11 features, including App Usage Time, Screen On Time, Battery
Drain, Number of Installed Apps, Age, Gender, Device Model, and Operating System. The target
variable is continuous Daily Data Usage in MB.
To begin, we conducted exploratory data analysis (EDA) using tools such as Pandas, Seaborn,
and Matplotlib. This helped us understand the distribution of data, detect outliers, and examine
relationships between features and the target variable. We observed a right-skewed distribution
for data usage, indicating that most users consume moderate amounts of data while a few
consume very high amounts.
Following the EDA, we preprocessed the data through One-Hot Encoding for categorical
variables and Standard Scaling for numeric variables. The dataset was then split into training and
testing subsets in an 80:20 ratio.
We trained three machine learning models Linear Regression, Random Forest Regressor, and
Gradient Boosting Regressor and compared their performances using evaluation metrics like
Mean Absolute Error (MAE) and Root Mean Squared Log Error (RMSLE). Gradient Boosting
yielded the best results in our experiments, suggesting its suitability for this kind of regression
task. Furthermore, we examined feature importance, revealing that Battery Drain and App Usage
Time were the most influential predictors of daily mobile data consumption.
I
Introduction
Objective: Predict daily mobile data usage (MB/day) based on user behavior and device
characteristics.
Project Overview
We'll analyze a dataset containing information about mobile device usage and user behavior to
predict daily data consumption. The dataset includes features like app usage time, screen time,
battery drain, number of apps installed, and demographic information.
Dataset:
● 700 rows, 11 features (e.g., App Usage Time, Screen On Time, Battery Drain, Age, Gender).
● Target Variable: Data Usage (MB/day) (continuous).
● Source: Smartphone Usage and Behavioral Dataset
Approach:
1. Exploratory Data Analysis (EDA)
2. Preprocessing & Feature Engineering
3. Model Training (3 Algorithms)
4. Evaluation & Comparison
Problem Definition
We're trying to predict how much mobile data (in MB) a user will consume per day based on
their device characteristics and usage patterns. This is a regression problem since we're
predicting a continuous numerical value
1
Figure 1: Image shows a code snippet that shows the project dependencies.
Algorithms Used
a. Linear Regression
● Type: Simple and interpretable linear model.
● How it works: It finds the best-fitting straight line through the data by
minimizing the difference between predicted and actual values (using least
squares).
● Use case: Good for baseline models and when relationships between features and
the target are mostly linear.
● Equation
b. Random Forest Regressor
● Type: Ensemble model (uses multiple decision trees).
● How it works: Builds many decision trees on random subsets of the data and
averages their predictions to reduce overfitting and improve accuracy.
● Strength: Handles non-linear relationships, missing data, and categorical
variables well.
2
● Key concept: Bagging – training each tree on a different random sample of the
data.
c. Gradient Boosting Regressor
● Type: Ensemble model using boosting.
● How it works: Builds decision trees sequentially, where each new tree learns
from the errors of the previous ones.
● Strength: Highly accurate, great for capturing complex patterns in data.
● Key concept: Boosting correcting the previous model's mistakes step-by-step to
improve performance.
3
Exploratory Data Analysis (EDA)
Dataset Overview
Figure 2: Image shows a code snippet that prints the first 5 sample in our datatset
Figure 3: Image shows a code snippet that displays the information about the dataset.
4
Here we have imported the data and printed the first 5 rows. To import the data we used Pandas.
Key Observations:
● Mixed data types (numeric + categorical).
● No missing values ([Link]() confirms all columns are complete).
Target Variable Distribution
Figure 4: Image shows a plot distribution of the target variables.
5
Interpretation:
● Right-skewed distribution.
● Most users consume 300 to 1000 MB/day, with outliers (>2000 MB).
Target Variable Distribution
Then we draw our boxplot to see the distribution of the User behavior class against Data Usage.
Figure 5: Image shows a plot of user behavior class against their data usage.
6
Preprocessing & Feature Engineering
Steps
1. Categorical Encoding:
○ OneHotEncoder for Device Model, Operating System, Gender.
2. Numerical Scaling:
○ StandardScaler for Battery Drain, Screen On Time, etc.
Figure 6: Image shows the feature extraction and preprocessing of categorical data.
7
Train-Test Split
80% Train, 20% Test (train_test_split).
Figure 7: Image shows how the dataset has been splitted into the trained and test set.
8
Model Training & Evaluation
Models Used
Training Code
Model initialization and fitting.
Figure 8: Image shows model initialization and fitting.
9
Evaluation Metrics
Model Comparison:
Linear Regression Performance:
Mean Absolute Error: 117.04
Root Mean Squared Log Error: 0.2014
Random Forest Performance:
Mean Absolute Error: 114.74
Root Mean Squared Log Error: 0.2023
Gradient Boosting Performance:
Mean Absolute Error: 113.64
Root Mean Squared Log Error: 0.1980
10
Interpretation:
● Gradient Boosting performs best (lowest MAE and MSLE).
Making a prediction on new data
Figure 9: Image shows sample predictions made on the model.
Saving the Model for Future Use
Save the best model so we can deploy it for use.
11
Figure 10: Image shows saving the model for future use.
Feature Importance
After we have train the model we want to see the features that contributed well to our model.
Figure 11: Image shows the most important features that we considered during the training.
12
Top Features:
1. Battery Drain contributed about 45%
2. App Usage Time contributed to about 28%
3. Number of App installed contributed to about 19%
N:B The ones that contributed less to our model like OS, Device model, Gender etc can be
removed if more data is to be collected since its contribution to the model is not significant.
Conclusion
In this project, we successfully built a predictive system to estimate daily data usage (MB/day)
based on user and device behavior. We used three machine learning models.
What We Can Do to Improve the Model Further
Here are practical steps for improvement:
1. Feature Engineering
2. Data Cleaning & Outlier Handling
3. Model Optimization
● Try more advanced models like: XGBoost, LightGBM, CatBoost
Final Thoughts
This project shows how user behavior data can be leveraged to predict mobile data usage, which
could be useful for:
● Telecom companies optimizing data plans,
● Device manufacturers understanding usage patterns,
● End users tracking and managing data consumption.
13