0% found this document useful (0 votes)
115 views15 pages

Predicting Mobile Data Usage

This project explores the application of machine learning algorithms to predict daily mobile data usage based on user behavior and smartphone characteristics, utilizing a dataset of 700 records. Three models were trained: Linear Regression, Random Forest, and Gradient Boosting, with Gradient Boosting achieving the best performance. Key features influencing data usage were identified as Battery Drain and App Usage Time, and recommendations for further model improvement were provided.

Uploaded by

Loweh Fonyuy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views15 pages

Predicting Mobile Data Usage

This project explores the application of machine learning algorithms to predict daily mobile data usage based on user behavior and smartphone characteristics, utilizing a dataset of 700 records. Three models were trained: Linear Regression, Random Forest, and Gradient Boosting, with Gradient Boosting achieving the best performance. Key features influencing data usage were identified as Battery Drain and App Usage Time, and recommendations for further model improvement were provided.

Uploaded by

Loweh Fonyuy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Table of Content

PREFACE........................................................................................................................................................i
RESEARCH................................................................................................................................................... 1
Introduction.................................................................................................................................................... 1
Project Overview....................................................................................................................... 1
Dataset:...................................................................................................................................... 1
Approach:...................................................................................................................................1
Problem Definition.................................................................................................................................. 1
Algorithms Used........................................................................................................................ 2
a. Linear Regression.............................................................................................................2
b. Random Forest Regressor............................................................................................................. 2
c. Gradient Boosting Regressor........................................................................................................ 3
Exploratory Data Analysis (EDA)................................................................................................... 4
Dataset Overview.......................................................................................................................4
Target Variable Distribution.......................................................................................................5
Target Variable Distribution.......................................................................................................6
Preprocessing & Feature Engineering............................................................................................................7
Steps...........................................................................................................................................7
Train-Test Split.......................................................................................................................... 8
Model Training & Evaluation.......................................................................................................... 9
Models Used.............................................................................................................................. 9
Training Code............................................................................................................................ 9
Model initialization and fitting............................................................................................ 9
Evaluation Metrics................................................................................................................... 10
Making a prediction on new data.............................................................................................11
Saving the Model for Future Use.............................................................................................11
Feature Importance.................................................................................................................. 12
Conclusion............................................................................................................................... 13
What We Can Do to Improve the Model Further.................................................................................. 13
Final Thoughts....................................................................................................................................... 13
PREFACE

This project investigates how different machine learning algorithms can be applied to predict
daily mobile data usage based on user behavior and smartphone characteristics. With the
growing need for efficient data plan management and usage forecasting, this project seeks to
demonstrate the practicality of predictive models like Linear Regression, Random Forest, and
Gradient Boosting in estimating mobile data consumption. The goal is to provide insight into
how such models can assist telecom companies, device manufacturers, and end-users in
understanding and managing mobile data consumption patterns.

RESEARCH

We utilized the Smartphone Usage and Behavioral Dataset sourced from Kaggle. This dataset
contains 700 records and 11 features, including App Usage Time, Screen On Time, Battery
Drain, Number of Installed Apps, Age, Gender, Device Model, and Operating System. The target
variable is continuous Daily Data Usage in MB.

To begin, we conducted exploratory data analysis (EDA) using tools such as Pandas, Seaborn,
and Matplotlib. This helped us understand the distribution of data, detect outliers, and examine
relationships between features and the target variable. We observed a right-skewed distribution
for data usage, indicating that most users consume moderate amounts of data while a few
consume very high amounts.

Following the EDA, we preprocessed the data through One-Hot Encoding for categorical
variables and Standard Scaling for numeric variables. The dataset was then split into training and
testing subsets in an 80:20 ratio.

We trained three machine learning models Linear Regression, Random Forest Regressor, and
Gradient Boosting Regressor and compared their performances using evaluation metrics like
Mean Absolute Error (MAE) and Root Mean Squared Log Error (RMSLE). Gradient Boosting
yielded the best results in our experiments, suggesting its suitability for this kind of regression
task. Furthermore, we examined feature importance, revealing that Battery Drain and App Usage
Time were the most influential predictors of daily mobile data consumption.

I
Introduction
Objective: Predict daily mobile data usage (MB/day) based on user behavior and device
characteristics.

Project Overview

We'll analyze a dataset containing information about mobile device usage and user behavior to
predict daily data consumption. The dataset includes features like app usage time, screen time,
battery drain, number of apps installed, and demographic information.

Dataset:

●​ 700 rows, 11 features (e.g., App Usage Time, Screen On Time, Battery Drain, Age, Gender).
●​ Target Variable: Data Usage (MB/day) (continuous).
●​ Source: Smartphone Usage and Behavioral Dataset

Approach:

1.​ Exploratory Data Analysis (EDA)


2.​ Preprocessing & Feature Engineering
3.​ Model Training (3 Algorithms)
4.​ Evaluation & Comparison

Problem Definition

We're trying to predict how much mobile data (in MB) a user will consume per day based on
their device characteristics and usage patterns. This is a regression problem since we're
predicting a continuous numerical value

1
Figure 1: Image shows a code snippet that shows the project dependencies.

Algorithms Used

a.​ Linear Regression


●​ Type: Simple and interpretable linear model.
●​ How it works: It finds the best-fitting straight line through the data by
minimizing the difference between predicted and actual values (using least
squares).
●​ Use case: Good for baseline models and when relationships between features and
the target are mostly linear.
●​ Equation

b.​ Random Forest Regressor


●​ Type: Ensemble model (uses multiple decision trees).
●​ How it works: Builds many decision trees on random subsets of the data and
averages their predictions to reduce overfitting and improve accuracy.
●​ Strength: Handles non-linear relationships, missing data, and categorical
variables well.

2
●​ Key concept: Bagging – training each tree on a different random sample of the
data.​

c.​ Gradient Boosting Regressor


●​ Type: Ensemble model using boosting.
●​ How it works: Builds decision trees sequentially, where each new tree learns
from the errors of the previous ones.
●​ Strength: Highly accurate, great for capturing complex patterns in data.
●​ Key concept: Boosting correcting the previous model's mistakes step-by-step to
improve performance.

3
Exploratory Data Analysis (EDA)

Dataset Overview

Figure 2: Image shows a code snippet that prints the first 5 sample in our datatset

Figure 3: Image shows a code snippet that displays the information about the dataset.

4
Here we have imported the data and printed the first 5 rows. To import the data we used Pandas.

Key Observations:
●​ Mixed data types (numeric + categorical).
●​ No missing values ([Link]() confirms all columns are complete).

Target Variable Distribution

Figure 4: Image shows a plot distribution of the target variables.

5
Interpretation:
●​ Right-skewed distribution.
●​ Most users consume 300 to 1000 MB/day, with outliers (>2000 MB).

Target Variable Distribution


Then we draw our boxplot to see the distribution of the User behavior class against Data Usage.

Figure 5: Image shows a plot of user behavior class against their data usage.

6
Preprocessing & Feature Engineering

Steps

1.​ Categorical Encoding:


○​ OneHotEncoder for Device Model, Operating System, Gender.
2.​ Numerical Scaling:
○​ StandardScaler for Battery Drain, Screen On Time, etc.

Figure 6: Image shows the feature extraction and preprocessing of categorical data.

7
Train-Test Split

80% Train, 20% Test (train_test_split).

Figure 7: Image shows how the dataset has been splitted into the trained and test set.

8
Model Training & Evaluation

Models Used

Training Code

Model initialization and fitting.

Figure 8: Image shows model initialization and fitting.

9
Evaluation Metrics

Model Comparison:

Linear Regression Performance:


Mean Absolute Error: 117.04
Root Mean Squared Log Error: 0.2014

Random Forest Performance:


Mean Absolute Error: 114.74
Root Mean Squared Log Error: 0.2023

Gradient Boosting Performance:


Mean Absolute Error: 113.64
Root Mean Squared Log Error: 0.1980

10
Interpretation:
●​ Gradient Boosting performs best (lowest MAE and MSLE).

Making a prediction on new data

Figure 9: Image shows sample predictions made on the model.

Saving the Model for Future Use


Save the best model so we can deploy it for use.

11
Figure 10: Image shows saving the model for future use.

Feature Importance
After we have train the model we want to see the features that contributed well to our model.

Figure 11: Image shows the most important features that we considered during the training.

12
Top Features:
1.​ Battery Drain contributed about 45%
2.​ App Usage Time contributed to about 28%
3.​ Number of App installed contributed to about 19%

N:B The ones that contributed less to our model like OS, Device model, Gender etc can be
removed if more data is to be collected since its contribution to the model is not significant.

Conclusion
In this project, we successfully built a predictive system to estimate daily data usage (MB/day)
based on user and device behavior. We used three machine learning models.

What We Can Do to Improve the Model Further

Here are practical steps for improvement:

1. Feature Engineering
2. Data Cleaning & Outlier Handling
​ 3. Model Optimization
●​ Try more advanced models like: XGBoost, LightGBM, CatBoost

Final Thoughts
This project shows how user behavior data can be leveraged to predict mobile data usage, which
could be useful for:
●​ Telecom companies optimizing data plans,
●​ Device manufacturers understanding usage patterns,
●​ End users tracking and managing data consumption.​

13

You might also like