Solution To Linear Regression Assignment

The document summarizes the steps taken to analyze a bike sharing dataset and build a predictive model. Key steps included: descriptive analysis, outlier removal, correlation analysis, and random forest model selection and training. Random forest regression achieved a mean absolute error of 44.30 across three cross-validation splits. The model and analysis code are provided to allow reproducibility and potential extension by colleagues. While random forests scale poorly to very large datasets, alternatives like combining top-tree pre-classification with distributed computation using Apache Spark ML are suggested.

Uploaded by

Soumik Bhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

170 views5 pages

Solution To Linear Regression Assignment

Uploaded by

Soumik Bhar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Part 1 – Data analysis and predictive model

The following steps were performed to analyze the Bike Sharing Dataset and build a predictive
model:

 Descriptive Analysis
 Missing Value Analysis
 Outlier Analysis
 Correlation Analysis
 Model selection
 Random Forest Training and Feature Ranking

1 – Business Case

The hourly prediction of the bike sharing count value is not only important to estimate the expected
revenue of the bike sharing service, but also to provide the required amount of bicycles at each
station of a distributed bike sharing service. With more information about different stations, you
could predict and schedule the rebalance of given back bikes. In this task, I investigate the prediction
of the hourly bike count based on the specific hour, expected weather and day information.

2 – Analysis and data preparation

The missing value analysis revealed the data set does not contain any not-a-number or null values
which require a replacement for further processing. In the following step, the outliers of the count
values were removed using median and interquartile range (IQR) as the count values do not fit a
normal distribution. This reduces the data set from 15641 to 15179 samples.

Data with Data without

outliers outliers
Box plots for
different

The box plots and the correlation matrix of the numerical features revealed that the hour and
temperature are some promising feature variables to predict the hourly count value. The correlation
analysis also revealed that temperature and feeling temperature are highly correlated. To reduce the
model complexity and avoid collinearity, the feeling temperature features were dismissed.
Correlation
matrix for

3 – Model selection

The prediction of the count values requires a regression algorithm based on categorical and
numerical features. The dataset is quite small with less than 20K samples and the analysis steps
revealed that a few features could be particularly significant. Based on these characteristics of the
task and the data, I evaluated a set of possible algorithms: Lasso, Elastic Net, Support Vector
Regression with different kernels, Ridge Regression and Random Forests.

4 – Random Forest

The random forests showed the most promising results on the Bike Sharing Dataset and were picked
for the final result. The final random forest model consists of 200 decision trees trained on various-
subsamples of the dataset and uses averaging to improve the predictive accuracy and control over-
fitting. An internal needs at least four samples to split and the mean squared error was used to
estimate the quality of a split.

The final model receives a mean absolute error of 44.30 averaged over three splits using three-fold
cross-validation:

Model Split Mean Squared Mean Absolute RMSL R² Score

Error Error E
RandomForestRegressor 1 4489.95 43.72 0.41 0.86
RandomForestRegressor 2 4636.60 44.60 0.41 0.86
RandomForestRegressor 3 4691.67 44.57 0.41 0.86
RandomForestRegresso Mean 4606.07 44.30 0.41 0.86
r
Feature ranking of the decision trees in the random forest regressor on the training samples of the
first split:

5 – Code

The python code to reproduce the provided analysis, plots and models is provided including
documentation and unittests. This guarantees the reliability of the code securely, facilitates
maintenance and allows potential colleagues to extend the code.

Create a python environment using conda and the ‘env.yaml’ file:

conda env create --file=env.yaml

All analysis steps are provided in more detail in the ‘Bike Sharing Data Science Project.ipynb’ jupyter
notebook. Run it with the following command from the source folder:
jupyter notebook

The random forest is implemented in the ‘random_forest.py’ file. Run it with the following command
from the source folder:
python random_forest.py

To run the unittests for the random forest code, you can run the unittest detection of python:
python -m unittest
Part 2 – Large-scale dataset

The used sklearn python implementation of random forest regression will extremely slow down for
large scale datasets (> 10 Mio. samples). This is caused by costly computations and because the data
can not completely be stored in the main memory anymore. In the worst case, the sklearn
implementation will crash and is not usable for large-scale datasets.

A good solution would be the combination of state-of-the-art top tree pre-classification like in woody
and distributed computation e.g. with Apache Spark ML.

The python woody implementation uses top-trees for a coarse pre-classification and distributes the
samples to bottom random forests which are implemented in pure C and highly optimized. In a direct
comparison (see arxiv-paper) woody shows good results in runtime performance and accuracy.
Apache Spark ML is a machine learning framework highly optimized for distributed computation and
would allow the utilization of a computation cluster. Spark ML can run on Hadoop, Apache Mesos,
Kubernetes, etc. , can access data from popular Apache databases like Apache Cassandra and claims
to be 100 times faster than classic algorithms.

One idea could be the distribution of data with coarse top-tree pre-classification on distributed
machines and to learn the bottom random forests on these machines. Such a solution makes only
sense if a large random forest is required to model the underlying data properly. The success of this
solution depends on the following factors:

• Pre-classification runtime: The runtime of the pre-classification could slow down the data
distribution.
• Network delay: Especially in a computation cluster the network could be the bottleneck
and slow down the computation.
• Hardware: A computation cluster provides a multitude of computational power, but is
not affordable in every use case.

I have theoretical knowledge about distributed computation and cluster architectures. I have no
hands-on experience with frameworks like Hadoop or Apache Spark ML so far.

Bike Sharing Demand Prediction With Code
No ratings yet
Bike Sharing Demand Prediction With Code
6 pages
Bike Sharing Prediction Project Structure
No ratings yet
Bike Sharing Prediction Project Structure
37 pages
Solution - Data Analysis With Python-Project-2 - v1.0
No ratings yet
Solution - Data Analysis With Python-Project-2 - v1.0
14 pages
Bike Rental R
No ratings yet
Bike Rental R
13 pages
Random Forest Classifier for Car Safety
No ratings yet
Random Forest Classifier for Car Safety
6 pages
Mini Project Final Presentation
No ratings yet
Mini Project Final Presentation
18 pages
Internship Report Bike Data
No ratings yet
Internship Report Bike Data
30 pages
Project Report Based On AI For Predictive Maintenace Using IoT
No ratings yet
Project Report Based On AI For Predictive Maintenace Using IoT
11 pages
Bike Sharing Company Analysis
No ratings yet
Bike Sharing Company Analysis
14 pages
ML Week 15
No ratings yet
ML Week 15
6 pages
Predictive Maintenance for Wind Turbines
No ratings yet
Predictive Maintenance for Wind Turbines
5 pages
Machine Learning Model Development Guide
No ratings yet
Machine Learning Model Development Guide
3 pages
Bike Rental
No ratings yet
Bike Rental
12 pages
Flight Fare Prediction Project
No ratings yet
Flight Fare Prediction Project
15 pages
Power Forecasting for Engineers
No ratings yet
Power Forecasting for Engineers
6 pages
R Quiz Question2025
No ratings yet
R Quiz Question2025
2 pages
Ieee Research Paper
No ratings yet
Ieee Research Paper
2 pages
SQR Da 2
No ratings yet
SQR Da 2
11 pages
Top Datasets for Data Science
100% (1)
Top Datasets for Data Science
9 pages
Rain Prediction Using Random Forest
No ratings yet
Rain Prediction Using Random Forest
30 pages
Analyzing Ola Data For Predicting Price
No ratings yet
Analyzing Ola Data For Predicting Price
7 pages
Support of Big Data Machine Learning With Apache Spark
No ratings yet
Support of Big Data Machine Learning With Apache Spark
7 pages
DA PRA WEEK 13 (Random Forest) - 054551
No ratings yet
DA PRA WEEK 13 (Random Forest) - 054551
12 pages
Group7 Report
No ratings yet
Group7 Report
10 pages
AttiqAhmadAfsar Lab 13
No ratings yet
AttiqAhmadAfsar Lab 13
5 pages
Slay The Day
No ratings yet
Slay The Day
21 pages
Energy Consumption Prediction Report
No ratings yet
Energy Consumption Prediction Report
4 pages
Sentimental Analysis
No ratings yet
Sentimental Analysis
7 pages
ML Asst.-01
No ratings yet
ML Asst.-01
21 pages
ML - Assignment Advanced
No ratings yet
ML - Assignment Advanced
2 pages
Detail Project Report
No ratings yet
Detail Project Report
9 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
Methods and Models
No ratings yet
Methods and Models
12 pages
Project
No ratings yet
Project
27 pages
Analyzing Bike Sharing Trends in AI
No ratings yet
Analyzing Bike Sharing Trends in AI
7 pages
PyTorch Tabular Regression Guide
No ratings yet
PyTorch Tabular Regression Guide
13 pages
FRM Course Syllabus IPDownload
No ratings yet
FRM Course Syllabus IPDownload
3 pages
Random Forest Regressor EV
No ratings yet
Random Forest Regressor EV
2 pages
Yulu SRK
No ratings yet
Yulu SRK
20 pages
A1 Siddhant's Resume
No ratings yet
A1 Siddhant's Resume
1 page
Data Science
No ratings yet
Data Science
8 pages
House Price Prediction Using ML
No ratings yet
House Price Prediction Using ML
26 pages
GA - Meet - Problem Statement & Methodology
No ratings yet
GA - Meet - Problem Statement & Methodology
19 pages
Predicting Faulty Commits with ML
No ratings yet
Predicting Faulty Commits with ML
8 pages
New PPT Presentation
No ratings yet
New PPT Presentation
28 pages
Machine Learning Statistical Model Using Transportation Data
No ratings yet
Machine Learning Statistical Model Using Transportation Data
32 pages
Project Report Kodeinkgp
No ratings yet
Project Report Kodeinkgp
6 pages
Data Science Projects
No ratings yet
Data Science Projects
74 pages
Em Semester Project
No ratings yet
Em Semester Project
21 pages
Predicting Merge Conflicts with ML
No ratings yet
Predicting Merge Conflicts with ML
77 pages
Report On Machine Learning Founation Topic "Detection Dataset"
No ratings yet
Report On Machine Learning Founation Topic "Detection Dataset"
8 pages
Naan Mudhalvan
No ratings yet
Naan Mudhalvan
43 pages
ADS-ch3 2024-25
No ratings yet
ADS-ch3 2024-25
35 pages
Configuration Manual for Cyber Security Project
No ratings yet
Configuration Manual for Cyber Security Project
19 pages
Exp 1 To 3 Ai-Iot
No ratings yet
Exp 1 To 3 Ai-Iot
10 pages
Flight Fare Prediction Model Overview
No ratings yet
Flight Fare Prediction Model Overview
17 pages
Pyspark PDF
100% (1)
Pyspark PDF
397 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
Essential Datasets for Data Science
No ratings yet
Essential Datasets for Data Science
9 pages
Prayer Book
No ratings yet
Prayer Book
74 pages
Tian Di Ren
100% (1)
Tian Di Ren
2 pages
Death Was Arrested
No ratings yet
Death Was Arrested
31 pages
Ge3361 PD Lab
No ratings yet
Ge3361 PD Lab
4 pages
Danfoss Series 90 Pump and Motor Guide
100% (1)
Danfoss Series 90 Pump and Motor Guide
34 pages
Logistics & Supply Chain Management (Case: HP Deskjet Supply Chain)
67% (3)
Logistics & Supply Chain Management (Case: HP Deskjet Supply Chain)
23 pages
Electronics Lab Guide
No ratings yet
Electronics Lab Guide
115 pages
Cake Decorating Techniques and Tips
No ratings yet
Cake Decorating Techniques and Tips
5 pages
KSAOs: Reliability and Validity in Selection
No ratings yet
KSAOs: Reliability and Validity in Selection
3 pages
Learning About Herbivores, Carnivores, and Omnivores - Wayground
No ratings yet
Learning About Herbivores, Carnivores, and Omnivores - Wayground
5 pages
Cooperative Education Advocacy
No ratings yet
Cooperative Education Advocacy
11 pages
Introduction to Cryptography Basics
No ratings yet
Introduction to Cryptography Basics
45 pages
Lesson 6 Powers of The Mind
100% (1)
Lesson 6 Powers of The Mind
36 pages
Family Law Post-Judgment Guide
100% (1)
Family Law Post-Judgment Guide
28 pages
Cremorne Point Circuit (Nsw-Cremorner-Cpc)
No ratings yet
Cremorne Point Circuit (Nsw-Cremorner-Cpc)
5 pages
PRAN Company
No ratings yet
PRAN Company
10 pages
Ko 2014
No ratings yet
Ko 2014
15 pages
Free Office 365 ProPlus Activation Guide
No ratings yet
Free Office 365 ProPlus Activation Guide
1 page
Viewpower: User Manual
No ratings yet
Viewpower: User Manual
54 pages
Monopolistic Competition Insights
No ratings yet
Monopolistic Competition Insights
2 pages
FuseBox Brochure Print HIGH RES
No ratings yet
FuseBox Brochure Print HIGH RES
20 pages
Floor Finish Plan
No ratings yet
Floor Finish Plan
1 page
Product PDF 4956
No ratings yet
Product PDF 4956
2 pages
Piping Inspection Summary Report
No ratings yet
Piping Inspection Summary Report
6 pages
Mole Concept
No ratings yet
Mole Concept
14 pages
Exam Day Booklet Paper Based
No ratings yet
Exam Day Booklet Paper Based
36 pages
Dga - Ariel-P-Ip67 - Datasheet (Option 2 Spike)
No ratings yet
Dga - Ariel-P-Ip67 - Datasheet (Option 2 Spike)
3 pages
Electronic Fuel Injection Guide
No ratings yet
Electronic Fuel Injection Guide
113 pages
Keygen 2014
33% (12)
Keygen 2014
6 pages
Mental Health and Hypertension in Uganda: Exploring The Psychological Risk Factors and Comorbidities (WWW - Kiu.ac - Ug)
No ratings yet
Mental Health and Hypertension in Uganda: Exploring The Psychological Risk Factors and Comorbidities (WWW - Kiu.ac - Ug)
8 pages