0% found this document useful (0 votes)

11 views13 pages

Predictive Modeling

The document presents an analysis of a dataset containing information about visitors, ad impressions, and content views from a streaming platform. It includes data cleaning, exploratory data analysis with visualizations, and the development of predictive models using linear regression and random forest regression. Key findings include a strong correlation between trailer views and content views, as well as the impact of various features on visitor counts.

Uploaded by

anuragsingh0406

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views13 pages

Predictive Modeling

Uploaded by

anuragsingh0406

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

import numpy as np

import pandas as pd

df=pd.read_csv("ottdata.csv")

df.head()

visitors ad_impressions major_sports_event genre dayofweek

season \
0 1.67 1113.81 0 Horror Wednesday
Spring
1 1.46 1498.41 1 Thriller Friday
Fall
2 1.47 1079.19 1 Thriller Wednesday
Fall
3 1.85 1342.77 1 Sci-Fi Friday
Fall
4 1.46 1498.41 0 Sci-Fi Sunday
Winter

views_trailer views_content
0 56.70 0.51
1 52.69 0.32
2 48.74 0.39
3 49.81 0.44
4 55.83 0.46

df.describe()

visitors ad_impressions major_sports_event views_trailer

\
count 1000.000000 1000.000000 1000.000000 1000.00000

mean 1.704290 1434.712290 0.400000 66.91559

std 0.231973 289.534834 0.490143 35.00108

min 1.250000 1010.870000 0.000000 30.08000

25% 1.550000 1210.330000 0.000000 50.94750

50% 1.700000 1383.580000 0.000000 53.96000

75% 1.830000 1623.670000 1.000000 57.75500

max 2.340000 2424.200000 1.000000 199.92000

views_content
count 1000.000000
mean 0.473400
std 0.105914
min 0.220000
25% 0.400000
50% 0.450000
75% 0.520000
max 0.890000

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 visitors 1000 non-null float64
1 ad_impressions 1000 non-null float64
2 major_sports_event 1000 non-null int64
3 genre 1000 non-null object
4 dayofweek 1000 non-null object
5 season 1000 non-null object
6 views_trailer 1000 non-null float64
7 views_content 1000 non-null float64
dtypes: float64(4), int64(1), object(3)
memory usage: 62.6+ KB

import matplotlib.pyplot as plt

import seaborn as sns

sns.set(style="whitegrid")

plt.figure(figsize=(10, 6))
sns.histplot(df['views_content'], bins=30, kde=True, color='red')
plt.title('Distribution of Content Views')
plt.xlabel('Views on Content')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='genre',
order=df['genre'].value_counts().index, palette='Set2')
plt.title('Distribution of Content Genres')
plt.xlabel('Genre')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

C:\Users\anura\AppData\Local\Temp\ipykernel_2388\2618020269.py:2:
FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be

removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.countplot(data=df, x='genre',
order=df['genre'].value_counts().index, palette='Set2')
day_avg_views = df.groupby('dayofweek')
['views_content'].mean().sort_values()

plt.figure(figsize=(8, 5))
sns.barplot(x=day_avg_views.index,y=day_avg_views.values,palette='cool
warm')
plt.title("Average Content Views by Day of Release")
plt.xlabel("Day of the Week")
plt.ylabel("Average Content Views")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

C:\Users\anura\AppData\Local\Temp\ipykernel_2388\1698395269.py:4:
FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be

removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.barplot(x=day_avg_views.index,y=day_avg_views.values,palette='cool
warm')
plt.figure(figsize=(5,5))
sns.boxplot(data=df, x='season',y='views_content');
custom_palette
={'Spring':'orange','Fall':'blue','Summer':'green','Winter':'red'}
sns.boxplot(data=df,x='season',
y='views_content',palette=custom_palette)

C:\Users\anura\AppData\Local\Temp\ipykernel_2388\2208172424.py:4:
FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be

removed in v0.14.0. Assign the `x` variable to `hue` and set
`legend=False` for the same effect.

sns.boxplot(data=df,x='season',
y='views_content',palette=custom_palette)

<Axes: xlabel='season', ylabel='views_content'>

correlation = df['views_trailer'].corr(df['views_content'])
print(f"Correlation between trailer views and content views:
{correlation:.2f}")

Correlation between trailer views and content views: 0.75

plt.figure(figsize=(8, 6))
sns.scatterplot(data=df, x='views_trailer', y='views_content')
sns.regplot(data=df, x='views_trailer', y='views_content',
scatter=False, color='red')
plt.title('Correlation between Trailer Views and Content Views')
plt.xlabel('Trailer Views')
plt.ylabel('Content Views')
plt.show()
duplicates = df[df.duplicated()]
print(f"Number of duplicate rows: {duplicates.shape[0]}")

Number of duplicate rows: 0

df.isnull().sum()

visitors 0
ad_impressions 0
major_sports_event 0
genre 0
dayofweek 0
season 0
views_trailer 0
views_content 0
dtype: int64

sns.boxplot(x=df['views_content'])
plt.title('Boxplot of Content Views')
plt.xlabel('Content Views')
plt.show()
from sklearn.model_selection import train_test_split

X = df.iloc[:, :-1]
y = df.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)

print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (800, 7)

X_test shape: (200, 7)
y_train shape: (800,)
y_test shape: (200,)

from sklearn.linear_model import LinearRegression

df_encoded = pd.get_dummies(df, drop_first=True)

X = df_encoded.iloc[:, :-1]
y = df_encoded.iloc[:, -1]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

LinearRegression()

print("Intercept:", model.intercept_)
print("Coefficients:")
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature}: {coef}")

Intercept: 0.18811987016589798
Coefficients:
visitors: -0.04440734743739965
ad_impressions: 8.990939399706324e-06
major_sports_event: 0.07516397750507174
views_trailer: -0.0027899014914202452
views_content: 1.3118786110793876
genre_Comedy: 0.011095989825962885
genre_Drama: -0.0016560167230568895
genre_Horror: -0.08656635482699294
genre_Others: -0.056001381566065864
genre_Romance: -0.04919131479908598
genre_Sci-Fi: 0.02188836518732013
genre_Thriller: -0.08216761016780921
dayofweek_Monday: -0.16879363232769098
dayofweek_Saturday: -0.03442416007668321
dayofweek_Sunday: -0.04048380937273572
dayofweek_Thursday: 0.05764282958222824
dayofweek_Tuesday: -0.1028157961476516
dayofweek_Wednesday: -0.08262541105422154
season_Spring: -0.5195187305634899
season_Summer: -0.55614695920083

coefficients = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_
})

print("\nModel Coefficients:")
print(coefficients)

Model Coefficients:
Feature Coefficient
0 visitors -0.044407
1 ad_impressions 0.000009
2 major_sports_event 0.075164
3 views_trailer -0.002790
4 views_content 1.311879
5 genre_Comedy 0.011096
6 genre_Drama -0.001656
7 genre_Horror -0.086566
8 genre_Others -0.056001
9 genre_Romance -0.049191
10 genre_Sci-Fi 0.021888
11 genre_Thriller -0.082168
12 dayofweek_Monday -0.168794
13 dayofweek_Saturday -0.034424
14 dayofweek_Sunday -0.040484
15 dayofweek_Thursday 0.057643
16 dayofweek_Tuesday -0.102816
17 dayofweek_Wednesday -0.082625
18 season_Spring -0.519519
19 season_Summer -0.556147

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import statsmodels.api as sm

X = df.drop(columns='visitors')
y = df['visitors']

categorical_features = ['genre', 'dayofweek', 'season']

numerical_features = ['ad_impressions', 'major_sports_event',
'views_trailer', 'views_content']

preprocessor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(drop='first'), categorical_features)
],
remainder='passthrough'
)

X_processed = preprocessor.fit_transform(X)
X_processed = sm.add_constant(X_processed) # Add intercept
model = sm.OLS(y, X_processed).fit()

residuals = model.resid
fitted = model.fittedvalues

plt.figure(figsize=(16, 12))

plt.subplot(2, 2, 1)
sns.scatterplot(x=fitted, y=y)
plt.plot(fitted, fitted, color='red')
plt.title("Linearity: Fitted vs Actual")
Text(0.5, 1.0, 'Linearity: Fitted vs Actual')

plt.subplot(2, 2, 2)
sns.scatterplot(x=fitted, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.title("Homoscedasticity: Residuals vs Fitted")

Text(0.5, 1.0, 'Homoscedasticity: Residuals vs Fitted')

plt.subplot(2, 2, 3)
sns.histplot(residuals, kde=True)
plt.title("Normality of Residuals")

Text(0.5, 1.0, 'Normality of Residuals')

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error,
r2_score

df_clean = df.dropna()

df_encoded = pd.get_dummies(df_clean, columns=['genre', 'dayofweek',

'season'], drop_first=True)

X = df_encoded.drop('visitors', axis=1)
y = df_encoded['visitors']

model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)

RandomForestRegressor(random_state=42)

y_pred = model.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)

mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

mae, mse, rmse, r2

(0.26805, 0.1555445, 0.3943913031495497, 0.18123700486906158)

Data Handling in Data Science
No ratings yet
Data Handling in Data Science
76 pages
Roll NO 2020
No ratings yet
Roll NO 2020
8 pages
Aayushi ML File
No ratings yet
Aayushi ML File
37 pages
OCS353 Data Science Manual Print
No ratings yet
OCS353 Data Science Manual Print
58 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
@PowerBI - Ir - Data Visualization Cheat Sheet
No ratings yet
@PowerBI - Ir - Data Visualization Cheat Sheet
15 pages
PML Ex3
No ratings yet
PML Ex3
20 pages
PRO Level Data Visualization Cheat Sheet
No ratings yet
PRO Level Data Visualization Cheat Sheet
15 pages
Math 189 HW-1: Data Analysis with Pandas
No ratings yet
Math 189 HW-1: Data Analysis with Pandas
11 pages
Unit 3
No ratings yet
Unit 3
110 pages
Titanic Fare Distribution Analysis
No ratings yet
Titanic Fare Distribution Analysis
21 pages
Mini Project
No ratings yet
Mini Project
17 pages
Pandas Notes
No ratings yet
Pandas Notes
27 pages
Seaborn EDA for Python Users
No ratings yet
Seaborn EDA for Python Users
39 pages
Heart Disease Prediction! ?
No ratings yet
Heart Disease Prediction! ?
52 pages
Code Shabab Error 7
No ratings yet
Code Shabab Error 7
5 pages
Data Visualization Techniques in Python
No ratings yet
Data Visualization Techniques in Python
24 pages
Data Analysis and Visualization Course
No ratings yet
Data Analysis and Visualization Course
4 pages
Deepak Data Analysis 1
No ratings yet
Deepak Data Analysis 1
31 pages
ML Lab Manual
No ratings yet
ML Lab Manual
12 pages
Twitch Streamer Data Analysis
No ratings yet
Twitch Streamer Data Analysis
10 pages
Samplecode (HDPS)
No ratings yet
Samplecode (HDPS)
29 pages
Chapter-5 - Matplotlib-Part-1
No ratings yet
Chapter-5 - Matplotlib-Part-1
63 pages
Dsa Lab Record (Ai&Ds)
No ratings yet
Dsa Lab Record (Ai&Ds)
34 pages
West Rox
No ratings yet
West Rox
29 pages
Python EDA Workshop with Olympics Data
No ratings yet
Python EDA Workshop with Olympics Data
12 pages
Numpy and Pandas Programming Guide
No ratings yet
Numpy and Pandas Programming Guide
116 pages
Unit 5 Descriptive Statistics
No ratings yet
Unit 5 Descriptive Statistics
7 pages
Python Code Library
No ratings yet
Python Code Library
8 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Data Visualization
No ratings yet
Data Visualization
33 pages
Data Visualization
No ratings yet
Data Visualization
70 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Security Review 2-1
No ratings yet
Security Review 2-1
12 pages
CS-3361-Data-science-lab Manual
No ratings yet
CS-3361-Data-science-lab Manual
36 pages
Malicious Coding
No ratings yet
Malicious Coding
4 pages
DXV Guidelines
No ratings yet
DXV Guidelines
3 pages
Houses Prices Prediction Model
No ratings yet
Houses Prices Prediction Model
11 pages
Presentation 1
No ratings yet
Presentation 1
30 pages
Ai&Ml Bail606 ML Lab Manual
No ratings yet
Ai&Ml Bail606 ML Lab Manual
50 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
Top Korean TV Shows by Rating
No ratings yet
Top Korean TV Shows by Rating
3 pages
Marvel Vs DC
No ratings yet
Marvel Vs DC
1 page
Pandas Data Analysis and Wrangling Guide
No ratings yet
Pandas Data Analysis and Wrangling Guide
12 pages
UT-1-Machine Learning Lecture Notes-2
No ratings yet
UT-1-Machine Learning Lecture Notes-2
11 pages
Mat Plot Lib
No ratings yet
Mat Plot Lib
12 pages
Python Matplotlib Cheat Sheet
No ratings yet
Python Matplotlib Cheat Sheet
1 page
List of Imported Libraries
No ratings yet
List of Imported Libraries
12 pages
DSF Lab Exp Full
No ratings yet
DSF Lab Exp Full
88 pages
Data Viz Cheat Sheet Final
No ratings yet
Data Viz Cheat Sheet Final
2 pages
DSBDL Write Ups 8 To 10
No ratings yet
DSBDL Write Ups 8 To 10
7 pages
Machine Learning: Technical Requirements & Data Processing Guide
No ratings yet
Machine Learning: Technical Requirements & Data Processing Guide
30 pages
Seaborn Cheat Sheet Python For Data Science: 3 Plotting With Seaborn 3 Plotting With Seaborn
No ratings yet
Seaborn Cheat Sheet Python For Data Science: 3 Plotting With Seaborn 3 Plotting With Seaborn
1 page
2 Program
No ratings yet
2 Program
8 pages
Unit 5 PythonPackages (Matplotlib)
No ratings yet
Unit 5 PythonPackages (Matplotlib)
24 pages
DS Lab Manual Lovesh 1
No ratings yet
DS Lab Manual Lovesh 1
15 pages
Happiness Prediction with ML Models
No ratings yet
Happiness Prediction with ML Models
21 pages
Concert Data Analysis with Python
No ratings yet
Concert Data Analysis with Python
37 pages
DVT Project
No ratings yet
DVT Project
16 pages
DVT Project
No ratings yet
DVT Project
16 pages
Query Results
No ratings yet
Query Results
15 pages
Python Project - Checkpoint
No ratings yet
Python Project - Checkpoint
26 pages
Project Inferential Statistics-Checkpoint
No ratings yet
Project Inferential Statistics-Checkpoint
11 pages
ML2 - Easy Visa Business Report
No ratings yet
ML2 - Easy Visa Business Report
17 pages
Machine Learning Project 1
No ratings yet
Machine Learning Project 1
30 pages
Python Coded Project
No ratings yet
Python Coded Project
31 pages
Cps Exams
No ratings yet
Cps Exams
9 pages
Untitled6.Ipynb - Colab
No ratings yet
Untitled6.Ipynb - Colab
6 pages
Point and Interval Estimation in Statistics
No ratings yet
Point and Interval Estimation in Statistics
49 pages
Statistical Data Fitting Guide
No ratings yet
Statistical Data Fitting Guide
2 pages
Multiple Regression Estimation Guide
No ratings yet
Multiple Regression Estimation Guide
76 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
3 pages
Module 6 Content
No ratings yet
Module 6 Content
12 pages
Quant Finance Statistical Foundations
No ratings yet
Quant Finance Statistical Foundations
1 page
Datascience 2 PDF
No ratings yet
Datascience 2 PDF
24 pages
Regression Basics
No ratings yet
Regression Basics
8 pages
Cobb-Douglas Production Function Study
No ratings yet
Cobb-Douglas Production Function Study
2 pages
Final Examination: Sample
No ratings yet
Final Examination: Sample
7 pages
Jurnal Pengaruh Persepsi Dan Harga PDF
No ratings yet
Jurnal Pengaruh Persepsi Dan Harga PDF
11 pages
Excel Regression & Scatterplot Lab
No ratings yet
Excel Regression & Scatterplot Lab
4 pages
Econometrics for Non-Econ Majors
No ratings yet
Econometrics for Non-Econ Majors
2 pages
CH-15 - IInd Sem 23-24
No ratings yet
CH-15 - IInd Sem 23-24
99 pages
Understanding Propensity Score Matching
No ratings yet
Understanding Propensity Score Matching
3 pages
SSC CGL Syllabus 2025 For Tier 1 and Tier 2 Exams
No ratings yet
SSC CGL Syllabus 2025 For Tier 1 and Tier 2 Exams
15 pages
EViews 7 Users Guide II PDF
No ratings yet
EViews 7 Users Guide II PDF
822 pages
Stata Survey Data Reference Manual: Release 13
No ratings yet
Stata Survey Data Reference Manual: Release 13
213 pages
Regression Analysis of Student Grades
No ratings yet
Regression Analysis of Student Grades
3 pages
Chapter 3 - Linear Regression Model
No ratings yet
Chapter 3 - Linear Regression Model
289 pages
P08 - 178380 - Eviews Guide
No ratings yet
P08 - 178380 - Eviews Guide
9 pages
Bias Variance Trade Off
No ratings yet
Bias Variance Trade Off
20 pages
Linear Regression Concepts - A4
No ratings yet
Linear Regression Concepts - A4
6 pages
PhD Econometrics Exam 2016/2017
No ratings yet
PhD Econometrics Exam 2016/2017
2 pages
Module 5: Curve Fitting Techniques
No ratings yet
Module 5: Curve Fitting Techniques
5 pages
Data Analysis Project 2 Due 5:00 PM Nov 21 1 Instructions
No ratings yet
Data Analysis Project 2 Due 5:00 PM Nov 21 1 Instructions
3 pages
Using Outreg2 To Report Regression Output, Descriptive Statistics, Frequencies and Basic Crosstabulations
No ratings yet
Using Outreg2 To Report Regression Output, Descriptive Statistics, Frequencies and Basic Crosstabulations
16 pages
Design of Experiments: (1) A B Ab C Ac BC Abc
No ratings yet
Design of Experiments: (1) A B Ab C Ac BC Abc
7 pages

Predictive Modeling

Uploaded by

Predictive Modeling

Uploaded by

import numpy as np

visitors ad_impressions major_sports_event genre dayofweek

visitors ad_impressions major_sports_event views_trailer

mean 1.704290 1434.712290 0.400000 66.91559

std 0.231973 289.534834 0.490143 35.00108

min 1.250000 1010.870000 0.000000 30.08000

25% 1.550000 1210.330000 0.000000 50.94750

50% 1.700000 1383.580000 0.000000 53.96000

75% 1.830000 1623.670000 1.000000 57.75500

max 2.340000 2424.200000 1.000000 199.92000

import matplotlib.pyplot as plt

Passing `palette` without assigning `hue` is deprecated and will be

Passing `palette` without assigning `hue` is deprecated and will be

Passing `palette` without assigning `hue` is deprecated and will be

<Axes: xlabel='season', ylabel='views_content'>

Correlation between trailer views and content views: 0.75

Number of duplicate rows: 0

print("X_train shape:", X_train.shape)

X_train shape: (800, 7)

from sklearn.linear_model import LinearRegression

df_encoded = pd.get_dummies(df, drop_first=True)

from sklearn.linear_model import LinearRegression

categorical_features = ['genre', 'dayofweek', 'season']

Text(0.5, 1.0, 'Homoscedasticity: Residuals vs Fitted')

Text(0.5, 1.0, 'Normality of Residuals')

df_encoded = pd.get_dummies(df_clean, columns=['genre', 'dayofweek',

mae = mean_absolute_error(y_test, y_pred)

mae, mse, rmse, r2

(0.26805, 0.1555445, 0.3943913031495497, 0.18123700486906158)

You might also like