0% found this document useful (0 votes)

11 views106 pages

Module 4

The document covers various statistical tests including hypothesis testing, Pearson's correlation coefficient, Chi-squared test, and One-Way ANOVA. It explains the formulation of null and alternative hypotheses, assumptions for each test, and provides step-by-step procedures for conducting these tests using Python. Additionally, it includes examples with datasets to illustrate the application of these statistical methods.

Uploaded by

rahman.mega

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views106 pages

Module 4

Uploaded by

rahman.mega

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 106

Hypothesis Testing

Hypothesis Testing –Pearson’s Correlation Coefficient, Chi-

Squared Test, One-Way -Analysis of Variance Test (ANOVA).

PredictiveAnalysis: Steps to Building a Multiple Linear

Regression, Model Diagnostics.

https://towardsdatascience.com/hypothesis-testing-for-data-scientists-everything-you-need-to-know-8c36ddde4cd2
Statistical Analysis
Defining Hypothesis
The scientific question that must be answered, expressed in
the form of the Null Hypothesis (H₀) and the Alternative
Hypothesis (H₁ or Hₐ). H₀ and H₁ must be mutually exclusive,
and H₁ shouldn’t contain equality: H₀: μ=x, H₁: μ≠x
H₀: μ≤x, H₁: μ>x
H₀: μ≥x, H₁: μ<x
Independent Number of Dependent Number of Statistical Test
Variable Independent Variable Dependent
Measurement Variables Measurement Variables
Nominal 1 Nominal 1 Chi Square
Test
Nominal 1 Ordinal ( 1 ANOVA
Continuous)
Nominal 1 Interval (or) 1 One Way
Ratio ANOVA
Nominal 2 Interval (or) 1 Two Way
Ratio ANOVA
Interval or 2 or More 1 Interval or Regression
Ratio Ratio
Interval (or) Ratio 1 ‘t’ – One
Sample t test
Independent Sample – Nominal ( with two groups ) - Variable I , Independent
Variable II – Interval or Ratio sample ‘t’ test

Dependent Sample ( Before and After) – Interval or Ratio Paired ‘t’ test

Variable values – Interval or Ratio ( No differentiation between Correlation

independent or dependent variables )

Logistic regression is applied to predict the categorical dependent variable. In

other words, it's used when the prediction is categorical, for example, yes or no,
true or false, 0 or 1. The predicted probability or output of logistic regression can
be either one of them, and there's no middle ground.
Decision and Conclusion
If the p-value is smaller than the alpha (the significance level), in
other words, there is enough evidence to prove H₀ is not valid;
reject H₀.

Otherwise, fail to reject H₀. Rejecting H₀ validates H₁. However,

failing to reject H₀ does not mean H₀ is valid, nor does it mean H₁
is wrong.
Pearson Correlation
Tests whether two samples have a linear relationship.
• Assumptions:
• The samples must be independent and identically distributed (iid).
• The samples must follow a normal distribution.
• The samples should have homoscedasticity (equal variance).
• Hypothesis:
• H₀ (Null Hypothesis): The two samples are independent (no linear
relationship).
• H₁ (Alternative Hypothesis): There is a dependency (linear correlation)
between the samples.
If the Pearson correlation coefficient (r) is significantly different from
zero, it indicates a linear relationship between the two variables.
Independent – The observations in one sample should not influence or depend on the observations in the other sample.
Identically Distributed – All observations in a sample should come from the same probability distribution.
Step-1-Display of Data

import pandas as pd
import numpy as np
WA=pd.read_excel(r"D:\DataSet2024\Hypothesis
Testing2024\Web_Analytics_6.xlsx")
print(WA)
Step-2-Displaying Meta data

WA.info()
RangeIndex: 250 entries, 0 to 249
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Source_Medium 250 non-null object
1 Year 250 non-null int64
2 MonthOfYear 250 non-null int64
3 Users 250 non-null int64
4 NewUsers 250 non-null int64
5 Sessions 250 non-null int64
6 Bounce Rate 250 non-null float64
7 Pageviews 250 non-null int64
8 AverageSessionDuration 250 non-null object
9 ConversionRatePercentage 250 non-null object
10 Transactions 250 non-null int64
11 Revenue 250 non-null int64
dtypes: float64(1), int64(8), object(3)
Step-3-#Dropping All Rows with Missing Values

WAD=WA.dropna()

print(WAD)
Step-4-Dropping variables other than
numerical

WADC=pd.DataFrame(WAD)

WADC.drop(['Year','MonthOfYear','Source_Mediu
m','AverageSessionDuration','ConversionRatePerc
entage'],axis=1,inplace=True)
Step-5-Finding correlation

WAD_Correlation=WADC.corr()

WAD_Correlation
NewUsers Sessions Bounce Rate Pageviews Transactions Revenue
Users

Users 1.000000 0.986702 0.990824 0.272593 0.977173 0.717779 0.735674

NewUsers 0.986702 1.000000 0.965485 0.266696 0.947228 0.695307 0.712963

Sessions 0.990824 0.965485 1.000000 0.267220 0.982619 0.746478 0.760513

Bounce Rate 0.272593 0.266696 0.267220 1.000000 0.243034 0.039252 0.016691

Pageviews 0.977173 0.947228 0.982619 0.243034 1.000000 0.764345 0.775667

Transactions 0.717779 0.695307 0.746478 0.039252 0.764345 1.000000 0.981740

Revenue 0.735674 0.712963 0.760513 0.016691 0.775667 0.981740 1.000000

#Testing Hypothesis
#Hypothesis Correlation
#Variables= Pageviews, Revenue
#H0:There is no significant association between Page views
and Revenue
#HA=There is a significant association between Page views
and Revenue
Step-6-#Importing Library functions

from scipy.stats import pearsonr

import scipy.stats as stat

Step-7-Correlation – Hypothesis Testing
WADP=pd.DataFrame(WAD,columns=['Pageviews','Revenue'])

stat,p=pearsonr(WADP['Pageviews'],WADP['Revenue'])
print('stat=%.3f,p=%.3f'%(stat,p))
if p>0.05:
print('No association between Pageviews and Revenue')
else:
print('There is an association between Pageviews and
Revenue')
Dataset – HealthInsurance
Test : Correlation
# To display Data
import pandas as pd
import numpy as np

HI=pd.read_excel(r"D:\DataSet2024\HypothesisTesting
2024\HealthInsurance_10.xlsx")

print(HI)
#Finding Correlations

HIC=pd.DataFrame(HI)

HIC.drop(['Gender','Smoker','Region'],axis=1,inplace=True)

Correlation=HIC.corr()

Correlation
Correlation Age BMI Children Charges

Age 1.000000 0.109272 0.042469 0.299008

BMI 0.109272 1.000000 0.012759 0.198341

Children 0.042469 0.012759 1.000000 0.067998

Charges 0.299008 0.198341 0.067998 1.000000

# Hypothesis - Correlation

# Variables : Age and Insurance charges

# H0 : There is no significant association between age

and insurance charges

# H1 : There is a significant association between age

and insurance charges
#Importing Library Functions

from scipy.stats import pearsonr

import scipy.stats as stat
The f then refers to "Floating
point decimal format". The .3
indicates to round to 3 places
after the decimal point.
%=https://docs.python.org/2/l
ibrary/stdtypes.html#string-
formatting
Finding correlation with ‘p’ Value
df=pd.DataFrame(HealthInsurance,columns=['Age','Charges'])

stat,p=pearsonr(df['Age'], df['Charges'])
print('stat=%.3f, p=%.3f' % (stat, p))

if p > 0.05:
print(‘No Association between Age and Charges')
else:
print(‘There is an association between Age and charges')
Output

stat=0.299, p=0.000

There is an association between age and

insurance charges
Dataset – HealthInsurance
Test : Chi-Square
Chi- Square test
Tests whether two categorical variables are related or independent.

Assumptions:
Observations used in the calculation of the contingency table are
independent.
25 or more samples in each cell of the contingency table.

Interpretation
H0: the two samples are independent.
H1: there is a dependency between the samples.
Step-1-Data Display
import pandas as pd
import numpy as np
HR=pd.read_excel(r"D:\DataSet2024\HypothesisT
esting2024\HR_Anaytics_7.xlsx")
print(HR)
Step-2-Meta Data Display

HR.info()
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1470 non-null int64
1 Attrition 1470 non-null object
2 BusinessTravel 1470 non-null object
3 DailyRate 1470 non-null int64
4 Department 1470 non-null object
5 DistanceFromHome 1470 non-null int64
6 Education 1470 non-null int64
7 EducationField 1470 non-null object
8 EmployeeCount 1470 non-null int64
9 EmployeeNumber 1470 non-null int64
10 EnvironmentSatisfaction 1470 non-null int64
11 Gender 1470 non-null object
12 HourlyRate 1470 non-null int64
13 JobInvolvement 1470 non-null int64
14 JobLevel 1470 non-null int64
15 JobRole 1470 non-null object
16 JobSatisfaction 1470 non-null int64
17 MaritalStatus 1470 non-null object
18 MonthlyIncome 1470 non-null int64
19 MonthlyRate 1470 non-null int64
20 NumCompaniesWorked 1470 non-null int64
21 Over18 1470 non-null object
22 OverTime 1470 non-null object
23 PercentSalaryHike 1470 non-null int64
24 PerformanceRating 1470 non-null int64
25 RelationshipSatisfaction 1470 non-null int64
26 StandardHours 1470 non-null int64
27 StockOptionLevel 1470 non-null int64
28 TotalWorkingYears 1470 non-null int64
29 TrainingTimesLastYear 1470 non-null int64
30 WorkLifeBalance 1470 non-null int64
31 YearsAtCompany 1470 non-null int64
32 YearsInCurrentRole 1470 non-null int64
33 YearsSinceLastPromotion 1470 non-null int64
34 YearsWithCurrManager 1470 non-null int64
dtypes: int64(26), object(9)
Step-3-Hypothesis – Category to Category –
Object (Data)

#H0: There is no association between Attrition

and Overtime
#HA: There is an association between Attrition
and Overtime
Step-4-Importing library and constructing
Cross Tabulation
from scipy.stats import chi2_contingency

HRChi=pd.DataFrame(HR,columns=['Attrition','Ov
erTime’])

HRChi=pd.crosstab(HR.Attrition,HR.OverTime)
HRChi
Step-5 Calculation of Chi-Square
stat,p,dof,expected=chi2_contingency(CS)
print('stat=%.3f,p=%.3f'%(stat,p))

if p>0.05:
print('No association between Attrition and Overtime')
else:
print('There is an association between Attrition and Overtime')
Chi-Square-2 - HealthInsurance
# Hypothesis -Chi-Square Test
# Variables : Gender and Smoking Habits

#H0 : There is no association between smoking and

gender
#H1: There is a significant association between
smoking and gender
#Chi-Square Data Display
import pandas as pd
import numpy as np

HI=pd.read_excel(r"D:\DataSet2024\HypothesisTesting2024
\HealthInsurance_10.xlsx")

print(HI)
# Cross-Tabulation & Table

from scipy.stats import chi2_contingency

df=pd.DataFrame(HealthInsurance,columns=['Gender','Smoker'])

df=pd.crosstab(HealthInsurance.Gender,HealthInsurance.Smoker)

df
#To get Chi-Square Value
stat, p, dof, expected = chi2_contingency(Data)

print('stat=%.3f, p=%.3f' % (stat, p))

if p > 0.05:
print(‘There is no association between smoking and gender ')
else:
print(‘There is a association between smoking and gender')
Output
Smoker no yes stat=7.393, p=0.007
Gender
female 547 115 Gender and Smoking
male 517 159 habits are dependent
Hypothesis Testing
Analysis of Variance Test (ANOVA)
Tests whether the means of two or more independent samples are
significantly different.
• Assumptions:
• Observations in each sample are independent and identically distributed
(i.i.d.).
• Observations in each sample are normally distributed.
• The samples have equal variances (homogeneity of variance).

Interpretation:
• H₀ (Null Hypothesis): The means of the samples are equal.
• H₁ (Alternative Hypothesis): At least one of the sample means is different.
Dataset - HealthInsurance
# To display Data
import pandas as pd

import numpy as np

HI=pd.read_excel(r"D:\DataSet2024\HypothesisTesting
2024\HealthInsurance_10.xlsx")

print(HI)
#To display – Meta Data

HI.info()
Health Insurance
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1338 non-null int64
1 Gender 1338 non-null object
2 BMI 1338 non-null float64
3 Children 1338 non-null int64
4 Smoker 1338 non-null object
5 Region 1338 non-null object
6 Charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
# Hypothesis : One-Way - ANOVA

#H0: Mean Insurance charges are equal in all four

regions
# One –Way ANOVA – to get data in the form of table

from scipy.stats import f_oneway

df=pd.DataFrame(HI,columns=['Region','Charges'])
df

Data=pd.crosstab(HI.Charges,HI.Region)
Data
# ANOVA Results
import scipy.stats as stats

R1=df['Charges'][df['Region']=='northeast']
R2=df['Charges'][df['Region']=='northwest']
R3=df['Charges'][df['Region']=='southeast']
R4=df['Charges'][df['Region']=='southwest']

ANOVA_Result=stats.f_oneway(R1,R2,R3,R4)
print(ANOVA_Result)
Hypothesis Testing
# Significance level
alpha = 0.05

# Hypothesis testing
if ANOVA_Result.pvalue < alpha:
print("Reject the null hypothesis: At least one of the sample means is
significantly different.")
else:
print("Fail to reject the null hypothesis: The sample means are not
significantly different.")
Output
F_onewayResult(statistic=2.96962669358912,
pvalue=0.0308933560705201)
Example : 2 : ANOVA

Test whether the mean number of bike bookings

equals all four seasons
Dataset : Season
Predictive Analytics
Transforming Data into Future
Syllabus

Steps to Building a Multiple Linear

Regression Model and Model Diagnostics
Predictive Analytics

Predictive analytics is the process of using historical

or primary data to forecast future outcomes.

It involves applying data analysis, machine learning,

artificial intelligence, and statistical models to identify
patterns that can be used to predict future behavior.
Simple Linear Regression

Simple linear regression is a statistical technique used for

finding the existence of a relationship between a dependent
variable (aka response variable or outcome variable) and an
independent variable (aka explanatory variable, predictor
variable or feature) https://www.analyticsvidhya.com/blog/2021/10/everything-
you-need-to-know-about-linear-regression/
Examples
1. A hospital may be interested in finding how the total cost
of a patient for a treatment varies with the body weight of
the patient

2. Insurance companies would like to understand the

association between healthcare costs and ageing

3. An organization may be interested in finding the

relationship between revenue generated from a product
and features such as the price, money spent on promotion,
competitors’ price, and promotion expenses
4.Restaurants would like to know the relationship between the customer
waiting time after placing the order and the revenue

5.E-commerce companies such as Amazon, Big Basket, and Flipkart would like
to understand the relationship between revenue and features such as
(a) Number of customer visits to their portal
(b) Number of clicks on products
(c) Number of items on sale
(d) Average discount percentage

6.Banks and other financial institutions would like to understand the impact of
variables such as unemployment rate, marital status, balance in the bank
account, rain fall, etc. on the percentage of non-performing assets (NPA)
Steps In Building a Regression Model
Step 1: Collect/Extract Data
Step 2: Pre-Process the Data
Step 3: Dividing Data into Training and Validation Datasets
Step 4: Perform Descriptive Analytics or Data Exploration
Step 5: Build the Model
Step 6: Perform Model Diagnostics
Step 7: Validate the Model and Measure Model Accuracy
Step 8: Decide on Model Deployment
Step 1 – import libraries

import pandas as pd

import numpy as np

import statsmodels.api as sm
Step-2 – Setting Pandas – Decimal Values

## Setting pandas print option to print decimal

values upto ## 4 decimal places

np.set_printoptions(precision=4, linewidth=100)
Step-3- #Read the Excel file
RetailSales=pd.read_excel(r"D:\DataSet2025\mlr
\Retail_12.xlsx",index_col=0)

print(RetailSales)

RetailSales.info()

RetailSales.shape
Step-4 – To removes the rows that contains
NULL values (if any)

RetailSales.dropna()
Step-5-Creating Feature set (X) –
Independent Variable

RT=pd.DataFrame(RetailSales,columns=['Temper
ature','Fuel_Price','CPI','Unemployment’])

X=sm.add_constant(RT)

X.head()
Reference

RT.dropna()

X.dropna()
Step-6-Creating Outcome Variable(Y)-
Dependent Variable
RT1=pd.DataFrame(RetailSales,columns=['Weekly
_Sales’])

Y=RT1

Y.head()
Reference

Y=Y.dropna()
Step-7-Splitting Data Set

from sklearn.model_selection import train_test_split

train_X,test_X,train_Y,test_Y=train_test_split(X,Y,train_
size=0.8,random_state=100)

Note: train_size = 0.8 implies 80% of the data is used for

training the model and the remaining 20% is used for
validating the model.
Step-8- Fitting the Model - Fit the model using OLS
method and pass train_y and train_X as parameters

RetailWS_LM=sm.OLS(train_Y,train_X).fit()

print(RetailWS_LM.params)
Step-9-Regression Model Summary

RetailWS_LM.summary2()
Hypothesis Test for the Regression Co-efficient

The regression co-efficient b1(b1=byx) captures the existence

of a linear relationship between the outcome variable and the
feature
If b1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.

H0: b1 = 0
HA: b1 ≠ 0
Model Output
Model: OLS Adj. R-squared: 0.022
Dependent Variable: Weekly_Sales AIC: 150832.0011 Constant : 1798119.3852
Date: 2023-10-22 11:52 BIC: 150864.7329
No. Observations: 5148 Log-Likelihood: -75411. Temperature : -773.7856
Df Model: 4 F-statistic: 30.39
Df Residuals: 5143 Prob (F-statistic): 4.87e-25 Fuel_Price : -32229.0894
R-squared: 0.023 Scale: 3.1014e+11
Coef. Std.Err. t P>|t| [0.025 0.975] CPI : -1503.6908
const 1798119.3852 88643.2319 20.2849 0.0000 1624340.9461
1971897.8244 Unemployment : -41633.7558
Temperature -773.7856 442.6573 -1.7480 0.0805 -1641.5822
94.0111
Fuel_Price -32229.0894 17567.6200 -1.8346 0.0666 -
66669.0970 2210.9182
CPI -1503.6908 218.2353 -6.8902 0.0000 -1931.5248 -1075.8568
Unemployment -41633.7558 4467.3299 -9.3196 0.0000 -
50391.6226 -32875.8891
Omnibus:294.750 Durbin-Watson: 1.975
Prob(Omnibus): 0.000 Jarque-Bera (JB): 347.278
Skew: 0.636 Prob(JB): 0.000
Kurtosis: 3.051 Condition No.: 2154
Estimated parameters
The estimated (predicted) model can be written as:

Weekly_Sales= 1798119.3852 -773.7856(Temperature) -

32229.0894 (Fuel_Price ) - 1503.6908 (CPI) -41633.7558
(Unemployment)

The equation can be interpreted as follows: For every 1%

increase in Temperature, Fuel price , CPI and Unemployment
the weekly sales decrease by 1798119.38
Output-Summary
1. The model R-squared value is 0.023, that is, the model explains 2.3%
of the variation in weekly sales

2. The p-value for the t-test is 0.0805, 0.0666, 0.0000 ,0.0000which

indicates that there is a statistically significant relationship (at
significance value a = 0.05) between the Temperature, Fuel price , CPI
and Unemployment and Weekly Sales.

Also, the probability value of F-statistic of the model is 0.00 which

indicates that the overall model is statistically significant.
Outlier Analysis
Outliers are observations whose values show a large
deviation from the mean value. Presence of an outlier can
have a significant influence on the values of regression
coefficients.

• Cook’s Distance - Value of more than 1 indicates highly

influential observation
#Library Functions

import matplotlib.pyplot as plt

import seaborn as sn
%matplotlib inline
Step-10-Cooks Distance
import numpy as np Bring in
cooks
distance
WS_Influence=RetailWS_LM.get_influence() display

(c,p)=WS_Influence.cooks_distance
plt.stem(np.arange(len(train_X)),np.round(c,3))
plt.title("Cooks distance")
plt.xlabel("Row Index")
plt.ylabel("Cooks Distance") print(c,p)

plt.show()
Interpretation
It is clear from Figure that none of the observations in
Cook's distance exceed 1, making them all outliers.
Multi-Collinearity and Handling Multi-
Collinearity
When the dataset has a large number of independent variables
(features), it is possible that few of these independent variables
(features) may be highly correlated. The existence of a high correlation
between independent variables is called multi-collinearity.

Presence of multi-collinearity can destabilize the multiple linear

regression model.
Variance Inflation Factor (VIF)
Variance Inflation Factor (VIF) is a measure used for identifying
the existence of multi-collinearity

• variance_inflation_factor() method available in

statsmodels.stats.outliers_influence package can be used to
calculate VIF for the features

• select the features that have VIF value of more than 4

Step-11- Variance Inflation Factor (VIF)
from statsmodels.stats.outliers_influence import
variance_inflation_factor

from statsmodels.tools.tools import add_constant

pd.Series([variance_inflation_factor(X.values, i)for
i in range(X.shape[1])],index=X.columns)
Step-12-Correlation

Correlation=RT.corr(method='pearson');

print(Correlation)
Temperature :1.104991
Fuel_Price:1.081709
CPI:1.220733
Unemployment:1.149112
dtype: float64
Temperature Fuel_Price CPI Unemployment
Temperature 1.000000 0.144982 0.176888 0.101158
Fuel_Price 0.144982 1.000000 -0.170642 -0.034684
CPI 0.176888 -0.170642 1.000000 -0.302020
Unemployment 0.101158 -0.034684 -0.302020 1.000000
#Step-13-Regression Plot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
fig = plt.figure(figsize=(14, 8))
fig =
sm.graphics.plot_regress_exog(RetailWS_LM,'CPI',fig=fig)
plt.show()
fig = plt.figure(figsize=(14, 8))
fig =
sm.graphics.plot_regress_exog(RetailWS_LM,'Fuel_Price'
,fig=fig)
plt.show()

fig = plt.figure(figsize=(14, 8))

fig =
sm.graphics.plot_regress_exog(RetailWS_LM,'Unemplo
yment',fig=fig)
plt.show()
Work Book : MLR_2
Dataset: Song
Preamble
Humans have greatly associated themselves with Songs &
Music. It can improve mood, decrease pain and anxiety, and
facilitate opportunities for emotional expression. Research
suggests that music can benefit our physical and mental
health in numerous ways. To understand songs & its
popularity based on certain factors such as song duration,
Acousticness, Danceability and Energy data is given in the
dataset. Develop a multiple regression equation to predict
song popularity.
#Step-1- Import Library Modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import pandas.testing as tm
#Setting pandas print option to decimal values upto 4
decimal places
np.set_printoptions(precision=4,linewidth=100)
#Step-2-Read the excel file
#Read the excel file
SongPop=pd.read_excel(r"D:\DataSet2025\Song_10.xlsx")
SongPop.head(10)

#MetaData
SongPop.info()
#Step-3-#Creating feature set(X) and outcome
variable(Y)
SP=pd.DataFrame(SongPop,columns=['Duration','Acousticn
ess','Danceability','Energy’])

SP=SP.dropna()
X=sm.add_constant(SP)
X.head(5)
#Creating Outcome set (Y) dependent variable
SP1=pd.DataFrame(SongPop,columns=['Popularity'])
SP1=SP1.dropna()
Y=SP1
Y.head(5)
#Step-4-#Splitting data set
from sklearn.model_selection import train_test_split

train_X,test_X,train_Y,Test_Y=train_test_split(X,Y,train_size=
0.8,random_state=100)
#Step-5-#Fit the model using OLS method and
pass train_y and train_X as parameters

SongPop_LM=sm.OLS(train_Y,train_X).fit()
#Step-6-#Printing estimated parameters

print(SongPop_LM.params)
print(SongPop_LM.summary2())
const 51.065922 Results: Ordinary least squares
Duration -0.000003 ====================================================================
Acousticness -6.745192 Model: OLS Adj. R-squared: 0.015
Danceability 12.760540 Dependent Variable: Popularity AIC: 135136.8742
Energy -6.006211 Date: 2024-05-04 16:45 BIC: 135174.9599
dtype: float64 No. Observations: 15020 Log-Likelihood: -67563.
Df Model: 4 F-statistic: 58.21
Df Residuals: 15015 Prob (F-statistic): 7.80e-49
R-squared: 0.015 Scale: 472.92
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
const 51.0659 1.4455 35.3265 0.0000 48.2325 53.8994
Duration -0.0000 0.0000 -0.9233 0.3559 -0.0000 0.0000
Acousticness -6.7452 0.8387 -8.0422 0.0000 -8.3892 -5.1012
Danceability 12.7605 1.1689 10.9163 0.0000 10.4693 15.0518
Energy -6.0062 1.1129 -5.3969 0.0000 -8.1876 -3.8248
--------------------------------------------------------------------
Omnibus: 654.701 Durbin-Watson: 1.998
Prob(Omnibus): 0.000 Jarque-Bera (JB): 737.904
Skew: -0.538 Prob(JB): 0.000
Kurtosis: 2.856 Condition No.: 2414309
====================================================================
#Step-7-Outlier Analysis-Cooks Distance
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
import numpy as np
Song_Influence=SongPop_LM.get_influence()
(c,p)=Song_Influence.cooks_distance
plt.stem(np.arange(len(train_X)),np.round(c,3))
plt.title("Cooks distance")
plt.xlabel("Row Index")
plt.ylabel("Cooks Distance")
plt.show()
#Print Cooks distance values
np.set_printoptions(suppress=True)
Cook=(c,p)
print(Cook)
#By default, the cooks_distance()
function displays an array of values for
Cook’s distance for each observation
followed by an array of corresponding
p-values.
#Step-8- Multicollinearity- #Variance Inflation
Factor

from statsmodels.stats.outliers_influence import

variance_inflation_factor

from statsmodels.tools.tools import add_constant

pd.Series([variance_inflation_factor(X.values, i)for i in
range(X.shape[1])],index=X.columns)
#Step-9- Correlation Matrix
Correlation=SP.corr(method='pearson');
print(Correlation)
#Step-10-Heatmap-Dropping Column Song Name

HSongPop=SongPop.drop(['Song Name'],axis=1)

HSongPop.dropna()
#Heatmap
fig,ax=plt.subplots()

sn.heatmap(HSongPop.corr(),annot=True,cmap="YlGn
Bu",linecolor='r',linewidths=0.5)
l = lower case L

plt.show()
#Step-11-Regression Plot
import seaborn as sns

sns.set_style('whitegrid’)

sns.lmplot(x ='Danceability', y ='Popularity', data

=SongPop)
plt.show()

Correlation Analysis in Python
100% (1)
Correlation Analysis in Python
6 pages
UpdatedUnit 1 Data Preprocessing
No ratings yet
UpdatedUnit 1 Data Preprocessing
38 pages
Stats - Hypothesis - Testing - Ipynb at Main Pik1989 - Stats GitHub
No ratings yet
Stats - Hypothesis - Testing - Ipynb at Main Pik1989 - Stats GitHub
10 pages
Class Notes
No ratings yet
Class Notes
38 pages
SPSS Def + Example - New - 1!1!2011
No ratings yet
SPSS Def + Example - New - 1!1!2011
43 pages
Lecture 5557
No ratings yet
Lecture 5557
15 pages
Data Preprocessing
No ratings yet
Data Preprocessing
39 pages
SPSS Analysis: Chi-Square & Regression Guide
No ratings yet
SPSS Analysis: Chi-Square & Regression Guide
18 pages
03preprocessing DMDW
No ratings yet
03preprocessing DMDW
81 pages
Kunal DS
No ratings yet
Kunal DS
92 pages
Understanding Correlation in Statistics
No ratings yet
Understanding Correlation in Statistics
8 pages
SPSS Pearson R
No ratings yet
SPSS Pearson R
20 pages
2020 Preprocessing
No ratings yet
2020 Preprocessing
63 pages
Bivariate Pearson Correlation Overview
No ratings yet
Bivariate Pearson Correlation Overview
10 pages
Pearson Correlation Coefficient Explained
No ratings yet
Pearson Correlation Coefficient Explained
6 pages
Data Cleaning
No ratings yet
Data Cleaning
39 pages
Statistical Treatment
No ratings yet
Statistical Treatment
41 pages
Data Analytics-11
No ratings yet
Data Analytics-11
23 pages
Lecture 2
No ratings yet
Lecture 2
30 pages
Statistic Inference
No ratings yet
Statistic Inference
38 pages
07 HR
No ratings yet
07 HR
15 pages
Notes On SPSS
No ratings yet
Notes On SPSS
19 pages
Egression & Orrelation: Nalysis
0% (1)
Egression & Orrelation: Nalysis
48 pages
SPSS Def + Job Description
No ratings yet
SPSS Def + Job Description
54 pages
Spss Tutorials: Pearson Correlation
No ratings yet
Spss Tutorials: Pearson Correlation
10 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
80 pages
058 1
No ratings yet
058 1
25 pages
JASP
No ratings yet
JASP
8 pages
Module 5 Bivariate Analysis
No ratings yet
Module 5 Bivariate Analysis
81 pages
MBA60 - 616 Techniques
No ratings yet
MBA60 - 616 Techniques
42 pages
PPT1
No ratings yet
PPT1
93 pages
Employee Attrition Study Case
No ratings yet
Employee Attrition Study Case
88 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Project Advance Stats - Abhishek
No ratings yet
Project Advance Stats - Abhishek
14 pages
10 Correlation JASP
100% (1)
10 Correlation JASP
12 pages
Pearson R Correlation
100% (1)
Pearson R Correlation
8 pages
Advanced Statistics Project Report Final
No ratings yet
Advanced Statistics Project Report Final
40 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
CLASS Analysis
No ratings yet
CLASS Analysis
14 pages
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
100% (2)
A Comprehensive Guide To Data Exploration: Steps of Data Exploration and Preparation Missing Value Treatment
8 pages
Bivariate Analysis
No ratings yet
Bivariate Analysis
24 pages
SQL Interview Questions Goldman Sachs
No ratings yet
SQL Interview Questions Goldman Sachs
19 pages
Churn Prediction - Commercial Use of Data Science
No ratings yet
Churn Prediction - Commercial Use of Data Science
25 pages
Inferential Statistics (Inferential Statistics (Correlation AND PARTIAL-Correlation)
No ratings yet
Inferential Statistics (Inferential Statistics (Correlation AND PARTIAL-Correlation)
28 pages
Prescriptive HR Analytics Overview
No ratings yet
Prescriptive HR Analytics Overview
10 pages
Business Ana
No ratings yet
Business Ana
4 pages
Understanding Correlation Coefficients
No ratings yet
Understanding Correlation Coefficients
44 pages
CH 03-01 Data Preprocessing
No ratings yet
CH 03-01 Data Preprocessing
27 pages
Data Preprocessing Guide
No ratings yet
Data Preprocessing Guide
3 pages
Bivariate Analysis: Cross-Tabulation & Chi-Square
No ratings yet
Bivariate Analysis: Cross-Tabulation & Chi-Square
40 pages
Group 5 - Paz, Chavez, Raña, Corporal
No ratings yet
Group 5 - Paz, Chavez, Raña, Corporal
46 pages
Sana H - Data Analysis in Psychological Research
No ratings yet
Sana H - Data Analysis in Psychological Research
5 pages
Employee Analysis
No ratings yet
Employee Analysis
19 pages
Exploratory Data Analysis with Python
No ratings yet
Exploratory Data Analysis with Python
9 pages
Data Exploration
No ratings yet
Data Exploration
23 pages
Advanced Statistics Project Report
100% (1)
Advanced Statistics Project Report
34 pages
MBR Lab Week 10-12-1
No ratings yet
MBR Lab Week 10-12-1
65 pages
ILP Report Ankith
No ratings yet
ILP Report Ankith
10 pages
11.3.25 DB Presentation CO5 24-26
No ratings yet
11.3.25 DB Presentation CO5 24-26
2 pages
Course Code: 23PGDBA302 Course: Python Programming For Business
No ratings yet
Course Code: 23PGDBA302 Course: Python Programming For Business
1 page
RDA 24 QP Supple
No ratings yet
RDA 24 QP Supple
2 pages
Module 1
No ratings yet
Module 1
194 pages
Iata Sis User Guide v1.1
No ratings yet
Iata Sis User Guide v1.1
409 pages
Learning A Language
No ratings yet
Learning A Language
8 pages
TRUMPF VCSEL Laser Drying in Battery Production en
No ratings yet
TRUMPF VCSEL Laser Drying in Battery Production en
20 pages
Multimodal Literacies Across Digital Learning Contexts (Maria Grazia Sindoni, Ilaria Moschini, (Eds.) ) (Z-Library)
100% (1)
Multimodal Literacies Across Digital Learning Contexts (Maria Grazia Sindoni, Ilaria Moschini, (Eds.) ) (Z-Library)
287 pages
Pre Assess Report 4237783
No ratings yet
Pre Assess Report 4237783
12 pages
Data Link Control
No ratings yet
Data Link Control
85 pages
SCIENTIFIC STUDY-nails
No ratings yet
SCIENTIFIC STUDY-nails
2 pages
Tea English Unit 2
No ratings yet
Tea English Unit 2
228 pages
IT 210 Business Systems Project Template
0% (2)
IT 210 Business Systems Project Template
10 pages
SAT Inequalities Absolute Value Coordinate Plane
No ratings yet
SAT Inequalities Absolute Value Coordinate Plane
2 pages
Neural Networks For The Identification and Control of Blast Furnace Hot Metal Quality
No ratings yet
Neural Networks For The Identification and Control of Blast Furnace Hot Metal Quality
16 pages
Class XII BST Annual Planner
No ratings yet
Class XII BST Annual Planner
3 pages
InDesign Scripting Quick Reference
No ratings yet
InDesign Scripting Quick Reference
1 page
Self Construal Scale
No ratings yet
Self Construal Scale
3 pages
Chapter 3 Essential Process Modeling
No ratings yet
Chapter 3 Essential Process Modeling
90 pages
M.Sc. Nursing Question Bank 2024-25
No ratings yet
M.Sc. Nursing Question Bank 2024-25
8 pages
PWANI
No ratings yet
PWANI
12 pages
Mechanical Engineer's Resume
No ratings yet
Mechanical Engineer's Resume
17 pages
29 Cylinder Head Specialist
No ratings yet
29 Cylinder Head Specialist
3 pages
Sinif Ingilizce Bep Plani 2023 2024
No ratings yet
Sinif Ingilizce Bep Plani 2023 2024
2 pages
NCERT Class 5 Maths: Tenths & Hundredths
No ratings yet
NCERT Class 5 Maths: Tenths & Hundredths
26 pages
t24 Course Catalogue
100% (4)
t24 Course Catalogue
186 pages
Chapter 13 Evans Berman
No ratings yet
Chapter 13 Evans Berman
28 pages
Health IT Boosts Medication Efficiency
No ratings yet
Health IT Boosts Medication Efficiency
4 pages
Father-Son Bonding in Bewilderment
No ratings yet
Father-Son Bonding in Bewilderment
2 pages
Motion Specification in Computer Animation
No ratings yet
Motion Specification in Computer Animation
2 pages
UTS-Reviewer-PreFinal - Docx 20251005 195201 0000
No ratings yet
UTS-Reviewer-PreFinal - Docx 20251005 195201 0000
7 pages
Business Flow Accelerators: Oracle Consulting
No ratings yet
Business Flow Accelerators: Oracle Consulting
6 pages
Heating Systems in Buildings - Method For Calculation of System Energy Requirements and System Efficiencies
100% (1)
Heating Systems in Buildings - Method For Calculation of System Energy Requirements and System Efficiencies
22 pages
Intelligent Agents: Theory and Practice: Michael Wooldridge Nicholas R. Jennings
No ratings yet
Intelligent Agents: Theory and Practice: Michael Wooldridge Nicholas R. Jennings
62 pages