Hypothesis Testing
Hypothesis Testing –Pearson’s Correlation Coefficient, Chi-
Squared Test, One-Way -Analysis of Variance Test (ANOVA).
PredictiveAnalysis: Steps to Building a Multiple Linear
Regression, Model Diagnostics.
https://towardsdatascience.com/hypothesis-testing-for-data-scientists-everything-you-need-to-know-8c36ddde4cd2
Statistical Analysis
Defining Hypothesis
The scientific question that must be answered, expressed in
the form of the Null Hypothesis (H₀) and the Alternative
Hypothesis (H₁ or Hₐ). H₀ and H₁ must be mutually exclusive,
and H₁ shouldn’t contain equality: H₀: μ=x, H₁: μ≠x
H₀: μ≤x, H₁: μ>x
H₀: μ≥x, H₁: μ<x
Independent Number of Dependent Number of Statistical Test
Variable Independent Variable Dependent
Measurement Variables Measurement Variables
Nominal 1 Nominal 1 Chi Square
Test
Nominal 1 Ordinal ( 1 ANOVA
Continuous)
Nominal 1 Interval (or) 1 One Way
Ratio ANOVA
Nominal 2 Interval (or) 1 Two Way
Ratio ANOVA
Interval or 2 or More 1 Interval or Regression
Ratio Ratio
Interval (or) Ratio 1 ‘t’ – One
Sample t test
Independent Sample – Nominal ( with two groups ) - Variable I , Independent
Variable II – Interval or Ratio sample ‘t’ test
Dependent Sample ( Before and After) – Interval or Ratio Paired ‘t’ test
Variable values – Interval or Ratio ( No differentiation between Correlation
independent or dependent variables )
Logistic regression is applied to predict the categorical dependent variable. In
other words, it's used when the prediction is categorical, for example, yes or no,
true or false, 0 or 1. The predicted probability or output of logistic regression can
be either one of them, and there's no middle ground.
Decision and Conclusion
If the p-value is smaller than the alpha (the significance level), in
other words, there is enough evidence to prove H₀ is not valid;
reject H₀.
Otherwise, fail to reject H₀. Rejecting H₀ validates H₁. However,
failing to reject H₀ does not mean H₀ is valid, nor does it mean H₁
is wrong.
Pearson Correlation
Tests whether two samples have a linear relationship.
• Assumptions:
• The samples must be independent and identically distributed (iid).
• The samples must follow a normal distribution.
• The samples should have homoscedasticity (equal variance).
• Hypothesis:
• H₀ (Null Hypothesis): The two samples are independent (no linear
relationship).
• H₁ (Alternative Hypothesis): There is a dependency (linear correlation)
between the samples.
If the Pearson correlation coefficient (r) is significantly different from
zero, it indicates a linear relationship between the two variables.
Independent – The observations in one sample should not influence or depend on the observations in the other sample.
Identically Distributed – All observations in a sample should come from the same probability distribution.
Step-1-Display of Data
import pandas as pd
import numpy as np
WA=pd.read_excel(r"D:\DataSet2024\Hypothesis
Testing2024\Web_Analytics_6.xlsx")
print(WA)
Step-2-Displaying Meta data
WA.info()
RangeIndex: 250 entries, 0 to 249
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Source_Medium 250 non-null object
1 Year 250 non-null int64
2 MonthOfYear 250 non-null int64
3 Users 250 non-null int64
4 NewUsers 250 non-null int64
5 Sessions 250 non-null int64
6 Bounce Rate 250 non-null float64
7 Pageviews 250 non-null int64
8 AverageSessionDuration 250 non-null object
9 ConversionRatePercentage 250 non-null object
10 Transactions 250 non-null int64
11 Revenue 250 non-null int64
dtypes: float64(1), int64(8), object(3)
Step-3-#Dropping All Rows with Missing Values
WAD=WA.dropna()
print(WAD)
Step-4-Dropping variables other than
numerical
WADC=pd.DataFrame(WAD)
WADC.drop(['Year','MonthOfYear','Source_Mediu
m','AverageSessionDuration','ConversionRatePerc
entage'],axis=1,inplace=True)
Step-5-Finding correlation
WAD_Correlation=WADC.corr()
WAD_Correlation
NewUsers Sessions Bounce Rate Pageviews Transactions Revenue
Users
Users 1.000000 0.986702 0.990824 0.272593 0.977173 0.717779 0.735674
NewUsers 0.986702 1.000000 0.965485 0.266696 0.947228 0.695307 0.712963
Sessions 0.990824 0.965485 1.000000 0.267220 0.982619 0.746478 0.760513
Bounce Rate 0.272593 0.266696 0.267220 1.000000 0.243034 0.039252 0.016691
Pageviews 0.977173 0.947228 0.982619 0.243034 1.000000 0.764345 0.775667
Transactions 0.717779 0.695307 0.746478 0.039252 0.764345 1.000000 0.981740
Revenue 0.735674 0.712963 0.760513 0.016691 0.775667 0.981740 1.000000
#Testing Hypothesis
#Hypothesis Correlation
#Variables= Pageviews, Revenue
#H0:There is no significant association between Page views
and Revenue
#HA=There is a significant association between Page views
and Revenue
Step-6-#Importing Library functions
from scipy.stats import pearsonr
import scipy.stats as stat
Step-7-Correlation – Hypothesis Testing
WADP=pd.DataFrame(WAD,columns=['Pageviews','Revenue'])
stat,p=pearsonr(WADP['Pageviews'],WADP['Revenue'])
print('stat=%.3f,p=%.3f'%(stat,p))
if p>0.05:
print('No association between Pageviews and Revenue')
else:
print('There is an association between Pageviews and
Revenue')
Dataset – HealthInsurance
Test : Correlation
# To display Data
import pandas as pd
import numpy as np
HI=pd.read_excel(r"D:\DataSet2024\HypothesisTesting
2024\HealthInsurance_10.xlsx")
print(HI)
#Finding Correlations
HIC=pd.DataFrame(HI)
HIC.drop(['Gender','Smoker','Region'],axis=1,inplace=True)
Correlation=HIC.corr()
Correlation
Correlation Age BMI Children Charges
Age 1.000000 0.109272 0.042469 0.299008
BMI 0.109272 1.000000 0.012759 0.198341
Children 0.042469 0.012759 1.000000 0.067998
Charges 0.299008 0.198341 0.067998 1.000000
# Hypothesis - Correlation
# Variables : Age and Insurance charges
# H0 : There is no significant association between age
and insurance charges
# H1 : There is a significant association between age
and insurance charges
#Importing Library Functions
from scipy.stats import pearsonr
import scipy.stats as stat
The f then refers to "Floating
point decimal format". The .3
indicates to round to 3 places
after the decimal point.
%=https://docs.python.org/2/l
ibrary/stdtypes.html#string-
formatting
Finding correlation with ‘p’ Value
df=pd.DataFrame(HealthInsurance,columns=['Age','Charges'])
stat,p=pearsonr(df['Age'], df['Charges'])
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print(‘No Association between Age and Charges')
else:
print(‘There is an association between Age and charges')
Output
stat=0.299, p=0.000
There is an association between age and
insurance charges
Dataset – HealthInsurance
Test : Chi-Square
Chi- Square test
Tests whether two categorical variables are related or independent.
Assumptions:
Observations used in the calculation of the contingency table are
independent.
25 or more samples in each cell of the contingency table.
Interpretation
H0: the two samples are independent.
H1: there is a dependency between the samples.
Step-1-Data Display
import pandas as pd
import numpy as np
HR=pd.read_excel(r"D:\DataSet2024\HypothesisT
esting2024\HR_Anaytics_7.xlsx")
print(HR)
Step-2-Meta Data Display
HR.info()
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1470 non-null int64
1 Attrition 1470 non-null object
2 BusinessTravel 1470 non-null object
3 DailyRate 1470 non-null int64
4 Department 1470 non-null object
5 DistanceFromHome 1470 non-null int64
6 Education 1470 non-null int64
7 EducationField 1470 non-null object
8 EmployeeCount 1470 non-null int64
9 EmployeeNumber 1470 non-null int64
10 EnvironmentSatisfaction 1470 non-null int64
11 Gender 1470 non-null object
12 HourlyRate 1470 non-null int64
13 JobInvolvement 1470 non-null int64
14 JobLevel 1470 non-null int64
15 JobRole 1470 non-null object
16 JobSatisfaction 1470 non-null int64
17 MaritalStatus 1470 non-null object
18 MonthlyIncome 1470 non-null int64
19 MonthlyRate 1470 non-null int64
20 NumCompaniesWorked 1470 non-null int64
21 Over18 1470 non-null object
22 OverTime 1470 non-null object
23 PercentSalaryHike 1470 non-null int64
24 PerformanceRating 1470 non-null int64
25 RelationshipSatisfaction 1470 non-null int64
26 StandardHours 1470 non-null int64
27 StockOptionLevel 1470 non-null int64
28 TotalWorkingYears 1470 non-null int64
29 TrainingTimesLastYear 1470 non-null int64
30 WorkLifeBalance 1470 non-null int64
31 YearsAtCompany 1470 non-null int64
32 YearsInCurrentRole 1470 non-null int64
33 YearsSinceLastPromotion 1470 non-null int64
34 YearsWithCurrManager 1470 non-null int64
dtypes: int64(26), object(9)
Step-3-Hypothesis – Category to Category –
Object (Data)
#H0: There is no association between Attrition
and Overtime
#HA: There is an association between Attrition
and Overtime
Step-4-Importing library and constructing
Cross Tabulation
from scipy.stats import chi2_contingency
HRChi=pd.DataFrame(HR,columns=['Attrition','Ov
erTime’])
HRChi=pd.crosstab(HR.Attrition,HR.OverTime)
HRChi
Step-5 Calculation of Chi-Square
stat,p,dof,expected=chi2_contingency(CS)
print('stat=%.3f,p=%.3f'%(stat,p))
if p>0.05:
print('No association between Attrition and Overtime')
else:
print('There is an association between Attrition and Overtime')
Chi-Square-2 - HealthInsurance
# Hypothesis -Chi-Square Test
# Variables : Gender and Smoking Habits
#H0 : There is no association between smoking and
gender
#H1: There is a significant association between
smoking and gender
#Chi-Square Data Display
import pandas as pd
import numpy as np
HI=pd.read_excel(r"D:\DataSet2024\HypothesisTesting2024
\HealthInsurance_10.xlsx")
print(HI)
# Cross-Tabulation & Table
from scipy.stats import chi2_contingency
df=pd.DataFrame(HealthInsurance,columns=['Gender','Smoker'])
df=pd.crosstab(HealthInsurance.Gender,HealthInsurance.Smoker)
df
#To get Chi-Square Value
stat, p, dof, expected = chi2_contingency(Data)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print(‘There is no association between smoking and gender ')
else:
print(‘There is a association between smoking and gender')
Output
Smoker no yes stat=7.393, p=0.007
Gender
female 547 115 Gender and Smoking
male 517 159 habits are dependent
Hypothesis Testing
Analysis of Variance Test (ANOVA)
Tests whether the means of two or more independent samples are
significantly different.
• Assumptions:
• Observations in each sample are independent and identically distributed
(i.i.d.).
• Observations in each sample are normally distributed.
• The samples have equal variances (homogeneity of variance).
Interpretation:
• H₀ (Null Hypothesis): The means of the samples are equal.
• H₁ (Alternative Hypothesis): At least one of the sample means is different.
Dataset - HealthInsurance
# To display Data
import pandas as pd
import numpy as np
HI=pd.read_excel(r"D:\DataSet2024\HypothesisTesting
2024\HealthInsurance_10.xlsx")
print(HI)
#To display – Meta Data
HI.info()
Health Insurance
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 1338 non-null int64
1 Gender 1338 non-null object
2 BMI 1338 non-null float64
3 Children 1338 non-null int64
4 Smoker 1338 non-null object
5 Region 1338 non-null object
6 Charges 1338 non-null float64
dtypes: float64(2), int64(2), object(3)
# Hypothesis : One-Way - ANOVA
#H0: Mean Insurance charges are equal in all four
regions
# One –Way ANOVA – to get data in the form of table
from scipy.stats import f_oneway
df=pd.DataFrame(HI,columns=['Region','Charges'])
df
Data=pd.crosstab(HI.Charges,HI.Region)
Data
# ANOVA Results
import scipy.stats as stats
R1=df['Charges'][df['Region']=='northeast']
R2=df['Charges'][df['Region']=='northwest']
R3=df['Charges'][df['Region']=='southeast']
R4=df['Charges'][df['Region']=='southwest']
ANOVA_Result=stats.f_oneway(R1,R2,R3,R4)
print(ANOVA_Result)
Hypothesis Testing
# Significance level
alpha = 0.05
# Hypothesis testing
if ANOVA_Result.pvalue < alpha:
print("Reject the null hypothesis: At least one of the sample means is
significantly different.")
else:
print("Fail to reject the null hypothesis: The sample means are not
significantly different.")
Output
F_onewayResult(statistic=2.96962669358912,
pvalue=0.0308933560705201)
Example : 2 : ANOVA
Test whether the mean number of bike bookings
equals all four seasons
Dataset : Season
Predictive Analytics
Transforming Data into Future
Syllabus
Steps to Building a Multiple Linear
Regression Model and Model Diagnostics
Predictive Analytics
Predictive analytics is the process of using historical
or primary data to forecast future outcomes.
It involves applying data analysis, machine learning,
artificial intelligence, and statistical models to identify
patterns that can be used to predict future behavior.
Simple Linear Regression
Simple linear regression is a statistical technique used for
finding the existence of a relationship between a dependent
variable (aka response variable or outcome variable) and an
independent variable (aka explanatory variable, predictor
variable or feature) https://www.analyticsvidhya.com/blog/2021/10/everything-
you-need-to-know-about-linear-regression/
Examples
1. A hospital may be interested in finding how the total cost
of a patient for a treatment varies with the body weight of
the patient
2. Insurance companies would like to understand the
association between healthcare costs and ageing
3. An organization may be interested in finding the
relationship between revenue generated from a product
and features such as the price, money spent on promotion,
competitors’ price, and promotion expenses
4.Restaurants would like to know the relationship between the customer
waiting time after placing the order and the revenue
5.E-commerce companies such as Amazon, Big Basket, and Flipkart would like
to understand the relationship between revenue and features such as
(a) Number of customer visits to their portal
(b) Number of clicks on products
(c) Number of items on sale
(d) Average discount percentage
6.Banks and other financial institutions would like to understand the impact of
variables such as unemployment rate, marital status, balance in the bank
account, rain fall, etc. on the percentage of non-performing assets (NPA)
Steps In Building a Regression Model
Step 1: Collect/Extract Data
Step 2: Pre-Process the Data
Step 3: Dividing Data into Training and Validation Datasets
Step 4: Perform Descriptive Analytics or Data Exploration
Step 5: Build the Model
Step 6: Perform Model Diagnostics
Step 7: Validate the Model and Measure Model Accuracy
Step 8: Decide on Model Deployment
Step 1 – import libraries
import pandas as pd
import numpy as np
import statsmodels.api as sm
Step-2 – Setting Pandas – Decimal Values
## Setting pandas print option to print decimal
values upto ## 4 decimal places
np.set_printoptions(precision=4, linewidth=100)
Step-3- #Read the Excel file
RetailSales=pd.read_excel(r"D:\DataSet2025\mlr
\Retail_12.xlsx",index_col=0)
print(RetailSales)
RetailSales.info()
RetailSales.shape
Step-4 – To removes the rows that contains
NULL values (if any)
RetailSales.dropna()
Step-5-Creating Feature set (X) –
Independent Variable
RT=pd.DataFrame(RetailSales,columns=['Temper
ature','Fuel_Price','CPI','Unemployment’])
X=sm.add_constant(RT)
X.head()
Reference
RT.dropna()
X.dropna()
Step-6-Creating Outcome Variable(Y)-
Dependent Variable
RT1=pd.DataFrame(RetailSales,columns=['Weekly
_Sales’])
Y=RT1
Y.head()
Reference
Y=Y.dropna()
Step-7-Splitting Data Set
from sklearn.model_selection import train_test_split
train_X,test_X,train_Y,test_Y=train_test_split(X,Y,train_
size=0.8,random_state=100)
Note: train_size = 0.8 implies 80% of the data is used for
training the model and the remaining 20% is used for
validating the model.
Step-8- Fitting the Model - Fit the model using OLS
method and pass train_y and train_X as parameters
RetailWS_LM=sm.OLS(train_Y,train_X).fit()
print(RetailWS_LM.params)
Step-9-Regression Model Summary
RetailWS_LM.summary2()
Hypothesis Test for the Regression Co-efficient
The regression co-efficient b1(b1=byx) captures the existence
of a linear relationship between the outcome variable and the
feature
If b1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.
H0: b1 = 0
HA: b1 ≠ 0
Model Output
Model: OLS Adj. R-squared: 0.022
Dependent Variable: Weekly_Sales AIC: 150832.0011 Constant : 1798119.3852
Date: 2023-10-22 11:52 BIC: 150864.7329
No. Observations: 5148 Log-Likelihood: -75411. Temperature : -773.7856
Df Model: 4 F-statistic: 30.39
Df Residuals: 5143 Prob (F-statistic): 4.87e-25 Fuel_Price : -32229.0894
R-squared: 0.023 Scale: 3.1014e+11
Coef. Std.Err. t P>|t| [0.025 0.975] CPI : -1503.6908
const 1798119.3852 88643.2319 20.2849 0.0000 1624340.9461
1971897.8244 Unemployment : -41633.7558
Temperature -773.7856 442.6573 -1.7480 0.0805 -1641.5822
94.0111
Fuel_Price -32229.0894 17567.6200 -1.8346 0.0666 -
66669.0970 2210.9182
CPI -1503.6908 218.2353 -6.8902 0.0000 -1931.5248 -1075.8568
Unemployment -41633.7558 4467.3299 -9.3196 0.0000 -
50391.6226 -32875.8891
Omnibus:294.750 Durbin-Watson: 1.975
Prob(Omnibus): 0.000 Jarque-Bera (JB): 347.278
Skew: 0.636 Prob(JB): 0.000
Kurtosis: 3.051 Condition No.: 2154
Estimated parameters
The estimated (predicted) model can be written as:
Weekly_Sales= 1798119.3852 -773.7856(Temperature) -
32229.0894 (Fuel_Price ) - 1503.6908 (CPI) -41633.7558
(Unemployment)
The equation can be interpreted as follows: For every 1%
increase in Temperature, Fuel price , CPI and Unemployment
the weekly sales decrease by 1798119.38
Output-Summary
1. The model R-squared value is 0.023, that is, the model explains 2.3%
of the variation in weekly sales
2. The p-value for the t-test is 0.0805, 0.0666, 0.0000 ,0.0000which
indicates that there is a statistically significant relationship (at
significance value a = 0.05) between the Temperature, Fuel price , CPI
and Unemployment and Weekly Sales.
Also, the probability value of F-statistic of the model is 0.00 which
indicates that the overall model is statistically significant.
Outlier Analysis
Outliers are observations whose values show a large
deviation from the mean value. Presence of an outlier can
have a significant influence on the values of regression
coefficients.
• Cook’s Distance - Value of more than 1 indicates highly
influential observation
#Library Functions
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
Step-10-Cooks Distance
import numpy as np Bring in
cooks
distance
WS_Influence=RetailWS_LM.get_influence() display
(c,p)=WS_Influence.cooks_distance
plt.stem(np.arange(len(train_X)),np.round(c,3))
plt.title("Cooks distance")
plt.xlabel("Row Index")
plt.ylabel("Cooks Distance") print(c,p)
plt.show()
Interpretation
It is clear from Figure that none of the observations in
Cook's distance exceed 1, making them all outliers.
Multi-Collinearity and Handling Multi-
Collinearity
When the dataset has a large number of independent variables
(features), it is possible that few of these independent variables
(features) may be highly correlated. The existence of a high correlation
between independent variables is called multi-collinearity.
Presence of multi-collinearity can destabilize the multiple linear
regression model.
Variance Inflation Factor (VIF)
Variance Inflation Factor (VIF) is a measure used for identifying
the existence of multi-collinearity
• variance_inflation_factor() method available in
statsmodels.stats.outliers_influence package can be used to
calculate VIF for the features
• select the features that have VIF value of more than 4
Step-11- Variance Inflation Factor (VIF)
from statsmodels.stats.outliers_influence import
variance_inflation_factor
from statsmodels.tools.tools import add_constant
pd.Series([variance_inflation_factor(X.values, i)for
i in range(X.shape[1])],index=X.columns)
Step-12-Correlation
Correlation=RT.corr(method='pearson');
print(Correlation)
Temperature :1.104991
Fuel_Price:1.081709
CPI:1.220733
Unemployment:1.149112
dtype: float64
Temperature Fuel_Price CPI Unemployment
Temperature 1.000000 0.144982 0.176888 0.101158
Fuel_Price 0.144982 1.000000 -0.170642 -0.034684
CPI 0.176888 -0.170642 1.000000 -0.302020
Unemployment 0.101158 -0.034684 -0.302020 1.000000
#Step-13-Regression Plot
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
fig = plt.figure(figsize=(14, 8))
fig =
sm.graphics.plot_regress_exog(RetailWS_LM,'CPI',fig=fig)
plt.show()
fig = plt.figure(figsize=(14, 8))
fig =
sm.graphics.plot_regress_exog(RetailWS_LM,'Fuel_Price'
,fig=fig)
plt.show()
fig = plt.figure(figsize=(14, 8))
fig =
sm.graphics.plot_regress_exog(RetailWS_LM,'Unemplo
yment',fig=fig)
plt.show()
Work Book : MLR_2
Dataset: Song
Preamble
Humans have greatly associated themselves with Songs &
Music. It can improve mood, decrease pain and anxiety, and
facilitate opportunities for emotional expression. Research
suggests that music can benefit our physical and mental
health in numerous ways. To understand songs & its
popularity based on certain factors such as song duration,
Acousticness, Danceability and Energy data is given in the
dataset. Develop a multiple regression equation to predict
song popularity.
#Step-1- Import Library Modules
import pandas as pd
import numpy as np
import statsmodels.api as sm
import pandas.testing as tm
#Setting pandas print option to decimal values upto 4
decimal places
np.set_printoptions(precision=4,linewidth=100)
#Step-2-Read the excel file
#Read the excel file
SongPop=pd.read_excel(r"D:\DataSet2025\Song_10.xlsx")
SongPop.head(10)
#MetaData
SongPop.info()
#Step-3-#Creating feature set(X) and outcome
variable(Y)
SP=pd.DataFrame(SongPop,columns=['Duration','Acousticn
ess','Danceability','Energy’])
SP=SP.dropna()
X=sm.add_constant(SP)
X.head(5)
#Creating Outcome set (Y) dependent variable
SP1=pd.DataFrame(SongPop,columns=['Popularity'])
SP1=SP1.dropna()
Y=SP1
Y.head(5)
#Step-4-#Splitting data set
from sklearn.model_selection import train_test_split
train_X,test_X,train_Y,Test_Y=train_test_split(X,Y,train_size=
0.8,random_state=100)
#Step-5-#Fit the model using OLS method and
pass train_y and train_X as parameters
SongPop_LM=sm.OLS(train_Y,train_X).fit()
#Step-6-#Printing estimated parameters
print(SongPop_LM.params)
print(SongPop_LM.summary2())
const 51.065922 Results: Ordinary least squares
Duration -0.000003 ====================================================================
Acousticness -6.745192 Model: OLS Adj. R-squared: 0.015
Danceability 12.760540 Dependent Variable: Popularity AIC: 135136.8742
Energy -6.006211 Date: 2024-05-04 16:45 BIC: 135174.9599
dtype: float64 No. Observations: 15020 Log-Likelihood: -67563.
Df Model: 4 F-statistic: 58.21
Df Residuals: 15015 Prob (F-statistic): 7.80e-49
R-squared: 0.015 Scale: 472.92
---------------------------------------------------------------------
Coef. Std.Err. t P>|t| [0.025 0.975]
---------------------------------------------------------------------
const 51.0659 1.4455 35.3265 0.0000 48.2325 53.8994
Duration -0.0000 0.0000 -0.9233 0.3559 -0.0000 0.0000
Acousticness -6.7452 0.8387 -8.0422 0.0000 -8.3892 -5.1012
Danceability 12.7605 1.1689 10.9163 0.0000 10.4693 15.0518
Energy -6.0062 1.1129 -5.3969 0.0000 -8.1876 -3.8248
--------------------------------------------------------------------
Omnibus: 654.701 Durbin-Watson: 1.998
Prob(Omnibus): 0.000 Jarque-Bera (JB): 737.904
Skew: -0.538 Prob(JB): 0.000
Kurtosis: 2.856 Condition No.: 2414309
====================================================================
#Step-7-Outlier Analysis-Cooks Distance
import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
import numpy as np
Song_Influence=SongPop_LM.get_influence()
(c,p)=Song_Influence.cooks_distance
plt.stem(np.arange(len(train_X)),np.round(c,3))
plt.title("Cooks distance")
plt.xlabel("Row Index")
plt.ylabel("Cooks Distance")
plt.show()
#Print Cooks distance values
np.set_printoptions(suppress=True)
Cook=(c,p)
print(Cook)
#By default, the cooks_distance()
function displays an array of values for
Cook’s distance for each observation
followed by an array of corresponding
p-values.
#Step-8- Multicollinearity- #Variance Inflation
Factor
from statsmodels.stats.outliers_influence import
variance_inflation_factor
from statsmodels.tools.tools import add_constant
pd.Series([variance_inflation_factor(X.values, i)for i in
range(X.shape[1])],index=X.columns)
#Step-9- Correlation Matrix
Correlation=SP.corr(method='pearson');
print(Correlation)
#Step-10-Heatmap-Dropping Column Song Name
HSongPop=SongPop.drop(['Song Name'],axis=1)
HSongPop.dropna()
#Heatmap
fig,ax=plt.subplots()
sn.heatmap(HSongPop.corr(),annot=True,cmap="YlGn
Bu",linecolor='r',linewidths=0.5)
l = lower case L
plt.show()
#Step-11-Regression Plot
import seaborn as sns
sns.set_style('whitegrid’)
sns.lmplot(x ='Danceability', y ='Popularity', data
=SongPop)
plt.show()