90% found this document useful (21 votes)
12K views44 pages

FRA Project Milestone 1 Overview

The document discusses outlier treatment and missing value treatment for a financial dataset. It identifies outliers in each column through box plots and treats them. It also finds that there are no missing values in most columns, but 4 missing values in one column.

Uploaded by

Nikhil jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
90% found this document useful (21 votes)
12K views44 pages

FRA Project Milestone 1 Overview

The document discusses outlier treatment and missing value treatment for a financial dataset. It identifies outliers in each column through box plots and treats them. It also finds that there are no missing values in most columns, but 4 missing values in one column.

Uploaded by

Nikhil jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 44

PROJECT

ON

FRA PROJECT(MILESTONE-1)

SUBMITTED BY:

ABHIJIT KUMAR KALITA


PROBLEM STATEMENT
Businesses or companies can fall prey to default if they are not able to keep up their debt obligations.
Defaults will lead to a lower credit rating for the company which in turn reduces its chances of getting
credit in the future and may have to pay higher interests on existing debts as well as any new obligations.
From an investor's point of view, he would want to invest in a company if it is capable of handling its
financial obligations, can grow quickly, and is able to manage the growth scale.

A balance sheet is a financial statement of a company that provides a snapshot of what a company owns,
owes, and the amount invested by the shareholders. Thus, it is an important tool that helps evaluate the
performance of a business.

Data that is available includes information from the financial statement of the companies for the previous
year (2015). Also, information about the Networth of the company in the following year (2016) is
provided which can be used to drive the labeled field.

Hints :

Dependent variable - We need to create a default variable that should take the value of 1 when net worth
next year is negative & 0 when net worth is positive.

Test Train Split - Split the data into Train and Test dataset in a ratio of 67:33 and use random_state =42.
Model Building is to be done on Train Dataset and Model Validation is to be done on Test Dataset.

1.1) OUTLIER TREATMENT

Answer:

In order to treat the outlier we need to first create the outlier identification which have been done using
the below –

col_names = list(Default.columns)
fig, ax = plt.subplots(len(col_names), figsize=(8,100))
for i, col_val in enumerate(col_names):
sns.boxplot(y=Default[col_val], ax=ax[i])
ax[i].set_title('Box plot - {}'.format(col_val), fontsize=10)
ax[i].set_xlabel(col_val, fontsize=8)
plt.show()
Etc….
Now after treating the outliers

-17.445 3.985 123.80250000000001 1978.8225000000002 <built-in function min> <built-in function max>
0.5 3.75 19.517500000000002 131.24 <built-in function min> <built-in function max>
-11.697499999999998 3.8925 117.2975 1829.0825 <built-in function min> <built-in function max>
0.4225 7.602499999999999 226.60500000000002 3634.915 <built-in function min> <built-in function max>
0.0 0.03 72.35000000000001 1572.61 <built-in function min> <built-in function max>
0.0 0.57 131.895 1409.325 <built-in function min> <built-in function max>
-11.944999999999999 0.9424999999999999 61.175 827.735 <built-in function min> <built-in function max>
0.14 4.0 135.2775 2014.74 <built-in function min> <built-in function max>
0.02 0.7324999999999999 65.65 1021.03 <built-in function min> <built-in function max>
1.22 10.555 310.54 4568.7300000000005 <built-in function min> <built-in function max>
0.0 1.4425000000000001 242.25 2845.3725 <built-in function min> <built-in function max>
0.0 1.44 234.44000000000003 2780.1400000000003 <built-in function min> <built-in function max>
0.0 0.02 3.635 78.80250000000001 <built-in function min> <built-in function max>
0.0 1.4124999999999999 235.8375 2803.74 <built-in function min> <built-in function max>
0.01 0.94 189.54999999999998 1981.68 <built-in function min> <built-in function max>
0.0 0.0 3.8825 74.97999999999999 <built-in function min> <built-in function max>
-3.995 0.04 23.525 455.5025 <built-in function min> <built-in function max>
-17.872500000000002 0.0 12.945 290.0175 <built-in function min> <built-in function max>
-10.834999999999999 0.0 16.6675 365.68 <built-in function min> <built-in function max>
-32.7575 -0.06 7.422499999999999 219.055 <built-in function min> <built-in function max>
-30.575 -0.06 5.54 171.6925 <built-in function min> <built-in function max>
-29.557499999999997 -0.09 5.342499999999999 153.3 <built-in function min> <built-in function max>
-15.584999999999999 0.0 10.91 231.39 <built-in function min> <built-in function max>
0.0 0.0 7.2 360.725 <built-in function min> <built-in function max>
0.0 0.0 6.9875 304.665 <built-in function min> <built-in function max>
0.0 0.0 0.0 11.322500000000002 <built-in function min> <built-in function max>
-13.567499999999999 7.9624999999999995 71.66749999999999 349.655 <built-in function min> <built-in
function max>
-13.567499999999999 7.0649999999999995 59.96 298.08750000000003 <built-in function min> <built-in
function max>
0.0 0.0 111.45750000000001 3984.7225000000003 <built-in function min> <built-in function max>
-7.475 0.0 8.772499999999999 59.3 <built-in function min> <built-in function max>
-21.95 -0.3075 12.6475 229.84 <built-in function min> <built-in function max>
-143.6425 -5.1175 0.12 13.997500000000002 <built-in function min> <built-in function max>
-130.235 -5.847499999999999 0.4575 52.332499999999996 <built-in function min> <built-in function max>
-55.6275 -1.4874999999999998 11.3625 90.20750000000001 <built-in function min> <built-in function max>
-33.38249999999999 -3.835 12.587499999999999 73.77 <built-in function min> <built-in function max>
-32.5875 0.0 6.72 51.365 <built-in function min> <built-in function max>
-73.18499999999999 -8.0775 21.525000000000002 227.41 <built-in function min> <built-in function max>
-73.17249999999999 -8.1175 21.567500000000003 227.41 <built-in function min> <built-in function max>
-66.4 -7.2425 23.122500000000002 225.93 <built-in function min> <built-in function max>
-26.4375 -3.9725 12.5 64.1575 <built-in function min> <built-in function max>
-200.0 -23.3625 47.875 347.0375 <built-in function min> <built-in function max>
-313.59749999999997 -30.5975 52.915 350.0 <built-in function min> <built-in function max>
-250.0 -31.3525 50.1425 371.665 <built-in function min> <built-in function max>
-513.9475 -41.235 61.957499999999996 372.3775 <built-in function min> <built-in function max>
-501.2525 -43.7325 65.3475 422.02500000000003 <built-in function min> <built-in function max>
-321.7375 -29.505 52.9075 347.12749999999994 <built-in function min> <built-in function max>
-60.907500000000006 0.0 0.0 56.96 <built-in function min> <built-in function max>
-76.4 0.0 0.0 122.17999999999999 <built-in function min> <built-in function max>
-41.0825 0.0 47.515 237.875 <built-in function min> <built-in function max>
0.1 0.88 2.77 39.3325 <built-in function min> <built-in function max>
0.0 0.27 4.74 54.667500000000004 <built-in function min> <built-in function max>
0.0 0.0 8.9375 58.0325 <built-in function min> <built-in function max>
0.0 0.42 8.5175 45.832499999999996 <built-in function min> <built-in function max>
0.0 0.07 1.55 3.98 <built-in function min> <built-in function max>
-6.7075 0.0 3.71 59.75 <built-in function min> <built-in function max>
-58.682500000000005 0.0 18.9875 72.23249999999999 <built-in function min> <built-in function max>
-82.2225 0.0 14.285 66.4725 <built-in function min> <built-in function max>
-90.79500000000002 0.0 14.1 57.9775 <built-in function min> <built-in function max>
-87.21249999999999 0.0 11.387500000000001 48.2475 <built-in function min> <built-in function max>
-117.12 0.0 7.407500000000001 40.692499999999995 <built-in function min> <built-in function max>
0.0 8.0 106.0 715.25 <built-in function min> <built-in function max>
0.0 8.0 89.0 615.25 <built-in function min> <built-in function max>
0.0 0.0 93.0 324.75 <built-in function min> <built-in function max>
0.0 0.07 1.16 2.5175 <built-in function min> <built-in function max>
0.0 0.27 4.91 57.9625 <built-in function min> <built-in function max>
0.0 0.0 0.0 1.0 <built-in function min> <built-in function max>

1.2) MISSING VALUE TREATMENT

Answer:

I have found out the missing values as below -

Co_Code 0
Co_Name 0
Networth_Next_Year 0
Equity_Paid_Up 0
Networth 0
Capital_Employed 0
Total_Debt 0
Gross_Block 0
Net_Working_Capital 0
Current_Assets 0
Current_Liabilities_and_Provisions 0
Total_Assets_by_Liabilities 0
Gross_Sales 0
Net_Sales 0
Other_Income 0
Value_Of_Output 0
Cost_of_Production 0
Selling_Cost 0
PBIDT 0
PBDT 0
PBIT 0
PBT 0
PAT 0
Adjusted_PAT 0
CP 0
Revenue_earnings_in_forex 0
Revenue_expenses_in_forex 0
Capital_expenses_in_forex 0
Book_Value_Unit_Curr 0
Book_Value_Adj._Unit_Curr 4
Market_Capitalisation 0
CEPS_annualised_Unit_Curr 0
Cash_Flow_From_Operating_Activities 0
Cash_Flow_From_Investing_Activities 0
Cash_Flow_From_Financing_Activities 0
ROG_Net_Worth_perc 0
ROG_Capital_Employed_perc 0
ROG_Gross_Block_perc 0
ROG_Gross_Sales_perc 0
ROG_Net_Sales_perc 0
ROG_Cost_of_Production_perc 0
ROG_Total_Assets_perc 0
ROG_PBIDT_perc 0
ROG_PBDT_perc 0
ROG_PBIT_perc 0
ROG_PBT_perc 0
ROG_PAT_perc 0
ROG_CP_perc 0
ROG_Revenue_earnings_in_forex_perc 0
ROG_Revenue_expenses_in_forex_perc 0
ROG_Market_Capitalisation_perc 0
Current_Ratio_Latest_ 1
Fixed_Assets_Ratio_Latest_ 1
Inventory_Ratio_Latest_ 1
Debtors_Ratio_Latest_ 1
Total_Asset_Turnover_Ratio_Latest_ 1
Interest_Cover_Ratio_Latest_ 1
PBIDTM_perc_Latest_ 1
PBITM_perc_Latest_ 1
PBDTM_perc_Latest_ 1
CPM_perc_Latest_ 1
APATM_perc_Latest_ 1
Debtors_Velocity_Days 0
Creditors_Velocity_Days 0
Inventory_Velocity_Days 103
Value_of_Output_by_Total_Assets 0
Value_of_Output_by_Gross_Block 0
default 0

The columns with Missing values are as follows –

array([29, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 64], dtype=int64),)

I have treated these missing values with median and replacement with median eliminates the impact of
outliers and used the code as ----

Default.drop('Co_Code', axis = 1, inplace = True)


Default["Co_Name"] = Default["Co_Name"].astype('object')
Default.drop('Co_Name', axis = 1, inplace = True)
col=list(Default)
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median')
Default = pd.DataFrame(imputer.fit_transform(Default))
Default.columns=col
Default.head()
1.3 TRANSFORM TARGET VARIABLE INTO 0 AND 1

Answer:

We have define the target variable as “default” since there is no such target variable defined earlier
in the data set hence created the variable as mentioned in the question using the existing variable

“Networth_Next_Year” and the code used as ---


conditions = [
(Default['Networth_Next_Year'] < 0),
(Default['Networth_Next_Year'] > 0)
]
values = ['1', '0']
Default['default'] = np.select(conditions, values)

1.4 UNIVARIATE (4 MARKS) & BIVARIATE (6 MARKS) ANALYSIS WITH PROPER


INTERPRETATION.

ANSWER:

UNIVARIATE ANALYSIS:

While performing Univariate analysis I have found that most of the column variables are rightly skewed
distribution hence the presence of outliers expressed in the right side due to the mean is greater than the
median in all the parameters.

Few are the analysis done using distplot and boxplot and are explained as below –

NETWORTH_NEXT_YEAR:
EQUITY_PAID_UP:

NETWORTH

TOTAL_DEBT
GROSS_BLOCK

NET_WORKING_CAPITAL

CURRENT_ASSETS
BIVARIATE ANALYSIS:

For Bivariate Analysis I have used scatterplot, correlation, heatmap and barplot between few of the
column variable which are as follows –

 Scatterplot between Gross Sales and Net Sales shows direct relationship between them.

 Scatterplot between Networth_Next_Year and Networth shows direct relationship between them.

 Scatterplot between Cost_of_Production and Selling_Cost does not show direct relationship
between them.
CORRELATION OF THE DATA SET ALL COLUMN VARIABLE: ( IMAGE IS NOT
COMPLETE DUE TO SIZE CONSTRAINT)
HEATMAP OF THE ENTIRE DATA SET:

 BARPLOT BETWEEN CURRENT_ASSETS AND TOTAL_ASSETS_BY_LIABILITIES:


 BARPLOT BETWEEN TOTAL_DEBT AND NET_WORKING_CAPITAL:

1.5 TRAIN TEST SPLIT


Answer:

I have splitted the data into 67:33 ratio and use random state =42 and used the below code for performing
the entire operation –

X = Default.drop(['default','Networth_Next_Year'], axis=1)
y = Default['default']

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.33,random_state=42,stratify=Default['default'])

Default_train = pd.concat([X_train,y_train], axis=1)


Default_test = pd.concat([X_test,y_test], axis=1)

Default_train.to_csv('Default_train.csv',index=False)
Default_test.to_csv('Default_test.csv',index=False)
And Default train columns are –

Index(['Equity_Paid_Up', 'Networth', 'Capital_Employed', 'Total_Debt',


'Gross_Block', 'Net_Working_Capital', 'Current_Assets',
'Current_Liabilities_and_Provisions', 'Total_Assets_by_Liabilities',
'Gross_Sales', 'Net_Sales', 'Other_Income', 'Value_Of_Output',
'Cost_of_Production', 'Selling_Cost', 'PBIDT', 'PBDT', 'PBIT', 'PBT',
'PAT', 'Adjusted_PAT', 'CP', 'Revenue_earnings_in_forex',
'Revenue_expenses_in_forex', 'Capital_expenses_in_forex',
'Book_Value_Unit_Curr', 'Book_Value_Adj._Unit_Curr',
'Market_Capitalisation', 'CEPS_annualised_Unit_Curr',
'Cash_Flow_From_Operating_Activities',
'Cash_Flow_From_Investing_Activities',
'Cash_Flow_From_Financing_Activities', 'ROG_Net_Worth_perc',
'ROG_Capital_Employed_perc', 'ROG_Gross_Block_perc',
'ROG_Gross_Sales_perc', 'ROG_Net_Sales_perc',
'ROG_Cost_of_Production_perc', 'ROG_Total_Assets_perc',
'ROG_PBIDT_perc', 'ROG_PBDT_perc', 'ROG_PBIT_perc', 'ROG_PBT_perc',
'ROG_PAT_perc', 'ROG_CP_perc', 'ROG_Revenue_earnings_in_forex_perc',
'ROG_Revenue_expenses_in_forex_perc', 'ROG_Market_Capitalisation_perc',
'Current_Ratio_Latest_', 'Fixed_Assets_Ratio_Latest_',
'Inventory_Ratio_Latest_', 'Debtors_Ratio_Latest_',
'Total_Asset_Turnover_Ratio_Latest_', 'Interest_Cover_Ratio_Latest_',
'PBIDTM_perc_Latest_', 'PBITM_perc_Latest_', 'PBDTM_perc_Latest_',
'CPM_perc_Latest_', 'APATM_perc_Latest_', 'Debtors_Velocity_Days',
'Creditors_Velocity_Days', 'Inventory_Velocity_Days',
'Value_of_Output_by_Total_Assets', 'Value_of_Output_by_Gross_Block',
'default'],dtype='object')
Before proceeding in Model Building I have checked the problem of multicollinearity. Multicollinearity
occurs when two or more independent variables are highly correlated with one another in a regression
model and checked using the VIF score and used the below code –

 from statsmodels.stats.outliers_influence import variance_inflation_factor


def calc_vif(X):
vif = pd.DataFrame()
vif["variables"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
return(vif)

And the VIF score of few columns are given below –


Now I have created the Logistic regression Model into the training data

f_1 = 'default ~ ROG_Revenue_earnings_in_forex_perc + ROG_Revenue_expenses_in_forex_perc +


ROG_Gross_Block_perc + Current_Ratio_Latest_ + ROG_Market_Capitalisation_perc +
Creditors_Velocity_Days + Inventory_Ratio_Latest_ + Inventory_Velocity_Days +
Debtors_Velocity_Days + Debtors_Ratio_Latest_ + Interest_Cover_Ratio_Latest_ +
ROG_Cost_of_Production_perc + ROG_Net_Worth_perc + Cash_Flow_From_Financing_Activities +
Revenue_earnings_in_forex + Capital_expenses_in_forex + Equity_Paid_Up + Selling_Cost +
Other_Income + Revenue_expenses_in_forex + Cash_Flow_From_Investing_Activities +
Market_Capitalisation + ROG_Total_Assets_perc + ROG_Capital_Employed_perc +
CEPS_annualised_Unit_Curr + Total_Debt + Net_Working_Capital'
and created the model_1 for Logistic Regression using model_1 = SM.logit(formula = f_1,
data=Default).fit()

The models adjusted pseudo R-square value is 0.30165859932856054.

The Adjusted pseudo R-square seems to be lower than Pseudo R-square value which means there are
insignificant variables present in the model and once again tried to remove variables whose p value is
greater than 0.05 & rebuild our model using

 f_2 = 'default ~ + + ROG_Gross_Block_perc + Current_Ratio_Latest_ + +


Creditors_Velocity_Days + Inventory_Ratio_Latest_ + Inventory_Velocity_Days +
Debtors_Velocity_Days + Debtors_Ratio_Latest_ + Interest_Cover_Ratio_Latest_ +
ROG_Cost_of_Production_perc + ROG_Net_Worth_perc +
Cash_Flow_From_Investing_Activities + Market_Capitalisation + CEPS_annualised_Unit_Curr
+ Total_Debt + Net_Working_Capital'

The new adjusted pseudo R-square value is 0.30794990631152586 . I can see that adjusted R sq is now
close to Rsq, thus suggesting lesser insignificant variables in the model and also notice that current
model has no insignificant variables and can be used for prediction purposes.

Now Lets test the prediction of this model on train and test dataset.

The boxplot of target variable and train data is


From the above boxplot, we need to decide on one such value of a cut-off which will give us the
most reasonable descriptive power of the model. Let us take a cut-off of 0.07 and check.

Let us now see the predicted classes and Checking the accuracy of the model using confusion
matrix for training set.

Finally the Classification report on training data is found as -

So accuracy of the model i.e. %overall correct predictions is 69%

Sensitivity of the model is 90% i.e. 90% of those defaulted were correctly identified as defaulters by the
model
Now let us check the prediction on Test Set

Checking the accuracy of the model using confusion matrix for test set

The Classification report of test data is given below –


1.6 BUILD LOGISTIC REGRESSION MODEL (USING STATSMODEL LIBRARY) ON MOST
IMPORTANT VARIABLES ON TRAIN DATASET AND CHOOSE THE OPTIMUM CUTOFF.

Answer:
After Encoding the data we have converted from no to 0 and yes to 1 and for foreign no to 0 and yes to 1.

feature: vote

[Labour, Conservative]

Categories (2, object): [Conservative, Labour]

[1 0]

feature: gender

[female, male]

Categories (2, object): [female, male]

[0 1]

Scaling is not necessary for the given data set since except age all variables are categorical variables and
to perform scaling in any data set the measurement scales are to be different which is not there in this case
also there is another variable AGE which is in continuous form but it is also can be converted into
categorical variable to perform required analysis in line with the other variables, hence we have not
performed scaling here.

Now in the splitting process we need to create two buckets with independent and dependent variables.

Now importing train_test_split from sklearn model_selection we need to divide the data into training and
testing data.Which will give the values for X_train, X_test, train_labels and test_labels and these will be
used to to create the models for and also to evaluate their performance.

Formula used is –

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1)

In above the data set is divided into 70% training data and 30% testing data and with this division we are
having data values for X_train , X_test, train_labels, test_labels. So making training data we will have
training data for modeling and now if I will pass the training the data into our model so we will be able to
predict the future outputs so we have to create testing data which is also known as unseen data for the
model so that to have a clear idea about how accurate the model is to provide us output for unseen data.
So randomly the entire data set is divided into 70 and 30 percent as Train and test data.

1.4) Apply Logistic Regression and LDA (Linear Discriminant Analysis) (3 pts). Interpret the
inferences of both model s (2 pts)
Answer:

Logistic Regression Model:

We have created the Logistic Regression using LogisticRegression(solver='newton-


cg',max_iter=10000,penalty='none',verbose=True,n_jobs=2,fit_intercept=True) and fit the model with
X_train and y_train.

The Classification report for Train data of Logistic Regression Model is –

precision recall f1-score support


0 0.74 0.64 0.69 307
1 0.86 0.91 0.88 754
accuracy 0.83 1061
macro avg 0.80 0.77 0.79 1061
weighted avg 0.83 0.83 0.83 1061

The Classification report for Test data of Logistic Regression Model is -

precision recall f1-score support


0 0.76 0.74 0.72 153
1 0.87 0.88 0.88 303
accuracy 0.84 456
macro avg 0.82 0.81 0.81 456
weighted avg 0.83 0.84 0.83 456

Linear Discriminant Analysis Model (LDA) :

We have created the Linear Discriminant Analysis (LDA) Model and fit the model with X_train and
y_train.
The Classification report for Train data of LDA Model is -

precision recall f1-score support


0 0.74 0.65 0.69 322
1 0.86 0.91 0.89 739
accuracy 0.83 1061
macro avg 0.80 0.78 0.79 1061
weighted avg 0.83 0.82 0.83 1061

The Classification report for Test data of LDA Model is -

precision recall f1-score support


0 0.77 0.73 0.74 153
1 0.86 0.89 0.88 303
accuracy 0.83 456
macro avg 0.82 0.81 0.81 456
weighted avg 0.83 0.83 0.83 456

Interpret the inferences of both models:

From the Logistic and LDA, we have the accuracy score for Logistic for test data is 86 and in LDA it is
0.85 also the f1 score for Logistic is for the voters who will vote for Labour is 0.74 and vote for
conservative is 0.90 and in LDA we have a f1 score of voters vote for Labour is 0.74 and vote for
conservative is 0.90. So we can say that the LDA and Logistic model gives the similar results.

1.5) Apply KNN Model and Naïve Bayes Model(5 pts). Interpret the inferences of each model (2
pts)

Answer:

KNN:
We have created the KNN Model using KNeighborsClassifier(n_neighbors= 5 , weights = 'distance' ) and
fit the model with X_train and y_train.

The Classification report for Train data of KNN Model is –

precision recall f1-score support


0 0.79 0.66 0.72 307
1 0.87 0.93 0.90 754
accuracy 0.85 1061
macro avg 0.83 0.80 0.81 1061
weighted avg 0.85 0.85 0.85 1061

The Classification report for Test data of KNN Model is -

precision recall f1-score support


0 0.77 0.65 0.70 153
1 0.83 0.90 0.87 303
accuracy 0.82 456
macro avg 0.80 0.77 0.78 456
weighted avg 0.81 0.82 0.81 456

Naïve Bayes(GaussianNB):

We have created the GaussianNB Model and fit the model with X_train and y_train.

The Classification report for Train data of GaussianNB Model is -

precision recall f1-score support


0 0.73 0.69 0.71 307
1 0.88 0.90 0.89 754
accuracy 0.84 1061
macro avg 0.80 0.79 0.80 1061
weighted avg 0.83 0.84 0.83 1061

The Classification report for Test data of GaussianNB Model is -

precision recall f1-score support


0 0.74 0.73 0.73 153
1 0.87 0.87 0.87 303
accuracy 0.82 456
macro avg 0.80 0.80 0.80 456
weighted avg 0.82 0.82 0.82 456

Interpret the inferences of both models:

From the KNN and GaussianNB, we have the accuracy score for KNN test data is 0.82 and GaussianNB
0.82 also the f1 score for KNN is for the voters who will vote for Labour is 0.70 and vote for conservative
is 0.87 and in GaussianNB we have a f1 score of voters vote for Labour is 0.73 and vote for conservative
is 0.87. So we can say that the GaussianNB model is giving us better result in comparison to KNN model.

1.6) Model Tuning (2 pts) , Bagging ( 2.5 pts) and Boosting (2.5 pts).

Answer:

For the Bagging and Boosting I have done the analysis using Model tuning and without model tuning for
both the algorithms.

BAGGING (RANDOM FOREST) WITH MODEL TUNING:


In Bagging model after tuning I have found out the below parameters are the best parameters for tuning
the Bagging model.

{'max_depth': 5, 'max_features': 4, 'min_samples_leaf': 10, 'min_samples_split': 50, 'n_estimators':


300}

Using the above parameters I have created the Bagging Model using Random Forest Classifier

RandomForestClassifier(max_depth=5,max_features=4,min_samples_leaf=10,min_samples_split=50,
n_estimators=300, random_state=1) and and fit the model with X_train and y_train.

The Classification report for Train data of Tuned BAGGING Model is –

precision recall f1-score support


0 0.79 0.68 0.73 307
1 0.88 0.93 0.90 754
accuracy 0.85 1061
macro avg 0.83 0.80 0.82 1061
weighted avg 0.85 0.85 0.85 1061

The Classification report for Test data of Tuned BAGGING Model is -

precision recall f1-score support


0 0.79 0.67 0.73 153
1 0.85 0.91 0.88 303
accuracy 0.83 456
macro avg 0.82 0.79 0.80 456
weighted avg 0.83 0.83 0.83 456

BAGGING (RANDOM FOREST) WITHOUT MODEL TUNING:

We have created the BAGGING Model using RandomForestClassifier(n_estimators = 50,


random_state=1 , max_features = 5) and fit the model with X_train and y_train.

The Classification report for Train data of BAGGING Model is –

precision recall f1-score support


0 1 1 1 307
1 1 1 1 754
accuracy 1 1061
macro avg 1 1 1 1061
weighted avg 1 1 1 1061

The Classification report for Test data of BAGGING Model is -

precision recall f1-score support


0 0.76 0.71 0.73 153
1 0.86 0.89 0.87 303
accuracy 0.83 456
macro avg 0.81 0.80 0.80 456
weighted avg 0.82 0.83 0.83 456

The above model clearly explain and over-fitting issue model. Hence in Bagging, model tuning gives us
more accurate results compared to Bagging without model tuning.

BOOSTING WITH MODEL TUNING:

In Boosting model after tuning I have found out the below parameters are the best parameters for tuning
the Boosting model.

{'algorithm': 'SAMME.R', 'learning_rate': 0.1, 'n_estimators': 70}

Using the above parameters I have created the Boosting Model using AdaBoostClassifier and and fit the
model with X_train and y_train.
The Classification report for Train data of Tuned Boosting Model is -

precision recall f1-score support


0 0.77 0.58 0.66 307
1 0.84 0.93 0.89 754
accuracy 0.83 1061
macro avg 0.81 0.76 0.77 1061
weighted avg 0.82 0.83 0.82 1061

The Classification report for Test data of Tuned Boosting Model is -

precision recall f1-score support


0 0.75 0.61 0.67 153
1 0.82 0.90 0.86 303
accuracy 0.80 456
macro avg 0.78 0.75 0.76 456
weighted avg 0.80 0.80 0.79 456

BOOSTING WITHOUT MODEL TUNING:

We have created the AdaBoostClassifier(n_estimators=5, random_state=1) Model and fit the model with
X_train and y_train.

The Classification report for Train data of BOOSTING Model is -

precision recall f1-score support


0 0.72 0.71 0.71 307
1 0.88 0.89 0.88 754
accuracy 0.84 1061
macro avg 0.80 0.80 0.80 1061
weighted avg 0.84 0.84 0.84 1061

The Classification report for Test data of BOOSTING Model is -

precision recall f1-score support


0 0.67 0.68 0.67 153
1 0.84 0.83 0.83 303
accuracy 0.78 456
macro avg 0.75 0.75 0.75 456
weighted avg 0.78 0.78 0.78 456

Interpret the inferences of both models:

From the BAGGING and Boosting, we have the accuracy score for Bagging test data is 82 and for
Boosting it is 78 also the f1 score for BAGGING is for the voters who will vote for Labour is 73 and vote
for conservative is 87 and in Boosting we have a f1 score of voters vote for Labour is 67 and vote for
conservative is 83. So we can say that the Bagging model is giving us better result in comparison to
Boosting model.

On the other hand from the Tuned Bagging and Tuned Boosting, we have the accuracy score for Bagging
test data is 83 and for Boosting it is 80 also the f1 score for Bagging is for the voters who will vote for
Labour is 73 and vote for conservative is 88 and in Boosting we have a f1 score of voters vote for Labour
is 67 and vote for conservative is 86. So we can say that the Bagging tuned model is giving us better
result in comparison to Boosting model.

So finally I can say that Bagging without and with Tuning models are giving better predicted results than
Boosting without and with Tuning.

1.7 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model (4 pts) Final
Model - Compare all models on the basis of the performance metrics in a structured tabular
manner. Describe on which model is best/optimized (3 pts)
Answer :

 Performance of Predictions on Train Sets:

SL NO LOGISTIC LDA KNN NAÏVE BAYES BAGGING BAGGING BOOSTING BOOSTING


REGRESSIO WITHOUT WITHOUT WITH TUNING
N TUNING WITH TUNING
TUNING

ACCURACY 83 83 85 82 100 85 84 83

CONFUSION 196 111 200 107 204 103 211 96 307 0 209 98 218 89 178 129
MATRIX

68 686 69 685 53 701 79 675 0 754 56 698 85 669 52 702

PLOT ROC
CURVE

ROC_AUC 89 88.90 92.3 88.80 100 91.23 87.80 89.80


SCORE

Performance of Predictions on Test Sets:

SL NO LOGISTIC LDA KNN NAÏVE BAYES BAGGING BAGGING WITH BOOSTING BOOSTING WITH
REGRESSION WITHOUT TUNING TUNING WITHOUT TUNING TUNING

ACCUR 84 83 82 82 83 83 81 80
ACY

113 40 111 42 99 54 112 41 108 45 103 50 185 49 93 60


CONFU
SION 35 268 34 269 30 273 40 263 34 269 28 275 52 251 31 272
MATRI
X

PLOT
ROC
CURVE

ROC_A
UC 88.30 88.80 85.20 87.60 88.50 89.05 85.10 88.10
SCORE

Combine ROC AUC Curve for Training Data for all Models ---
Combine ROC AUC Curve for Testing Data for all Models ---

In Training Data Set:


From the Logistic, LDA,KNN, NAÏVE BAYES , BAGGING(with Tuning) & BOOSTING(with Tuning)
and BAGGING(without Tuning) & BOOSTING(without Tuning) we have the accuracy score for Logistic
for train data is 83, LDA 83, KNN 100, NAÏVE BAYES 82, BAGGING(RF) 100, BOOSTING 84,
BAGING(TUNING) 85 and BOOSTING(TUNING) 83 also the f1 score for Logistic is for the voters
who will vote for Labour is 69 and vote for conservative is 88 and LDA we have a f1 score of voters vote
for Labour is 69 and vote for conservative is 89, KNN we have a f1 score of voters vote for Labour is 100
and vote for conservative is 100 and NAÏVE BAYES we have a f1 score of voters vote for Labour is 71
and vote for conservative is 89, BAGGING we have a f1 score of voters vote for Labour is 100 and vote
for conservative is 100, BOOSTING we have a f1 score of voters vote for Labour is 71 and vote for
conservative is 88 and BAAGING(TUNING) we have f1 score of voters vote for Labour is 73 and vote
for conservative is 90 BOOSTING(TUNING) we have f1 score of voters vote for Labour is 66 and
conservative is 89.

&

In Testing Data Set:

From the Logistic, LDA,KNN, NAÏVE BAYES , BAGGING AND BOOSTING, we have the accuracy
score for Logistic for test data is 84, LDA 83, KNN 82, NAÏVE BAYES 82, BAGGING(RF) 83,
BOOSTING 78 also the f1 score for Logistic is for the voters who will vote for Labour is 75 and vote for
conservative is 88 and LDA we have a f1 score of voters vote for Labour is 74 and vote for conservative
is 88, KNN we have a f1 score of voters vote for Labour is 70 and vote for conservative is 87 and NAÏVE
BAYES we have a f1 score of voters vote for Labour is 73 and vote for conservative is 87, BAGGING we
have a f1 score of voters vote for Labour is 73 and vote for conservative is 87, BOOSTING we have a f1
score of voters vote for Labour is 67 and vote for conservative is 83, BAAGING(TUNING) we have f1
score of voters vote for Labour is 73 and vote for conservative is 88 and BOOSTING(TUNING) we have
f1 score of voters vote for Labour is 67 and conservative is 86.

Hence, by looking at all the parameters we can say that Bagging with Tuning Model is the best/optimized
predictions regarding voters vote for conservative party.

1.8) Based on your analysis and working on the business problem, detail out appropriate insights
and recommendations to help the management solve the business objective.
Answer:

The Data set is basically a voter group of 1525 voters for their vote for Labour Party or for
Conservative party in the recent elections so that the news agency could create an exit poll that
will help in predicting overall win and seats covered by a particular party. In order to analyse the
prediction we are given with 8 factors as age of voters, current national economic condition
rating, current household economic condition rating, Labour leader rating, Conservative leader
rating, voters attitudes toward European integration, Knowledge of parties positions on European
integration and gender of the voters. The mean age of voter is closed to 54 means that it is a mix
group of voters of lower and higher age. The mean assessment of current national economic
condition and current household economic conditions is closed to 3 and 3.5 which means the
economic condition of the nation and household is having an average situation between surplus
and deficit therefore a party needs to focus more on mentioning about the future plans for
developing the national economy and also needs to focus on improving the household economy
in their election manifesto. The Blair is the assessment criteria for labour party leaders, which
clearly indicates that the average rating is 3.33 means popularity of labour leaders is average.
The Hague is the assessment criteria for conservative party leaders, which clearly indicates that
the average rating is 2.74 means popularity of conservative leaders are below average rating. Out
of 11 point scale in Europe the average is scale value is 6.75 means the voters not all are having
Eurosceptic sentiment. Also the political knowledge among the voters also plays an vital role but
in our data set voter group has a mean of 1.54 out of scale of 0 to 3 therefore voters are of a mix
group having political knowledge and finally the gender consists of 808 females and 709 males
therefore women class is having mojor share as a deciding factors for election results. Hence the
Bagging with Tuning model which is applied on the segregated ratio of 70:30 for Training and
Test data , we have found that the accuracy score for training and testing data is of 85 and 83 and
for both the train and test data f1 score of voters who will vote for labour party is given as 73 and
for training voters who vote for conservative party is given as 90 while in testing voters vote for
conservative party is showing as 88 also the AUC score for training set is 91.23 and for Testing
set 89.05, therefore Bagging with Tuning model will be best suite model for having the best to
predict voters vote for Conservative party.

PROBLEM 2:
In this particular project, we are going to work on the inaugural corpora from the nltk in Python.
We will be looking at the following speeches of the Presidents of the United States of America:

President Franklin D. Roosevelt in 1941

President John F. Kennedy in 1961

President Richard Nixon in 1973

2.1) Find the number of characters, words and sentences for the mentioned documents. (Hint:
use .words(), .raw(), .sent() for extracting counts) (Hint: use .words(), .raw(), .sent() for extracting
counts)

Answer:

For 1941-Roosevelt Speech:

 The number of words in Roosevelt speech are 1536


 The number of Sentence in Roosevelt speech are 68
 The number of characters in Roosevelt Speech are 7571

For 1961-Kennedy Speech:

 The number of words in Kennedy speech are 1546


 The number of Sentence in Kennedy speech are 52
 The number of characters in Kennedy Speech are 7618

For 1973-Nixon Speech:

 The number of words in Nixon speech are 2028


 The number of Sentence in Nixon speech are 69
 The number of characters in Nixon Speech are 9991

2.2) Remove all the stopwords from the three speeches.

Answer:
Roosevelt Speech after Removing Stopwords is below–

 'nation day inaugur sinc peopl renew sens dedic unit state washington day task peopl creat weld
togeth nation lincoln day task peopl preserv nation disrupt within day task peopl save nation
institut disrupt without us come time midst swift happen paus moment take stock recal place
histori rediscov may risk real peril inact live nation determin count year lifetim human spirit life
man three score year ten littl littl less life nation full measur live men doubt men believ democraci
form govern frame life limit measur kind mystic artifici fate unexplain reason tyranni slaveri
becom surg wave futur freedom eb tide american know true eight year ago life republ seem
frozen fatalist terror prove true midst shock act act quickli boldli decis later year live year fruit
year peopl democraci brought us greater secur hope better understand life ideal measur materi
thing vital present futur experi democraci success surviv crisi home put away mani evil thing
built new structur endur line maintain fact democraci action taken within three way framework
constitut unit state coordin branch govern continu freeli function bill right remain inviol freedom
elect wholli maintain prophet downfal american democraci seen dire predict come naught
democraci die know seen reviv grow know cannot die built unhamp initi individu men women
join togeth common enterpris enterpris undertaken carri free express free major know democraci
alon form govern enlist full forc men enlighten know democraci alon construct unlimit civil
capabl infinit progress improv human life know look surfac sens still spread everi contin human
advanc end unconquer form human societi nation like person bodi bodi must fed cloth hous
invigor rest manner measur object time nation like person mind mind must kept inform alert must
know understand hope need neighbor nation live within narrow circl world nation like person
someth deeper someth perman someth larger sum part someth matter futur call forth sacr guard
present thing find difficult even imposs hit upon singl simpl word yet understand spirit faith
america product centuri born multitud came mani land high degre mostli plain peopl sought earli
late find freedom freeli democrat aspir mere recent phase human histori human histori permeat
ancient life earli peopl blaze anew middl age written magna charta america impact irresist
america new world tongu peopl contin new found land came believ could creat upon contin new
life life new freedom vital written mayflow compact declar independ constitut unit state
gettysburg address first came carri long spirit million follow stock sprang move forward
constantli consist toward ideal gain statur clariti gener hope republ cannot forev toler either
undeserv poverti self serv wealth know still far go must greatli build secur opportun knowledg
everi citizen measur justifi resourc capac land enough achiev purpos alon enough cloth feed bodi
nation instruct inform mind also spirit three greatest spirit without bodi mind men know nation
could live spirit america kill even though nation bodi mind constrict alien world live america
know would perish spirit faith speak us daili live way often unnot seem obviou speak us capit
nation speak us process govern sovereignti state speak us counti citi town villag speak us nation
hemispher across sea enslav well free sometim fail hear heed voic freedom us privileg freedom
old old stori destini america proclaim word propheci spoken first presid first inaugur word almost
direct would seem year preserv sacr fire liberti destini republican model govern justli consid
deepli final stake experi intrust hand american peopl lose sacr fire let smother doubt fear shall
reject destini washington strove valiantli triumphantli establish preserv spirit faith nation furnish
highest justif everi sacrific may make caus nation defens face great peril never encount strong
purpos protect perpetu integr democraci muster spirit america faith america retreat content stand
still american go forward servic countri god'

Kennedy Speech after Removing Stopwords is below–

 'vice presid johnson mr speaker mr chief justic presid eisenhow vice presid nixon presid truman
reverend clergi fellow citizen observ today victori parti celebr freedom symbol end well begin
signifi renew well chang sworn almighti god solemn oath forebear l prescrib nearli centuri three
quarter ago world differ man hold mortal hand power abolish form human poverti form human
life yet revolutionari belief forebear fought still issu around globe belief right man come generos
state hand god dare forget today heir first revolut let word go forth time place friend foe alik torch
pass new gener american born centuri temper war disciplin hard bitter peac proud ancient heritag
unwil wit permit slow undo human right nation alway commit commit today home around world
let everi nation know whether wish us well ill shall pay price bear burden meet hardship support
friend oppos foe order assur surviv success liberti much pledg old alli whose cultur spiritu origin
share pledg loyalti faith friend unit littl cannot host cooper ventur divid littl dare meet power
challeng odd split asund new state welcom rank free pledg word one form coloni control shall
pass away mere replac far iron tyranni shall alway expect find support view shall alway hope find
strongli support freedom rememb past foolishli sought power ride back tiger end insid peopl hut
villag across globe struggl break bond mass miseri pledg best effort help help whatev period
requir communist may seek vote right free societi cannot help mani poor cannot save rich sister
republ south border offer special pledg convert good word good deed new allianc progress assist
free men free govern cast chain poverti peac revolut hope cannot becom prey hostil power let
neighbor know shall join oppos aggress subvers anywher america let everi power know
hemispher intend remain master hous world assembl sovereign state unit nation last best hope age
instrument war far outpac instrument peac renew pledg support prevent becom mere forum invect
strengthen shield new weak enlarg area writ may run final nation would make adversari offer
pledg request side begin anew quest peac dark power destruct unleash scienc engulf human plan
accident self destruct dare tempt weak arm suffici beyond doubt certain beyond doubt never
employ neither two great power group nation take comfort present cours side overburden cost
modern weapon rightli alarm steadi spread deadli atom yet race alter uncertain balanc terror stay
hand mankind final war let us begin anew rememb side civil sign weak sincer alway subject proof
let us never negoti fear let us never fear negoti let side explor problem unit us instead belabor
problem divid us let side first time formul seriou precis propos inspect control arm bring absolut
power destroy nation absolut control nation let side seek invok wonder scienc instead terror
togeth let us explor star conquer desert erad diseas tap ocean depth encourag art commerc let side
unit heed corner earth command isaiah undo heavi burden let oppress go free beachhead cooper
may push back jungl suspicion let side join creat new endeavor new balanc power new world law
strong weak secur peac preserv finish first day finish first day life administr even perhap lifetim
planet let us begin hand fellow citizen mine rest final success failur cours sinc countri found
gener american summon give testimoni nation loyalti grave young american answer call servic
surround globe trumpet summon us call bear arm though arm need call battl though embattl call
bear burden long twilight struggl year year rejoic hope patient tribul struggl common enemi man
tyranni poverti diseas war forg enemi grand global allianc north south east west assur fruit life
mankind join histor effort long histori world gener grant role defend freedom hour maximum
danger shrink respons welcom believ us would exchang place peopl gener energi faith devot
bring endeavor light countri serv glow fire truli light world fellow american ask countri ask
countri fellow citizen world ask america togeth freedom man final whether citizen america citizen
world ask us high standard strength sacrific ask good conscienc sure reward histori final judg
deed let us go forth lead land love ask bless help know earth god work must truli'

Nixon Speech after Removing Stopwords is below–

 'mr vice president mr speaker mr chief justice senator cook mrs eisenhower fellow citizens great
good country share together met four years ago america bleak spirit depressed prospect seemingly
endless war abroad destructive conflict home meet today stand threshold new era peace world
central question us shall use peace let us resolve era enter postwar periods often time retreat
isolation leads stagnation home invites new danger abroad let us resolve become time great
responsibilities greatly borne renew spirit promise america enter third century nation past year
saw far reaching results new policies peace continuing revitalize traditional friendships missions
peking moscow able establish base new durable pattern relationships among nations world
america bold initiatives long remembered year greatest progress since end world war ii toward
lasting peace world peace seek world flimsy peace merely interlude wars peace endure
generations come important understand necessity limitations america role maintaining peace
unless america work preserve peace peace unless america work preserve freedom freedom let us
clearly understand new nature america role result new policies adopted past four years shall
respect treaty commitments shall support vigorously principle country right impose rule another
force shall continue era negotiation work limitation nuclear arms reduce danger confrontation
great powers shall share defending peace freedom world shall expect others share time passed
america make every nation conflict make every nation future responsibility presume tell people
nations manage affairs respect right nation determine future also recognize responsibility nation
secure future america role indispensable preserving world peace nation role indispensable
preserving peace together rest world let us resolve move forward beginnings made let us continue
bring walls hostility divided world long build place bridges understanding despite profound
differences systems government people world friends let us build structure peace world weak safe
strong respects right live different system would influence others strength ideas force arms let us
accept high responsibility burden gladly gladly chance build peace noblest endeavor nation
engage gladly also act greatly meeting responsibilities abroad remain great nation remain great
nation act greatly meeting challenges home chance today ever history make life better america
ensure better education better health better housing better transportation cleaner environment
restore respect law make communities livable insure god given right every american full equal
opportunity range needs great reach opportunities great let us bold determination meet needs new
ways building structure peace abroad required turning away old policies failed building new era
progress home requires turning away old policies failed abroad shift old policies new retreat
responsibilities better way peace home shift old policies new retreat responsibilities better way
progress abroad home key new responsibilities lies placing division responsibility lived long
consequences attempting gather power responsibility washington abroad home time come turn
away condescending policies paternalism washington knows best person expected act responsibly
responsibility human nature let us encourage individuals home nations abroad decide let us locate
responsibility places let us measure others today offer promise purely governmental solution
every problem lived long false promise trusting much government asked deliver leads inflated
expectations reduced individual effort disappointment frustration erode confidence government
people government must learn take less people people let us remember america built government
people welfare work shirking responsibility seeking responsibility lives let us ask government
challenges face together let us ask government help help national government great vital role play
pledge government act act boldly lead boldly important role every one us must play individual
member community day forward let us make solemn commitment heart bear responsibility part
live ideals together see dawn new age progress america together celebrate th anniversary nation
proud fulfillment promise world america longest difficult war comes end let us learn debate
differences civility decency let us reach one precious quality government cannot provide new
level respect rights feelings one another new level respect individual human dignity cherished
birthright every american else time come us renew faith america recent years faith challenged
children taught ashamed country ashamed parents ashamed america record home role world every
turn beset find everything wrong america little right confident judgment history remarkable times
privileged live america record century unparalleled world history responsibility generosity
creativity progress let us proud system produced provided freedom abundance widely shared
system history world let us proud four wars engaged century including one bringing end fought
selfish advantage help others resist aggression let us proud bold new initiatives steadfastness
peace honor made break toward creating world world known structure peace last merely time
generations come embarking today era presents challenges great nation generation ever faced
shall answer god history conscience way use years stand place hallowed history think others
stood think dreams america think recognized needed help far beyond order make dreams come
true today ask prayers years ahead may god help making decisions right america pray help
together may worthy challenge let us pledge together make next four years best four years
america history th birthday america young vital began bright beacon hope world let us go forward
confident hope strong faith one another sustained faith god created us striving always serve
purpose'

2.3) Which word occurs the most number of times in his inaugural address for each president?
Mention the top three words. (after removing the stopwords)

Answer:
Roosevelt:

Roosevelt inaugural speech addresses the below word most of the times -

 Nation with 17 no of times

Roosevelt inaugural speech address the following top three words which appears most no of times -

 Nation with 17 times, know with 10times and people with 9 times.

Kennedy:

Kennedy inaugural speech addresses the below word most of the times -

 Let with 16 times

Kennedy inaugural speech address the following top three words which appears most no. of times -

 Let with 16 times, Us with 12 times and Power with 9 times.

Nixon:

Nixon inaugural speech addresses the below word most of the times -

 Us with 26 times

Nixon inaugural speech address the following top three words which appears most no. of times -

 Us with 26 times, Let with 22 times and America with 21 times.

2.4) Plot the word cloud of each of the three speeches. (after removing the stopwords)
Answer:
Roosevelt Speech Word Cloud:
Kennedy Speech Word Cloud:
Nixon Speech Word Cloud:

You might also like