0% found this document useful (0 votes)
16 views30 pages

ML-1 Project

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views30 pages

ML-1 Project

Uploaded by

aurorajashri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

ML-1 Project

(Coded)
DSBA

By:
E. AuroRajashri

0
List of Content
Clustering - Part 1
1.1 Clustering: Define the problem and perform Exploratory Data
Analysis.............................................................................................................4
1.1.1 Problem definition
1.1.2 Check shape, Data types, statistical summary
1.1.3 Univariate analysis and Bivariate analysis. Key meaningful observations on individual
variables and the relationship between variables
1.2 Clustering: Data Preprocessing………………………………………………………….….9
1.2.1 Missing value check and treatment
1.2.2 Outlier Treatment
1.2.3 z-score scaling
1.3 Clustering: Hierarchical Clustering………………………………………………...….12
1.3.1 Construct a dendrogram using Ward linkage and Euclidean distance
1.3.2 Identify the optimum number of Clusters
1.4 Clustering: K-means Clustering……………………………………………………………13
1.4.1 Apply K-means Clustering
1.4.2 Plot the Elbow curve
1.4.3 Check Silhouette Scores and Figure out the appropriate number of clusters
1.4.4 Cluster Profiling
1.5 Clustering: Actionable Insights & Recommendations…………………….……….16
Extract meaningful insights (atleast 3) from the clusters to identify the most effective types
of ads, target audiences, or marketing strategies that can be inferred from each segment...
Based on the clustering analysis and key insights, provide actionable recommendations
(atleast 3) to Ads24x7 on how to optimize their digital marketing efforts, allocate budgets
efficiently, and tailor ad content to specific audience segments…………………………………

PCA- Part 2
2.1 PCA: Define the problem and perform Exploratory Data Analysis…….……18
2.1.1 Problem Definition - Check shape, Data types, statistical summary
2.1.2 Perform an EDA on the data to extract useful insights
2.2 PCA: Data Preprocessing…………………………………………………………….………21
2.2.1 Check for and treat (if needed) missing values
2.2.2 Check for and treat (if needed) data irregularities

1
2.2.3 Scale the Data using the z-score method
2.2.4 Visualize the data before and after scaling and comment on the impact on outliers
2.3 PCA: PCA……………………………………………………………………….…………………24
2.3.1 Create the covariance matrix
2.3.2 Get eigen values and eigen vectors
2.3.3 Identify the optimum number of PCs
2.3.4 Show Scree plot
2.3.5 Compare PCs with Actual Columns and identify which is explaining most variance
2.3.6 Write inferences about all the PCs in terms of actual variables
2.3.7 Write linear equation for first PC

2
PART-1: Clustering:
Digital Ads Data:
The ads24x7 is a Digital Marketing company which has now got seed funding of $10
Million. They are expanding their wings in Marketing Analytics. They collected data
from their Marketing Intelligence team and now wants you (their newly appointed
data analyst) to segment type of ads based on the features provided. Use Clustering
procedure to segment ads into homogeneous groups.
The following three features are commonly used in digital marketing:
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note
that the Total Campaign Spend refers to the 'Spend' Column in the dataset and the
Number of Impressions refers to the 'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend)
refers to the 'Spend' Column in the dataset and the Number of Clicks refers to the
'Clicks' Column in the dataset.
CTR = Total Measured Clicks / Total Measured Ad Impressions x
100. Note that the Total Measured Clicks refers to the 'Clicks' Column in the dataset
and the Total Measured Ad Impressions refers to the 'Impressions' Column in the
dataset.
The Data Dictionary and the detailed description of the formulas for
CPM, CPC and CTR are given in the sheet 2 of the Clustering Clean
ads_data Excel File.
Perform the following in given order:
 Read the data and perform basic analysis such as printing a few rows (head
and tail), info, data summary, null values duplicate values, etc.
 Treat missing values in CPC, CTR and CPM using the formula given. You may
refer to the Bank_KMeans Solution File to understand the coding behind
treating the missing values using a specific formula. You have to basically
create an user defined function and then call the function for imputing.
 Check if there are any outliers.
 Do you think treating outliers is necessary for K-Means clustering? Based on
your judgement decide whether to treat outliers and if yes, which method to
employ. (As an analyst your judgement may be different from another
analyst).
 Perform z-score scaling and discuss how it affects the speed of the algorithm.
 Perform clustering and do the following:
 Perform Hierarchical by constructing a Dendrogram using WARD and
Euclidean distance.
 Make Elbow plot (up to n=10) and identify optimum number of clusters for k-
means algorithm.
 Print silhouette scores for up to 10 clusters and identify optimum number of
clusters.
 Profile the ads based on optimum number of clusters using silhouette score
and your domain understanding
[Hint: Group the data by clusters and take sum or mean to identify trends in
clicks, spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar
plots.]
 Conclude the project by providing summary of your learnings.

3
1.1 Clustering: Define the problem and perform
Exploratory Data Analysis
1.1.1 Problem Definition
 Imported necessary libraries like NumPy, Pandas,matplotlib,seaborn.
 Loaded the given dataset to dataframe df

Fig 1: Dataset Head rows

1.1.2 Check shape, Data types, statistical summary


 Dataset has shape of 23066 rows and 19 columns. And it has 6 float datatypes
,7 integer datatypes and 6 object datatypes.

Fig 2: Dataset Info

 Below is the dataset statistical Summary

4
Fig 3: Dataset Statistical Summary

 There are no duplicates in the dataset.

1.1.3 Univariate analysis and Bivariate analysis


 Categorical Variables

Fig 4: Inventory Type Analysis

Fig 5: Platform, Device type and Format Analysis

1. Format 4 is the most used Inventory type, followed by Format 5


and Format 1.
2. The most preferred platform is Video, followed by Web and App.
3. Mobile is the top preferred Device type than Desktop.
4. Choice of Display and Video format are almost the same.

 Numerical Variables

5
1. The median of the spend lies between 1000 to 2000

2. There is a high frequency of data points with a low number of


impressions, peaking at around 0.5 million impressions. This suggests that most of
the data points have a low number of impressions.
3. KDE confirms the skewness of the distribution towards lower
impression counts.
4. right-skewed distribution of clicks, where the most of the data
points are clustered at the lower end of the click range, suggesting that lower click
counts are more common.

Fig 6: Spend,Impression,Revenue and clicks distribution

6
Fig 7: CTR,CPM,CPC distribution

 Relationship between Numerical Variables


Based on Pair plot:
1) Ad Length and Ad Width: There is a clear upward trend indicating that as the ad
length increases, the ad width tends to increase as well
2) Ad Length and Ad Size: Since ad size is likely a function of ad length and ad width, it's
not surprising to see a positive correlation here, with larger ad lengths contributing to
larger overall ad size
3) Ad Width and Ad Size: Similar to ad length, as the ad width increases, the ad size also
increases, showing a positive correlation
4) Impressions and Clicks: There is a positive correlation, as more impressions typically
lead to more clicks
5) Impressions and Revenue: The scatter plot suggests that higher impressions are
associated with higher revenue, indicating a positive correlation
6) Clicks and Revenue: This scatter plot also shows a positive correlation, where more
clicks are associated with higher revenue

7
Fig 8: Pair plot of numeric variables

7) From above, we could see that, mobile transactions are generally of lower value
compared to desktop transactions. This is because most significant peak occurring
much earlier than in the desktop distribution

Fig 9: Revenue based on Device Type

8
8) The median spending on the App platform is higher than that on the Video platform
but lower than on the Web platform.
9) Web being the platform where users tend to spend the most, followed by App, and
then Video.

Fig 10: Spend based on Platform

1.2 Clustering: Data Preprocessing


1.2.1 Missing value check and treatment
 There is missing values in CTR ,CPM,CPC of 4736 each as shown below

Fig 11: Missing values in the dataset

 Imputed by the following formula and we could see there is no null post that.

9
Fig 12: Formulae and Post imputation, Missing values in the dataset

1.2.2 Outlier Treatment


 We could see there is outlier in all features except Ad_length and Ad_width.
Treated by IQR method.
Before Outlier treatment:

Fig 13: Before Outlier treament

After Outlier treatment:

10
Fig 14: After Outlier treatment

1.2.3 z-score scaling


 From the below, we could see there is different scale units among the features.
So, there is a need for scaling. Z-score scaling is done.
Before scaling:

Fig 15: Before scaling

After scaling:

Fig 16: after scaling

11
1.3 Clustering: Hierarchical Clustering
1.3.1 Construct a dendrogram using Ward linkage and
Euclidean distance
 Imported dendrogram, linkage from scipy.cluster.hierarchy.
 By ‘Ward’ method and ‘euclidean’ metric, constructed the below dendrogram
by truncating to the last 10 clustering.

Fig 17: Truncated Dendrogram

1.3.2 Identify the optimum number of Clusters


 By the above dendrogram, we could see ‘5’ could be the optimal
number of clusters could be formed.
 fCluster are applied and the output is as below. And the column is
added to original df.

Fig 18: Hierarchical cluster

12
Fig 19: Hierarchical cluster added to original df

1.4 Clustering: K-means Clustering


1.4.1 Apply K-means Clustering
 Imported Kmeans, silhouette score and Silhouette samples library
 Below is the top 10 K means inertia.

Fig 20: K means Inertia

 It could be clear about the difference between inertia using elbow curve plot.

1.4.2 Plot the Elbow curve


 From the below elbow curve, we could see there is sudden drop of inertia from
1 to 5. Post 5, there is slow and smooth drop. So, 5 clusters would be the
optimal number.

13
Fig 21: Elbow Curve

1.4.3 Check Silhouette Scores and Figure out the appropriate


number of clusters
 To evaluate the model, silhouette score is used. As it is around 0.52 for 5
clusters, which is positive. It means clusters are very well separated.

Fig 22: Silhouette score

 Below is the silhouette width which is positive as well meaning the mapping is
correct to its centroid.

Fig 23: Silhouette width


 Silhouette width is added to the data frame as shown below.

Fig 24: Silhouette width added to df

14
1.4.4 Cluster Profiling
 Data are grouped by kmeans cluster and taken mean for the variables as shown
below.

Fig 25: Data grouped by clusters

 As per rubric, plotted the bar graph of the above tabulation based on device
type.

15
Fig 26: Data grouped by clusters by device type as hue

1.5 Clustering: Actionable Insights &


Recommendations
 Based on above analysis, Clusters can be grouped into High spending, medium
spending and low spending
 Cluster 2 and 4 are high spending, cluster 0 and 1 are medium spending and
cluster 3 are low spending
 The cluster with the lowest Cost-Per-Click (CPC) is 'Cluster 0'. This indicates
that, among the analysed clusters, Cluster 0 represents the most cost-effective
segment for digital advertising, with the lowest average cost incurred per click
on advertisements.
 Cluster 4 has the highest CTR for both Desktop and Mobile devices, making it
the best performing cluster among the five presented in terms of CTR.
 Based on the analysis, the cluster with the highest revenue is Cluster 3. This
suggests that Cluster 3 would be the best for revenue among the clusters
analysed.
 Cluster 1 has the highest spend for both Desktop and Mobile devices,
indicating it is the best performing cluster in terms of spend.
 Clusters 1 and 3 show a significant difference in ad dimensions (length and
width) and their performance metrics. Consider testing different ad sizes to
find the most effective dimensions for engagement and clicks
 Cluster 4 has a high number of clicks and a substantial revenue figure. Analyze
the characteristics of ads in this cluster to understand what makes them
successful and replicate these features in other ads.
 Clusters 0 and 2 have a lower spend-to-revenue ratio compared to others.
Evaluate the ROI of each cluster and adjust your ad spend accordingly to
maximize profitability.

16
PCA:
PART 2: PCA FH (FT): Primary census abstract for female headed households excluding
institutional households (India & States/UTs - District Level), Scheduled tribes - 2011
PCA for Female Headed Household Excluding Institutional Household. The Indian
Census has the reputation of being one of the best in the world. The first Census in India
was conducted in the year 1872. This was conducted at different points of time in
different parts of the country. In 1881 a Census was taken for the entire country
simultaneously. Since then, Census has been conducted every ten years, without a
break. Thus, the Census of India 2011 was the fifteenth in this unbroken series since
1872, the seventh after independence and the second census of the third millennium
and twenty first century. The census has been uninterruptedly continued despite of
several adversities like wars, epidemics, natural calamities, political unrest, etc. The
Census of India is conducted under the provisions of the Census Act 1948 and the
Census Rules, 1990. The Primary Census Abstract which is important publication of
2011 Census gives basic information on Area, Total Number of Households, Total
Population, Scheduled Castes, Scheduled Tribes Population, Population in the age group
0-6, Literates, Main Workers and Marginal Workers classified by the four broad
industrial categories, namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household
Industry Workers, and (iv) Other Workers and also Non-Workers. The characteristics of
the Total Population include Scheduled Castes, Scheduled Tribes, Institutional and
Houseless Population and are presented by sex and rural-urban residence. Census 2011
covered 35 States/Union Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and
6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details
without using Data Science Techniques. You are tasked to perform detailed EDA and
identify Optimum Principal Components that explains the most variance in data. Use
Sklearn only.
 Note: The 24 variables given in the Rubric is just for performing EDA. You will have to
consider the entire dataset, including all the variables for performing PCA.
Data file - PCA India Data Census.xlsx

17
2.1 PCA: Define the problem and perform Exploratory
Data Analysis
2.1.1 Problem Definition - Check shape, Data types, statistical
summary
 Exported necessary libraries like NumPy, Pandas, Seaborn
 Data is read using pd_excel and top 5 head rows are shown below.
 Dataset has 640 rows and 61 columns

Fig 27: Data Head

Fig 28: Data Statistical Summary

 Dataset has 59 numeric variable and 2 object variables. And there is no null
value as shown below.
 There are no duplicates in the dataset
 According to the statistical summary, 50% row represents the median, shows
that for many columns, the mean is higher than the median, indicating a
right-skewed distribution.

18
Fig 29: Data info

2.1.2 Perform an EDA on the data to extract useful insights


 As per below graph, Uttar Pradesh, Madhya Pradesh and Bihar has higher
number of area name.
 Considered these 5 variables for EDA: State,LIT_F,LIT_M,TOT_M,TOT_F

Fig 30: State of India

19
 Kerala has highest Female and male population, Followed by West Bengal as
shown in below graphs

Fig 31: Total Male and Female Population of different states

 Kerala, Maharashtra, and Tamil Nadu were noted for having higher median values
which means high number of literate males
 Kerala have highest literate females and Bihar has lowest Literate females.

20
Fig 32: Total Male and Female Literates of different states

 Mumbai Suburban of Maharashtra has highest literate male, followed by


North 24 parganas of West Bengal.
 Below is the top 5 literate male grouped by state and area.

Fig 33: Top 5 Literate male grouped by state and area

2.2 PCA: Data Preprocessing


2.2.1 Check for and treat (if needed) missing values
 There are no null values as shown below.

Fig 34: Null values in dataset

21
2.2.3 Scale the Data using the z-score method
Before scaling:
 Data is of different scalar units. To make the analysis better, scaling is
necessary. Z-score technique is used.

Fig 35: Before Scaling

After Scaling:

Fig 36: After Scaling

2.2.4 Visualize the data before and after scaling and comment
on the impact on outliers
Before Scaling: Outliers

22
Fig 37: Before Scaling

After Scaling: Outliers

 There is no impact of scaling on the outliers. This could be seen by comparing two
pair plot before and after scaling.

23
Fig 38: Before Scaling

2.3 PCA: PCA


2.3.1 Create the covariance matrix
 The variance and the relationship between different variables in the dataset are the co-
variance matrix. It is shown as heatmap as below.

24
Fig 39: Heatmap of dataset

 Since there is correlation between the variables. To know the significance of


correlation, Bartlett sphericity test conducted. If p-value < 0.05, then there is
no significance of correlation. For the dataset given, its 0 so, we can proceed
with PCA.

Fig 40: Bartlett Sphericity

 To confirm the sample adequacy, kmo model technique is used. Above 0.7 is
good. For the given dataset, it is 0.8 which confirms the sample adequancy

Fig 41: Kmo model

2.3.2 Get eigen values and eigen vectors


 No of eigen value = No of PC we have.
 Below is the eigen vectors of the data set.

25
Fig 42: Eigen vectors

Fig 43: Eigen value

 Below is the explained variance ratio

Fig 44: Explained Variance ratio

2.3.3 Identify the optimum number of PCs


 As per rubric, 90% explained variance need to be considered. A
 As per below the cumulative variance in %, 6 PCA can be considered which
covers 90% of variance.

Fig 45: Explained Variance in %

2.3.4 Show Scree plot


 As per scree plot, Post 6 PC, the drop is slow. The optimal number of PC
would be 6 which results in dimensionality reduction.

26
Fig 46: Scree Plot

 Post confirming the no of PC, PCA getting applied to all the features and shape
becomes 640 rows and 6 columns as shown below

Fig 47: Post applying PCA

Fig 48: Correlation post PCA

27
2.3.5 Compare PCs with Actual Columns and identify which is
explaining most variance
 PC1 has the highest absolute loading values compared to the other PCs.
 The first bar in the PC1 graph is the tallest among all the first bars in the other
PC graphs, which suggests that PC1 accounts for the most variance within the
data set.

Fig 49: Absolute Loadings of PC’s

28
Fig 50: Correlations of PC’s with original feature

2.3.7 Write linear equation for first PC


PC 1 = a1x1 + a2x2 + a3X3 + …….+ anxn

29

You might also like