5/22/2020 srivardhan python
Programming for data analysis
Name: katakam srivardhan hruday kuamr
student id: st20166815
Moodle code: CIS7031_S2_19
Moodle leader: Imitiaz Khan
1. Data Preparation
1.1 Downloaded dataset for the period 2008 to 2019 from stat wales data source.
1.2 Data has been processed and we found that there is no outlier or null vales in the dataset.
1.3 Dataset has changes the name of the industry as aforementioned in assignment.
In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
In [2]:
base_data=pd.read_excel('C:\\Users\\Admin\\Downloads\\Python\\[Link]')
Below is the final dataframe, shows wales total employment values.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 1/25
5/22/2020 srivardhan python
In [3]:
base_data.head(15)
Out[3]:
Industry 2009 2010 2011 2012 2013 2014 2015 2016
0 Agriculture 37700 38200 36100 36100 36800 42700 40700 43200
1 Production 156700 149800 158600 154400 164200 173300 172300 162500 1
2 Construction 96600 93200 90000 91300 89300 97000 92600 102700
3 Retail 345400 344500 343100 347300 345100 337300 357700 360200 3
4 ICT 27800 27900 26400 27200 26900 35700 24000 34400
5 Finance 33800 29800 33200 31100 32400 32400 30800 31000
6 Real_Estate 13500 14600 17600 18800 18000 22200 19100 22700
7 Professional_Service 144800 145800 143600 137300 149900 152900 166200 161200 1
8 Public_Adminstration 415600 418600 425600 421000 427000 427600 423200 418500 4
9 Other_Service 64200 68000 72400 72800 75500 73300 77200 72400
In [4]:
base_data.index=base_data['Industry']
base_data.head()
Out[4]:
Industry 2009 2010 2011 2012 2013 2014 2015 2016
Industry
Agriculture Agriculture 37700 38200 36100 36100 36800 42700 40700 43200
Production Production 156700 149800 158600 154400 164200 173300 172300 162500
Construction Construction 96600 93200 90000 91300 89300 97000 92600 102700
Retail Retail 345400 344500 343100 347300 345100 337300 357700 360200
ICT ICT 27800 27900 26400 27200 26900 35700 24000 34400
In [5]:
del base_data['Industry']
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 2/25
5/22/2020 srivardhan python
In [6]:
base_data.head()
Out[6]:
2009 2010 2011 2012 2013 2014 2015 2016 2017 2
Industry
Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41
Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165
Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101
Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347
ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31
In [7]:
base_data['Total_Employees']=base_data.sum(axis=1)
In [8]:
base_data['Total_Employees_Growth']=round(((base_data[2018]/base_data[2009])-1)*100,2)
In [9]:
base_data.head(10)
Out[9]:
2009 2010 2011 2012 2013 2014 2015 2016 20
Industry
Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 402
Production 156700 149800 158600 154400 164200 173300 172300 162500 1651
Construction 96600 93200 90000 91300 89300 97000 92600 102700 908
Retail 345400 344500 343100 347300 345100 337300 357700 360200 3335
ICT 27800 27900 26400 27200 26900 35700 24000 34400 589
Finance 33800 29800 33200 31100 32400 32400 30800 31000 321
Real_Estate 13500 14600 17600 18800 18000 22200 19100 22700 182
Professional_Service 144800 145800 143600 137300 149900 152900 166200 161200 1764
Public_Adminstration 415600 418600 425600 421000 427000 427600 423200 418500 4245
Other_Service 64200 68000 72400 72800 75500 73300 77200 72400 832
2. Data Analysis
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 3/25
5/22/2020 srivardhan python
In [10]:
base_data['Total_Employees'].min()
Out[10]:
189900
In [11]:
base_data[base_data['Total_Employees']==base_data['Total_Employees'].min()].index
Out[11]:
Index(['Real_Estate'], dtype='object', name='Industry')
In [12]:
base_data['Total_Employees'].max()
Out[12]:
4236500
In [13]:
base_data[base_data['Total_Employees']==base_data['Total_Employees'].max()].index
Out[13]:
Index(['Public_Adminstration'], dtype='object', name='Industry')
In [14]:
import [Link] as px
2.1 Which industry employed highest and lowest workers over the period
We have fetch data of workplace employment by industry and area (Wales) for 12 years i.e. from 2008 to
2019. Below is the visualisation using python plotly express. It can observed that public administration has
highest number of employments over the period while real estate employee least number of the employee in
time span from 2008 to 2019.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 4/25
5/22/2020 srivardhan python
In [15]:
fig = [Link](base_data, y="Total_Employees", x=base_data.index, color=base_data.index,t
ext='Total_Employees')
fig.update_layout(title_text='Industry Employee Numbers')
[Link]()
Industry Employee Numbers
4M
3.5M
3461700
3M
Total_Employees
2.5M
2M
1.5M
162
2.2 Which industry has the highest and lowest overall growth over the period?
The below visualization shows industry percentage growth of employment over the period. It can be
observed that real estate shows highest percentage i.e. 86% growth in the employment from 2008 to 2019
while retail shows least percentage employee growth i.e. 0.64%.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 5/25
5/22/2020 srivardhan python
In [16]:
fig = [Link](base_data, y="Total_Employees_Growth", x=base_data.index, color=base_data.
index,text='Total_Employees_Growth')
fig.update_layout(title_text='Industry Employee % Growth')
[Link]()
Industry Employee % Growth
90
86.67
80
70
tal_Employees_Growth
60
50
40
30
2.3 Which years are the best and worst performing year in relation to number of employments. (highest and
lowest employment)
Bar graph visualisation shows number of performing years in relations to employment. It shows that 2018 is
the best performing year with highest employment with 1.45 million whiles 2010 is worst performing year with
least number of employments with 1.33 million.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 6/25
5/22/2020 srivardhan python
In [17]:
base_data2=base_data.T
base_data2.head()
Out[17]:
Industry Agriculture Production Construction Retail ICT Finance Real_Estate Pro
2009 37700.0 156700.0 96600.0 345400.0 27800.0 33800.0 13500.0
2010 38200.0 149800.0 93200.0 344500.0 27900.0 29800.0 14600.0
2011 36100.0 158600.0 90000.0 343100.0 26400.0 33200.0 17600.0
2012 36100.0 154400.0 91300.0 347300.0 27200.0 31100.0 18800.0
2013 36800.0 164200.0 89300.0 345100.0 26900.0 32400.0 18000.0
In [18]:
base_data2.head()
Out[18]:
Industry Agriculture Production Construction Retail ICT Finance Real_Estate Pro
2009 37700.0 156700.0 96600.0 345400.0 27800.0 33800.0 13500.0
2010 38200.0 149800.0 93200.0 344500.0 27900.0 29800.0 14600.0
2011 36100.0 158600.0 90000.0 343100.0 26400.0 33200.0 17600.0
2012 36100.0 154400.0 91300.0 347300.0 27200.0 31100.0 18800.0
2013 36800.0 164200.0 89300.0 345100.0 26900.0 32400.0 18000.0
In [19]:
base_data2['Yearly_Total_Employees']=base_data2.sum(axis=1)
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 7/25
5/22/2020 srivardhan python
In [20]:
base_data2.head(10)
Out[20]:
Industry Agriculture Production Construction Retail ICT Finance Real_Estate Pro
2009 37700.0 156700.0 96600.0 345400.0 27800.0 33800.0 13500.0
2010 38200.0 149800.0 93200.0 344500.0 27900.0 29800.0 14600.0
2011 36100.0 158600.0 90000.0 343100.0 26400.0 33200.0 17600.0
2012 36100.0 154400.0 91300.0 347300.0 27200.0 31100.0 18800.0
2013 36800.0 164200.0 89300.0 345100.0 26900.0 32400.0 18000.0
2014 42700.0 173300.0 97000.0 337300.0 35700.0 32400.0 22200.0
2015 40700.0 172300.0 92600.0 357700.0 24000.0 30800.0 19100.0
2016 43200.0 162500.0 102700.0 360200.0 34400.0 31000.0 22700.0
2017 40200.0 165100.0 90800.0 333500.0 58900.0 32100.0 18200.0
2018 41100.0 165700.0 101800.0 347600.0 31500.0 35500.0 25200.0
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 8/25
5/22/2020 srivardhan python
In [21]:
fig=[Link](base_data2,x=base_data2.index,y="Yearly_Total_Employees")
fig.update_layout(title='Yearly Total Employee',legend=dict(x=0,y=0.5))
[Link]()
Yearly Total Employee
14M
12M
rly_Total_Employees
10M
8M
6M
3 Visual analysis
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 9/25
5/22/2020 srivardhan python
In [22]:
base_data3=pd.read_excel('C:\\Users\\Admin\\Downloads\\Python\\[Link]')
base_data3.index=base_data3['Industry']
base_data3.head()
Out[22]:
Industry 2009 2010 2011 2012 2013 2014 2015 2016
Industry
Agriculture Agriculture 37700 38200 36100 36100 36800 42700 40700 43200
Production Production 156700 149800 158600 154400 164200 173300 172300 162500
Construction Construction 96600 93200 90000 91300 89300 97000 92600 102700
Retail Retail 345400 344500 343100 347300 345100 337300 357700 360200
ICT ICT 27800 27900 26400 27200 26900 35700 24000 34400
3.1 Create a dynamic scatter/bubble plot showing the change of workforce number over the period using
plotly Express.
To plot scatter chart, first we have to convert dataframe into columns, below is syntax to convert data frame
into columns.
In [23]:
del base_data3['Industry']
In [24]:
base_data4=base_data3.T
In [25]:
Final_df=[Link](columns=['Year','Workforce','Industry','Workforce_Change'])
for col in base_data4.columns:
if col!='Yearly_Total_Employees':
#print(col)
final_data=[Link](columns=['Year','Workforce','Industry','Workforce_Chang
e'])
final_data['Workforce']=base_data4[col].tolist()
final_data['Industry']=col
final_data['Year']=base_data4[col].index
final_data['Workforce_Change']= final_data['Workforce'] - final_data['Workforc
e'].shift()
final_data=final_data.fillna(0)
Final_df=Final_df.append(final_data)
Final output of the Data Frame to plot scatter chart.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 10/25
5/22/2020 srivardhan python
In [26]:
Final_df.head(100)
Out[26]:
Year Workforce Industry Workforce_Change
0 2009 37700 Agriculture 0.0
1 2010 38200 Agriculture 500.0
2 2011 36100 Agriculture -2100.0
3 2012 36100 Agriculture 0.0
4 2013 36800 Agriculture 700.0
... ... ... ... ...
5 2014 73300 Other_Service -2200.0
6 2015 77200 Other_Service 3900.0
7 2016 72400 Other_Service -4800.0
8 2017 83200 Other_Service 10800.0
9 2018 81800 Other_Service -1400.0
100 rows × 4 columns
Below dynamic scatter plot visualization shows the change of workforce number over the period. It can be
observed that in year 2017 ICT industry shows highest number of increases in workforce change followed by
retail industry with workforce employee to 20.4k in 2015 while same industry(ICT) shows highest number of
decreases in workforce in 2018 followed by retail in 2017. So it can be concluded that retail and ICT shows a
greater number of workforce changes over time period.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 11/25
5/22/2020 srivardhan python
In [27]:
fig = [Link](Final_df, x="Year", y="Workforce_Change", color="Industry",
log_x=True, size_max=60)
fig.update_layout(title='Scatter plot of change in workforce')
[Link]()
Scatter plot of change in workforce
20k
10k
Workforce_Change
10k
4. PCA/Correlation
PCA is basically dimensionality reduction method that is used to reduce the dimensions of the dataset into
smaller set of the variables. Using below syntax in python we have drawn PCA = 2 (Principle Component
Analysis).
In [28]:
from [Link] import PCA
pca = PCA()
PCA_base=base_data3[[2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]]
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 12/25
5/22/2020 srivardhan python
In [29]:
PCA_base.head()
Out[29]:
2009 2010 2011 2012 2013 2014 2015 2016 2017 2
Industry
Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41
Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165
Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101
Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347
ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31
In [30]:
pca.n_components = 2
X_reduced = pca.fit_transform(PCA_base)
df_X_reduced = [Link](X_reduced,columns=['PC1','PC2'], index=PCA_base.index)
df_X_reduced=round(df_X_reduced,2)
In [31]:
df_X_reduced.head(10)
Out[31]:
PC1 PC2
Industry
Agriculture -312091.23 -8151.22
Production 76819.90 912.47
Construction -137358.52 -8786.80
Retail 658534.20 -15872.69
ICT -335201.17 10284.00
Finance -334432.79 -10293.86
Real_Estate -376204.46 -7409.96
Professional_Service 58629.49 33479.53
Public_Adminstration 903354.93 2773.17
Other_Service -202050.35 3065.36
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 13/25
5/22/2020 srivardhan python
In [32]:
corr = df_X_reduced.[Link]()
[Link].background_gradient(cmap='coolwarm')
Out[32]:
Industry Agriculture Production Construction Retail ICT Finance Real_Estate
Industry
Agriculture 1 -1 1 -1 1 1
Production -1 1 -1 1 -1 -1 -
Construction 1 -1 1 -1 1 1
Retail -1 1 -1 1 -1 -1 -
ICT 1 -1 1 -1 1 1
Finance 1 -1 1 -1 1 1
Real_Estate 1 -1 1 -1 1 1
Professional_Service -1 1 -1 1 -1 -1 -
Public_Adminstration -1 1 -1 1 -1 -1 -
Other_Service 1 -1 1 -1 1 1
Real estate, Finance, agriculture, and ICT have large negative loading on principle component 2. This
component focuses on wales more unemployed workforce. While, production, public admin service, retail
and professional service have positive loading on component 1. This component have focuses on industry
have more workforce.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 14/25
5/22/2020 srivardhan python
In [33]:
fig = [Link](df_X_reduced, x='PC1', y='PC2',color=df_X_reduced.index,hover_name=df_
X_reduced.index)
fig.update_layout(title='Principle Component Analysis Scatterplot')
[Link]()
Principle Component Analysis Scatterplot
30k
20k
10k
PC2
4.2 Correlation for each industry over years
Below correlation matrix shows correlation for each industry from 2009 to 2018. It can be observe that
agriculture industry is highly correlated with construction industry and other services is also positively
corelated with professional service and public administration. whereas retail and ICT shows weak linear
relationship.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 15/25
5/22/2020 srivardhan python
In [34]:
PCA_base.head()
Out[34]:
2009 2010 2011 2012 2013 2014 2015 2016 2017 2
Industry
Agriculture 37700 38200 36100 36100 36800 42700 40700 43200 40200 41
Production 156700 149800 158600 154400 164200 173300 172300 162500 165100 165
Construction 96600 93200 90000 91300 89300 97000 92600 102700 90800 101
Retail 345400 344500 343100 347300 345100 337300 357700 360200 333500 347
ICT 27800 27900 26400 27200 26900 35700 24000 34400 58900 31
In [35]:
import [Link] as plt
corr = round(PCA_base.[Link](),3)
[Link].background_gradient(cmap='coolwarm')
Out[35]:
Industry Agriculture Production Construction Retail ICT Finance Real_Es
Industry
Agriculture 1 0.647 0.727 0.228 0.378 -0.005 0
Production 0.647 1 0.188 0.028 0.232 0.225 0
Construction 0.727 0.188 1 0.414 0.01 0.309 0
Retail 0.228 0.028 0.414 1 -0.552 -0.253 0
ICT 0.378 0.232 0.01 -0.552 1 0.043 0
Finance -0.005 0.225 0.309 -0.253 0.043 1 0
Real_Estate 0.668 0.604 0.598 0.232 0.154 0.316
Professional_Service 0.637 0.56 0.441 0.046 0.503 0.389 0
Public_Adminstration 0.195 0.547 0.08 -0.258 0.122 0.59 0
Other_Service 0.333 0.578 -0.031 -0.156 0.543 0.242 0
5. Clustering (k means & hierarchical)
5.1 Using the best and worst performing year column’s employment data (2.3) undertake a K means
clustering analysis (K=2 & 3) and identify industries cluster together.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 16/25
5/22/2020 srivardhan python
Below is K-means Clustering Table. K_2 is 2 means clustering K_3 is 3 means clustering
In [36]:
cluster_base=base_data3[[2010,2018]]
In [37]:
cluster_base.head()
Out[37]:
2010 2018
Industry
Agriculture 38200 41100
Production 149800 165700
Construction 93200 101800
Retail 344500 347600
ICT 27900 31500
In [38]:
import [Link] as plt
from [Link] import KMeans
cluster = KMeans(n_clusters=2)
predicted_2 = cluster.fit_predict(cluster_base)
cluster2 = KMeans(n_clusters=3)
predicted_3 = cluster2.fit_predict(cluster_base)
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 17/25
5/22/2020 srivardhan python
In [39]:
cluster_base['K_2']=predicted_2+1
cluster_base['K_3']=predicted_3+1
cluster_base['K_2']=cluster_base['K_2'].astype(str)
cluster_base['K_3']=cluster_base['K_3'].astype(str)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: Settin
gWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: [Link]
s/stable/user_guide/[Link]#returning-a-view-versus-a-copy
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: Settin
gWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: [Link]
s/stable/user_guide/[Link]#returning-a-view-versus-a-copy
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: Settin
gWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: [Link]
s/stable/user_guide/[Link]#returning-a-view-versus-a-copy
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: Settin
gWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: [Link]
s/stable/user_guide/[Link]#returning-a-view-versus-a-copy
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 18/25
5/22/2020 srivardhan python
In [40]:
cluster_base.head(10)
Out[40]:
2010 2018 K_2 K_3
Industry
Agriculture 38200 41100 1 2
Production 149800 165700 1 1
Construction 93200 101800 1 2
Retail 344500 347600 2 3
ICT 27900 31500 1 2
Finance 29800 35500 1 2
Real_Estate 14600 25200 1 2
Professional_Service 145800 187100 1 1
Public_Adminstration 418600 434900 2 3
Other_Service 68000 81800 1 2
Scatter plot of K means clustering with K=2. From below two cluster visualization it can be observed that
more industry are in cluster 2 while cluster 1 has only 2 industry with certain similarities.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 19/25
5/22/2020 srivardhan python
In [41]:
fig = [Link](cluster_base, x=2010, y=2018, color="K_2",hover_name=cluster_base.inde
x)
fig.update_layout(title='Scatter plot of K means clustering with K=2')
[Link]()
Scatter plot of K means clustering with K=2
450k
400k
350k
300k
250k
2018
200k
Scatter plot of K means clustering with K=3. From below k = 3 cluster visualization it can be observed
that more industry are in cluster 3 while cluster 1 and cluster 2 has 2 industry with certain similarities.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 20/25
5/22/2020 srivardhan python
In [42]:
fig = [Link](cluster_base, x=2010, y=2018, color="K_3",hover_name=cluster_base.inde
x)
fig.update_layout(title='Scatter plot of K means clustering with K=3')
[Link]()
Scatter plot of K means clustering with K=3
450k
400k
350k
300k
250k
2018
200k
5.2 Hierarchical cluster
Dendrogram is used to determine the number of appropriate clusters in hierarchical clustering. It is the main
output of hierarchical clustering. The horizontal axis of dendrogram represent distances between cluster. The
number of clusters is equal to distance between two straight line drawn from one cluster to another. This is
refer to as Euclidean distance. So from above diagram using this clustering we have identified 6 clusters.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 21/25
5/22/2020 srivardhan python
In [43]:
import [Link] as sch
#Lets create a dendrogram variable linkage is actually the algorithm #itself of hierarc
hical clustering and then in linkage we have to #specify on which data we apply and eng
age. This is X dataset
dendrogram = [Link]([Link](base_data3[[2010,2018]], method = "ward"))
[Link]('Dendrogram')
[Link]('Years')
[Link]('Euclidean distances')
[Link]()
In [ ]:
In [44]:
from [Link] import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 6, affinity = 'euclidean', linkage ='ward')
# Lets try to fit the hierarchical clustering algorithm to dataset #X while creating t
he clusters vector that tells for each customer #which cluster the customer belongs to.
y_hc=hc.fit_predict(base_data3[[2010,2018]])
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 22/25
5/22/2020 srivardhan python
In [45]:
cluster_base['Hierarchical_clustering']=y_hc
cluster_base['Hierarchical_clustering']=cluster_base['Hierarchical_clustering']+1
cluster_base['Hierarchical_clustering']=cluster_base['Hierarchical_clustering'].astype(
str)
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: Settin
gWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: [Link]
s/stable/user_guide/[Link]#returning-a-view-versus-a-copy
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:2: Settin
gWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: [Link]
s/stable/user_guide/[Link]#returning-a-view-versus-a-copy
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:3: Settin
gWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: [Link]
s/stable/user_guide/[Link]#returning-a-view-versus-a-copy
Below Scatter plot is created using the k = 6 cluster.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 23/25
5/22/2020 srivardhan python
In [46]:
fig = [Link](cluster_base, x=2010, y=2018, color="Hierarchical_clustering",hover_na
me=cluster_base.index)
fig.update_layout(title='Scatter plot of Hierarchical clustering with K=6')
[Link]()
Scatter plot of Hierarchical clustering with K=6
450k
400k
350k
300k
250k
2018
200k
k-means cluster is formed with predetermine number of clusters. In this we have identify the industry cluster
of best and worst performing year of employment with k = 2 and k = 3 cluster while in hierarchical clustering
as name suggest built hierarchy of cluster and result of number of clusters are reproduced as k =6 industry
cluster for best and worst performing year of employment.
6. Discussion
Provide a brief discussion (~ 300 words) on employment landscape of Wales based on the employment data
analysis results.
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 24/25
5/22/2020 srivardhan python
From the report it can be observed that employment in wales shows highest workforce in public
administration services followed by retail, production, and professional services while least work force is in
real estate, but it shows highest growth percentage in employment from 2008 to 2019. Though retail work
force is second highest, but this industry has lowest percentage growth rate over a period. With year wise
total wales employment, 2018 shows highest employment with real estate showing 38% growth while ICT
shows negative % growth. In 2010, wales shows least total workforce, with average negative (-2%) growth.
From correlation matrix, it can be observe that agriculture industry is highly correlated with construction
industry and other services is also positively corelated with professional service and public administration.
whereas retail and ICT shows weak linear relationship.
In [ ]:
localhost:8888/nbconvert/html/srivardhan [Link]?download=false 25/25