0% found this document useful (0 votes)
10 views1 page

K-Means Clustering with PySpark

The document outlines a process for clustering customer data using K-Means in PySpark. It includes steps for reading data, assembling features, scaling the data, applying K-Means clustering, and evaluating the model using silhouette scores. Finally, it visualizes the silhouette scores to determine the optimal number of clusters.

Uploaded by

namyachawla8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views1 page

K-Means Clustering with PySpark

The document outlines a process for clustering customer data using K-Means in PySpark. It includes steps for reading data, assembling features, scaling the data, applying K-Means clustering, and evaluating the model using silhouette scores. Finally, it visualizes the silhouette scores to determine the optimal number of clusters.

Uploaded by

namyachawla8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

In [ ]:

from pyspark.sql import SparkSession


spark = SparkSession.builder.appName(‘Clustering using K-Means’).getOrCreate()
data_customer=spark.read.csv('prodintdb.csv', header=True, inferSchema=True)
data_customer.printSchema()

In [ ]:
from pyspark.ml.feature import VectorAssembler
data_customer.columns
assemble=VectorAssembler(inputCols=['PDPcountperday','CheckoutHistory','Booked Revnue','B
randname','Styletype'], outputCol='features')
assembled_data=assemble.transform(data_customer)
assembled_data.show(2)

In [ ]:

from pyspark.ml.feature import StandardScaler


scale=StandardScaler(inputCol='features',outputCol='standardized')
data_scale=scale.fit(assembled_data)
data_scale_output=data_scale.transform(assembled_data)
data_scale_output.show(2)

In [ ]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
silhouette_score=[]
evaluator = ClusteringEvaluator(predictionCol='prediction', featuresCol='standardized', \
metricName='silhouette', distanceMeasure='squaredEuclide
an')
for i in range(2,10):

KMeans_algo=KMeans(featuresCol='standardized', k=i)

KMeans_fit=KMeans_algo.fit(data_scale_output)

output=KMeans_fit.transform(data_scale_output)

score=evaluator.evaluate(output)

silhouette_score.append(score)

print("Silhouette Score:",score)

In [ ]:

#Visualizing the silhouette scores in a plot


import matplotlib.pyplot as plt
fig, ax = plt.subplots(1,1, figsize =(8,6))
ax.plot(range(2,10),silhouette_score)
ax.set_xlabel(‘k’)
ax.set_ylabel(‘cost’)

You might also like