© July 2020 | IJIRT | Volume 7 Issue 2 | ISSN: 2349-6002
CUSTOMER SEGMENTATION USING MACHINE
LEARNING
Aman Banduni, Prof Ilavendhan A.
School of Computing Science & Engineering, Galgotias University, Greater Noida, U.P.
Abstract- The emergence of many competitors and commercial business using the data mining method.
entrepreneurs has caused a lot of tension among Customer segmentation is a group of business
competing businesses to find new buyers and keep the customer base called customer segment such that each
old ones. As a result of the predecessor, the need for
customer segment has customers who share the same
exceptional customer service becomes appropriate
market characteristics.[5] These differences are based
regardless of the size of the business.[2] Furthermore,
the ability of any business to understand the needs of
on factors that directly or indirectly affect the market
each of its customers will provide greater customer or business such as product preferences or
support in providing targeted customer services and expectations, location, behavior and so on. The
developing customized customer service plans. This importance of customer segmentation includes, inter
understanding is possible through structured customer alia, the ability of a business to customize market
service. Each segment has customers who share the plans that would be appropriate for each segment of
same market features.[5] Big data ideas and machine its customers;[6] Support for business decisions based
learning have promoted greater acceptance of
on risky environments such as credit relationships
automated customer segmentation approaches in favor
with its customers; Identify products related to
of traditional market analytics that often do not work
when the customer base is very large. In this paper, the
individual components and how to manage demand
k-means clustering algorithm is used for this purpose.[8] and supply power; Interdependence and interaction
The Sklearn liabrary was developed for the k-Means between consumers, between products, or between
algorithm (found in the Appendix) and the program is customers and products are revealed, which the
trained using a 100-pattern two-factor dataset derived business may not be aware of; The ability to predict
from the retail trade. Characteristics of average number customer declines, and which customers are likely to
of customer purchases and average number of monthly have problems and raise other market research
customers.
questions and provide clues to find solutions.
Buried in a database of integrated data proved to be
Index Terms- data mining; machine learning; big data;
customer segment; k-Mean algorithm; sklearn;
effective for detecting subtle but subtle patterns or
extrapolation; relationships. This mode of learning is classified
under supervised learning. Integration algorithms
include the K-Means algorithm, K-nearest algorithm,
I. INTRODUCTION sorting map (SOM), and more.[4] These algorithms,
Over the years, increased competition among without prior knowledge of the data, are able to
businesses and the availability of large-scale historical identify groups in them by repeatedly comparing input
data has resulted in widespread use of data mining patterns, as long as static aptitude in training examples
techniques to find critical and strategic information is achieved based on subject matter or process. Each
that is hidden in organizations' information.[1] Data set has data points that have very close similarities but
mining is the process of extracting logical information differ greatly from the data points of other groups.
from a dataset and presenting it in a human-accessible Integration has great applications in pattern
manner for decision support. Data mining techniques recognition, image analysis, and bioinformatics and so
distinguish fields such as statistics, artificial on.[15] In this paper the k-means clustering algorithm
intelligence, machine learning, and data systems. Data was implemented in the customer segment. The scalar
mining applications include, but are not limited to library (Appendix) of the K-Means algorithm was
bioinformatics, weather forecasting, fraud detection, developed, and training was started using a standard
financial analysis and customer segmentation. The silhouette -score with two feature sets of 100 training
key to this paper is to identify customer segments in a patterns found in the retail trade. After several
IJIRT 149990 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 116
© July 2020 | IJIRT | Volume 7 Issue 2 | ISSN: 2349-6002
indications, four stable intervals or customer segments collection is part of research in all fields of study
were identified. Two factors are considered in including physical and social sciences, humanities and
combination with the number of items a customer business. The purpose of all data collection is to
purchases per month and the average number of obtain quality evidence that leads the analysis to
customers per month. From the dataset, four construct concrete and misleading answers to the
customers or categories are classified and labeled as questions presented. We collected data from the UCI
follows: cluster_metrics_1, cluster_metrics_2, machine learning repository.
cluster_metrics_3, cluster_metrics_4. D. Clustering data
Clustering is the process of grouping information into
II. LITERATURE SURVEY a dataset based on some commonalities. There are
several algorithms, which can be applied to datasets
A. Customer Classification
based on the provided condition.[7] However, no
Over the years, the commercial world has become
universal clustering algorithm exists, hence it
more competitive, as organizations such as these have
becomes important to choose the appropriate
to meet the needs and desires of their customers,
clustering techniques. In this paper, we have
attract new customers, and thus improve their
implemented three clustering algorithms using the
businesses.[6] The task of identifying and meeting the
Python scalar library.
needs and requirements of every customer in the
E. K-mein
business is very difficult. This is because customers
K-means that an algorithm is one of the most popular
can vary according to their needs, wants,
classification algorithms. This clustering algorithm
demographics, size, taste and taste, features etc. As it
relies on centro, where each data point is placed in
is, it is a bad practice to treat all customers equally in
one of the overlapping ones, which is pre-sorted in the
business. This challenge has adopted the concept of
K-algorithm. Clusters are created that correspond to
customer segmentation or market segmentation,
hidden patterns in the data that provide the necessary
where consumers are divided into subgroups or
information to help decide execution. process. There
segments, where members of each subcategory exhibit
are many ways to make assembling K-means, we will
similar market behaviors or characteristics.[9]
use the elbow method.
Accordingly, customer segmentation is the process of
dividing the market into indigenous groups. III. METHODOLOGY
B Big Data
The data used in this paper were collected from
Recently, Big Data research has gained momentum.
the UCI Machine Learning Repository. It is a set of
Defines big data - a term that describes a large
geographic data, including all transactions that occur
number of formal and informal data, which cannot be
between 1/1/2/10 and 9/12/2011 in an unregistered
analyzed using traditional methods and algorithms.
and unregistered UK broker. The company mainly
Companies include billions of data about their
sells unique gifts to everyone at once. Many of the
customers, suppliers, and operations, and millions of
company's customers are shopkeepers.[10] The
internally connected sensors are sent to the real world
database has 8 attributes. These features include:
on devices such as mobile phones and cars, sensing,
"Invoice: invoice number. By default, a 6-digit total
manufacturing and communications data.[10] Ability
number is assigned separately for each transaction. If
to improve forecasting, save money, increase
this code starts with the letter 'c', it indicates a
efficiency and improve various areas such as traffic
cancellation. "
control, weather forecasting, disaster prevention,
Stockcode Code: Product (Item). Name, a 5-digit
finance, fraud control, business transactions, national
number assigned only to each unique product. "
security, education and healthcare. Big data is mainly
"Definition: Product Name (Item). By Name."
seen in three Vs: volume, variability, and speed. Other
“Price: The price of each product (item). Number. "
2Vs are available - authenticity and price, thus making
"Invoice: The date and time of the invitation. In terms
it 5V.
of numbers, the date and time of each transaction. "
"UnitPrice: Price is one unit. Price, product price per
C. data repository
unit of measure. "
Data collection is the process of collecting and
"Customer: Customer Number. Name, 5-digit number
measuring information against targeted changes in an
to each customer. "
established system, which enables one to answer
relevant questions and evaluate the results.[12] Data
IJIRT 149990 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 117
© July 2020 | IJIRT | Volume 7 Issue 2 | ISSN: 2349-6002
Country: Country name. Name, the name of the data points are randomly selected as cluster centroids
country where each customer resides. " using k (k = 4 in this case).
In this paper several steps were taken to obtain an Technical introduction: -
accurate result. It includes a feature with Centro's first The code below was created in the Jupiter manual
stage, allocation phase and update phase, which are using Python 3.x and some Python packages for
the most common phase k-means algorithms. editing, processing, analyzing, and visualizing
A. Collect data information.[11]
This is a data preparation phase. The feature usually Most of the codes below come from the Github
helps to refine all data items at a standard rate to package of a book called Hands-on Data Science for
improve the performance of clustering algorithms.[12] Marketing. The book is available on Amazon or
Each data point varies from grade 2 to +2. Integration OilReilly if you are a customer.
techniques that include min-max, decimal, and z-point The open source data cost used in the following code
are the standard z-signing strategy used to make comes from Irwin's machine learning repository.
things uneven before the dataset algorithm applies the
IV. PROPOSED MODEL
k-Means algorithm.
B. Methods of customer classification A) Import packages and data:
There are many ways to partition, which vary in To begin, we import the necessary packages to do our
severity, data requirements, and purpose. The analysis and then the xlsx (Excel spreadsheet) data
following are some of the most commonly used file.[12] If you want to follow up with the same data,
methods, but this is not an incomplete list.[13] There you have to download it from UCI. For this example,
are papers that discuss artificial neural networks, I place the xlsx file in the folder (directory) where I
particle determination and complex types of present Jupiter's notebook.
ensemble, but are not included due to limited B) Data cleaning:
exposure. In future articles, I may go into some of After importing the package and data, we will see that
these options, but for now, these general methods the data is not as helpful as that, so we need to clean
should suffice. and organize this data in a way that we can create
Each subsequent section of this article will include a more actionable insights.
basic description of the method, as well as a code C) Normalize the data:
example for the method used. If you do not have the The K-means area unit is sensitive to the scale of the
expertise, well, just skip the code and you have to get information used, such as clustering algorithms, so we
a good handle on each of the 4 sub-sections included would like to generalize the information.[15]
in this article.[14] A screenshot of the StackExchange answer below
C. Group analysis discusses why standardization or normalization is
Group analysis is an integration or unification, necessary for data used in K-means clustering. The
approach to consumers based on their similarity. screenshot is linked to the StackExchange question, so
There are 2 main types of categorical group analysis you can click on it and read the entirety of the
in market policy: hierarchical group analysis, and discussion if you want more information.[10]
classification (Miller, 2015). In the meantime, we will
discuss how to classify groups, called k-methods.
D. K. Means encounter
The K-means clustering algorithm is an algorithm
often used to draw insights into formats and
differences within a database.[13] In marketing, it is
often used to build customer segments and understand
the behavior of these unique segments. Let's try to
build an assembly model in Python's environment.
E. Centroids initiation Fig. 1 Standardistion and Normalisation
Selected cents or initials were selected. Figure 1 D) Select the optimal number of groups:
introduces the beginning of graduate centers. The four Okay, we are ready to run cluster analysis. But first,
selected centers, shown in different sizes, were we need to find out how many groups we want to use.
selected using the Forgi method. In Forgy's method, There are several approaches to selecting the number
of groups to use, but I am going to cover two in this
IJIRT 149990 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 118
© July 2020 | IJIRT | Volume 7 Issue 2 | ISSN: 2349-6002
article: (1) the silhouette coefficient, and (2) the elbow
method.[7]
E) Silhouette (clustering):
The silhouette refers to how to interpret and validate
consistency within data structures. This method shows
a diagram of how well each item is organized. [1]
The value of a silhouette is a measure of how
something is more similar in its collection
(combination) than other groups (partitions). The
silhouette goes from -1 to +1, where a higher value
indicates that an object matches its collection properly
Fig. 3 Elbow Graph exported from my working
and is compared to neighboring groups. If several
Jupyter notebook
objects have a high value, the integration
Based on the graph above, it looks like K = 4, or 4
configuration is appropriate. If most points have a
clusters is the correct number of clusters in this
value or a negative value, the coordinate system may
analysis. Now translates the customer segments
have too many or too few clusters.
provided by these components.
The silhouette can be calculated with any distance
G) Explaining customer segment
metric, such as the Euclidean distance or the
Manhattan distance.
Now that we know a whole lot of silhouettes, we use
code to find the right number of groups.
Fig. 2 Silhoette Score
Cluster 4 had the most complete silhouette fit,
indicating that 4 may be the best number of clusters.
Fig. 4 Customer table
But we'll see twice the way to the elbow.
Now we have to combine the matrix of integration
F) Elbow criterion method (with the sum of squared
and see what we can gather from the standard data for
errors) (SSE):
each cluster.
The idea behind the elbow method is to run a k-mean
correlation in the data given for the k value
(num_clusters, e.g. k = 1 to 10), and for each k value,
calculate the sum of the squared errors (SSE). is.
Then, adjust the SSE line for each k value. If the line
graph looks like a hand - a red circle (in the form of
an angle) below the line of the line, the "elbow" on
the hand is the correct value (collection value).[6]
Here, we want to reduce SSE. SSE usually falls to 0
as we go up k (and SSE is 0 where k is equal to the Fig. 5 Clusters
number of data points, because where each data point In the following section, we need to visualize
has its own set, and there is no error between it and its clustering by adding different columns in the x and y
trunk) . axes. Let's see what we say.
The objective is therefore to select a smaller value of
k, which still has a lower SSE, and the cone usually
represents where it begins to return negatively with
increasing.
Well, with the correct understanding of the elbow
mechanism at hand, let's use the elbow method to see
if it agrees with our previous results suggesting 4 sets.
IJIRT 149990 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 119
© July 2020 | IJIRT | Volume 7 Issue 2 | ISSN: 2349-6002
Fig. 6 TotalSales vs OrderCount Clusters
Green customers have the lowest price and lowest Fig. 8 AvgOrderValue vs TotalSales Clusters
order count, meaning they are the lowest bidder. On In this building, it has an average price and order
the other hand, orange customers have the highest compared to the total retail price. This structure also
total sales and highest order count, indicating that they reinforces the previous 2 sites in identifying the
are the highest priced customers. orange group as the highest value customer, green as
the lowest priced customer, and blue and red as the
high potential customers.
From a development perspective, I focus my attention
on the blue and red collections. I try to better
understand each encounter and their intelligent
behavior on site as to which team to focus on first and
introduce some test cycles.
H) Best-selling item by segment
We know that we have 4 categories and we know how
much they spend on each purchase, their total usage
and the number of their orders. The next thing we can
Fig. 7 AvgOrderValue vs OrderCount Clusters do is to help customer segments better understand
In this structure, we consider the average order value which items sell best in each segment.
versus the order value. Once again, green buyers have
the lowest prices and orange has the highest customer
prices.
You can see it this way. You can target customers in
red graphics and try to find ways to increase your
order count via email reminders or SMS notifications
directed to other identification features. Maybe you
can give them a discount when they come back within
30 days. Ideally, you can provide a delayed coupon
Fig. 9 StockCode
(which will be used at some point) at checkout.
Similarly, with customers who are in the blue
segment, you may want to try other sales and V. RESULT
marketing strategies for the cart. Possibly the fastest
Here, the result suggests that the orange cluster as the
offer based on market basket analysis (see section on
highest value customers, green as the lowest value
market basket analysis below).
customers, and blue and red as the high opportunity
customers.
IJIRT 149990 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 120
© July 2020 | IJIRT | Volume 7 Issue 2 | ISSN: 2349-6002
REFERENCES
[1] Blanchard, Tommy. Bhatnagar, Pranshu.
Behera, Trash. (2019). Marketing Analytics
Scientific Data: Achieve your marketing
objectives with Python's data analytics
capabilities. S.l: Packt printing is limited
[2] Griva, A., Bardaki, C., Pramatari, K.,
Papakiriakopoulos, D. (2018). Sales business
analysis: Customer categories use market
basket data. Systems Expert Systems, 100, 1-
16.
[3] Hong, T., Kim, E. (2011). It separates
Fig. 10 AvgOrderValue vs ToatalSales Clusters
consumers from online stores based on
Result also concludes that the Jumbo Bag Red
factors that affect the customer's intention to
Retrospot is the best-selling item.
purchase. Expert System Applications, 39
(2), 2127-2131.
[4] Hwang, Y. H. (2019). Hands-on Advertising
Science Data: Develop your machine
learning marketing strategies… using python
and r. S.l: Packt printing is limited
[5] Puwanenthiren Premkanth, - Market
Classification and Its Impact on Customer
Satisfaction and Special Reference to the
Commercial Bank of Ceylon PLC.‖ Global
Fig. 11 StockCode
Journal of Management and Business
Publisher Research: Global Magazenals Inc.
VI. CONCLUSION (USA). 2012. Print ISSN: 0975-5853.
Volume 12 Issue 1.
As our dataset was unbalanced, in this paper we
[6] Puwanenthiren Premkanth, - Market
opted for internal clustering validation rather than
Classification and Its Impact on Customer
external clustering verification, which relies on some
Satisfaction and Special Reference to the
external data such as labels. Internal cluster validation
Commercial Bank of Ceylon PLC.‖ Global
can be used to choose the clustering algorithm that
Journal of Management and Business
best suits the dataset and vice versa can correctly
Publisher Research: Global Magazenals Inc.
cluster the data in the cluster.
(USA). 2012. Print ISSN: 0975-5853.
Volume 12 Issue 1.
Customer segmentation can have a positive impact on
[7] Sulekha Goyat. "The basis of market
business if done properly.
segmentation: a critical review of the
literature. European Journal of Business and
So we can give people of orange bunches special
Management www.iiste.org. 2011. ISSN
discounts or gift vouchers to keep them for a long
2222-1905 (Paper) ISSN 2222-2839
time and we can give discounts to people in blue and
(Online). Vol 3, No.9, 2011
red clusters and advertise highly sold items to attract
[8] By Jerry W Thomas. 2007. Accessed at:
them , And for those of lower value who are in green
www.decisionanalyst.com on July 12, 2015.
clusters, we can organize feedback columns to find
[9] T.Nelson Gnanaraj, Dr.K.Ramesh Kumar
out what we can change to attract them.
N.Monica. AnuManufactured cluster analysis
using a new algorithm from structured and
Based on the above information, we now know that
unstructured data. International Journal of
the Jumbo Bag Red Retrospot is the best-selling item
Advances in Computer Science and
by our most expensive team. With that information
Technology. 2007. Volume 3, No.2.
available, we can make recommendations for other
[10] McKinsey Global Institute. Big data. The
potential customers in this section.
next frontier is creativity, competition and
IJIRT 149990 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 121
© July 2020 | IJIRT | Volume 7 Issue 2 | ISSN: 2349-6002
productivity. 2011. Accessed at:
www.mckinsey.com/mgi on July 14, 2015.
[11] Jean Yan. - Big Data, Big Opportunities-
Domains of Data.gov: Promote, lead,
contribute, and collaborate in the big data
era. 2013. Retrieved from
http://www.meritalk.com/pdfs/bdx/bdx-
whitepaper-090413.pdf July 14, 2015.
[12] A.K. Jain, M.N. Murty and P.J. Flynn.‖ Data
Integration: A Review‖. ACM Computer
Research. 1999. Vol. 31, No. 3.
[13] Vishish R. Patel1 and Rupa G. Mehta.
MpImpact for External Removal and
Standard Procedures for JCSI International
International Science Issues Issues, Vol. 8,
Appeals 5, No 2, September 2011 ISSN
(Online): 1694-0814
[14] Jayant Tikmani, Sudhanshu Tiwari, Sujata
Khedkar "Telecom Customer Classification
Based on Group Analysis of K-methods",
JIRCCE, Year: 2015.
[15] Vaishali R. Patel and Rupa G. Mehta
“Impact of Outlier Removal and
Normalization Approach in Modified k-
Means Clustering Algorithm”, IJCSI,Year:
2011.
IJIRT 149990 INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY 122