Running RFM in Python
Importing Required Library
#import modules
import pandas as pd # for dataframes
import matplotlib.pyplot as plt # for plotting graphs
import seaborn as sns # for plotting graphs
import datetime as dt
Loading Dataset
data = pd.read_excel("C:\Users\siva\Desktop\Online_Retail.xlsx")
data.head()
data.tail( )
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo 541909 non-null object
StockCode 541909 non-null object
Description 540455 non-null object
Quantity 541909 non-null int64
InvoiceDate 541909 non-null datetime64[ns]
UnitPrice 541909 non-null float64
CustomerID 406829 non-null float64
Country 541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB
This material is not original work. This compilation draws heavily from various sources
data= data[pd.notnull(data['CustomerID'])]
Removing Duplicates
Sometimes you get a messy dataset. You may have to deal with duplicates, which will skew your
analysis. In python, pandas offer function drop_duplicates(), which drops the repeated or
duplicate records.
filtered_data=data[['Country','CustomerID']].drop_duplicates()
filtered_data.Country.value_counts()
United Kingdom 3950
Germany 95
France 87
Spain 31
Belgium 25
Switzerland 21
Portugal 19
Italy 15
Finland 12
Austria 11
Norway 10
Denmark 9
Netherlands 9
Australia 9
Channel Islands 9
Sweden 8
Japan 8
Cyprus 8
Poland 6
Unspecified 4
Canada 4
Israel 4
Greece 4
USA 4
EIRE 3
Bahrain 2
United Arab Emirates 2
Malta 2
Lithuania 1
Singapore 1
Iceland 1
Lebanon 1
RSA 1
Saudi Arabia 1
Czech Republic 1
This material is not original work. This compilation draws heavily from various sources
Brazil 1
European Community 1
filtered_data.Country.value_counts()[:10].plot(kind='bar')
filtered_data.Country.value_counts()[:5].plot(kind='bar')
To Filter data for United Kingdom customer
uk_data=data[data.Country=='United Kingdom']
The describe() function in pandas is convenient in getting various summary statistics. This
function returns the count, mean, standard deviation, minimum and maximum values and the
quantiles of the data.
uk_data.describe()
Quantity UnitPrice CustomerID
count 361878.000000 361878.000000 361878.000000
mean 11.077029 3.256007 15547.871368
std 263.129266 70.654731 1594.402590
This material is not original work. This compilation draws heavily from various sources
min -80995.000000 0.000000 12346.000000
25% 2.000000 1.250000 14194.000000
50% 4.000000 1.950000 15514.000000
75% 12.000000 3.750000 16931.000000
max 80995.000000 38970.000000 18287.000000
To remove the negative quantity
uk_data = uk_data[(uk_data['Quantity']>0)]
uk_data.describe()
Filter required Columns
Here, you can filter the necessary columns for RFM analysis. You only need her five columns
CustomerID, InvoiceDate, InvoiceNo, Quantity, and UnitPrice. CustomerId will uniquely define
your customers, InvoiceDate help you calculate recency of purchase, InvoiceNo helps you to
count the number of time transaction performed(frequency). Quantity purchased in each
transaction and UnitPrice of each unit purchased by the customer will help you to calculate the
total purchased amount.
uk_data=uk_data[['CustomerID','InvoiceDate','InvoiceNo','Quantity','UnitPrice'
]]
uk_data['TotalPrice'] = uk_data['Quantity'] * uk_data['UnitPrice']
uk_data['InvoiceDate'].min(),uk_data['InvoiceDate'].max()
(Timestamp('2010-12-01 08:26:00'), Timestamp('2011-12-09 12:49:00'))
PRESENT = dt.datetime(2011,12,10)
uk_data['InvoiceDate'] = pd.to_datetime(uk_data['InvoiceDate'])
uk_data.head()
CustomerID InvoiceDate InvoiceNo Quantity UnitPrice TotalPrice
0 17850.0 2010-12-01 08:26:00 536365 6 2.55 15.30
1 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
2 17850.0 2 010-12-01 08:26:00 536365 8 2.75 22.00
3 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
This material is not original work. This compilation draws heavily from various sources
4 17850.0 2010-12-01 08:26:00 536365 6 3.39 20.34
RFM Analysis
Here, you are going to perform following opertaions:
For Recency, Calculate the number of days between present date and date of last
purchase each customer.
For Frequency, Calculate the number of orders for each customer.
For Monetary, Calculate sum of purchase price for each customer.
rfm= uk_data.groupby('CustomerID').agg({'InvoiceDate': lambda date: (PRESENT -
date.max()).days,'InvoiceNo': lambda num: len(num),'TotalPrice': lambda price:
price.sum()})
rfm.columns
Index(['InvoiceDate', 'TotalPrice', 'InvoiceNo'], dtype='object')
# Change the name of columns
rfm.columns=['recency','frequency','monetary']
rfm['recency'] = rfm['recency'].astype(int)
rfm.head()
recency frequency monetary
CustomerID
12346.0 325 1 77183.60
12747.0 2 103 4196.01
12748.0 0 4596 33719.73
12749.0 3 199 4090.88
12820.0 3 59 942.34
Computing Quantile of RFM values
Customers with the lowest recency, highest frequency and monetary amounts considered as top
customers.
qcut() is Quantile-based discretization function. qcut bins the data based on sample quantiles. For
example, 1000 values for 4 quantiles would produce a categorical object indicating quantile
membership for each customer.
rfm['r_quartile'] = pd.qcut(rfm['recency'], 4, ['1','2','3','4'])
This material is not original work. This compilation draws heavily from various sources
rfm['f_quartile'] = pd.qcut(rfm['frequency'], 4, ['4','3','2','1'])
rfm['m_quartile'] = pd.qcut(rfm['monetary'], 4, ['4','3','2','1'])
rfm.head()
Recency frequency monetary r_quartile f_quartile m_quartile
CustomerID
12346.0 325 1 77183.60 4 4 1
12747.0 2 103 4196.01 1 1 1
12748.0 0 4596 33719.73 1 1 1
12749.0 3 199 4090.88 1 1 1
12820.0 3 59 942.34 1 2 2
RFM Result Interpretation
Combine all three quartiles(r_quartile,f_quartile,m_quartile) in a single column, this rank will
help you to segment the customers well group.
rfm['RFM_Score'] = rfm.r_quartile.astype(str)+ rfm.f_quartile.astype(str) +
rfm.m_quartile.astype(str)
rfm.head()
# Filter out Top/Best cusotmers
rfm[rfm['RFM_Score']=='111'].sort_values('monetary', ascending=False).head()
This material is not original work. This compilation draws heavily from various sources
This material is not original work. This compilation draws heavily from various sources