Title: Comprehensive Analysis of Consumer Behavior
Using E-commerce Transaction Data
Data Science
Instructor: Miss Andleeb Akram
Sec A
Submitted By
Abdul Moeed B-26546 Fall 2021-2025
Ahmad Usman B-26763 Fall 2021-2025
Haider Ali B-26714 Fall 2021-2025
University of South Asia
Department of Computer Science
1|Page
1. Introduction
In the rapidly evolving digital economy, e-commerce businesses thrive on their
ability to understand and anticipate consumer behavior. With the wealth of
transactional data available today, data science techniques can be employed to
extract actionable insights that drive marketing, improve product
recommendations, and enhance customer retention strategies.
This report presents a deep dive into a UK-based online retail company's
transactional dataset collected between December 2010 and December 2011. The
overarching goals are to understand purchasing patterns, identify customer
segments, and build predictive models for customer churn. The insights derived
will serve as a blueprint for making informed, data-driven business decisions.
2. Data Overview
The dataset under investigation includes over 18,000 retail transactions made by
4,338 customers across 37 countries, featuring 3,877 unique products. Key
features in the dataset include:
InvoiceNo: Transaction identifier
StockCode: Product ID
Description: Product name
Quantity: Units purchased per transaction
InvoiceDate: Timestamp of the purchase
UnitPrice: Price per item
CustomerID: Unique customer identifier
Country: Customer’s location
The data was first cleaned to remove missing values (especially missing
CustomerIDs), negative quantities (indicating returns or errors), and duplicates. A
final dataset with 18,532 valid transactions formed the basis for further
exploration.
2|Page
Summary Statistics:
Unique Customers: 4,338
Distinct Products: 3,877
Unique Transactions: 18,532
Transaction Period: December 1, 2010 – December 9, 2011
Countries Represented: 37
3. Purchase Trend Analysis
3.1 Top-Selling Products
To determine product popularity, total quantity sold was aggregated. The top 10
most purchased products include:
1. White Hanging Heart T-Light Holder
2. Regency Cake Stand
3. Jumbo Bag Red Retrospot
4. Party Bunting
5. Paper Chain Kit
6. Feltcraft Princess Doll Kit
7. Pack of 72 Retrospot Cake Cases
8. Assorted Colour Bird Ornament
9. Set of 3 Cake Tins Pantry Design
[Link] of 60 Pink Paisley Cake Cases
These items suggest a dominant market for decorative and party-oriented goods,
which can guide future stock planning and promotional efforts.
3|Page
3.2 Hourly and Daily Patterns
Hourly transaction analysis revealed that most purchases occur between 10 AM to
3 PM, peaking at 12 PM, likely coinciding with work breaks. Daily activity
showed Tuesday and Thursday as peak days, indicating possible targeted
marketing windows.
4|Page
3.3 Monthly Trends
Sales volume increased sharply in November and December, coinciding with the
holiday season. A time series plot of daily sales revealed predictable seasonality,
crucial for inventory and marketing planning.
5|Page
3.4 Additional Insights
A heatmap of purchases by hour and weekday demonstrated heightened sales
activity during late mornings on weekdays, with a drop-off on weekends.
Additionally, order value distributions showed a right-skewed pattern indicating
most purchases are low-to-mid value.
4. Customer Segmentation Using RFM and KMeans
4.1 RFM Feature Engineering
To segment customers, we used RFM (Recency, Frequency, Monetary) analysis:
Recency: Days since last purchase
Frequency: Total number of purchases
Monetary: Total spend per customer
6|Page
The dataset was grouped by CustomerID, and values were normalized to prevent
any feature from dominating clustering due to scale.
4.2 Clustering with KMeans
KMeans clustering was applied (optimal clusters = 4, determined via Elbow
Method), yielding:
Cluster 0 (20%): High-spenders, frequent and recent buyers – VIPs
Cluster 1 (25%): Recent but low-frequency buyers – potential to upsell
Cluster 2 (30%): Infrequent, low-value customers – low engagement
Cluster 3 (25%): Haven’t purchased recently – likely churned
These profiles help in crafting tiered retention and engagement strategies.
5. Churn Prediction Using Machine Learning
5.1 Label Definition and Features
Churn was defined based on a 6-month inactivity threshold. Customers who had
not made a purchase within 180 days of the last dataset date were labeled as
"churned" (1), others as "active" (0).
7|Page
Predictors: RFM features
Model: Random Forest Classifier
Data split: 70% training, 30% testing
5.2 Performance Metrics
The model achieved perfect performance on the test data:
Class Precision Recall F1-Score Support
0 (Active) 1.00 1.00 1.00 1054
1 (Churned) 1.00 1.00 1.00 248
Accuracy 1.00 1302
Macro Avg 1.00 1.00 1.00 1302
Weighted Avg 1.00 1.00 1.00 1302
While these results are excellent, caution is advised. Such high accuracy may
suggest data leakage or overfitting. Validation using unseen or future data is
necessary to ensure generalizability.
6. Visualizations
Eight key charts were generated to visually support the analysis:
1. Top 10 Most Purchased Products
2. Daily Sales Over Time (Time Series)
3. Hourly Purchase Frequency
4. Daily Purchase Frequency
5. Monthly Sales Volume
6. Heatmap of Sales by Day and Hour
7. Order Value Distribution
8. Customer Segments (Pie Chart from RFM + KMeans
These visualizations are included in the supporting code and can be embedded
into presentations or dashboards.
8|Page
7. Conclusion
Clear demand for specific product categories: The repeated purchases of
decorative and party-related items indicate a stable and predictable customer
preference, offering opportunities for focused promotions and stocking
strategies.
Predictable purchase timing and seasonality: Temporal analysis
highlights increased customer activity between 10 AM to 3 PM on weekdays
and during the holiday months of November and December. This pattern
suggests ideal timing for flash sales, newsletters, and advertising campaigns.
Well-defined customer segments based on behavior: The application of
RFM analysis and KMeans clustering uncovered four distinct customer
segments. Understanding these segments allows businesses to tailor their
retention, upselling, and engagement strategies based on customer lifetime
value.
Highly accurate churn prediction model: The Random Forest model
provided perfect accuracy on test data, demonstrating the potential of
machine learning for proactive customer retention. However, additional
validation with future datasets is recommended to ensure robustness and
avoid overfitting.
8. Recommendations
1. Personalized Campaigns: Tailor offers based on segment profiles (e.g.,
discounts for low-frequency customers).
2. Inventory Optimization: Increase stock of top-selling products before
holiday seasons.
3. VIP Engagement: Launch loyalty programs for Cluster 0 customers to
prevent churn.
4. Churn Intervention: Use churn predictions to re-engage inactive customers
with win-back offers.
5. Dashboards: Deploy real-time dashboards to track KPIs and trends
continuously.
9|Page
9. Future Work
Product Category Classification: Use NLP to tag products for richer
segmentation.
A/B Testing: Validate marketing strategies on different segments.
Real-Time Prediction: Integrate with CRM systems to apply predictions
dynamically.
Enhanced Models: Explore XGBoost, LSTM models for better
performance and temporal analysis.
By expanding on this foundation, businesses can transition toward a data-first
customer intelligence framework that continuously evolves through feedback and
experimentation.
10 | P a g e