SAMPLE SALES DATA
ANALYSIS
Submission Date:
SURYANSHU KUMAR
2023000776
Table of Contents
1. Project Title Page
2. Table of Contents
3. Introduction
4. Requirements
5. Code Structure
6. Challenges & Solutions
7. Conclusion & Future Work
8. References
Introduction
Objectives
The primary objectives of this analysis are:
To perform descriptive, bivariate, and multivariate
statistical analyses on the Sample Sales Data.
To derive insights into sales patterns, customer
behavior, and shipping performance.
To identify factors influencing sales and customer
satisfaction.
Scope and Limitations
Scope: The analysis encompasses various
statistical techniques, including descriptive
statistics, hypothesis testing, correlation analysis,
regression analysis, and principal component
analysis (PCA).
Limitations: The dataset's quality and
completeness may affect the analysis. Additionally,
the findings are limited to the data provided and
may not be generalizable.
Requirements
Software & Libraries
Python 3.x
Libraries:
o pandas
o numpy
o matplotlib
o seaborn
o scipy
o statsmodels
o scikit-learn
Hardware Requirements
Standard computing hardware capable of running
Python and the aforementioned libraries.
Installation Instructions
To install the required libraries, execute:
pip install pandas numpy matplotlib seaborn scipy statsmodels scikit-
learn
Code Structure
[Link]
import pandas as pd
import numpy as np
import [Link] as plt
import seaborn as sns
from scipy import stats
from [Link] import ols
from [Link] import PCA
b. Inputs (Data)
Dataset: Sample Sales Data
Source: Kaggle Dataset
c. Process (Methods)
Data Loading and Cleaning
# Load the dataset
df = pd.read_csv('sample_sales_data.csv')
# Display basic information
[Link]()
# Handle missing values
[Link](inplace=True)
1. Descriptive/Univariate Analysis
Summaries:
# Summary statistics
[Link]()
Plots:
# Histogram
df['Sales'].hist()
[Link]('Sales Distribution')
[Link]('Sales')
[Link]('Frequency')
[Link]()
# Boxplot
[Link](x=df['Sales'])
[Link]('Sales Boxplot')
[Link]()
# Heatmap
df_numeric = [Link](pd.to_numeric, errors='coerce')
df_numeric = df_numeric.dropna(axis=1, how='all')
corr_matrix = df_numeric.corr()
[Link](figsize=(10, 6))
[Link](corr_matrix, annot=True, cmap="coolwarm",
fmt=".2f", linewidths=0.5)
[Link]("Correlation Heatmap")
[Link]()
Normality Tests:
# Shapiro-Wilk test
stat, p = [Link](df['Sales'])
print('Statistics=%.3f, p=%.3f' % (stat, p))
Hypothesis Tests:
# One-sample t-test
t_stat, p_val = stats.ttest_1samp(df['Sales'], popmean=500)
print('t-statistic=%.3f, p-value=%.3f' % (t_stat, p_val))
2. Bivariate Analysis
Correlation:
# Correlation matrix
corr_matrix = [Link]()
[Link](corr_matrix, annot=True)
[Link]('Correlation Matrix')
[Link]()
Simple Linear Regression:
# Regression analysis
model = ols('Sales ~ Quantity', data=df).fit()
print([Link]())
3. Multivariate Analysis
Multiple Regression:
# Multiple regression
model = ols('Sales ~ Quantity + Discount', data=df).fit()
print([Link]())
Principal Component Analysis (PCA):
# PCA
features = ['Sales', 'Quantity', 'Discount']
x = df[features]
pca = PCA(n_components=2)
principal_components = pca.fit_transform(x)
Exploratory Factor Analysis (EFA):
# EFA
df_numeric = df.select_dtypes(include=[[Link]])
df_numeric = df_numeric.dropna()
fa_no_rotation = FactorAnalyzer(rotation=None)
fa_no_rotation.fit(df_numeric)
eigenvalues, _ = fa_no_rotation.get_eigenvalues()
n_factors = sum(eigenvalues > 1)
fa = FactorAnalyzer(n_factors=n_factors, rotation='varimax')
[Link](df_numeric)
loadings = fa.loadings_
print("\nFactor Loadings:")
print([Link](loadings, index=df_numeric.columns))
d. Outputs (Results – Numeric, Plots)
Descriptive Statistics:
Visualization
Statistical Test Results:
Statistics=0.927, p=0.000
t-statistic=20.791, p-value=0.000
Regression Analysis:
PCA Results:
EFA Results:
Challenges & Solutions
Challenges
Data Quality: Missing values and potential outliers.
Assumptions: Ensuring statistical tests'
assumptions are met.
Solutions
Data Cleaning: Handled missing values by
removing incomplete records.
Validation: Conducted normality tests and
visualizations to validate assumptions.
Conclusion & Future Work
Summary of Key Findings
Sales Distribution: Sales data exhibited
[normal/non-normal] distribution.
Correlations: Significant correlation found between
sales and quantity.
Regression Models: Quantity and discount were
significant predictors of sales.
PCA: Identified principal components explaining
variance in sales data.
Suggestions for Future Improvements
Data Enrichment: Incorporate additional variables
like customer demographics.
Advanced Models: Explore machine learning
models for better prediction accuracy.
References
Kaggle Dataset: Sample Sales Data
Python Libraries Documentation:
o pandas
o numpy
o matplotlib
o seaborn
o scipy
o statsmodels
o scikit-learn