0% found this document useful (0 votes)
39 views19 pages

Group9 DataProject

Uploaded by

thaidd22416c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views19 pages

Group9 DataProject

Uploaded by

thaidd22416c
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Dataset 1: Products Sale Data

1. BUSINESS ISSUE UNDERSTANDING


1.1 Define business objective:
Objective: The objective of the "Product Sales Data" dataset is to provide a comprehensive
dataset for performing sales analysis. This includes identifying sales trends, evaluating
product performance, understanding customer behavior, and assessing the impact of
promotions. So, the objective of this dataset is to forecast sales, optimize pricing strategies
and improve overall business decisions based on data-driven insights.
Questions:
1) Is there any trend in the sales of all four products during certain months?
2) Out of all four products , which product has seen the highest sales in all the given years?
3) What are the key trends in sales over the past year?
4) What are the factors contributing to sales declines or growth in specific periods?
5) What is the correlation between marketing spend and sales growth?
6) What is the average time from product launch to reaching peak sales?
7) Which factors (price, product quality, customer service, etc.) have the greatest influence
on customer purchasing decisions?
Business context:
Rural Electrification Corporation Ltd (REC) is a Navratna Central Public Sector Enterprise
under the Ministry of Power in India. The company provides financial assistance to state
electricity boards, state governments, central and state power utilities, independent power
producers, rural electric cooperatives, and private sector utilities. REC finances and promotes
power projects across India, including generation, transmission, and distribution projects.
 Financial Performance:
REC has shown a strong financial performance with significant growth in key financial
metrics.
The company's Assets Under Management (AUM) grew by 21% year-on-year to ₹4.97 lakh
crore
 Stock Performance: Over the past year, REC's stock has seen a significant increase,
reflecting investor confidence in the company's performance and growth prospects
 Market Position:
REC holds a strong market position as a key financial player in the Indian power sector. Its
status as a Navratna Central Public Sector Enterprise underscores its importance and
reliability in the industry
The company's strategic focus on financing and promoting power projects positions it as a
critical entity in supporting India's energy infrastructure development.
 Competitor:
 Rural Electrification Corporation Ltd (REC) is a significant player in India's power
sector, focusing on providing financial assistance for power projects. Its primary
competitor is the Power Finance Corporation (PFC), which also operates under the
Ministry of Power. In 2019, PFC acquired a majority stake in REC, consolidating
their positions to form one of India's largest non-banking financial companies
(NBFCs) dedicated to power sector financing.
Power Finance Corporation (PFC):
PFC and REC together are pivotal in financing India’s power infrastructure, including
thermal and renewable energy projects. They have extensive loan portfolios, heavily exposed
to the thermal power sector, which poses significant non-performing asset (NPA) risks due to
obsolete coal-fired power plants
PFC's strategic goal includes shifting towards financing renewable energy projects,
which aligns with India's target of 450 GW of renewable energy by 2030
In addition, REC also has other competitors such as: Other NBFCs and Financial Institutions,
International Financial Institutions,....

1.2 The method


Diagnostic Analytics:
Diagnostic analytics is the area of data analytics that is concerned with identifying the root
cause of problems or issues. Unlike other types of data analytics, which focus on
understanding what has happened in the past or on predicting what will happen in the future,
diagnostic analytics is focused on understanding why something has happened. Being able to
give a concrete “why” behind users’ actions or lack thereof is one of the hardest tasks in data
analytics as it is a process of inference, not proof.

It often follows descriptive analytics, which focuses on what has happened in the past.
Diagnostic analytics relies on information from descriptive analytics to proceed, as we need
to know what happened before we can ask why it happened. This is followed in turn by
prescriptive analytics, which focuses on what to do in the future. Following the order of
“what?” then “why” then “what next?” is a sensible way to do data analytics, as we need to
know what happened and why before we can decide what to do next.
One example of diagnostic analytics is a marketing funnel analysis. This means looking at
the set of steps that a user might take before reaching a final goal, such as a conversion or a
sale, and understanding why they do or don’t complete each step. For example, before a user
reaches the goal of a purchase, they may reach a series of intermediate goals such as visiting
our website, adding an item to their shopping cart, and clicking the “checkout” button.
Diagnostic analytics allows us to analyze why people are not converting or purchasing by
looking at which steps they were at when they dropped off, and inferring why.
Why do we use this method?
We use diagnostic research methods to gain insights into the reasons behind sales trends and
the root causes of issues within the business. Additionally, this method helps identify
weaknesses in the sales process, thereby enhancing operational efficiency and enabling more
accurate analyses and predictions. Furthermore, we can strategize and create a competitive
advantage to make the most precise and informed decisions, optimizing revenue and profit.

Benefits of diagnostic analytics


The purpose of diagnostic analytics is to give a business more actionable information than
descriptive analytics alone. It enables us to find out why something is working (or not
working), allowing us to correct any wrong assumptions we may have had. The benefits of
diagnostic analytics include:
Greater insights: It allows for deeper insights into our data when used in conjunction with
other types of data analytics.
Forming and testing hypotheses: Having evidence of what has previously happened helps
businesses to form and test new hypotheses more easily.
Identifying anomalies: Diagnostic analytics helps us determine whether data outliers were
one-off anomalies or useful, significant findings.
Avoiding future mistakes: It helps us identify when and where something didn’t perform
well, enabling us to improve efficiency, reduce waste, and avoid costly mistakes.
Ease of understanding: Diagnostic findings are generally simple to understand, and once they
have been turned into data visualizations, they can be easily shared with stakeholders.
Limitations of diagnostic analytics
One limitation of diagnostic analytics is that it is easy to mistake correlation for causation.
For example, there is a correlation between ice cream sales and bee stings, but that doesn’t
mean that one caused the other. They are in fact both dependent on a third factor (warm
temperatures). So any correlations in our data must always be fully investigated before
assuming a causal link.

Diagnostic analytics can’t predict the future, or make suggestions about what should be done
— it can only explain why something happened, and any further information can only be
gained either from a knowledgeable person making educated guesses or from predictive or
prescriptive analytics. Nor does it answer the question “What should we do?” — this is
answered by the field of prescriptive analytics.

Diagnostic analytics doesn’t give definitive answers. It can’t tell us that A definitely caused
B, only that a certain percentage of people who encountered event A did (or did not)
encounter event B. The accuracy of outcomes can be improved, however, with better-quality
data, larger data sets, and the involvement of domain experts in interpreting the data.

How to use diagnostic analytics in our business


The first step in diagnostic analytics is deciding on the questions we want answers to. These
may include questions like:

"What causes customers to cancel their subscriptions to our online product?"


"Why has web traffic decreased by so much this month?"
"Why are so many of our employees quitting their jobs this year?"
"Why do sales always increase in November?"
We should ensure that we have access to a reasonably large data set containing good-quality
data that’s relevant to our question. This will help us to draw useful inferences and avoid
making decisions based on outliers or the opinions of a vocal minority. Some examples of
the kinds of data sets that are large enough to be useful are sales records, marketing statistics,
and product inventory records. Once we have some suitable and relevant data, we can
develop our hypothesis — our proposed reason for why the thing we are studying happened
— to help direct our analytics. Even though diagnostic analytics doesn’t deal with future
predictions, it’s still possible to have a hypothesis about the past. For example, we might
hypothesize that the reason sales fell last month was because we spent less on advertising.

Next, we will need to prepare our data by cleaning it (which may involve removing defective
data or duplicates), transforming it into a useful format, and loading it into a single location
such as a data warehouse. We can also filter the data so that only what is relevant is left for
the analysis, or do data drilling, which involves looking at hierarchical data at a higher or
lower level — so “drilling down” is when we access data at a deeper, more granular level
than before. For example, when looking at how potential customers have responded to a
particular marketing campaign, we might drill down to see how those who live in a particular
region responded. Once our data is prepared, we can use one of the diagnostic analytics
techniques below.

Finally, we will need to create some data visualizations to use when communicating our
results to any interested stakeholders. If our analytics need to be run regularly, we should
automate the above steps and run it regularly against our production data, which is known as
operationalizing our analytics. As we formalize our diagnostic analytics steps, it will be
useful to refer to the data analytics lifecycle, which covers all the necessary steps including
operationalizing our analytics.

Once we are comfortable posing questions, forming hypotheses, and using our data to
support or disprove them, we can get creative. Less-proven data sets, or data from third
parties, can be introduced to see if they can yield any additional depth or experimental
insights from our diagnostic analytics process.

Diagnostic analytics techniques


Different diagnostic analytics techniques can be appropriate depending on the type of
question we are trying to answer and how comprehensive our data is. These techniques tend
to involve either statistical analysis or machine learning.

Statistical analysis:
Statistical analysis is the process of finding trends and patterns in data through the use of
statistical models. These models are also used to help work out which relationships between
variables are the most important. Some common statistical models for diagnostic analytics
are:

Correlation analysis: finding out if there is a relationship between two or more specified
variables or data sets and investigating how strong that relationship is.
Regression analysis: working out which variables affect the thing we’re interested in the
most (for example, sales).

Machine learning:
Machine learning algorithms can also be used in diagnostic analytics, for example:

 Binary classification: making a decision about something that only has two possible
answers. The answers to questions like “is the weather affecting sales or not?” and
“do customers like our new product or not?” can only be yes or no.
 Time-series analysis: looking at trends (or data) that change over time. This could be
to answer a question such as why sales decreased in the previous quarter.
 While machine learning techniques are useful, humans with domain-specific
knowledge are still needed to provide context to the outcomes of diagnostic analytics.
For example, an expert might realize that a credit card customer making regular
withdrawals of a similar amount suggests that they may be using one credit card to
pay down another, and are thus a risk, whereas a machine might not have the context
to be able to understand what this unusual pattern means.

1.3 Scope of Work for Data Analysis Project


Data Preprocessing:

Use Python libraries such as Pandas to load the data into a DataFrame.
Check for missing values, data types, and inconsistencies.
Handle missing values appropriately (e.g., imputation, removal).
Convert date fields to datetime objects.
Ensure numerical fields (Q1, Q2, Q3, Q4, S1, S2, S3, S4) are correctly typed.

Historical Data Analysis:

Analyze historical sales and revenue data to identify trends and patterns.
Use time series forecasting methods (e.g., ARIMA, Prophet) to predict future sales and
revenue.
Provide a yearly estimate for sales and revenue for 2024, including confidence intervals for
accuracy.

Tools and Libraries

Xác định các công cụ và tài nguyên cần thiết để hoàn thành dự án: Sử dụng PowerBI,
Excel, Visual Studio, SSMS,SSIS,...
Python: For overall data processing and analysis.
Pandas: For data manipulation and analysis.
Matplotlib/Seaborn: For data visualization.
Scikit-learn/Statsmodels: For statistical analysis and modeling.
Prophet: For time series forecasting.

2. DATA UNDERSTANDING
2.1 Data information

This data update information of 4 products every single day from 13-06-2010 to 02-02-2023
The data file contains about 8 numerical parameters :
Q1- Total unit sales of product 1
Q2- Total unit sales of product 2
Q3- Total unit sales of product 3
Q4- Total unit sales of product 4
S1- Total revenue from product 1
S2- Total revenue from product 2
S3- Total revenue from product 3
S4- Total revenue from product 4
Example :
On 13-06-2010 , product 1 had been brought by 5422 people and INR 17187.74 had been
generated in revenue from product 1.

2.2 Explore data characteristics:

Step 1: Import libraries

# import the important packages

import pandas as pd # library used for data manipulation and analysis

import numpy as np # library used for working with arrays

import matplotlib.pyplot as plt # library for plots and visualizations

import seaborn as sns # library for visualizations

%matplotlib inline

Step 2: Loading the datasets

# data = pd.read_csv('statsfinal.csv')

# Checking the first 5 and last 5 rows of the dataset

data.head(-1)
4599 rows × 9 columns

We can observe the first entry in the data, starts at 13-06-2010. This means the data for year
2010 is not complete.

We can observe the last entry in the data, ends at 02-02-2023. This means the data for year 2023
is also not complete.

 it will be best to drop year 2010 and year 2023

Step 3: Check for missing values


data.isnull().sum()
we have no missing data

Step 4: EDA

#Extract the year, month and Day from the date

# Extract year from the 'Day' 'Month' 'year' from the 'Date' column using a lambda function

# We need to get the year from the data to analyse sales year to year

data['Day'] = data['Date'].apply(lambda x: x.split('-')[0])

data['Month'] = data['Date'].apply(lambda x: x.split('-')[1])

data['Year'] = data['Date'].apply(lambda x: x.split('-')[2])

data

#Drop downs for years 2010 and 2023:

data_reduced = data.query("Year != '2010' and Year != '2023'")

#Graph our TOTAL & MEAN unit sold for each product using a histogram:

#Create a function that allows us to plot a bar chart for the 4 products

def plot_bar_chart(df, columns, stri, str1, val):


# Aggregate sales for each product by year, by sum or mean

if val == 'sum':

sales_by_year = df.groupby('Year')[columns].sum().reset_index()

elif val == 'mean':

sales_by_year = df.groupby('Year')[columns].mean().reset_index()

# Melt the data to make it easier to plot

sales_by_year_melted = pd.melt(sales_by_year, id_vars='Year', value_vars=columns,


var_name='Product', value_name='Sales')

# Create a bar chart

plt.figure(figsize=(20,4))

sns.barplot(data=sales_by_year_melted, x='Year', y='Sales', hue='Product')


#,palette="cividis")

plt.xlabel('Year')

plt.ylabel(stri)

plt.title(f'{stri} by {str1}')

plt.xticks(rotation=45)

plt.show()

#use the plot_bar_chart function, enter the Unit Sales Columns and the Unit Sales string

plot_bar_chart(data_reduced, ['Q-P1', 'Q-P2', 'Q-P3', 'Q-P4'],'Total Unit Sales', 'Year', 'sum')

plot_bar_chart(data_reduced, ['Q-P1', 'Q-P2', 'Q-P3', 'Q-P4'],'Mean Unit Sales', 'Year',


'mean')
We can observe that P1 has the highest unit sales for each year. And it's highest is in year
2014.

We can observe that P4 has the lowest unit sales of all the products.

#use the plot_bar_chart function, enter the Revenue Columns and the Revenue string

plot_bar_chart(data_reduced, ['S-P1', 'S-P2', 'S-P3', 'S-P4'], 'Total Revenue', 'Year', 'sum')

plot_bar_chart(data_reduced, ['S-P1', 'S-P2', 'S-P3', 'S-P4'], 'Mean Revenue', 'Year', 'mean')


We can observe that P3 brought in the most revenue. This could be as a result of multiple
things:

- P3 was sold for higher than the rest, as it had the second highest unit sales for each year.

We can observe than P1 AND P2 brought in similar revenues for each year. With P2
bringing in slightly more.

- P1 despite having the most unit sold, brought in the second lowest revenue each year.

#Trend in sales of all four products during certain months

# Create a figure and axis

def month_plot():

fig, ax = plt.subplots()

# Plot the sales data for each product by month

data_reduced.groupby('Month')[['Q-P1', 'Q-P2', 'Q-P3', 'Q-P4']].sum().plot(ax=ax)

# Set the x-axis limits to only show up to December

ax.set_xlim(left=0, right=13)

# Set the axis labels and title

ax.set_xlabel('Month')

ax.set_ylabel('Total unit sales')

ax.set_title('Trend in sales of all four products by month')

# Show the plot


plt.show()

month_plot()

We can observe that Feb and Dec have the lowest sales for each product

For P1 We can observe Mar - Jul having the highest unit sales

For P2 We can observe Jan, Mar - Aug having the highest unit sales

For P3 We can observe May & Sep having the highest unit sales

For P4 We can observe uniform sales from Jan - Dec

Conclusion

# Unit Sales 2011 - 2022

P1 has the highest unit sales for each year. And it's highest is in year 2014.
We can observe that P4 has the lowest unit sales of all the products.

# Revenues 2011 - 2022

We can observe that P3 brought in the most revenue. This could be as a result of multiple
things:
- P3 was sold for higher than the rest, as it had the second highest unit sales for each year.

We can observe than P1 and P2 brought in similar revenues for each year. With P2 bringing
in slightly more.

- P1 despite having the most unit sold, brought in the second lowest revenue each year.

# Average Month Sales 2011 - 2022

We can observe that all Products unit sales drop in Feb.

We can observe that Feb and Dec have the lowest sales for each product

For P1 We can observe Mar - Jul having the highest unit sales

For P2 We can observe Jan, Mar - Aug having the highest unit sales

For P3 We can observe May & Sep having the highest unit sales

For P4 We can observe uniform sales from Jan - Dec

Dataset 2: Marketing Analytics Dataset


STEP 1: BUSINESS ISSUE UNDERSTANDING

Defining Business Objectives

Define the business objectives of the data analysis project: To identify weaknesses in the
Marketing strategy and provide solutions to address those weaknesses.

 Identify specific questions that the data needs to answer in order to achieve the
business objectives:

 Which sales regions should have more focused marketing efforts?


 Which communication channels should be prioritized to advertise products to
customers?

 Which products are the best and worst sellers?

 What adjustments should be made to the Marketing campaign to drive sales?

 Which customer segments buy the most (analyze by age, gender, country,
etc.)?

Collecting Initial Data

The data is collected from the website:

https://www.kaggle.com/code/xleong3/business-analysis-on-marketing-data/
notebook#Introduction

Determining Appropriate Data Analysis Methods

Based on the objective, type of data, and characteristics, choose an appropriate analysis
method. Refer to summary tables or instructional materials to compare different methods.

Some common methods include:

Descriptive analysis: Using statistical indices such as measures of central tendency


like mean, median, mode, and measures of variability like minimum, maximum,
frequency, percentile, etc.

Exploratory analysis: Calculating and determining indices such as standard deviation,


normal distribution, variance, mean, etc. Drawing charts like Histograms, Boxplots,
Scatter plots, and Heatmaps.

Predictive analysis: Predicting values or events in the future based on historical data.

 Classification Model: A model that categorizes data using machine learning


and AI to make predictions for the future. In Marketing, classification models
can be used to categorize users into different customer groups based on factors
like age, gender, interests, etc.

 Time Series Model: This forecasting and analysis method is highly preferred
by brands to identify patterns over time. In Marketing, time series models are
used for data visualization. They provide marketers with detailed and useful
information about seasonal trends and cyclic behavior. This model can be used
to predict potential changes in data.

 Anomaly Detection Model: This model is used to identify abnormal or outlier


data points in a dataset.
 Forecasting Model: Often seen as an extension of classification models, this
machine learning model predicts future values based on past data. In
Marketing, forecasting models can manage multiple parameters
simultaneously, making them the most widely used predictive analysis models.
They are used to predict website traffic, download numbers, app revenue, etc.,
even when historical data is unavailable.

 Clustering Model: A machine learning model that divides data into groups
with high similarity based on specific characteristics and attributes. In
Marketing, clustering models can be used to analyze data about customer
groups with similar interests, common characteristics, etc. This analysis helps
advertisers identify common characteristics of a specific customer group,
enabling them to devise suitable advertising strategies.

Machine Learning Analysis: Using machine learning algorithms to automatically


learn from data and make predictions or decisions.

 Recommendation Engines: These programs analyze your preferences and


recommend movies, products, etc., making you happier and more engaged.

 Dynamic Pricing: These programs allow airlines and online retailers to adjust
prices based on customer demand, competition, and other factors, maximizing
profits.

 Customer Segmentation: Companies use intelligent systems to analyze


customer data and segment them. This helps them prepare and deliver
information that each group desires, attracting customers.

 Image and Video Recognition: These tools can scan images and videos on the
web to identify references to brands. They can recognize logos in videos,
informing companies about their popularity levels.

Scope of Work:

 The project will be carried out within a one-month timeframe (from July 2024 to
August 2024).
 The project will be based on a data file named "business analysis on marketing data"
 Identify the necessary tools and resources to complete the project: Use PowerBI,
Excel, Visual Studio, SSMS, SSIS, etc.

STEP 2: DATA UNDERSTANDING

Description:

Customer
ID Customer's unique identifier

Year_Birth Customer's birth year

Education Customer's education level

Marital_Status Customer's marital status

Income Customer's yearly household income

Kidhome Number of children in customer's household

Teenhome Number of teenagers in customer's household

Dt_Customer Date of customer's enrollment with the company

Recency Number of days since customer's last purchase

MntWines Amount spent on wine in the last 2 years

MntFruits Amount spent on fruits in the last 2 years

MntMeatProducts Amount spent on meat in the last 2 years

MntFishProducts Amount spent on fish in the last 2 years

MntSweetProducts Amount spent on sweets in the last 2 years

MntGoldProds Amount spent on gold in the last 2 years

NumDealPurchases Number of purchases made with a discount

NumWebPurchases Number of purchases made through the company's web site

NumCatalogPurchases Number of purchases made using a catalogue

NumStorePurchases Number of purchases made directly in stores

NumWebVisitsMonth Number of visits to company's web site in the last month

AcceptedCmp3 1 if customer accepted the offer in the 3rd campaign, 0


otherwise

AcceptedCmp4 1 if customer accepted the offer in the 4th campaign, 0


otherwise

AcceptedCmp5 1 if customer accepted the offer in the 5th campaign, 0


otherwise
AcceptedCmp1 1 if customer accepted the offer in the 1st campaign, 0 otherwise

AcceptedCmp2 1 if customer accepted the offer in the 2nd campaign, 0


otherwise

Response 1 if customer accepted the offer in the last campaign, 0


otherwise

Complain 1 if customer complained in the last 2 years, 0 otherwise

Country Customer's location

Understanding:

This data set provides a comprehensive picture of the customers of company XYZ. By
analyzing the variables, we can draw many important insights about customer behavior and
characteristics.

From a demographic perspective, data on year of birth, education level, and marital status
helps identify different customer segments. For example, customers with high income, good
education, and who are married may have different needs and shopping behaviors compared
to younger, single customers with lower incomes. Analyzing these demographic
characteristics in depth will help the company design marketing strategies and products
tailored to each segment.

In addition, data on the number of children and adolescents in the customer's household also
provides important information about the household structure and its influence on shopping
behavior. Families with more children may have different needs compared to families
without children or with only one child.

Data on the number of purchases through different channels (web, store, catalog) along with
the monthly website visit count shows the most effective customer access channels. This will
help company XYZ optimize the distribution channels and customer experience.

Finally, information on the amount spent on different product groups (alcohol, fruits, meat,
fish, sweets, gold products) provides deep insights into customer consumption behavior. This
data can be used to identify products/services suitable for each customer segment, thereby
enhancing the effectiveness of marketing and sales campaigns.

You might also like