Group9 DataProject
Group9 DataProject
It often follows descriptive analytics, which focuses on what has happened in the past.
Diagnostic analytics relies on information from descriptive analytics to proceed, as we need
to know what happened before we can ask why it happened. This is followed in turn by
prescriptive analytics, which focuses on what to do in the future. Following the order of
“what?” then “why” then “what next?” is a sensible way to do data analytics, as we need to
know what happened and why before we can decide what to do next.
One example of diagnostic analytics is a marketing funnel analysis. This means looking at
the set of steps that a user might take before reaching a final goal, such as a conversion or a
sale, and understanding why they do or don’t complete each step. For example, before a user
reaches the goal of a purchase, they may reach a series of intermediate goals such as visiting
our website, adding an item to their shopping cart, and clicking the “checkout” button.
Diagnostic analytics allows us to analyze why people are not converting or purchasing by
looking at which steps they were at when they dropped off, and inferring why.
Why do we use this method?
We use diagnostic research methods to gain insights into the reasons behind sales trends and
the root causes of issues within the business. Additionally, this method helps identify
weaknesses in the sales process, thereby enhancing operational efficiency and enabling more
accurate analyses and predictions. Furthermore, we can strategize and create a competitive
advantage to make the most precise and informed decisions, optimizing revenue and profit.
Diagnostic analytics can’t predict the future, or make suggestions about what should be done
— it can only explain why something happened, and any further information can only be
gained either from a knowledgeable person making educated guesses or from predictive or
prescriptive analytics. Nor does it answer the question “What should we do?” — this is
answered by the field of prescriptive analytics.
Diagnostic analytics doesn’t give definitive answers. It can’t tell us that A definitely caused
B, only that a certain percentage of people who encountered event A did (or did not)
encounter event B. The accuracy of outcomes can be improved, however, with better-quality
data, larger data sets, and the involvement of domain experts in interpreting the data.
Next, we will need to prepare our data by cleaning it (which may involve removing defective
data or duplicates), transforming it into a useful format, and loading it into a single location
such as a data warehouse. We can also filter the data so that only what is relevant is left for
the analysis, or do data drilling, which involves looking at hierarchical data at a higher or
lower level — so “drilling down” is when we access data at a deeper, more granular level
than before. For example, when looking at how potential customers have responded to a
particular marketing campaign, we might drill down to see how those who live in a particular
region responded. Once our data is prepared, we can use one of the diagnostic analytics
techniques below.
Finally, we will need to create some data visualizations to use when communicating our
results to any interested stakeholders. If our analytics need to be run regularly, we should
automate the above steps and run it regularly against our production data, which is known as
operationalizing our analytics. As we formalize our diagnostic analytics steps, it will be
useful to refer to the data analytics lifecycle, which covers all the necessary steps including
operationalizing our analytics.
Once we are comfortable posing questions, forming hypotheses, and using our data to
support or disprove them, we can get creative. Less-proven data sets, or data from third
parties, can be introduced to see if they can yield any additional depth or experimental
insights from our diagnostic analytics process.
Statistical analysis:
Statistical analysis is the process of finding trends and patterns in data through the use of
statistical models. These models are also used to help work out which relationships between
variables are the most important. Some common statistical models for diagnostic analytics
are:
Correlation analysis: finding out if there is a relationship between two or more specified
variables or data sets and investigating how strong that relationship is.
Regression analysis: working out which variables affect the thing we’re interested in the
most (for example, sales).
Machine learning:
Machine learning algorithms can also be used in diagnostic analytics, for example:
Binary classification: making a decision about something that only has two possible
answers. The answers to questions like “is the weather affecting sales or not?” and
“do customers like our new product or not?” can only be yes or no.
Time-series analysis: looking at trends (or data) that change over time. This could be
to answer a question such as why sales decreased in the previous quarter.
While machine learning techniques are useful, humans with domain-specific
knowledge are still needed to provide context to the outcomes of diagnostic analytics.
For example, an expert might realize that a credit card customer making regular
withdrawals of a similar amount suggests that they may be using one credit card to
pay down another, and are thus a risk, whereas a machine might not have the context
to be able to understand what this unusual pattern means.
Use Python libraries such as Pandas to load the data into a DataFrame.
Check for missing values, data types, and inconsistencies.
Handle missing values appropriately (e.g., imputation, removal).
Convert date fields to datetime objects.
Ensure numerical fields (Q1, Q2, Q3, Q4, S1, S2, S3, S4) are correctly typed.
Analyze historical sales and revenue data to identify trends and patterns.
Use time series forecasting methods (e.g., ARIMA, Prophet) to predict future sales and
revenue.
Provide a yearly estimate for sales and revenue for 2024, including confidence intervals for
accuracy.
Xác định các công cụ và tài nguyên cần thiết để hoàn thành dự án: Sử dụng PowerBI,
Excel, Visual Studio, SSMS,SSIS,...
Python: For overall data processing and analysis.
Pandas: For data manipulation and analysis.
Matplotlib/Seaborn: For data visualization.
Scikit-learn/Statsmodels: For statistical analysis and modeling.
Prophet: For time series forecasting.
2. DATA UNDERSTANDING
2.1 Data information
This data update information of 4 products every single day from 13-06-2010 to 02-02-2023
The data file contains about 8 numerical parameters :
Q1- Total unit sales of product 1
Q2- Total unit sales of product 2
Q3- Total unit sales of product 3
Q4- Total unit sales of product 4
S1- Total revenue from product 1
S2- Total revenue from product 2
S3- Total revenue from product 3
S4- Total revenue from product 4
Example :
On 13-06-2010 , product 1 had been brought by 5422 people and INR 17187.74 had been
generated in revenue from product 1.
%matplotlib inline
# data = pd.read_csv('statsfinal.csv')
data.head(-1)
4599 rows × 9 columns
We can observe the first entry in the data, starts at 13-06-2010. This means the data for year
2010 is not complete.
We can observe the last entry in the data, ends at 02-02-2023. This means the data for year 2023
is also not complete.
Step 4: EDA
# Extract year from the 'Day' 'Month' 'year' from the 'Date' column using a lambda function
# We need to get the year from the data to analyse sales year to year
data
#Graph our TOTAL & MEAN unit sold for each product using a histogram:
#Create a function that allows us to plot a bar chart for the 4 products
if val == 'sum':
sales_by_year = df.groupby('Year')[columns].sum().reset_index()
sales_by_year = df.groupby('Year')[columns].mean().reset_index()
plt.figure(figsize=(20,4))
plt.xlabel('Year')
plt.ylabel(stri)
plt.title(f'{stri} by {str1}')
plt.xticks(rotation=45)
plt.show()
#use the plot_bar_chart function, enter the Unit Sales Columns and the Unit Sales string
We can observe that P4 has the lowest unit sales of all the products.
#use the plot_bar_chart function, enter the Revenue Columns and the Revenue string
- P3 was sold for higher than the rest, as it had the second highest unit sales for each year.
We can observe than P1 AND P2 brought in similar revenues for each year. With P2
bringing in slightly more.
- P1 despite having the most unit sold, brought in the second lowest revenue each year.
def month_plot():
fig, ax = plt.subplots()
ax.set_xlim(left=0, right=13)
ax.set_xlabel('Month')
month_plot()
We can observe that Feb and Dec have the lowest sales for each product
For P1 We can observe Mar - Jul having the highest unit sales
For P2 We can observe Jan, Mar - Aug having the highest unit sales
For P3 We can observe May & Sep having the highest unit sales
Conclusion
P1 has the highest unit sales for each year. And it's highest is in year 2014.
We can observe that P4 has the lowest unit sales of all the products.
We can observe that P3 brought in the most revenue. This could be as a result of multiple
things:
- P3 was sold for higher than the rest, as it had the second highest unit sales for each year.
We can observe than P1 and P2 brought in similar revenues for each year. With P2 bringing
in slightly more.
- P1 despite having the most unit sold, brought in the second lowest revenue each year.
We can observe that Feb and Dec have the lowest sales for each product
For P1 We can observe Mar - Jul having the highest unit sales
For P2 We can observe Jan, Mar - Aug having the highest unit sales
For P3 We can observe May & Sep having the highest unit sales
Define the business objectives of the data analysis project: To identify weaknesses in the
Marketing strategy and provide solutions to address those weaknesses.
Identify specific questions that the data needs to answer in order to achieve the
business objectives:
Which customer segments buy the most (analyze by age, gender, country,
etc.)?
https://www.kaggle.com/code/xleong3/business-analysis-on-marketing-data/
notebook#Introduction
Based on the objective, type of data, and characteristics, choose an appropriate analysis
method. Refer to summary tables or instructional materials to compare different methods.
Predictive analysis: Predicting values or events in the future based on historical data.
Time Series Model: This forecasting and analysis method is highly preferred
by brands to identify patterns over time. In Marketing, time series models are
used for data visualization. They provide marketers with detailed and useful
information about seasonal trends and cyclic behavior. This model can be used
to predict potential changes in data.
Clustering Model: A machine learning model that divides data into groups
with high similarity based on specific characteristics and attributes. In
Marketing, clustering models can be used to analyze data about customer
groups with similar interests, common characteristics, etc. This analysis helps
advertisers identify common characteristics of a specific customer group,
enabling them to devise suitable advertising strategies.
Dynamic Pricing: These programs allow airlines and online retailers to adjust
prices based on customer demand, competition, and other factors, maximizing
profits.
Image and Video Recognition: These tools can scan images and videos on the
web to identify references to brands. They can recognize logos in videos,
informing companies about their popularity levels.
Scope of Work:
The project will be carried out within a one-month timeframe (from July 2024 to
August 2024).
The project will be based on a data file named "business analysis on marketing data"
Identify the necessary tools and resources to complete the project: Use PowerBI,
Excel, Visual Studio, SSMS, SSIS, etc.
Description:
Customer
ID Customer's unique identifier
Understanding:
This data set provides a comprehensive picture of the customers of company XYZ. By
analyzing the variables, we can draw many important insights about customer behavior and
characteristics.
From a demographic perspective, data on year of birth, education level, and marital status
helps identify different customer segments. For example, customers with high income, good
education, and who are married may have different needs and shopping behaviors compared
to younger, single customers with lower incomes. Analyzing these demographic
characteristics in depth will help the company design marketing strategies and products
tailored to each segment.
In addition, data on the number of children and adolescents in the customer's household also
provides important information about the household structure and its influence on shopping
behavior. Families with more children may have different needs compared to families
without children or with only one child.
Data on the number of purchases through different channels (web, store, catalog) along with
the monthly website visit count shows the most effective customer access channels. This will
help company XYZ optimize the distribution channels and customer experience.
Finally, information on the amount spent on different product groups (alcohol, fruits, meat,
fish, sweets, gold products) provides deep insights into customer consumption behavior. This
data can be used to identify products/services suitable for each customer segment, thereby
enhancing the effectiveness of marketing and sales campaigns.