Marketing Analysis Project - PowerBI
Marketing Analysis Project - PowerBI
Key Points
Reduced Customer Engagement The number of customer interactions
and engagement with the site and marketing content has declined.
Decreased Conversion Rates Fewer site visitors are converting into
paying customers.
High Marketing Expenses Significant investments in marketing
campaigns are not yielding expected returns.
Need for Customer Feedback Analysis Understanding customer
opinions about products and services is crucial for improving engagement
and conversions.
Goals
Tools
SQL
Python
pandas
pyodbc
nltk
SentimentIntensityAnalyzer
vader_lexicon
PowerBI Desktop
PowerBI Service
Extract and Clean data with SQL
Extracting and cleaning data with SQL. This step in involves walking through
various SQL transformations applied to each dataset table, preparing our
data for visualization in Power BI.
The following table shows us the tables we’ll be working with, their fields
and DataTypes.
We’ll be dropping the Product Category column has it has only one
Category – Sports.
We’ll be labeling the products by their price using CASE. The values
used are the ones we see in the code (I want to see if there’s a metric
(like average +/- standard deviation or other metric to label products).
This new column may be of good use. So we just added this column.
There are other transformations and columns we could add but we’ll
leave like this. It’s a matter of thinking about this and if we feel it
makes sense then we just create those.
SELECT
ProductID, -- Selects the unique identifier for each product
ProductName, -- Selects the name of each product
Price, -- Selects the price of each product
-- Category, -- Selects the product category for each product
FROM
products; -- Specifies the source table from which to select the data
Dimension table – Customers
SELECT
c.CustomerID, -- Selects the unique identifier for each customer
c.CustomerName, -- Selects the name of each customer
c.Email, -- Selects the email of each customer
c.Gender, -- Selects the gender of each customer
c.Age, -- Selects the age of each customer
g.Country, -- Selects the country from the geography table to enrich
customer data
g.City -- Selects the city from the geography table to enrich
customer data
FROM
customers as c -- Specifies the alias 'c' for the dim_customers table
LEFT JOIN
geography g -- Specifies the alias 'g' for the dim_geography table
ON
c.GeographyID = g.GeographyID; -- Joins the two tables on the
GeographyID field to match customers with their geographic information
Again, there were other transformations and creations we could think about
on creating that could make sense, like binning ages, but we won’t be doing
that. We can also obtain that directly inside PowerBI (Best practices
determine that, for performance gains, new columns, if it’s possible to
create them in the source system, then it should be done there).
Facts table – Customer_Reviews
Here we identified that there are double spacing between words and we’re
removing one of those spaces using REPLACE.
NOTE: This is the table we’ll be focusing our Sentiment Analysis using
Pyhton. For the moment we’ll simply remove a space from the double
spaced words.
This is because one of our objectives is to try and understand why customer
engagement and satisfaction have been decreasing, so we’ll be enriching
this table with data in order for it to provide us with more information that
can help explain those issues.
SELECT
ReviewID, -- Selects the unique identifier for each review
CustomerID, -- Selects the unique identifier for each customer
ProductID, -- Selects the unique identifier for each product
ReviewDate, -- Selects the date when the review was written
Rating, -- Selects the numerical rating given by the customer (e.g., 1 to 5 stars)
-- Cleans up the ReviewText by replacing double spaces with single spaces to ensure
the text is more readable and standardized
REPLACE(ReviewText, ' ', ' ') AS ReviewText
FROM
dbo.customer_reviews; -- Specifies the source table from which to select the data
Facts table – Engagement_Data
Below, a view of 10 rows from the original table (we’ll include examples of
some issues we found on the records)
This query retrieves all rows, and if there are duplicates, it assigns a
sequential number to the duplicate rows.
A different way to find duplicates is to use the following query.
SELECT
JourneyID,
CustomerID,
ProductID,
VisitDate,
Stage,
Action,
Duration,
COUNT(*) AS DuplicateCount
FROM customer_journey
GROUP BY JourneyID, CustomerID, ProductID, VisitDate, Stage, Action, Duration
ORDER BY JourneyID
Discovering the duplicates and addressing are two distinct actions. As we’ll
be importing the data to PowerBI later, we have to find those and then
assure that we only import non duplicate records. We’ll get there.
We also noticed that there are NULL values in the duration column. For this
exercise we’ll be replacing those values with a value.
There are other approaches to address NULLS in our data both different
techniques to address them or different metrics that, depending, on our
analysis, could make more sense. For this case we’ll proceed with replacing
them with an average duration, calculated from the values we have
available our table.
In this case we opted to calculate the average duration of the visits for the
visit day. This is the syntax used to calculate the mentioned average
duration
AVG(Duration) OVER (PARTITION BY VisitDate) AS avg_duration
Then, we’ll replace the NULL values with this value using the following
syntax
COALESCE(Duration, avg_duration) AS Duration
These were the issues, that really needed to be addressed, we found in this
table.
Putting all together we have the final SQL query that addresses them:
SELECT
JourneyID, -- Selects the unique identifier for each journey to ensure data
traceability
CustomerID, -- Selects the unique identifier for each customer to link
journeys to specific customers
ProductID, -- Selects the unique identifier for each product to analyze
customer interactions with different products
VisitDate, -- Selects the date of the visit to understand the timeline of
customer interactions
Stage, -- Uses the uppercased stage value from the subquery for consistency in
analysis
Action, -- Selects the action taken by the customer (e.g., View, Click,
Purchase)
COALESCE(Duration, avg_duration) AS Duration -- Replaces missing durations
with the average duration for the corresponding date
FROM
(
-- Subquery to process and clean the data
SELECT
JourneyID, -- Selects the unique identifier for each journey to ensure
data traceability
CustomerID, -- Selects the unique identifier for each customer to link
journeys to specific customers
ProductID, -- Selects the unique identifier for each product to
analyze customer interactions with different products
VisitDate, -- Selects the date of the visit to understand the timeline
of customer interactions
UPPER(Stage) AS Stage, -- Converts Stage values to uppercase for
consistency in data analysis
Action, -- Selects the action taken by the customer (e.g., View,
Click, Purchase)
Duration, -- Uses Duration directly, as it's already a numeric type
AVG(Duration) OVER (PARTITION BY VisitDate) AS avg_duration, --
Calculates the average duration for each date, using only numeric values
ROW_NUMBER() OVER (
PARTITION BY CustomerID, ProductID, VisitDate, UPPER(Stage), Action
-- Groups by these columns to identify duplicate records
ORDER BY JourneyID -- Orders by JourneyID to keep the first
occurrence of each duplicate
) AS row_num -- Assigns a row number to each row within the partition
to identify duplicates
FROM
customer_journey -- Specifies the source table from which to select
the data
) AS subquery -- Names the subquery for reference in the outer query
WHERE
row_num = 1; -- Keeps only the first occurrence of each duplicate group
identified in the subquery
import pandas as pd
import pyodbc
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
#before importing these libraries to the project we need to install them. To do so we
need to import them with.
# Download the VADER lexicon for sentiment analysis if not already present.
nltk.download('vader_lexicon')
#This is another download of a component we'll need to use during the exercise. It's part
of nltk library so it's just a matter of bringing into play
For this project, we’ll utilize several Python libraries to perform sentiment
analysis and connect to our data.
# Define a function to fetch data from a SQL database using a SQL query
def fetch_data_from_sql():
# Define the connection string with parameters for the database connection
conn_str = (
"Driver={ODBC Driver 17 for SQL Server};" # Specify the driver for SQL Server
"Server=DESKTOP-PC06D3B;" # Specify your SQL Server instance
"Database=PortfolioProject_MarketingAnalytics;" # Specify the database name
"Trusted_Connection=yes;" # Use Windows Authentication for the connection
)
# Establish the connection to the database
conn = pyodbc.connect(conn_str)
As we can see, our dataframe is populated with the data we queried from
the database.
Create new sentiment analysis columns
Next we’ll create 3 new columns for our dataframe (table if you wish).
These 3 new columns are the ones that will enrich our data in order to
perform other analysis. If you’re like me and quite new to python, then
looking at the final python code, it may confuses you.
So we’ll be decomposing it and joining the related parts so you can
understand the process.
In Python, a common workflow is to first define the functions that
encapsulate the logic or calculations needed for data transformations. After
defining the functions, we apply them to the dataframe columns to create
new fields.
SentimentScore
and only when creating the new column (field) SentimentScore we use the
function to populate the values of the new column we’re creating
customer_reviews_df['SentimentScore'] =
customer_reviews_df['ReviewText'].apply(calculate_sentiment)
So, at the same time we create the column and also assign the field
ReviewText, from our dataframe, as the parameter (review) that the
function calculate_sentiment will use to return the polarity score.
The same method and sequence of actions are applied for the other two
functions and columns that we’ll be creating. We’ll go through both next.
SentimentCategory
Again, first we define the function that will compute the values to populate
the new SentimentCategory column.
This function will use the sentimentscore calculated before and rating
coming from our customer_reviews table, to attribute a category to the
sentiment. It uses IF statements to create the categories.
def categorize_sentiment(score, rating):
# Use both the text sentiment score and the numerical rating to determine sentiment
category
if score > 0.05: # Positive sentiment score
if rating >= 4:
return 'Positive' # High rating and positive sentiment
elif rating == 3:
return 'Mixed Positive' # Neutral rating but positive sentiment
else:
return 'Mixed Negative' # Low rating but positive sentiment
elif score < -0.05: # Negative sentiment score
if rating <= 2:
return 'Negative' # Low rating and negative sentiment
elif rating == 3:
return 'Mixed Negative' # Neutral rating but negative sentiment
else:
return 'Mixed Positive' # High rating but negative sentiment
else: # Neutral sentiment score
if rating >= 4:
return 'Positive' # High rating with neutral sentiment
elif rating <= 2:
return 'Negative' # Low rating with neutral sentiment
else:
return 'Neutral' # Neutral rating and neutral sentiment
Again, when creating the column in our dataframe we’ll also define how it
will be calculated by assigning the function to its calculation.
customer_reviews_df['SentimentCategory'] = customer_reviews_df.apply(
Without entering too much on the code, the lambda function operates per
row and the axis=1 reafirmes that the function operates in rows not
columns.
SentimentBucket
This column is merely to further demonstrate how we can enrich our data
with fields that we think can make sense. It does not mean that it will be
user or super useful but the idea is that we can enrich our data with
information that maybe either required or we think can be useful.
This columns will divide the sentiment into 4 buckets, similar to the
previous column but another take on categorizing the sentiment. This
function only uses SentimentScore and IF statements to categorize the
sentiments.
First we define how the function will operate
def sentiment_bucket(score):
if score >= 0.5:
return '0.5 to 1.0' # Strongly positive sentiment
elif 0.0 <= score < 0.5:
return '0.0 to 0.49' # Mildly positive sentiment
elif -0.5 <= score < 0.0:
return '-0.49 to 0.0' # Mildly negative sentiment
else:
return '-1.0 to -0.5' # Strongly negative sentiment
and then, when creating the column we do what we’ve done to the other 2
columns
customer_reviews_df['SentimentBucket'] =
customer_reviews_df['SentimentScore'].apply(sentiment_bucket)
This ends our python part of our project. As we can see python is a powerful
tool to transform and enrich data both with methods native to python and
with methods that produce the same result as if they were done using
other, more conventional tools.
Here is our dataframe with the data we created, ready to export as a .csv
file and ready to be then imported to PowerBI
Full python code
# Define a function to categorize sentiment using both sentiment score and review rating
# Here we're defining a function that will populate two fields (score and rating). Score
is a field created by VADER. Rating comes from our customer_reviews table.
def categorize_sentiment(score, rating):
# Use both the text sentiment score and the numerical rating to determine sentiment
category
if score > 0.05: # Positive sentiment score
if rating >= 4:
return 'Positive' # High rating and positive sentiment
elif rating == 3:
return 'Mixed Positive' # Neutral rating but positive sentiment
else:
return 'Mixed Negative' # Low rating but positive sentiment
elif score < -0.05: # Negative sentiment score
if rating <= 2:
return 'Negative' # Low rating and negative sentiment
elif rating == 3:
return 'Mixed Negative' # Neutral rating but negative sentiment
else:
return 'Mixed Positive' # High rating but negative sentiment
else: # Neutral sentiment score
if rating >= 4:
return 'Positive' # High rating with neutral sentiment
elif rating <= 2:
return 'Negative' # Low rating with neutral sentiment
else:
return 'Neutral' # Neutral rating and neutral sentiment
# Here we’re defining a function to create buckets populated with values only based on
the score given to the text.
def sentiment_bucket(score):
if score >= 0.5:
return '0.5 to 1.0' # Strongly positive sentiment
elif 0.0 <= score < 0.5:
return '0.0 to 0.49' # Mildly positive sentiment
elif -0.5 <= score < 0.0:
return '-0.49 to 0.0' # Mildly negative sentiment
else:
return '-1.0 to -0.5' # Strongly negative sentiment
And the same happens with the other tables. This import was meant to
show how we can query the SQL database from inside Power Query. Let’s
start:
In the right pane “Query settings” we can see the
applied steps. Those steps resulted on the import of
the data as shown on top.
If we remove the “navigation step” we’ll end up with
no data, just seeing the name of the tables in the
database. Then clicking on the settings wheel we get
the import window again
Clicking on “Advanced Options “the area for us to insert the SQL query
appears.
We then paste our query on that area and click “OK”
And there we have it. The data is imported with the transformations we’ve
done in SQL before.
Now we’ll replicate this step for every table and bring to Power Query our
transformed and enriched data. After importing all tables to Power Query is
very important to check the data types with which Power Query imported
the fields.
From here we can see that,
visitdate has been imported as Date/Time when no time stamp exists
ProductID has been imported as decimal when probably whole number
is better
NOTE: Our enriched table with sentiment analysis will also be imported, but
via .csv as we generated the .csv file
We’ll now do these verifications and prepare our data, so we can continue
to PowerBI desktop and start building our model.
From the get go, state that, as our data is somewhat similar to possible
data related to marketing data, but still demo data, we will be building a
comprehensive report to try achieving our goals of finding out why some of
the issues pointed in the beginning may be occurring, but keeping in mind
that our base for that is for demo purposes.
Performing an initial time analysis on several metrics we can see that the
demo data didn’t fully account to answer in full to the issues presented to
us, still we’ll be creating the report.
Having this pointed out and having our tables imported and our data model
defined, let’s start creating our report.
Needless to say that, with the amount of facts we have on our model, we
could be building a lot more measures, either directly from the data or
derived from it. In this link, you can see a table with possible measures |
visuals we could be creating using the available data, but for the purpose of
this project we’ll create the following measures.
Clicks = SUM(engagement_data[Clicks])
Likes = SUM(engagement_data[Likes])
Views = SUM(engagement_data[Views])
Coversion Rate =
VAR TotalVisitors = CALCULATE(COUNT(customer_journey[JourneyID]),customer_journey[Action]
= "View")
VAR TtoalPurchases = CALCULATE(
COUNT(customer_journey[JourneyID]),customer_journey[Action] = "Purchase")
RETURN
IF(
TotalVisitors = 0, 0, DIVIDE(TtoalPurchases, TotalVisitors)
As models can quickly scale, and possibly more than one people have
access to the report with build permissions, it’s a good practice to create a
standalone table, in the model, with the measures created. It acts as a
central repository for these objects, so others can quickly deploy more
visuals using existing measures, without the need to be looking for them.
Here they are, organized in the separate table, created to accommodate
them and accessible for other report contributors.
Creating the Report