Basics of Power Bi Modeling
Basics of Power Bi Modeling
The fundamental lessons of building a data model that works best for Power BI
solutions
24 Riverhaven Drive,
Wade Heads,
Whangaparaoa 0932
New Zealand
Copyright © 2020 by RADACAD, Reza Rad
All rights reserved. No part of the contents of this book may be reproduced or
transmitted in any form or by any means without the written permission of the publisher.
Agenda
Agenda
About the Author
Introduction: from the author
Part 1: Relationships
Chapter 1: Power BI Relationship Demystified
Chapter 2: Many-to-One or Many-to-Many? The Cardinality of Power BI
Relationship
Chapter 3: What is the Direction of Relationship in Power BI?
Chapter 4: UseRelationship or Role-Playing Dimension; Dealing with Inactive
Relationships in Power BI
Chapter 5: Relationship with Multiple Columns
Part 2: Dimensional Modeling
Chapter 6: Data Preparation; The First and Foremost Important Task in Power BI
Chapter 7: Why My Values are Not Showing Correctly in Power BI?
Chapter 8: What is a Dimension Table and Why Say No to a Single Big Table
Chapter 9: Fact Tables
Chapter 9: Do You Need a Date Dimension?
Chapter 10: Power BI Date Dimension, Default, or Custom?
Chapter 11: Script to Create Date Dimension in Power BI using Power Query
Chapter 12: Script for Creating TIME Table in Power BI with Hours, Minutes, and
Seconds Buckets
Chapter 13: Star Schema and How to Build it
Part 3: Star Schema in Action
Chapter 14: Combining Dimension Tables in Power BI using Power Query
Chapter 15: Creating a Shared Dimension in Power BI Using Power Query
Chapter 16: Combine Tables or Create Relationships?
Chapter 17: What Fields to Hide in Your Power BI Solution?
Chapter 18: Build Your First Star Schema Model in Action
Chapter 19: How to Use Time and Date Dimensions in a Power BI Model
Chapter 20: Budget vs. Actual: Zero Complexity Model in Power BI
Chapter 21: One Dimension Filters Another Dimension in Power BI
Part 4: Calculations
Chapter 22: M or DAX? That is the Question!
Chapter 23: Scenarios of Using Calculated Tables in Power BI
Chapter 24: Measure vs. Calculated Column
Part 5: Better Data Model
Chapter 25: Move Your Shared Tables to Dataflow; Build a Consistent Table in
Power BI
Chapter 26: Power BI Shared Datasets: What is it? How does it work? And Why
should you care?
Chapter 27: Reduce the size of the Power BI file
Chapter 28: Design Tip for Power BI: Be Careful of Dates
Other books from Reza Rad
About the Author
Reza Rad is a Microsoft Regional Director, an Author, Trainer, Speaker, and
Consultant. He has a BSc in Computer engineering; he has more than 20 years’
experience in data analysis, BI, databases, programming, and development, mostly on
Microsoft technologies. He is a Microsoft Data Platform MVP for ten continuous years.
Reza is an active blogger and co-founder of RADACAD. Reza is also co-founder and
co-organizer of the Difinity conference in New Zealand.
His articles on different aspects of technologies, especially on MS BI, can be found on
his blog: https://radacad.com/blog.
He wrote some books on Microsoft BI and Power BI topics. He was also an active
member on online technical forums such as MSDN and Experts-Exchange, and was a
moderator of MSDN SQL Server forums, and is an MCP, MCSE, and MCITP of BI. He
is the leader of the New Zealand Business Intelligence users group. He is also the
author of the very popular book Power BI from Rookie to Rock Star, which is free with
more than 1700 pages of content and the Power BI Pro Architecture published by
Apress.
He is an International Speaker in Microsoft Ignite, Microsoft Business Applications
Summit, Data Insight Summit, PASS Summit, SQL Saturday, and user groups.
Connect to Reza
LinkedIn: https://www.linkedin.com/in/rezarad/
LinkedIn: https://www.linkedin.com/company/radacad
Twitter: https://twitter.com/Rad_Reza
Twitter: https://twitter.com/RADACAD_COM
Twitter: https://twitter.com/DifinityConf
Introduction: from the author
I have been dealing with many Power BI challenges in my professional life as a Power
BI consultant and a trainer. Challenges normally come as calculation or DAX questions,
or sometimes as a performance question. However, after digging deeper into the
problem, soon, it will be revealed that the problem is related to a more fundamental
challenge; data modeling.
If you have a Power BI implementation with many calculation-related or performance-
related issues, I strongly suggest looking into your data model because that is where
most of the problems start.
A good data model is a great base, which upon that, you can build up many stories of
calculations and analysis. A bad data model causes problems on every level that you
add upon it, and might sometime cause the whole solution to collapse.
Fortunately, data modeling is not rocket science. I explained the basic principles of the
data modeling with examples in this book. Use this book as the learning path towards a
better data model. Most of the tips mentioned in this book are product-agnostic (such as
star-schema, dimension, and fact tables). However, this book is particularly designed
and developed for a Power BI product user.
This book is for you if you are building a Power BI solution. If your task is only
visualizing the existing data, this book might not be needed for you. However, What I
have seen in many cases, is that the requirement starts with just visualize the data, and
then more data tables appear, and you get into the tunnel of data modeling without
knowing the principles of it. This book is a guide for you through that tunnel.
Most of the chapters of this book come from my blog articles and videos, and countless
comments from my readers are applied to it. If you ever feel that you have a question
which you can’t get the answer of it through this book, feel free to contact me directly.
Download the files and codes for this book from here
[http://radacad.com/books/files/BookFiles-
BasicsOfPowerBIModeling.zip].
Part 1: Relationships
The book starts with understanding the basics of a relationship in Power BI. In this part,
you will learn why relationships are needed. You will also learn what the types of
relationships are. This part continues with some advice about how to use the right
relationship in your model.
Chapter 1: Power BI Relationship Demystified
This chapter is going to cover the basics of the relationship. Tips such as: What is a
relationship? Why do you need it? And what is happening logically behind the scene of
a relationship?
Sample Dataset
If you want to walk through the example of this chapter, create a new Power BI Desktop
file and get data from the Pubs.xlsx Excel file.
What is the Relationship in Power BI?
Let’s go through the meaning of a relationship through an example; Let’s say you have a
single table for Store, and you want to analyze that in Power BI. Our Store table looks
like this:
In the Power BI report, if you want to build a visualization which is the count of stores
in State, it would be like this:
As you can see, the State can filter the table to show the count of records for each state.
If I want to visualize the filtering, it would be like this:
Now, let’s do a visualization like this: The total Qty of titles sold in each state:
As you can see in the above visualization, the total of qty for each state is 493! And the
total qty in the Sales table is also 493! Obviously, this time filtering didn’t happen
correctly. Because Sales only knows about the stor_id, not the state:
If we want to filter this table by store, the only field that gives us such ability is stor_id,
and that doesn’t know anything about the state.
On the other hand, the Store table has the state field, but knows nothing about the Sale
and Qty! However, there is a stor_id field in this table too. Which is the unique id of the
store per each row (store);
In this example, we need both tables because they both contain fields that we need in
our visualization: State field from the Store table, and the Qty field from the Sales table.
The stor_id field is the field that can help us in looking up one value in the other table.
Let’s see that through an example:
The store Barnum’s located in Tustin, state CA. The ID of this store is “7066”.
The Sales table, on the other hand, tells us how much sales we had in store 7066 (which
is Barnum’s store);
These two tables can be connected to each other using stor_id. Or let’s say it in other
words; these two tables can be RELATED to each other using the stor_id field.
Relationships
When you want to query data from two tables, and the data of those two tables should
be somehow connected, then you can create a relationship between them (if these two
tables are not related through other tables).
To create a relationship between two tables, you need a field that can link these two
tables together. We call this field a relationship field. You can simply drag that field
from one of the tables (doesn’t matter source or destination), and drop it on the same
field on the other table.
After drag and drop, you should see the relationship created like below;
The relationship which is created might not be located in a way showing exactly from
stor_id to stor_id. However, when you click on the relationship line, then you can see
that fields that are part of the relationship are highlighted.
Fields can Have Different Names
The two fields from the two tables can have the same name or different names. If they
have the same name (or let’s say similar name pattern), Power BI can automatically
detect that relationship. Remember that we turned off that feature earlier in this chapter
to make the learning process easier. By default, the auto-detect relationship is on and
would have detected a scenario like this that we had stor_id in both fields and would
have created the relationship.
If field names are different, then you have to create the relationship manually, like what
we have done above. Alternatively, there is another way to create a relationship too.
Creating a relationship using the Manage Relationship Menu Option
Under the Home tab in Power BI Desktop, you can click on Manage Relationship;
You will see the Manage Relationship window, which will show all existing
relationships and their “from” and “to” columns. You can create a new one by clicking
on the New. You can also notice that there is a “autodetect” option right beside “New”
just in case you want to use it.
We cannot see the relationship that I have created in the previous step here because I
removed it manually to create it again this way and show you this as the second method
of creating it.
After clicking on New relationship, then in the Create Relationship window, you will
be able to set the source table, the destination table, and you have to also click on the
relationship field to highlight it in each table. Like what illustrated in the below image:
The choice of having the Stores table as the source or the Sales table as the source
makes no difference. However, the cardinality will make the process different. Power
BI usually is smart enough to understand cardinality. However, Cardinality itself is a
big topic in the relationship, which I will talk about it later in another chapter.
There are many other configurations in the Create Relationship window and each of
those needs a full chapter discussion for themselves. In this chapter, we are only
focusing on the basics of relationships. After clicking on OK with the configuration
above, you should achieve the same relationship diagram as we did in the previous
step:
Edit Relationship
In case you want to edit a relationship, you can either go through the Manage
Relationship section under the Home tab, find that relationship, and Edit it;
Or you can just double click on the relationship line between the two tables,
Data Types Should Match
The two fields that you are connecting to each other should have the same data type or a
data type that can be matched to each other (for example, text to number). Otherwise,
when you create a relationship, you get an error.
Linking Field Should Exists in Both Tables
If you don’t have the link field in one of the tables, then obviously, you cannot create a
relationship between them. In the above example, the stor_id is our link field, and it
HAS TO exists in both tables.
Relationship Means Filtering Across Tables
The whole purpose of creating relationships in Power BI is to be able to filter data by
the fields from two separate tables. Creating a relationship like the above example
means connecting the two tables like below;
Now, after creating the relationship; we can see the visualization works fine:
Behind the scene, this means filtering like below:
Stor_id field behind the scene is acting to connect these two tables. The first table gets
filtered by state, and then the second table (through the relationships of stor_id) also
gets filtered for each state.
Other Properties of a Relationship
There are some other important properties of the relationships, which I am going to
explain in the next few chapters, such as:
The cardinality of the relationship (1-1, 1-Many, Many-1, Many-Many)
The direction of the relationship
Active or inactive relationship
Creating a relationship based on multiple fields
Summary
Relationships are one of the most essentials parts of Power BI modeling. If you are
coming from the database background, you already have a good understanding of
relationships. If you are coming from Excel or a business background, you can consider
each relationship like a lookup table.
The relationship is what makes the filtering across multiple tables possible. In this
chapter, you learned about the basics of relationships, such as why we need it, and how
to create relationships, and what happens behind the scene of a relationship logically. In
future chapters, you will learn about other important features of a relationship.
Chapter 2: Many-to-One or Many-to-Many? The
Cardinality of Power BI Relationship
In the previous chapter, you learned the basics of relationships, you learned why we
need a relationship, and what is the filtering impact of it across multiple tables. In this
chapter, you will learn about one of the most important properties of a relationship
called Cardinality. Understanding what the meaning is of 1-1, 1-Many, Many-1, and
Many-Many relationship is the purpose of this chapter.
Relationships in Power BI
Power BI relationships give us the ability to have fields from multiple tables and
filtering ability across multiple tables in the data model. Relationships are based on a
field, which will connect two tables, and filter one based on the other (or vice versa
depends on the direction). As an example, we can filter the Qty of the Sales table by the
State in the Store table, as long as there is a relationship between Sales and Store table
based on stor_id;
And the relationship between the tables is as below;
To learn more about the details of the relationships, and why we need that, read chapter
one.
What is the Cardinality of the relationship?
When you create a relationship between two tables, you get two values, which can be
one or * on the two ends of the relationship between two tables, called Cardinality of
the relationship.
The two values of 1 or * are saying that the field in that relationship has how many of
that value per line in that table. Let’s check it with an example;
In the Stores table, we have one unique value per stor_id per row.
So if this field participates in one side of a relationship, then that side will take one as
the Cardinality indicator, Which is called ONE side of the relationship.
However, the stor_id in the Sales table is not unique per each data row in that table. We
have multiple rows for each stor_id. Or let’s say; in each store, there are multiple sales
transactions happening (which is normal, of course);
So if the stor_id in the Sales table is part of a relationship, that side of the relationship
will become the *, or what we call the “MANY” side of the relationship.
So based on what we know so far, If we create a relationship based on stor_id between
the two tables of Sales and Stores, here is the output:
This relationship can be read in two ways;
One-to-Many (1-*) relationship from the Stores table to the Sales table
Many-to-One (*-1) relationship from Sales table to Stores table
They are both the same, of course, and they will look exactly like each other in the
diagram view. Now that you know what the Cardinality is, let’s check all different types
of Cardinality.
Types of Cardinality
There are four types of cardinality, as below:
1-1: one-to-one
1-*: one-to-many
*-1: many-to-one
*-*: many-to-many
This type of relationship, although it is often used in many models, can always be the
subject of investigating for better modeling. In the ideal data model, you should NOT
have a relationship between two dimension tables directly. Let’s check that as an
example;
Let’s say there is a different model from what you have seen so far in this example;
Sales table, and a Product table, and two tables for category and subcategory
information of the product as below;
As you can see in the above relationship diagram, all relationships are many-to-one.
Which is fine. However, if you want to slice and dice the data of the fact table
(SalesAmount, for example), by a field from the DimProductCategory table
(ProductCategory name, for example), it requires three relationships to be processed;
This will add the consumption of some of the processing power, and it also brings a lot
of other issues, such as filtering from the “many” side of the relationship to the “one”
side of it. This type of relationship is not recommended. I have dedicated a whole
chapter to this discussion of why you need to combine dimension tables to avoid such a
scenario.
A better model would be combining category and subcategory tables with the product
and having one single many-to-one relationship from the fact table to the DimProduct
table.
One-to-One Relationship
A one-to-one relationship happens only in one scenario when you have unique values in
both tables per column. An example of such a scenario is when you have a Titles table
and a Titles Details table! They both have one row per title. So If I create a
relationship, it would look like this:
When you have a one-to-one relationship between two tables, they are a good candidate
to be combined with each other using Merge in Power Query
[https://radacad.com/append-vs-merge-in-power-bi-and-power-query]. Because both
tables have the same number of rows most of the time, or even if one of them has fewer
rows still considering the compression method of Power BI xVelocity engine, the
memory consumption would be the same if you bring it in one table. So If you have a
one-to-one relationship, then think about combining those tables seriously.
It would be better if we combine both tables above in one table which has everything
about the Title in it.
Many-to-Many Relationship: Weak Relationship
You have multiple records for each value in the joining field between tables, for both
tables. If you think about it for a second, you can see that this scenario happens when
you have tables that are related to each other with no shared dimension in between!
Let’s check one example; Let’s say I have a Fact Inventory table and a Fact Sales table.
They both have multiple records per product. And I want to connect them together using
the Product ID. This has to be a many-to-many relationship because there is no product
ID field that has unique values in it.
What if you have more than one table with that scenario?
You will end up creating a many-to-many relationship between every two tables, and
then because it causes circular filtering in the relationship, you end up with an inactive
relationship! There are tons of other issues with the many-to-many relationships, and
most of the time, they are all result of not having a shared dimension in between. I
dedicated a whole chapter about what is a shared dimension, and how to create it to
avoid a modeling mistake like above.
A better model for the above sample would be using shared dimensions as the diagram
below;
Summary
The cardinality of the relationship means having unique or multiple instances per value
for the joining field between two tables. The most common type of cardinality is one-to-
many or many-to-one, which happens between fact and dimension tables. However, you
can find one-to-one relationships too. One-to-one relationships are a good candidate to
be combined into one table. Sometimes, for some types of one-to-many relationships, it
is better to combine tables as well to create a flattened dimension. The type of
relationship that you should be avoiding is the many-to-many relationship that can be
resolved by creating a shared dimension.
Chapter 3: What is the Direction of Relationship
in Power BI?
Relationships in Power BI are a bit different from other database management systems.
In most of the systems, you just have a relationship, and there is no “Direction” for it. In
Power BI, however, there is a direction for the relationship. The direction of a
relationship plays a critical role in the way that filtering works in Power BI.
Understanding the direction of the relationship is an important step towards the
modeling of Power BI. In this chapter, you will learn about what the direction of the
relationship is and what is the difference between both directional or single-directional
relationship.
Sample Dataset
If you want to walk through the example of this chapter, create a new Power BI Desktop
file and get data from the AdventureWorksDW2012.xlsx Excel file and select these
tables; DimCustomer, DimProduct, FactInternetSales. Load the data into Power BI.
After loading data into Power BI, In the relationship tab, you should see all three tables
related to each other. You can see the direction of the relationship, which is usually
from one side of the relationship to the “many” side of it.
What is the Meaning of the Direction of the Relationship?
The most important question is what the direction of a relationship means? The answer
is; It means Filtering. Whatever direction of the relationship is, that means how Power
BI filters the data. In the above screenshot, you can see the direction of the relationship
is from DimCustomer to the FactInternetSales. It means any column from DimCustomer
can filter the data in the FactInternetSales. As an example; you can slice and dice the
SalesAmount (in the FactInternetSales table) using the EnglishEduction (in the
DimCustomer), as below;
So far, nothing seems strange or weird. You can slice and dice the data of the fact table,
also using columns from DimProduct because there is a relationship direction from
DimProduct to FactInternetSales. The problem (or let’s say the strange part) comes
when you want to filter in a different direction than what is defined in the relationship.
Let’s see an example.
So, here you go; now you know what the direction of the relationship is, and what is the
difference between both directional and single directional relationship. Before you go,
you need to read one important critical note about the both-directional relationship,
however!
Be Careful! Performance Considerations for Both-Directional
Relationship
After you’ve done the example above, it seems that both-directional relationship is
good! Then you may think; why we should not be using it all the time! If you have been
using Power BI Desktop in the early days of the second half of 2015, the default type of
direction was both-directional. That time, I got many calls and emails from people that
their model is slow! Why you think was that? Because of both-directional relationship!
In a Power BI model, relationships are important for passing filters. Filter propagates
through relationships. However, sometimes you create the relationship between two
tables, and the relationship is a dashed line. In this chapter, I’ll explain to you
everything you need to know about a dashed relationship or as it is called an Inactive
relationship. I will explain two different methods that you can deal with this kind of
relationship. So, ready? Let’s go through it.
Why Relationships in Power BI?
To start this chapter, the very first topic to discuss is to understand why a relationship in
Power BI is important? A relationship in relational database systems is important to
link tables to each other, but in Power BI, the relationship also plays another even more
important role; Filtering.
To understand how the relationship works, let’s check this example:
I have a sample Power BI file getting data from the AdventureWorksDW2012.xlsx
Excel file example, and I get information from two tables: FactInternetSales, and
DimDate. These two tables are NOT related to each other at the beginning.
Now, let’s create a simple column chart with the SalesAmount from the
FactInternetSales table, and the FullDateAlternateKey from the DimDate table. Because
the FullDateAlternateKey is a date field, Power BI brings the default hierarchy, and I’ll
see the visual slicing and dicing data by the highest level of the hierarchy, which is
Year.
But wait, it isn’t actually slicing and dicing! It is showing the same SalesAmount for
every single year from 2005 to 2010! The value is very close to $30 million, which is
actually the grand total of the sales in my dataset. The fact is that the
FullDateAlternateKey field is NOT filtering the FactSalesAmount table.
Relationship Means Filtering
Now, let’s create the relationship between these two tables, based on the OrderDateKey
in the FactInternetSales table and the DateKey in the DimDate table;
That’s it, let’s go and check the same visualization again:
As you can see, the same visual, this time filters by the date field. Or better to say,
DimDate can now FILTER the FactInternetSales table. All of that because of the
relationship. Without a relationship, we cannot filter data across tables just by itself.
You may need to do some DAX expressions instead.
Now that you now relationships are for Filtering let’s check out what the inactive
relationship is.
Inactive Relationship
The type of relationship you have seen above is called an active relationship. There is
another type of relationship called Inactive. Let’s see how an inactive relationship will
be created. In the previous example, we sliced and diced data by the OrderDateKey
field, because that was the field connected through the relationship to the DimDate
table. Now, let’s say we want to slice and dice data by the ShipDateKey. The very
simple approach is to create another relationship between the DimDate table and
FactInternetSales but this time to the ShipDateKey. Here is the result:
As you can see, this new type of relationship is different. It is a dashed line, compared
to the active, which was a solid line. This is an inactive relationship. You can only have
one active relationship between two tables. Any other relationships will become
inactive.
You can only have one active relationship between two tables.
Any additional relationships will become inactive.
An inactive relationship doesn’t pass filtering. It doesn’t do anything by itself. I still see
many people creating inactive relationships in their model, thinking that just the inactive
relationship by itself will do some filtering. It doesn’t. If I use the FullDateAlternateKey
from the DimDate table to slice and dice the SalesAmount from the FactInternetSales
table, which field I’m filtering based on? The field that is related through an Active
relationship is the base of the filtering. Here is a result for that (which is apparently the
same as what you have seen in the previous example because the inactive relationship
doesn’t do anything. It is just the active relationship that passes the filter);
Inactive Relationship Doesn’t pass the filtering by itself. It needs
treatment!
Yes, the inactive relationship needs a special treatment to work. Let’s see how this can
work. I explain two treatments for an inactive relationship; the Role-playing dimension
and the UseRelationship method.
Role-playing Dimension
A dimension that plays the role of multiple dimensions is called a role-playing
dimension in the data warehousing terminologies. In the above example, DimDate is
going to play the role of Order Date in some scenarios, and the role of Ship Date in
other scenarios, and also sometimes the role of Due Date in other times.
One method to deal with the inactive relationship is to remove the cause to create it! If
having multiple relationships between two tables is causing the creation of an inactive
relationship, one way to avoid it seems to be creating multiple instances of the same
table, and then you would need only one relationship, not more than that.
Let’s create a copy of the DimDate. One way to create the copy is to use a Calculated
Table with ALL DAX function in it;
ALL is a function that gives you the entire table. In this case, we are creating a copy of
the DimDate table and calling it ShipDate. Now you can create a normal active
relationship between ShipDate and the FactInternetSales table (I have removed the
inactive relationship from the previous section);
And now, as a result, you have slice and dice by the ShipDate table as well as the Order
Date (or let’s say DimDate table);
The role-playing dimension is one of the ways that you can
handle an inactive relationship, but be careful of memory
consumption!
This measure calculates the sum of sales by ship date. The whole secret is the usage of
the UseRelationship function. This is a really simple function to use. You just need to
provide two input columns to it, the two columns that are two sides of the relationship.
Their order is not important.
UseRelationship (<column 1>, <column 2>)
The important tip to consider is that you HAVE to have an existing inactive relationship
for this function to work. Otherwise, you get the error below:
Summary
In this chapter, you learned about inactive relationships and how to handle them through
two methods; Role-playing dimension and UseRelationship function in DAX. The role-
playing dimension method is good for smaller tables where the extra memory
consumption is not the issue. UseRelationship method, on the other hand, can be a good
substitute when the tables are bigger. There are other benefits, such as getting one table
filtering based on multiple fields at the same time as you’ve seen.
Chapter 5: Relationship with Multiple Columns
You can create relationships in Power BI between tables. Relationships are useful for
some functions to work across multiple tables and produce the result. The relationship
between tables also makes visualization and report elements more efficient because the
result of selection in one chart can affect another chart from a different table. However,
there is a limitation in the Power BI relationship that you can’t create a relationship
based on more than one column. In other words, if you want to create a relationship
with two or more columns, you cannot! Fortunately, there is a workaround that I’ll
explain in this chapter.
Defining the Problem
Assume that we have a budget table with the fiscal year, fiscal period, and budget
amount. Here is a screenshot of this table:
If I want to do date-based analysis and date-based calculations, it is best to create a
relationship between the budget table and a date dimension.
To join these two tables, I can add a day column to the budget table, and then join them
based on three columns: fiscal year, fiscal period, and day (day of the month). So here
is the budget table with the new day column added and the month value a bit polished to
remove “Mth” from the month label;
Now, if I want to create a relationship between the date dimension and budget table
based on these three columns, I cannot! The create relationship dialog doesn’t allow me
to select multiple columns, and because with a single column, a Key field won’t be
identified so the relationship can’t be created.
Workaround
The workaround for this problem is easy. Power BI doesn’t allow relationship in model
based on multiple columns, but Power Query can join tables with as many as columns
we want. So what I can do as a workaround is to join the budget table to date dimension
in Power Query and fetch the date key. Then I’ll use the date key as a single field
relationship in Power BI modeling section.
First, I open Merge Queries from the Combine section of the Home tab;
Here is how I join two tables based on multiple columns: I can hold the CTRL key and
select columns one by one (in the right order of joining)
Then I’ll have the new table embedded as a result of the join;
So I’ll just pick the key field from the embedded table;
After save and closing the query editor window, I can create a relationship in the Power
BI model based on a single column;
Part 2: Dimensional Modeling
In this part, you will learn about the principles of dimensional modeling.
You will learn why star schema is needed, what are the components of it,
what are fact and dimension tables. You will learn how to build a star-
schema from an operational data store.
Chapter 6: Data Preparation; The First and
Foremost Important Task in Power BI
I’ve talked about Data Preparation many times in conferences such as PASS Summit,
BA Conference, and many other conferences. Every time I talk about this, I realize how
much more I need to explain this. Data Preparation tips are basic but very important. In
my opinion, this is the most important task in building in BI system. In this chapter, I’ll
explain why data preparation is necessary and what are five basic steps you need to be
aware of when building a data model with Power BI (or any other BI tools). This
chapter is totally conceptual, and you can apply these rules to any tools.
Why Data Preparation?
All of us have seen many data models, like the screenshot below. A transactional
database by nature has many tables and relationships between tables. Transactional
databases are built for CRUD operations (Create, Retrieve, Update, Delete rows).
Because of this single purpose, transactional databases are built in a Normalized way to
reduce redundancy and increase the consistency of the data. For example, there should
be one table for Product Category with a Key related to a Product Table because then
whenever a product category name changes, there is only one record to update, and all
the products related to that will be updated automatically (because they are just using
the key). There are books to read if you are interested in how database normalization
works.
I’m not going to talk about how to build these databases. In fact, for building a data
model for the BI system, you need to avoid this type of modeling! This model works
perfectly for a transactional database (when there are systems and operators do data
entry and modifications). However, this model is not good for a BI system. There are
several reasons for that, here are two most important reasons;
1. The model is hard to understand for a Report User
2. Too many tables and many relationships between tables make a reporting query
(that might use 20 of these tables at once) very slow and not efficient.
You never want to wait for hours for a report to respond. The response time of the
reports should be fast. You also can’t expect that the report users (of self-service report
users) understand the schema above. It is sometimes even hard for a database developer
to understand how this works in the first few hours! You need to make your model
simpler, with few tables and relationships. Your first and the most important job as a BI
developer should be transforming the above schema to something like below;
This model is far simpler and faster. There is one table that keeps sales information
(Fact_Sales) and a few other tables which keep descriptions (such as Product
description, name, brand, and category). There is one relationship that connects the
table in the middle (Fact) to each of the tables around (Dimensions). This is the best
model to work within a BI system. This model is called a Star Schema. Building a star
schema or dimensional modeling is your most important task as a BI developer.
How to Design a Star Schema?
To build a star schema for your data model, I strongly suggest you take one step at a
time. What I mean by that is choosing one or a few use cases and start building the
model for that. For example, instead of building a data model for the whole Dynamics
AX or CRM, which might have more than thousands of tables, just choose the Sales side
of it, or Purchasing, or GL. After choosing one subject area (for example, Sales), then
start building the model for it considering what is required for the analysis.
Fact Table
Fact tables are tables that are holding numeric and additive data normally. Examples are
these numeric and additive values are; quantity sold, sales amount, discount, cost, or
things like that. These values are numeric and can be aggregated. Here is an example of
a fact table for sales;
Dimension Table
Any descriptive information will be kept in Dimension tables. For example, customer
name, customer age, customer geoinformation, customer contact information, customer
job, customer id, and any other customer-related information will be kept in a table
named Customer Dimension.
Each dimension table should contain a key column. This column should be numeric
(integer or big integer depends on the size of dimension), which is auto-increment
(Identity in SQL Server terminology). This column should be unique per each row in the
dimension table. This column will be the primary key of this table and will be used in
the fact table as a relationship. This column SHOULDN’T be the ID of the source
system. There are several reasons why. This column is called the Surrogate Key. Here
are a few reasons why you need to have a surrogate key:
Codes (or IDs) in the source system might be Text, not Integer.
Short Int, Int, or Big Int are the best data types for surrogate key because these
are values that will be used in the fact table. Because the fact table is the largest
table in the data set, it is important to keep it in the smallest size possible (using
Int data types for dimension foreign keys is one of the main ways of doing that).
Codes (or IDs) might be recycled.
You might want to keep track of changes (Slowly Changing Dimension), so one
ID or Code might be used for multiple rows.
…
The surrogate key of the dimension should be used in the fact table as a foreign key.
Here is an example;
Other dimensions should also be added in the same way. In the example below, a Date
Dimension and Product Dimension is also created. You can easily see in the screenshot
below that why it is called star schema; the Fact table is in the middle, and all other
dimensions are around with one relationship from the fact table to other dimensions.
Design Tips
Building star schema or dimensional modeling is taking more reading than what you had
in this chapter. That is why we have more chapters in this book to explain things more in
detail. However, it would be great to leave some tips here for you to get things started
towards better modeling. These tips are simple but easy to overlook. The number of BI
solutions that I have seen suffer from not obeying these rules are countless. These are
rules that, if you do not follow, you will be soon far away from proper data modeling,
and you have to spend ten times more to build your model proper from the beginning.
Here are the tips:
Tip 1: DO NOT add tables/files as is
Why?
Tables can be joined together to create a more flatten and simpler structure.
Solution: Create a flatten structure for tables (especially dimensions).
Summary
This was just a very quick introduction to data preparation with some tips. This is the
beginning of this part about dimensional modeling. Other chapters in this part will help
you to understand the concepts more in detail.
Chapter 7: Why My Values are Not Showing
Correctly in Power BI?
Both visualizations above are showing slicing and dicing the data of orders (one is the
SalesAmount and the other one count of orders) by the EnglishEduction, which is from
the DimCustomer table.
Values that can be fetched from Multiple Tables
The problem happens when there is a value that you can fetch from multiple tables.
Let’s talk about one scenario as an example:
We have FullName (from DimCustomer table) of customers in the table below, and we
want to add the count of products that have purchased besides them as a value column.
In a scenario like this, one of the very first actions for many developers is to go and
search for something about Product in the fields pane.
The product search will bring a lot of fields if we narrow it down with using
ProductKey, we see that it is available in two tables: FactInternetSales and
DimProduct.
In a properly designed data model, the ProductKey has to be hidden because it is a
technical column. If you use the ProductKey from the DimProduct table, then you get this
output:
If you wonder why all count of ProductKey values in the visual above is 606? I have
explained it fully in detail in chapter three about the direction of the relationship in
Power BI. As you see in the model diagram below, the DimCustomer table cannot filter
the value in the DimProduct table because of the single-directional relationship between
the DimProduct and FactInternetSales table.
You might then think of changing the relationship to both-directional between the
DimProduct and FactInternetSales table, which is a big modeling mistake. Your model
should always look like this:
So, How should you achieve that? Well, by choosing the value from the correct data
table, FactInternetSales table.
If I do the same with getting the value from the FactInternetSales table, then we get the
correct value;
Depends on the logic of your calculation, you might need to get Count of that field or
Count (Distinct) of that (because there are duplicate ProductKey values in the
FactInternetSales table; a product can be sold multiple times, of course). As you see in
the above visualization, the value is shown correctly.
How Do I Know Which Table Should I Get the Value from?
Well, we solved the above scenario. However, in a more complicated model, where
you have more than two or three tables, finding the right table to get the value from can
be a challenge. Here is the golden rule to select the values:
Examples of these values are; SalesAmount, Order Quantity, Count of Products been
sold, Count of Customers purchased specific products, etc. The Count of ProductKey
from the FactInternetSales table that you see above was one of the examples;
What If I don’t have Fact or Dimension tables designed property?
At this point, you might say that it is easy in a proper star-schema design where there
are fact and dimension tables designed properly. However, in my model, everything is
messed up. What should I do with that? That is a great question. My answer is this:
Having a proper Power BI model design is the building block of all other steps
afterward. If you don’t have the right data model, everything is going to be messed up on
top of it. You cannot build a house on top of a building block, which is not solid itself.
I’d say whatever stage or step of the Power BI development you are, consider re-
designing your model seriously. There are simple things that you can do to change your
model to a good data model, it would take time, but it would well worth it.
Summary
Values in your visuals should come from Fact tables, and slicers/filters/axis/column or
row headers should come from Dimension tables. For that, you do need to have proper
dimension and fact table design. However, that is not hard to implement.
Chapter 8: What is a Dimension Table and Why
Say No to a Single Big Table
Now is a good time to explain the two types of tables we deal with every day in a
reporting data model; Fact and Dimension table. In this chapter, I will explain what a
Dimension table is, and examples of it, and how it can be created. Although examples of
this chapter are all using Power BI and Power Query, the basics of modeling are valid
regardless of the tool that you use. If you heard about Dimension and have no idea what
it is, or if you haven’t heard about it and you want to build a data model for reporting,
this chapter is what you need to read.
Two Types of Tables
In a reporting data model, there are two types of tables. Before I go further on this, I
have to point out that there is a big difference between a data model build for a
transactional or operational system, compared to the data model build for a reporting
system. In this chapter, we are focusing on a data model for reporting systems. In a
reporting data model, we have two types of tables; Fact table, and Dimension table;
A Fact table is a table that keeps numeric data that might be aggregated in the reporting
visualizations.
A Dimension table is a table that keeps descriptive information that can slice and dice
the data of the fact table.
The definition above, although correct, can lead to creating heaps of different types of
tables and calling those a Dimension or Fact table. You do need to understand where
and how a Dimension table can be created, what are rules of thumb for creating such a
table, and what is a good dimension and what is bad. And that is what we are focusing
on in this chapter.
A Type of Field: Descriptive
A Dimension table is a table that has descriptive information, which is or can be used
for slicing and dicing. A Dimension table has fields that are descriptive. Descriptive
fields can be of datatype text, date and time, or even a number. Let’s have a look at
some examples;
The table below is a table full of information about customers, and they are all text data
type information;
This is a Dimension table because you use the data of this table mainly in visualizations
like this:
Most of the time, you use values from this table in an axis or legend of a chart of
visualization, but not as values. (You might use these as values in the table visual, but
that is an exception because the table visual, will show you everything). In the
screenshot above, the value is SalesAmount (coming from a different table), and it is
always sliced and diced by a field from the Customer table; Gender, EnglishEducation,
Lastname, etc. That brings us to the first part of the definition of the dimension table;
Fields from a Dimension table are always used for slicing and
dicing data of other tables (either through filters or slicers or
through axis or legend of the same visual). We call these fields
as Descriptive fields.
You also found that fields of the Customer table above were all text. That can be
different. Let’s check it out here; we have two new fields in the Customer table now;
The date field (BirthDate), and the numeric field (YearlyIncome) will also be used for
slicing and dicing as the below screenshot shows it;
No matter you have date and time fields or even numeric fields in the customer table,
they are used in the same way as text fields, they are used for slicing and dicing.
The data type of the descriptive fields can be text, date and time,
number, or anything else. The data type is not the important
property of these fields; the nature of use case for them is
important, which is slicing and dicing.
SalesAmount is not used for slicing and dicing. It is used as a value. In fact,
SalesAmount is sliced and diced by the descriptive fields of Gender, Education, and
Lastname. SalesAmount is not a descriptive field; it is a field that is used as a value and
can be aggregated sometimes (like what you see in SalesAmount by Gender or
SalesAmount by EnglishEducation). SalesAmount is a FACT field.
Fields that are the VALUE part of a visualization (not for table,
and a few other visuals), and sliced and diced by descriptive
fields, often aggregated are FACT fields.
Golden Rule: Fact and Descriptive Fields Not in the Same Table!
Now that you know what a Fact field is and what is a Descriptive field, it is time to
explain the first golden rule: Keep these two types of fields away from each other, in
separate tables. Fact fields should be in a table that does not have Descriptive fields in
it and vice versa. Your first question, I guess, is Why?! Well, let’s see.
Let’s assume that we have a table with both types of fields; descriptive fields and also
fact fields, like below;
All the seven first fields are descriptive fields, and the last two (SalesAmount and
Order Quantity) are fact fields. Using it in a model like above works fine, and you can
slice and dice the data of fact fields, by descriptive fields. However, what if you bring
another set of descriptive fields, which are not part of this table? For example, what if
you want to slice and dice the data of the fact fields by some fields from the Product
table, such as ProductName, Color, etc.? Then you end up bringing all those into this
table too, and end up with a super large table, which has customers, products, and sales
values! What if later on, you want to bring dates of the transactions, sales promotion,
store details? Each of these brings a whole lot of fields with themselves.
Say No to a Single Table Including Everything!
One of the other golden rules of data modeling is to say no to a scenario like the above
to happen. Do not bring everything into one large table with hundreds of columns. There
are several reasons for that. I just point a few;
Maintenance of such a big table would always be a challenge. Every time you
bring a new set of fields, you have to combine them into this table.
The level of detail stored in this table is dependent on the fields in it. If you
bring new fields, you need to store more details! This is also a maintenance
problem.
Combining is not always possible! What if you also want to bring a budget into
the same table? Most of the time is not possible because you may not have the
budget data at the level of the customer. You may have it only at the level of the
month and product. Then you end up with multiple very large tables.
Maintenance Challenge
Every time you add new fields, your table needs to be combined with the new table to
have everything in one place!
The level of detail is dependent on the fields in the table! If you add more fields, you
need to store more details.
Combining is Not Always Possible
If the level of details in one or two fact fields is different, then you cannot store them in
one table easily. You have to split into two;
This can become another big table, adding dates, and more details about
the product, etc. and then you need to create a “many-to-many”
relationships between the two tables, which cause heaps of other issues!
So it is not possible to have one table with everything in it, and as you realized, it is
better to decouple the fact fields from the dimension fields. Because then expanding the
model is much easier, and it won’t need a really high maintenance practice.
Dimension Table
Now it is time to go back to the definition of the dimension table;
This table needs to be related to other tables, including Fact fields, to slice and dice
their data. So this table needs to have a field that is the unique identifier for each row.
This field in database design practices is called as Primary Key. The primary key is a
single field or combination of fields in the dimension table that can uniquely identify
each row. For example, the Customer table can be identified with the CustomerKey
column as below;
This is the field that will be used for the relationships of One-to-many between
dimension table and fact tables.
The relationship between Dimension table and Fact tables, most of the time, is One-to-
Many from dimension to the fact table. There are instances of having a one-to-one
relationship as well, but not very often.
So, we need to change our definition a bit now;
If you have more than one key field (it is also called a compound key), then you cannot
create relationships between tables in Power BI using multiple fields, so you need
to use a workaround I mentioned in chapter five to create a single field for creating a
relationship.
When the Primary Key is Not a Primary Key!
The Primary Key is a column (let’s assume it is only one column for this example), that
is, row identifier. However, what if it isn’t!? Do you recall in your use cases a scenario
that something has to be a primary key, but it can’t be? Let me tell you an example; In a
reporting model for bank information, we have a table of credit cards. Each credit card
has a unique number, so can be used as the primary key;
However, if you look more in detail, a credit card will be renewed every few years,
with a new expiry date, but the same number!
This is an example of a field that should be somehow primary key (or maybe it is
already a primary key in the source operational system), but not in our model. We call
these fields; Alternate Key, Business Key, or the Primary Key in the source system. And
then, we need to create a new field in the Dimension table, which can be a real primary
key. A primary key of the dimension table, because of this reason, is called Surrogate
Key, and most of the times, can be generated as an auto-number index value;
There are other scenarios that you might end up creating an auto-number surrogate key
for your dimension as well. This was just an example to clarify things for you. If
scenarios like this happen, most of the time, you do need to also keep the Alternate Key
beside the Surrogate Key in the dimension table for slicing and dicing.
Dimension Tables are Wide
Another golden rule in designing dimension tables is to make them wide. Add fields that
are related to that; more fields in the dimension table means more power for slicing and
dicing the data. If you have just three fields of Customer first name, last name, and
gender, then it means you can only slice and dice data by these three. However, if you
have a customer table with all columns below, then you will have much more power in
slicing and dicing;
Dimension tables have many fields, and that is why they are Wide. And wider means
more slicing and dicing power.
Dimension Tables are Shallow
Dimension tables, also wide, but often, they are shallow. They do not have many rows.
They are often less than a million rows. However, there are always exceptions. For
example, if Facebook wants to create a customer dimension, it would be hundreds of
millions of rows for sure. A dimension table is not storing transactional data; it is
storing descriptive information. Transactional data is what it makes the number of rows
to grow.
Examples of Dimension Tables
Based on what you learned so far, you can now name some dimension tables in your
scenario. A table might be a dimension in one model, and with some difference in the
set of the fields, it might be a fact table in another model. Here are some samples of
dimensions, and their fields;
Customer: CustomerKey, First name, Last name, Full name, Age, Gender, Job title, etc.
Product: ProductKey, Product Number (Alternate key), Product Name, Color, Size, etc.
Store: StoreKey, Name, Address, etc.
Summary
In this chapter, you learned that there are two main types of fields; Descriptive and Fact
fields. These two fields should not be at the same table. So that is why we have tables
with Descriptive fields and call them Dimension tables. Dimension tables also have key
fields called the surrogate key as a row identifier. The relationship between a
dimension table and fact tables are one-to-many. In the next chapter, I will explain what
the fact table is and how it connects to a dimension table.
Chapter 9: Fact Tables
Fact tables are the core of analysis in a data model. In the previous chapter, I explained
what a dimension table is and why we cannot have everything in one big table. In this
chapter, you will learn about the fact table, and how it is positioned in a data model,
you will also learn how fact table and dimension table are related to each other to build
a proper data model. Examples of this chapter are built using Power BI. However, all
of these concepts can be used regardless of the technology.
What is a Fact Table?
A fact table is a table full of Facts! If you have read the definition of Fact from the
previous chapter, you know that fact is a numeric field that usually needs to be
aggregated and will be set as the value part of visualizations. Examples of Fact is Sales
Amount, Order Quantity, Profit, Cost, etc. A fact table is a table full of those fields. So
you may think a fact table is like the below screenshot.
The screenshot above can be a representation of a fact table based on the definition;
however, it lacks something very important. The data in a fact table should be sliced
and diced by the data of dimensions. For example, you should be able to see what was
the Sales Amount for each product category, for each client, in each store, etc. With the
structure of the table above, you cannot do that. The fact table has another set of fields
too; Keys from Dimension tables.
The screenshot below shows a fact table with keys from dimension tables;
As you can see in the above screenshot, A fact table includes two types of fields; Fields
from Dimension tables (Keys from dimension table, Surrogate keys from dimension
tables), and Facts (numeric and aggregatable fields).
Usually, fact tables are named based on their main entity of analysis. For example, if the
table above analyzing sales data, then it can be called FactSales, or just simply Sales.
The Grain of the Fact Table
In the below screenshot, I have two Sales fact tables, which are slightly different! The
first one has three dimension keys of ProductKey, OrderDateKey, and CustomerKey;
The second one, has more dimension keys, it also has PromotionKey and
SalesTerritoryKey;
The first fact table gives me the power to slice and dice data of sales by Product, Date,
or Customer dimension. However, the second fact table, in addition to those
dimensions, will allow me to slice and dice by Promotion and Sales Territory too. The
Grain of a fact table is the level of details stored in the fact table.
The grain for the first fact table is one record per combination of Product, Order Date,
and Customer.
The grain for the second fact table is one record per combination of Product, Order
Date, Customer, Promotion, and Sales Territory.
The more fields you have as a grain in your fact table means, the more dimension you
are connected to, and it means more power for slicing and dicing. On the other hand,
more fields also mean row numbers will increase too, and you will need more memory
to store the data.
If you build your fact table from the lowest grain (the most
detailed list of dimensions), you always get the ability to expand
easily in the future. Although you should keep in mind that it
means your fact table will have more data rows.
Relationship Between Fact Table and Dimension Tables
Every fact table will be related to one or more dimensions in the model. These
dimension tables are tables that their surrogate key (or primary key) is part of the fact
table. For example, the FactInternetSales in the below data model is related to
DimProduct, DimCustomer, DimDate, and DimSalesTerritory.
A fact table has a “many-to-one” relationship to every dimension. The direction of the
relationship is from the dimension table to the fact table. In other words, every
dimension can slice and dice the data of the fact table.
Star Schema: The Golden Schema for a Data Model
The relationship between the fact table and dimension tables around it forms the shape
of a Star. That is why this model is called the Star Schema. The Fact table is in the
center and dimensions around it.
Star Schema is one of the most important concepts of a data model for reporting. Based
on experience and research, the Star Schema is the best type of data model for
reporting. Star Schema means that the fact table and every dimension around it would
have one single direct relationship. If a dimension is related to another dimension and
the fact is not directly connected to the final dimension, this is not star schema anymore
and is called the snowflake model. The discussion of Star Schema is a big topic by
itself, and I explain it in one of the future chapters. For now, this is enough to understand
what is the star schema.
Different Types of Fact Tables
The fact table comes in different types, although most of the time, we all use one type of
it. However, in some scenarios, other types can be very helpful. Let’s check them out.
The types of fact tables that I explain in this chapter are; Transactional, Periodic
Snapshot, Accumulating Snapshot, and Factless Fact Tables.
Transactional Fact Table
This is the most common type of fact table. When a fact table is built from a
transactional data table, then the outcome is a transactional fact table. In a transactional
fact table, we have one row per transaction. The grain of this fact table is at the
transaction level.
An example of a transactional fact table is the FactSales that you have seen in the above
star schema. This table includes one transaction per row;
For example, in the stock exchange market, values and rates are changing every
millisecond. You may not need all those details in the data model. You may create
reports on data by minutes or by hours. In that case, you can create a periodic snapshot
fact table of the desired period.
Here is an example of a snapshot fact table that is aggregated in month level.
Accumulating Snapshot Fact Table
When the snapshot is created through a process, for example; for creating a work order,
first, the work order request has to be raised, then it should be sent to the appropriate
department and manager, then the manager should approve or reject it, then depends on
that action, some other steps might occur. Every row in the accumulating snapshot fact
table will have details of one work order in such a case.
The reason that this table is called accumulating-snapshot is that part of the data coming
later into the table. At the time of processing the table, we might not have the entire data
for a workflow or process.
The below screenshot is an example of an accumulating snapshot fact table. As you can
see, not all columns for every row is filled. Some part of the data might arrive later.
There are columns showing different stages of the process.
A factless fact table doesn’t have any facts in it. It just has key fields from dimensions;
below is an example of a factless fact table for promotions on products on different
dates;
As you can see, this table doesn’t have any facts about it at all. It is just a relationship
table that relates the three tables of Product, Promotion, and Date. That is why these
tables are also called Bridge Tables.
Summary
Fact tables are storing facts and also keys to the dimension tables around. The
relationship between the fact table and dimension tables is “many-to-one” and allows
dimension tables to filter the data in the fact table. The relationship between fact and
dimension tables creates a star formed schema called a star schema. However, there are
lots of details about Star Schema, which I will talk about it in the coming chapters. You
also learned that there are different types of fact tables such as transactional, periodic
snapshot, accumulating snapshot, and factless fact table.
Chapter 9: Do You Need a Date Dimension?
I’ve heard this question many times; “Do you think we need a date dimension?”. some
years ago, the date dimension has been used more, but nowadays less! The main reason
is that there are some tools that are using their own built-in hierarchy of date. As an
example, Power BI has a built-in date hierarchy that is enabled by default on top of all
the date fields, which gives you a simple hierarchy of year, quarter, month, and day.
Having this hierarchy brings the question most of the time that do I need to have a date
dimension still? Or can I simply use this hierarchy of the date? In this chapter, I’ll
answer this question and explain it with reasons.
What is the Date Dimension?
Date Dimension is a table that has one record per each day, no more, no less! Depends
on the period used in the business, you can define the start and the end of the date
dimension. For example, your date dimension can start from the 1st of Jan 1980 to the
31st of December of 2030. For every year normally, you will have 365 records (one
record per year), except leap years with 366 records. Here is an example screenshot of
a date dimension records;
Date Dimension is not a big dimension as it would be only ~3650 records for ten years,
or even ~36500 rows for 100 years (and you might never want to go beyond that).
Columns will normally be all descriptive information about the date, such as Date itself,
year, month, quarter, half-year, day of the month, day of the year, etc.
Date Dimension will be normally loaded once and used many times after it. So it
shouldn’t be part of your every night ETL or data load process.
Why Date Dimension?
Date Dimension is useful for scenarios mentioned below;
1. It gives you the ability to slice and dice values by many date attributes such as
week number, half-year, day of the year, and etc.
2. It makes the analysis consistent.
3. It gives you the ability to do analysis based on public holidays (Easter Monday,
Good Friday, and etc.)
4. There are some BI tools extended functions that need to work with a Date
Dimension.
The fact is, when you use the simple hierarchy of Year, Quarter, Month, and Day, you
are limited to only slice and dice by these fields. If you want to step beyond that, then
you have to create additional fields to do it. That’s when a date dimension with all of
these fields is handy. Here are some fields that you can slice and dice based on that in
Date Dimension (and this is just a very short list, a real date dimension might have
twice, three times, or four times more fields than this! Yes, more fields means more
power in slicing and dicing)
You can answer questions above without a date dimension, of course. For example, to
answer the first question above, you will need to add Weekday Name because you
would need to sort it based on the Weekday number, so you have to bring that as well.
To answer the second question, you need to bring the Week number of year. For the third
question, you need to bring fiscal columns, and after a while, you will have heaps of
date-related columns in your fact table!
And the worst part is that this is only this fact table (let’s say Sales fact table). Next
month you’ll bring the Inventory fact table, and the story begins with date fields in the
Inventory fact table. This simply leads us to the second reason for the date dimension;
Consistency.
Consistency in Analysis
You don’t want to be able to slice and dice your Sales data by the week number of year,
but not having this feature in Inventory data! If you repeat these columns in all fact
tables, you will have big fact tables that are not consistent though, if you add a new
column somewhere, you have to add it in all other tables. What about a change in the
calculation? Believe me; you never want to do it.
Date Dimension, on the other hand, is a dimension that is shared between all fact tables.
You add all fields and calculations here, and fact tables are only related to this. This is
a consistent approach. Here is a simple view of a data warehouse schema with a date
dimension shared between two fact tables.
If you have worked with Power BI for some time, you know that there are two types of
the Date dimensions; Custom or built-in/Default. It is always confusing for people,
which date dimension is good to use, and what is the difference between these two
approaches. Also, based on the selection of the type of Date dimension, your DAX
calculations may differ a little bit. In this chapter, I’ll explain all you need to know
about the default and custom date dimension in Power BI and help you to choose the
right one for your Power BI solution.
Power BI Default Date Dimension
The Power BI’s date dimension is still a bit of mystery for some people. Many people
still are not aware that there is a default or built-in date dimension. This date dimension
hasn’t existed at the very beginning of Power BI Desktop (which was July 2018). It
came a bit after the party. The purpose of this default date dimension is to ease the way
that time intelligence calculations work in Power BI and make the whole experience
easier for the user. Let’s see how this dimension works.
Power BI Creates a Default Date Dimension for every single date
field in your dataset
By default (you can change the default configuration), Power BI creates a date
dimension for every single date field in your dataset. The date dimension is a generic
date dimension with normal fields such as; Year, Month, Quarter, Day and etc. The
configuration in Power BI Desktop that allows the model to create the default date
dimension is here:
In the Power BI Desktop, File menu -> Option and Settings -> Options
In the Options window, under Current File, Data Load; Time Intelligence: Auto
Date/Time
If you have this item enabled, it means that Power BI automatically creates a hidden
date table for each field in the model that has a date or date/time data type. The reason
that it is not easy to find the default Date table is that it is HIDDEN! The main reason
for having this table hidden is that the user will face too much confusion if you see a
Date table per each date data type column. Here is a model created with the default
Date dimension:
In the screenshot above, you see that there are three date fields in the FactInternetSales:
DueDate, OrderDate, and ShipDate. There is a date dimension TABLE for each of these
fields that you cannot see here. But if you have an option in the Power BI settings
enabled, then you can see the Date hierarchy under that table, which shows there is a
table behind the scene.
Why a Date table for every single Date Field?
Well, now that you know there is a default Date dimension, your second question might
be: Why Power BI creates it multiple times for each date field?! The reason is that the
Date usually acts like a role-playing dimension. Sometimes it might be Due Date,
sometimes Ship Date, and sometimes, Order Date. All of these options are date fields,
but their values and behavior are different. If you have read chapter four of this book,
you know that one of the ways to do the role-playing dimension in Power BI is to create
copies of the Date dimension. This is what Power BI does behind the scene
automatically. Power BI Creates a template for the Date table first, and then copy it for
every single date or date/time field in the model. Here is a glance at the behind the
scene of the Power BI model:
Note that you cannot see the view above in the Power BI Desktop. You can use tools
like Power BI Helper [https://radacad.com/power-bi-helper] to get to that information,
though. In the screenshot above, you can see that there is a DateTableTemplate, and then
there are three date tables copied from that template.
What does the Default Date table look like?
There are many variations of the Date dimension in the world, and you may think, what
does the default Date table look like? What columns are available there, and what
columns are not? Here is a list of columns:
The view above can be generated with tools such as Power BI Helper. But if you are
interested to see the list of these column names in the Power BI Desktop, One way is to
see it when you write a DAX expression; after typing a dot (.) after the Date or
Date/Time field name, you get the list of fields;
This might have been the mystery for many people when they write a DAX statement;
What is the list of fields that comes after the dot (.) in front of a date field name? Now
you know the answer;
The Date column is NOT a column from the Power BI model
point of view. It is a TABLE, a hidden table, and because of that,
you can choose which column in that table you want to use
within your expression.
If you don’t use the “.[Date]” then you won’t get the correct result, because Time
Intelligence calculations need a DATE column to work with, and with the “.[Date]”, you
choose the date column in the default hidden Date table.
As you can see, the calculation doesn’t work if you do not include the “.[Date]” in the
expression. But if you include it, then all is good to do. Writing time-intelligence based
expressions using the default Date Dimension is very easy, as you’ve seen here.
Default Date Dimension has a Built-in Date Hierarchy
One of the main benefits of using the default Date dimension is the built-in Date
hierarchy of Year/Quarter/Month/Day it provides. Whenever you drag a Date field,
Power BI automatically shows the hierarchy under visual, because there is a hidden
Date field with the built-in hierarchy behind the scene.
You have to select a full date column as the Date column of the Date Table as well.
Usually, after this change, if you look at the icon of Date fields under your custom Date
table, you will see them differently (without the default Date hierarchy), which shows
the table now successfully marked as a Date table.
Writing Time Intelligence Calculations with Custom Date Dimension
The prerequisite for this step is to mark the table as Date Table, which we have done in
the previous step, now you can write a DAX expression as easy as below:
Sales YTD = TOTALYTD(
SUM(FactInternetSales[SalesAmount]),
DimDate[FullDateAlternateKey])
As you can see, in this expression, we do not need “.[Date]” to get to the date field.
Because there is no hidden Date table for this column, you just refer to the column to get
the Date field. In fact, if you use the “.[Date]” here, you get an error, because there is no
hidden Date table here.
Default or Custom Date Dimension?
Now that you know all about the default Date dimension and custom Date dimension, is
time for the main question: What is the difference?! when to choose what? The answer,
like many other situations, is Depends! If you want to use generic date-based analysis
and you want to build a model easier and quicker, then the default Date dimension is
very helpful in those scenarios. But if you want to do a special date-based analysis (as
an example; public holiday based analysis), then custom Date dimension will give you
more power. Let’s look at differences in details:
Modeling with the default Date Dimension is Much Easier
It is true that using a custom Date table means taking care of relationships, marking as a
date table, creating the default hierarchy and etc. If you use the custom Date dimension,
you have to spend more time on doing all these configurations, whereas the default Date
table will take care of everything for you.
If you have a Date field, and you are not using it, Remove It!
When you have a date field, Power BI automatically creates a table for it, create the
relationship of that to the main date field, and the table consumes memory! If you do not
intend to use that field, then remove it to save memory. This is also important for any
other fields that you do not use in the model; remove it to get better memory
consumption, but it is even more important for Date fields because, behind the scene,
you will have not just a field, but also a table.
Customized and Complex Analysis is Easier with Custom Date
Dimension
While the default Date dimension is easier for generic date-based analysis, The custom
Date dimension is a very powerful option when it comes to customized analysis. As an
example, let’s say you want to create a date-based analysis based on public holidays.
How can you do that with the default Date dimension? Well, the answer is to create a
list of public holidays as a table and merge or join it to the date field, which means kind
of creating your custom Date dimension. Here in this blog post
[https://radacad.com/custom-functions-made-easy-in-power-bi-desktop], I explained an
example of fetching public holidays for a custom Date dimension in Power BI.
Another example is when you need more than the default hierarchy when you want to
create a weekly hierarchy, financial calendar hierarchy, and many other scenarios. The
default Date dimension is good for generic analysis, but not when it comes to more
customization.
Common Mistake to Avoid!
One big common mistake that I’ve seen a lot, and it comes from the confusion of the
default Date dimension vs. custom Date table, is that we see both in one model! Let’s
see what I mean:
Let’s say you want to use a custom Date dimension, and you add it to your model, you
create the relationship to the date field in the fact table, and then you DO NOT Mark it
as a Date Table! That is where all mistakes start! When you do not mark it as Date
Table, then you allow Power BI to create the default date dimension for the date field
even under this table (considering that the creation of default date tables in enabled in
the options, which is ON by default). As a result, you have a custom Date table, but you
also have a default Date dimension for the date field under your custom date table! It
means you are using extra memory twice! And your DAX expressions also becomes
even more wrong like this:
Sometimes, you want to create a date dimension very fast, and you want just the script
to copy and paste. So that’s what this chapter is about, here you will find the full script
to create the date dimension in Power BI using Power Query that you can simply copy
and paste it into your solution.
Columns Included in the Date Dimension
This date dimension (or you might call it a calendar table) includes all the columns
related to the calendar year and financial year as below;
How to use the Script?
Create a new blank query in Power BI:
In the Power Query Editor window, go to the View tab, and click on Advanced Editor
Copy and Paste the entire script here (replace the existing text in the Advanced Editor
window:
Script
You can download the script for the date dimension from the book materials.
Configuration
You need to configure the Date table based on your need. The first few lines are the
configurations that you can set based on your need;
Sample Records
Here are some sample records of this table:
This is a script that you can simply just copy and paste into your Power BI solution, and
get a Time Table with all columns commonly needed.
Why Time Table?
The date dimension gives you the ability to slice and dice your data by different date
attributes, such as year, quarter, month, day, fiscal columns, etc. Time dimension, on the
other hand, will give you the ability to slice and dice data in the level of hours, minutes,
seconds, and buckets related to that, such as every 30 minutes, or 15 minutes, etc.
Time table SHOULD NOT be combined with the Date table. The main reason is the
huge size of the combined result. Let’s say your date table, which includes one record
per day, has ten years of data in it, which means 3,650 rows. Now, if you have a Time
table with a row for every second, this ends up with 24*60*60=86,400 rows just for the
time table. If you combine the date and time table, you will have
3,650*86,400=315,360,000 rows. 315 Million rows in a table are not good for a
dimension table. Even if you store one record per minute in your time table, you would
still end up with over 5 million rows.
So don’t combine the Date and Time table. These two should be two different tables,
and they both can have a relationship to the fact table.
Columns Included in the Time Dimension
This time dimension (or you might call it a time table) includes all the columns as
below;
In the Power Query Editor window, go to the View tab, and click on Advanced Editor
Copy and Paste the entire script here (replace the existing text in the Advanced Editor
window:
Script
You can download the script for the Time dimension from the book materials.
Sample Records
Here are some sample records of this table:
Best Practice Suggestion
Because the Time table is a table that is needed not only in one Power BI file but in
many others, I suggest creating a dataflow entity for the Time table.
Considerations
There are a few things you need to consider if you are using this script;
This Time dimension is in the grain of a second. Which means the table has one
record per second.
There are some time buckets already created, including hour buckets of 12, 8, 6,
4, 3, and 2, and minute buckets of 30, 15, and 10. I will add more buckets into
this in the future.
If you want to use a Date dimension as well, use this script to generate a Date
dimension.
The script might have an up-to-date version of it with more columns in the time
dimension. You can download the latest version from here
[https://radacad.com/script-for-creating-time-table-in-power-bi-with-hours-
minutes-and-seconds-buckets].
Here is an example of data analyzed by a couple of columns from this Time table;
Chapter 13: Star Schema and How to Build it
You might have heard about dimensional modeling or star schema, but you may have
vague ideas about what it is, or alternatively, you might have used it, but not sure if it is
the right way of doing it. In this chapter of the back to basics series, I’m going to explain
what is Star Schema, and how you can build it, what are the best and worst practices for
it.
Dimensional Modeling
For every BI or reporting system, you have a process of designing your tables and
building them based on that design. This process is called dimensional modeling. Some
others call it data warehouse design, which is the same thing. Dimensional modeling is
the process of thinking and designing the data model, including tables and their
relationships. As you see, there is no technology involved in the process of dimensional
modeling. It is all happening on your head and ends up with sketching diagrams on the
paper. Dimensional modeling is not the diagram in which tables are connected to each
other. It is the process of doing that.
There are multiple methods of building the diagram of tables and their relationships.
Let’s check these, one by one.
Transactional Database diagram
This is a diagram only useful for transactional systems, where the operations such as
insert, update, delete, and retrieve data happens often. In such systems, it is important
that the model is designed in a way that minimizes the efforts of doing an insert, update,
or delete. We are not talking about that method here. In a reporting system, the
transactional database design won’t be helpful. It will make the report slow in
performance. So what are the designs for a reporting model?
Snowflake Schema
The diagram of tables can be in all shapes; however, there are two big categories when
it comes to design a diagram for reporting systems; Snowflake and Star Schema.
Snowflake is when there are many relationships between tables, and when you have to
pass through multiple relationships to get from one table to another. Here is an example:
To path from FactInternetSales tables (which is the fact table) to DimProductCategory
(which is the dimension table) needs passing through three relationships. More
relationships will slow down the performance, and also it will bring the need for the
both-directional relationship. The diagram above is a snowflake schema. In a snowflake
schema, there are many intermediate tables to get from one table to another. To remind
you what snowflake looks like, here is a view of that;
Dimension tables are not directly connected to the fact table. They are related to other
dimensions in between.
Star Schema
Star Schema is when there is a direct relationship between the fact table and each
dimension. Each fact table can have a connection (or relationship) to multiple
dimensions, so it will come up with a view like this:
You cannot unsee the Star in the layout above. The fact table is the heart of the star, and
all dimensions are around it as points of the star. The star schema might no have exactly
five points (pentagram). It might have more or less. The important part of this design is
not the number of points for the star. It is the direct relationship between the fact table
and every dimension.
Star Schema means the fact table in the heart of the star, and a
single relationship to each dimension around it as points of the
star.
If we want to re-design the model you have seen in the snowflake example, it should
look like this:
In this example, the DimProduct includes all details of Subcategory and Category in it
too. And it is directly related to the fact table.
Star Schema Can Include Multiple Fact Tables
A common misunderstanding is that you should have only one fact table in a star
schema. Well, that is true somehow, but the fact is that you can have a combination of
star schemas to build a data model. Look at the example below;
We have two fact tables; FactInternetSales and FactResellerSales. They have some
shared dimensions in between (DimProduct and DimDate), and they also have their
separate dimensions (DimCustomer for FactInternetSales, and DimReseller for
FactResellerSales). The model above is a perfect star schema design. Because each fact
table still needs one single relationship to a dimension. A model like above, give us the
ability to filter both fact tables based on shared dimension in between;
The Importance of the Star Schema
Star Schema is the best way of designing a data model for reporting. You will get the
best performance and also flexibility using such a model. A lot of modeling challenges
and issues in Power BI implementation these days can be resolved with a good star
schema design. Star Schema is like a conductor of an orchestra. An orchestra without a
conductor can play music, like a Power BI implementation without a star schema.
However, the music won’t be synchronized, it won’t be the best it can do, and it won’t
be what the audiences are after. However, if the Power BI implementation leverages the
Star Schema (like the orchestra with a good conductor), it would play its full capacity,
answer all the requirements, and stun the audience.
[image reference:
https://i2.wp.com/www.irishtimes.com/polopoly_fs/1.1664308.1390496586!/image/image.jpg_gen/derivatives/b
ox_620_330/image.jpg]
Now that you learned about the principles of data modeling and star-schema, you are
ready to learn some sample models creating using the principles mentioned.
Chapter 14: Combining Dimension Tables in
Power BI using Power Query
Is it good to have too many dimension tables? Can you combine some of those tables
together to build one flatten dimension table? How much should you flatten it? Should
you end up with one huge table, including everything? In this chapter, I’m answering all
of these questions and explaining the scenarios of combining dimensions.
Sample Dataset
The dataset for this model is the AdventureWorksDW2012.xlsx Excel file. For this
example, we need these tables: DimProduct, DimProductCategory,
DimProductSubcategory, and FactInternetSales.
In the dataset, we have three tables related to Product information; DimProduct has the
details of every single product;
DimProductCategory has information about all categories of products;
Design Challenge
Let’s say we load all of the tables above (plus the FactInternetSale table, which is our
transactional sales table) into a Power BI model. This is how the relationship diagram
looks like:
Note that Power BI automatically creates a couple of these relationships, but you need
to create the one which is highlighted yellow yourself, connect ProductSubcategoryKey
from DimProduct to ProductSubcategoryKey in the DimProductSubcategory table.
As you can see, the visual above isn’t helpful. Product Subcategory cannot filter the
ProductCategory table, and the reason is obvious; the relationship is single directional
between these tables from the category to the subcategory.
Sometimes, depends on the requirements, the design above leads
to the need for both-directional relationships, which is not
recommended.
In the Power Query Editor window, click on Product table, and from the
Combine section, select Merge Queries;
In the Merge Queries window, check that the DimProduct is the first table, and then
select ProductSubcategoryKey from it, then select DimProductSubcategory as the
second table, and also the ProductSubcategoryKey column in there too, and then click
on OK.
Merge table is an operation which flattens two tables based on matching field(s). I have
written fully in detail what is Merge and the difference of that with append
[https://radacad.com/append-vs-merge-in-power-bi-and-power-query], and also what
are different join kinds in the merge operations [https://radacad.com/choose-the-right-
merge-join-type-in-power-bi]. In this operation, the default join kind would work for
us, but make sure to read the difference between all kinds of joins and their output
samples here [https://radacad.com/choose-the-right-merge-join-type-in-power-bi].
The Merge operation will create a new column in the DimProduct table (at the end of
all columns), which then can be expanded into details coming from
DimProductSubcategory. This way, we are combining product and subcategory together;
As you can see, the two columns above are now part of the Product table.
We do the Merge Operation one more time, and this time will select
DimProductCategory as the second table using the ProductCategoryKey in both tables to
merge.
And then expanding it to the details of DimProductCategory, which in this case is only
ProductSubcategoryName
Finally, we will have all category and subcategory details in the DimProduct as below;
You can see that after the second merge, I also deleted the ProductCategoryKey column,
which is not needed anymore.
Now, if you load the data into the Power BI model, you will have a simple model such
as below;
You can easily get slicing and dicing working correctly;
You can also build a model hierarchy and use it everywhere you want in the report;
How about other tables? Should I flatten all dimensions into one?
The method above was helpful, so you might think that let’s do that for other tables. Why
not combining customers and products together as an example! Well, flattening has a
golden rule; Flatten it as long as entities have a meaningful relationship with each other!
The product category and product have a meaningful relationship. Every product
belongs to a category. However, products and customers don’t have a relationship
unless there are sales transactions made! We do already have a sales transaction table,
which is the heart of this model, and flattening dimension tables into that, will make it
huge, and also it will reduce our flexibility if we want to connect another table later on
to product or customer. So in this scenario, we will not flatten customers and products.
We will keep them as their own dimensions, with a relationship to the fact table,
building a star schema.
Summary
In summary, flattening dimension tables will avoid a lot of challenges in the future in the
reporting system. It is recommended to flatten attributes into one big dimension for each
entity. The product dimension can have all information about the color, size, brand,
product number, category, and etc. and the Customer dimension can have everything
about the customer job title, education, age, and etc. You learned through this chapter,
what are some of the challenges when dimensions are not flattened, and you learned
how you could use the Power Query Merge command to combine them together.
Chapter 15: Creating a Shared Dimension in
Power BI Using Power Query
What is a shared dimension, and why do you need that in your Power BI model? In this
chapter, I will explain how it can prevent many issues, as well as the need for a both-
directional relationship or many-to-many relationship.
Sample Dataset
To follow the example in this chapter, use the excel file named “Shared dimension data
source.xlsx” In this sample dataset, we have three tables as below;
The inventory table shows the inventory details in each day in each warehouse.
The sales table shows sales transactions.
The manufacturing table; shows summarized information about the cost of producing
each product based on date.
Design Challenge
When you load the three tables above in a model, you can see that they are not related to
each other.
Another issue with the both-directional relationship is that you cannot apply it to all
relationships in your model because it might create a circular reference scenario!
Master List Does Not Exist!
The third issue of design with the three tables above is that there is no master list! There
are some products on each table, and we do not necessarily have all products on each
table. Or there are some dates in each table, and we do not necessarily have all dates in
each table (Power BI will create an auto date dimension which can resolve this issue
only for Date fields, but what about other fields such as Product?).
To explain the issue, I’ve created both directional relationships between all three tables
to make sure that they are all filtering each other. All of the relationships are based on
Date fields.
Then I created a table with the Date field from the Sales table as a slicer and three table
visuals from each table. The date slicer should be able to filter all three tables based on
the Date selection;
If I select a field in the date slicer, it will filter all three tables (because of both-
directional relationships). However, if you look closely, the two dates mentioned in the
above screenshot, and some other dates in the Inventory and Manufacturing tables,
doesn’t exist in the slicer. Because the date slicer is coming from the Sales table, and
the Sales table doesn’t have those dates in it! We will have the same challenge with the
Product slicer if we add it.
If you use a field as a slicer that doesn’t have all possible values
in it, then it cannot show the right data from all tables. It is not
recommended to design it this way. Read the rest of the chapter
to learn how a shared dimension can fix this challenge.
Right-click on the Inventory (2) table and uncheck the Enable Load option for it. This is
to save performance and avoid loading extra tables into the memory of Power BI
Desktop.
Do the same process now for the other two tables; Manufacturing and Sales;
create a reference from each table
Only keep the Product table and remove other columns
uncheck the enable load in the new query
You should now have the new three tables with one Product column only in each:
Set all column names to be the same
The next step is to make sure the column names are exactly the same. Because we are
going to append the three tables, if we have different names, then it would create extra
columns. Names should be an exact match, and remember that Power Query is a case-
sensitive language; “product” is different from “Product” in the Power Query world. In
our sample model, the two tables, Inventory, and Manufacturing, have the column name
as Product, but in the Sales table, it is called Product Name, rename it to Product.
The output of the append would be one table, including all Product values; you can
rename this query to Product.
Because this is a table you want to be loaded into Power BI Desktop, make sure the
Enable Load of this table is checked. This table is our master list, including all product
values. However, there are duplicate values in it that have to be removed.
Remove Duplicates
A dimension should have a unique list of values, so we need to remove duplicates for
the key field here.
Before using remove duplicate, make sure to read this article
[https://radacad.com/remove-duplicate-doesnt-work-in-power-query-for-power-bi-
here-is-the-solution] about important tips you need to know before applying remove
duplicate in the Power Query. In a nutshell, because Power Query is case-sensitive, and
because space at the end of text values, and other special characters may end with
keeping duplicate values, this is how you would remove duplicates in few steps;
Clean transformation
Trim transformation
Transform to Upper Case
Remove Duplicates
Now you have your Product shared dimension ready! Repeat this process for any other
shared dimensions with the relevant fields. However, for the Date table, we would do it
differently.
Date Dimension
You learned in an earlier chapter how to create a date dimension using Power Query
script.
Best Practice Design: Star Schema and Shared Dimensions
After loading the tables above, you can create a one-to-many relationship single
directional from Product and Date tables to all other fact tables. This is the final design
based on our example;
The above design uses two shared dimensions, and avoided all challenges mentioned;
It doesn’t need both-directional relationships
It doesn’t need many-to-many relationships
Product and Date tables are master tables which can be the source of any
slicing and dicing
To make sure that you won’t use incorrect fields for slicing and dicing, make sure that
you hide Date and Product columns in all the three fact tables as below;
This solution can now have proper reporting capabilities as below;
Summary
Nothing is worse than a bad data model design. Bad data model design causes using
relationships that decrease the performance, it causes writing a lot of unnecessary DAX
expressions to cover for the wrong design, and at the end of the day, it performs slow. In
this example, you learned one of the basics, but the most fundamentals of designing
Power BI data models. Using a shared dimension in your model will avoid both-
directional and many-to-many relationships. You learned how easy it is to create such a
dimension. This method can always be used in your Power BI data models.
Chapter 16: Combine Tables or Create
Relationships?
Despite all these chapters about the importance of star-schema and design principals of
fact and dimension tables. This might still be the question you may ask: “When is good
to combine tables vs. loading them in the Power BI and creating a relationship between
them?” This chapter is going to answer this question.
Relationships
In Power BI models, you can load tables into the model and create relationships
between tables. The sample below is an example of that type of relationship;
Combining tables
You can also combine some of those tables together and have a simpler model with
fewer relationships, like the example below;
To combine tables with each other, you can use the Merge [https://radacad.com/append-
vs-merge-in-power-bi-and-power-query] transformation in Power Query.
Relationship or Combined?
So the question for this chapter: Should you combine the Product, ProductCategory, and
ProductSubcategory tables into one Product dimension? Or should you leave them as is,
and create relationships between tables in the Power BI? Let’s talk about the pros and
cons of each method.
Performance
One of the distinguishing features of a reporting system is the performance of it. If you
are building a reporting system, and the loading the report takes five minutes, then what
is the benefit of this report instead of doing it in the operational system? Because it
would have taken that much in the operational system anyway!
When you build a reporting system, you should target creating reports that load fast.
Having extra relationships doesn’t help with performance. The DAX queries behind the
scene would be longer.
If I have a visualization like this:
The DAX query generated by this visual is different when tables are combined or when
they are separately connected using relationships.
This is the DAX query when separated product tables with relationships:
// DAX Query
DEFINE
VAR __DS0Core =
SELECTCOLUMNS(
KEEPFILTERS(
FILTER(
KEEPFILTERS(
SUMMARIZECOLUMNS(
'DimProductCategory'[EnglishProductCategoryName],
'DimProductSubcategory'[EnglishProductSubcategoryName],
'DimProduct'[EnglishProductName],
"CountRowsDimProduct", CALCULATE(COUNTROWS('DimProduct'))
)
),
OR(
OR(
NOT(ISBLANK('DimProductCategory'[EnglishProductCategoryName])),
NOT(ISBLANK('DimProductSubcategory'[EnglishProductSubcategoryName]))
),
NOT(ISBLANK('DimProduct'[EnglishProductName]))
)
)
),
"'DimProductCategory'[EnglishProductCategoryName]",
'DimProductCategory'[EnglishProductCategoryName],
"'DimProductSubcategory'[EnglishProductSubcategoryName]",
'DimProductSubcategory'[EnglishProductSubcategoryName],
"'DimProduct'[EnglishProductName]", 'DimProduct'[EnglishProductName]
)
VAR __DS0PrimaryWindowed =
TOPN(
501,
__DS0Core,
'DimProductCategory'[EnglishProductCategoryName],
1,
'DimProductSubcategory'[EnglishProductSubcategoryName],
1,
'DimProduct'[EnglishProductName],
1
)
EVALUATE
__DS0PrimaryWindowed
ORDER BY
'DimProductCategory'[EnglishProductCategoryName],
'DimProductSubcategory'[EnglishProductSubcategoryName],
'DimProduct'[EnglishProductName]
If tables are combined, the code is like below:
// DAX Query
DEFINE
VAR __DS0Core =
SUMMARIZE(
'DimProduct',
'DimProduct'[EnglishProductCategoryName],
'DimProduct'[EnglishProductSubcategoryName],
'DimProduct'[EnglishProductName]
)
VAR __DS0PrimaryWindowed =
TOPN(
501,
__DS0Core,
'DimProduct'[EnglishProductCategoryName],
1,
'DimProduct'[EnglishProductSubcategoryName],
1,
'DimProduct'[EnglishProductName],
1
)
EVALUATE
__DS0PrimaryWindowed
ORDER BY
'DimProduct'[EnglishProductCategoryName],
'DimProduct'[EnglishProductSubcategoryName],
'DimProduct'[EnglishProductName]
The second code is much simpler. And it would perform faster when you have many
large tables.
Simplicity vs. a Mess
Another big deterministic of a good reporting model is the ability to make it easy-to-
understand for data visualizers. If a data visualizer has 500 tables in a model all inter-
related, he/she would have a hard time to understand the model.
Having a few tables and relationships is always helpful to make the model simple to
understand.
The complexity of the relationship
Relationships come with a cost of complexity. You might need a both-directional
relationship sometimes, or you might get into the problem of having an inactive
relationship and trying to resolve that. Less relationship means fewer problems of this
type when it is unnecessary.
There are some ways to tidy up the Power BI solution, and one of them is hiding fields
from the report view. This option, although simple, has a significant impact on making
your Power BI solution very tidy and clean and easier to maintain. However, you might
wonder what fields have to be hidden in Power BI? In this chapter, I will explain how
to hide a field in report view, and then what are fields which are a good candidate for
this option.
Hide in Report View
When you have too many fields in the Power BI model, the report tab (or visualization
tab) would be a nightmare to investigate. You might look for a field and find multiple
instances of it, or on the other hand, find fields that are not important for visualization
purposes anyways. Here is an example of seeing fields like that;
There are two ProductKey fields, which will confuse users, which one to use for the
visualization. Or there are many key columns in the below report, which doesn’t really
make sense for the user that what is actually a key column?! what is the usage of those?
When a field is not needed to be in visualization, we can simply hide it from the Report
view. This option is possible by just a right-click on the field and choosing Hide in
Report View.
This option is available in all tabs in Power BI Desktop (Report, Data, or Model tab).
However, the label for that option in some tabs is “Hide in report view”; in some others
is just “Hide.” When you use that option, you would still see that field in the Model and
Data tab, but greyed out as below:
However, you won’t see that hidden field in the Report tab any more:
This is how you can use the Hide option to clean up your model.
Hidden Field can be used in a Calculation
It is important to know that a hidden field can still be used in a calculation. So if you
don’t use the field directly, you can make it hidden and use the calculation that generates
from it in the visualization.
Hiding Multiple Fields
You can also easily hide multiple fields in Power BI Desktop. The best place to do that
is to go into the Model tab. Then select all fields that you want to hide (using Ctrl or
Shift with selection), and then in the Properties tab beside it, turn the Is Hidden property
to On.
This option was not previously available, so you had to use Shift+F10 to select multiple
items and then set this property for all of them, but luckily the new Model tab is capable
of doing that.
What Fields Has to be Hidden?
Now that you know what is hiding fields, and how to hide fields or even multiple fields,
let’s see what fields are a good candidate to be hidden. Let’s first start with fields that
should NOT be hidden!
Do NOT Hide
Obviously, if you are using a field in any visualization, then you should not hide it!
That brings us to the conclusion that if you do not use a field in visualization, then you
can hide it. Well, there is a big exception:
Consider Remove before hiding
If you have a field that you do not use in any visualization, but also you don’t use it in
any calculations, or relationships, or anything at all in the Power BI report, then that
field should not be hidden. That field should be removed from the model. You can use
Power Query to remove any unnecessary fields. If there is a field that you do not use in
the model at all, then don’t load it into the model, because every field consumes
memory, making those fields hidden won’t help the performance of your model.
Hide These Fields
If there is a field that you need to bring into the model for any reason, but won’t use it in
visualization, then this is a good candidate to be hidden. here are some of these types of
fields:
Fields used in a Relationship
Fields used to Sort Other Fields
Fields used in Hierarchy
Fields used in DAX Calculations, but not in visuals directly
Fields used in a Relationship
In the example below; DimProduct and FactInternetSales are related to each other using
ProductKey:
We cannot remove the ProductKey from this model because we need it to connect the
two tables. However, for users, the ProductKey is unnecessary, because they will use
other Product columns such as Product Number, Name, Size, Color, etc. So these fields
are a good candidate to be hidden from the report view (from both tables).
Fields used to Sort Other Fields
For example, in the below Date table, the EnglishMonthName column is sorted by the
MonthNumberOfYear column;
MonthNumberOfYear column is needed in the model for sorting, but most probably not
for the visual itself. The MonthNumberOfYear column is a good candidate to be hidden.
Fields used in Hierarchy
When you have a hierarchy of fields, most of the time, you don’t need fields
individually. Because you can drag and drop them individually from the hierarchy if you
want, having them not hidden will confuse users in the report view:
Fields used in DAX Calculations, but not in visuals directly
There are many times that you don’t use a field directly in a visual, but you create
variations of calculations from it, and then use those calculations in visuals in Power
BI. Those fields are good candidates to be hidden.
Tools Can Help: Power BI Helper
Power BI Helper [https://radacad.com/power-bi-helper] is a FREE tool that can help
you determine what fields are used in all the Power BI report visuals, and then you can
hide the rest, it would help you to tidy up your model.
Summary
Hiding fields in Power BI is very simple but effective for cleaning up your data model.
In this chapter, you learned how you could hide one or multiple fields in Power BI. You
also learned what fields are a good candidate to be hidden. Now it’s time to go and
check your Power BI Model to see if you have any fields that need to be hidden? You
can always use a tool such as Power BI Helper to find that out easier.
Chapter 18: Build Your First Star Schema Model
in Action
Dimension and fact tables are the two essential parts of building a data model, and their
relationship in the form of star schema is the best practice of doing the modeling.
However, these are all theories until you really experience it in a real-world example.
In this chapter, I want to put everything you learned in the past few chapters into
practice and show you how we can design a star schema for a dataset. I will be using a
movie dataset for this example.
Sample Dataset
The sample dataset used here is coming from two sources;
The movie's sales information is fetched from the Box Office Mojo website
[https://www.boxofficemojo.com/chart/top_lifetime_gross/?area=XWW].
The movie’s rating information is fetched from the IMDB
[https://www.imdb.com/chart/top/].
If you are interested in how to do the data mashup and preparation for this dataset, read
my blog here [https://radacad.com/get-started-with-power-query-movies-data-mash-
up].
However, you don’t need to do the steps for data preparation. This dataset is now ready
to use. Start with the Power BI file in the book materials names “Movies.pbix.”
Movies Sales information is a data table like this:
There is No All in One Single table model. You have to Build the
Model for it.
Fact Table
Let’s break our table of Movies Sales into Fact and Dimensions. For a reference, here
is the Movies Sales table again;
Facts are all numeric values that might be aggregated at some stage. For example, facts
in this table are: Lifetime Gross (sales) and Rank can be a fact, and also can be a
dimension. Let’s leave that for now in the fact table. The fact table usually is build
based on the existing main table that we have, so we can call the Movies Sales table
as FactMoviesSales.
FactMoviesSales will have these facts in it:
Lifetime Gross
Rank
Dimension Tables
On the other hand, there are fields that we use as Slicer, filter, or axis of our visuals,
such as Year and the Title of the movie. Now that you know what descriptive fields are,
you can derive dimensions from it. However, not all of those fields into one single
dimension. Not every single field into their own dimension too. Both are wrong. You
should find out which fields together can build a dimension.
Descriptive Fields are:
Year
Title
If a movie is made in multiple years, then there will be multiple instances of movies for
a year. On the other hand, the year of the production of a movie is part of the movie
attributes anyway. So we need a Dimension table with both of these two fields in it.
Please note that not always you get all descriptive fields combined in one dimension.
This depends on the scenario.
To build the Movie Dimension, you can create another copy of the fact table. This can
be done using the Duplicate (Not reference, because we might need to merge this table
back to the fact table later on).
The new Movie Dimension, which I’m going to just call it Movie, will have only the
two descriptive fields. You can remove all other fields.
You also need to make sure that your dimension table doesn’t have duplicate entries. So
select the combination of all three fields, and then Remove Duplicates.
Note that in Power Query, Remove duplicates is sensitive to the upper case and lower
case, and also characters such as space at the beginning or end of the text values.
After doing all these actions, you have the Movies dimension ready. However, there is
still one more step left.
Linking Dimension(s) and Fact Tables
When you bring multiple tables into Power BI, you will need to create a relationship
between those, and the relationship can only be created based on one field in each table.
In our case, we have three fields to connect to (Title, Studio, and Year). We need to
create a single field for the connection. There are two methods to do that; one is to
create a Key column for the dimension table; this can be an auto number or Index.
Another is to create a concatenated field. Because Power BI, at the end of the day, will
apply compression on the fields, there are not many differences between the two
methods above in this example. However, if you have a really long text field, that the
concatenation would cause some issues, then the key column can be a better approach.
Concatenating fields is easy. So I show you the method of Key Column instead. I add a
new Index column to the Movie table;
This column is going to be what we call as Surrogate Key in the Dimension table. Or
let’s call it MovieKey.
The next step would be bringing the MovieKey into the FactMovieSales table too. For
that, we use a Merge from the FactMovieSales table to Movie table, based on the three
fields of Title, Studio, and Year (in the same order in both tables);
After the Merge, we can fetch the MovieKey from the Movie table to expand.
And then, you can remove the two fields you used for the join from the FactMovieSales
table. This is what your FactMovieSales will look like;
Note that there are a lot of details that I skipped in this example for simplifying the
process. for example, you have to take care of all these;
Making sure you have unique combinations. If you get multiple records from the
Movie table for one row in the fact table, then you need to investigate why, and
you might need to change the combination of fields in the dimension table then.
If there are blank/null values in the MovieKey column, you need to investigate
why it didn’t found a matching row.
When you combine this with other tables, you need to make sure that the Movie
table is maintained as a master list, which is a big topic for itself.
…
Now you can load these two tables into Power BI and create a relationship between
them based on the MovieKey. As you notice in the below screenshot, MovieKey is
hidden because it is a technical field, and the user doesn’t need to know about it at all.
Make sure the relationship between the fact table and dimension tables are always one-
to-many and single-directional from the dimension table to the fact table.
So here you go, now you have a star schema. It is at the moment just one fact table and
one dimension, but it is a star schema nonetheless.
Adding More Tables
The main benefit of the Star Schema is visible when you add more tables into that. For
example, now if I bring the Movies Rating table, which looks like below:
Then we can connect it to the Movie dimension. However, because the Movie
dimension is not yet a master list and doesn’t have all movies in it, you might need to
merge the list of movies into there and create one master list. I explained a bit about a
process like this in the chapter about the shared dimension in this book.
Bringing the MovieKey in the FactMoviesRating would make the table like the below
screenshot;
Loading this table also into the Power BI will result in another fact table related to the
Movie dimension
You can then add more dimensions and fact tables later into this model, and extend the
model to be always a star schema. Now with the power of star schema and shared
dimensions, I can easily build visualizations like below without the need for bi-
directional or many-to-many relationships or extra calculations.
Summary
Building a star schema design is not an easy job; It will take time. Probably 70 percent
of your time in developing a Power BI solution is spent on building your data model
using Power Query. This chapter was a quick run through some of the principles and
concepts that you have learned in the previous set of chapters in the Back to basics of
Power BI Modeling. As you have noticed, It is not possible to explain every single step
in full detail. It might end up being ten or more chapters for such a small example.
However, I explained the main topics and the order of things you need to follow to build
your model in the right way.
Chapter 19: How to Use Time and Date
Dimensions in a Power BI Model
You have learned about date and time dimensions separately. You may wonder how they
will be used together in a data model. In this chapter, I’ll explain that.
Date Dimension
The date dimension or calendar dimension is one of the most common tables in a
reporting data model. This is a table with one row per date with all different date
attributes.
Time Dimension
Because the date and time combined with each other in a table will make it big, The
time dimension is another table with one record per time span required for analysis.
Time and Date Tables in a Data Model
The time and the date tables should not be related to each other; their relationship
should be made in the fact table. In below, you can see that the fact table’s data in a full
date and time column (not separated);
For connecting this table to the time and date tables, you have to separate date and time,
which you can do that easier in the Power Query Editor.
Separate Date and Time
Go to transform data;
In the Power Query Editor, click on Add Column, and then Under Date, Select Date
Only.
Do similar thing this time with Time, and Time only;
Now that you have the date and time columns separately, you can remove the main date
and time column;
now you can close and apply the Power Query Editor window, and create the
relationship between this table and the Date and Time dimensions;
What if we want to have budgets and actuals in the same data model? What if there are
two tables with different granularity? That can also be solved with a zero complexity
model. In this chapter, I’m going to explain through an example, how two tables with
different granularity (like budget and actual) can be modeled easily using Power BI.
Sample Dataset
The sample dataset used here is the AdventureWorksDW2012.xlsx Excel file.
Sales Star Schema
A fact table is a table that contains the happening of action; it keeps numeric and
aggregatable data in it. An example of a fact table is the FactResllerSales table in the
model below, which is connected to DimProduct, DimDate, DimSalesTerritory, and
DimEmployee.
To create this model;
Get data from AdventureWorksDW2012.xlsx Excel file data source from above
Select all the tables mentioned,
Remove the relationship between DimSalesTerritory and DimEmployee
Set the relationship between FactResellerSales and DimEmployee as Active
Create a relationship between OrderDateKey in the FactResellerSales table
and the DateKey in the DimDate table
The model above is a perfect representative of a star schema model. This model can
easily answer all questions regarding sales. However, it seems difficult to think about
star schema when you want to bring another model with different granularity.
Sales Quota: Different Granularity
For this example, instead of the budget data, I have Sales Quota, which is under a
different granularity. The concept and principles remain the same. We have a table with
a different granularity, which we want to connect to this model. The Sales Quota has the
quota of sales for each employee in each quarter. The granularity of this table is per
employee and per quarter, while the granularity of the sales table is per employee, per
day, and per combination of SalesTerritory and Product also.
If you get the FactSalesQuota from the AdventureWorksDW2012.xlsx Excel file, you
will notice that there is a Date column, and the DateKey, which I have removed in my
example below, because they will be confusing for you to understand the table is
showing quarterly data (we will calculate the date key later on in this chapter)
Here is how the data in the Sales Quota table looks like: (FactSalesQuota in the
AdventureWorksDW2012.xlsx Excel file)
I’m going to explain an easy method to solve it, and then it will be a proper star schema
again. Let’s check it out.
Connect SalesQuota to the Date Table
Although the sales quota information is quarterly based, you can still connect it to the
date dimension. This would avoid creating an extra dimension for the quarter and the
snowflake connection between the quarter dimension and the date dimension. The only
thing to consider is that you have to consider a specific date in each quarter as your
default value. For example, we can consider the first day of each quarter as our date
value.
To achieve this purpose, we need to create a column in the SalesQuota table with has
the DateKey in it (the DateKey used in my sample is in this format YYYYMMDD), so It
can be something like 20190101, 20190401, and etc.). This is what we are going to
build in this part:
To create that field, you can easily do these steps:
1. Create month number based on the quarter number
2. Create the first date of that month as a custom column
And then change the data type of the DateKey to the Whole Number; you can also
remove these columns: CalendarYear, CalendarQuarter, and First Month of the Quarter
I haven’t explained the details of each step, because your data might not be at quarter
level, and the date key that you use might have a different format. You might even use the
Merge option in Power Query to connect this to an aggregated version of the date
dimension, based on year and quarter, and then fetch the first of each quarter from the
expand column. No matter how you do it, The main point of all these steps is to make
sure you can connect your table to the date dimension without needing an extra
dimension.
Star Schema Everywhere
After adding the DateKey to the Sales Quota table, you can connect it to the date
dimension and also the employee table. As you can see, we still have Star Schema
everywhere. Note that you can have multiple fact tables and multiple star schema’s in
one model.
There is no need for a bi-directional relationship or connecting a dimension to another
dimension. This model can answer your budget vs. actual questions easily!
Sample Analysis
To show you some sample analysis, I fetched from the Sales Quota vs. the Reseller
Sales in the above scenario; I have created some visual reports. All these visuals work
without needing any complicated DAX calculation or bi-directional relationship. All of
that is possible because the data model is set up right!
Simple DAX Measures
I only created two measures for this calculation! very simple measures: Bud vs Act:
Bud vs Act =
var Act=SUM(FactResellerSales[SalesAmount])
var Bud=SUM(FactSalesQuota[SalesAmountQuota])
return
Bud-Act
and Bud vs Act %:
Bud vs Act % =
var Bud=SUM(FactSalesQuota[SalesAmountQuota])
var Act=SUM(FactResellerSales[SalesAmount])
return
DIVIDE(act,Bud)
Sample Reports
And here are my reports:
By Employee:
And the details which are a drill through from the other two pages;
I often get this question after I explain about the star-schema. The question is that both
my dimensions are filtering my fact table values, but how can I get one dimension
filtering another dimension? For example, how can I get only the list of products that
have been purchased by this customer? In this chapter, I’ll explain how you can do that
with the star-schema.
Sample dataset
The sample dataset for this example has only three tables below:
DimCustomer and DimProduct both have a one-to-many single directional relationship
to the FactInternetSales. This is a simple star-schema model.
We also have a report like below;
In the report above, with the selection of the FullName (from the DimCustomer table)
slicer, or Color (from the DimProduct table), the table visual gets filtered. But what
about other scenarios?
Only showing Products that this customer purchased
What if your requirement is to filter the Color slicer based on only products that the
selected customer has purchased? In the above report, the Product slicer doesn’t get
filtered by the selection of the Customer slicer.
Why Doesn’t Filtering Happen Automatically?
You may ask why the product slicer doesn’t get filtered? The reason is that the direction
of the relationship is from the Product table to the Fact table, which means only the
product table can filter the fact table’s data, not the other way around.
This is normally the case that in a star-schema, the fact table is in the middle and gets
filtered by all dimensions, but dimensions can’t filter each other.
How to Filter One Dimension By Another?
If you have this requirement that one dimension should get filtered by another
dimension, there are two ways to implement it. One is to use a both-directional
relationship (which I don’t recommend). Another is to use Measures for filtering.
Why not a both-directional relationship?
A both-directional relationship comes with two major problems. The first problem is
the significant performance impact of this. The second problem is that you cannot apply
both-directional relationships all the time; it brings circular reference and ambiguity to
your data model.
Using a Filter Measure
Another option is to use measures as a filter, which is what I called here as Filter
Measures. You can create a measure like this:
Sales = SUM(FactInternetSales[SalesAmount])
This is a very simple measure that calculates the sum of the SalesAmount from the fact
table.
You can use this measure in the Product slicer (Color slicer) as a visual level filter and
set the value to be greater than zero. This means that only show Product Colors that
there are sales for them based on the current selection of other slicers in the report.
This means that now the Customer slicer can filter the product slicer, without having the
both-directional relationship in the data model.
You can use this measure also to filter the customer slicer based on the product slicer if
you want it the other way around.
This method is not just for slicers; it can work for any other visuals too.
If a simple expression like above doesn’t work, you can take one step further and create
a measure using the CrossFilter function such as below;
Filter Product =
CALCULATE(
SUM(FactInternetSales[SalesAmount]),
CROSSFILTER(
FactInternetSales[ProductKey],
DimProduct[ProductKey],
Both)
)
However, using a measure like above normally is not necessary, because as you have
seen, the simple measure works like a charm.
Summary
In summary, the simple trick is to create a measure that calculates a value from the fact
table. Then use that measure as the visual level filter of the visual, and set the filter to
be greater than zero (or what is the threshold of having an activity in the fact table).
Part 4: Calculations
This part talks about how does the calculation fits into the Power BI modeling. Should
the calculation be part of the data preparation and built into the structure of the star
schema, or should it be done in DAX calculation after the modeling? This part takes you
through some examples to understand the concept using real-world examples.
Chapter 22: M or DAX? That is the Question!
What is the main difference between M and DAX? Why can we do a calculated column
in two different places? What are the pros and cons of each? Which one should I use for
creating a profit column? Why I cannot do all of it in only one; DAX or M! Why two
different languages?! Why is the structure of these two so different? If any of these are
your questions, then you need to read this chapter. In this chapter, I’ll go through the
differences between these two languages and explain why, when, where to use each.
What is M?
M is the scripting language behind the scene for Power Query. M is the informal name
of this language. The formal name is Power Query Formula Language! Which is really
long, and even Microsoft refers it to M. M stands for many things, but one of the most
common words of it is Mashup. Which means this language is capable of data mashup
and transformation. M is a functional language. And the structure of M script can be
similar to this:
M is a step by step language structure. Usually (Not always), every line in M script is a
data transformation step. And the step after that will use the result of the previous step.
It is usually easy to follow the structure of the M language for a programmer. Because it
is understandable with programming blocks of Let and In, and some other programming
language features alike.
What is DAX?
DAX is the Data Analysis eXpression language. This is the common language between
SQL Server Analysis Services Tabular, Power BI, and Power Pivot in Excel. DAX is
an expression language, and unlike M, it is very similar to Excel functions. In fact, DAX
has many common functions with Excel. However, DAX is much more powerful than
Excel formula in many ways. Here is an example DAX expression:
DAX calculations are built in a way that makes sense mostly for Excel users. Normally
Excel users are very comfortable with this language. Everything goes through functions.
DAX doesn’t have programming blocks in it and is a combination of function uses,
filters, and expressions.
Example Usage of M
M can be used in many data transformation scenarios. As an example, it can be used to
Pivot or Unpivot Data, or to Group the data based on a number of columns. Here is how
a Pivot/Unpivot can work in Power Query;
reference: http://forkitchen.blogspot.co.nz/2008/10/what-are-different-types-of-kitchen.html
Almost every knife in the above picture can be used for cutting cheese except one of
them! So why are there so many knives for cutting cheese?! The answer is that; these are
not knives for cutting the cheese! Each knife is good for doing one special case. For
cutting bread, the bread knife gives you the best result. For cutting a fillet, you normally
need another type of knife. But as you agree, for some cases (such as cutting the
cheese!), you can use many of these knives. Let’s now go back to the original question;
The name speaks for itself; these are tables created by calculation. As these are in-
memory tables, their calculation is based on DAX. There are many benefits of using
Calculated tables, such as using them for role-playing dimensions (for example, having
more than one date dimension in a model). There are some DAX functions and
expressions that return a table as a result, and using them as a table in your model
sometimes is really helpful. For example, you might want to create a table for the top 10
customers and then use that as the main source table in many reports. In this chapter, I’ll
explain to you some use cases of calculated tables.
Role-Playing Dimension
The very first functionality that appears in mind when we talk about Calculated Tables
is the ability to create role-playing dimensions. Role Play dimensions are dimensions
with the same structure and data rows that play different roles in our data model. For
example, the Date Dimension is a generic dimension. However, in a sales transaction
table, you might have more than one date column to relate to the date dimension. In the
example below, we have three date fields in the FactInternetSales table: Order Date,
Ship Date, and Due Date.
The three fields should be related to three different date dimensions (because the
tabular model which Power BI is based on that doesn’t support role-playing dimensions
built-in). So what you can do is just load the date dimension once. Here is the example
date dimension loaded in Power BI Desktop:
Now I can create role-playing dimensions with creating a Calculated table:
This will create a table in memory for me and allows me to write a definition of the
table
The language for table definition is DAX. We want an exact copy of the DimDate table
here. So I can simply use ALL function in DAX as below:
For the relationship above, I have also created a Due Date dimension and renamed the
original DimDate to Order Date.
In-Memory Structure, Less Refresh Time
Calculated table loads into memory, so your Power BI file size will increase. However,
you don’t need to read them again from the external data source. Yes, you can create
multiple views in the source database and connect to them through the Get Data section
with Power Query. However, their data need to populate from the source database
every time a refresh happens (either scheduled or manual).
Without Calculated Tables
Here is an example of three date tables loaded from the external data source:
With Calculated Tables
Here is only one date dimension loaded (for the role-playing dimension example
above):
As you can see, this is much more efficient in terms of reducing the refresh time.
However, the memory consumption would be the same in both methods.
The date dimension was a narrow data set example. You might need role-playing for big
data tables, so Calculated tables will save you a lot of time in refreshing data in such
cases.
DAX Table Functions
There are some DAX functions that return a table. For example, ALL function which I
used in the role-playing sample above. ALL was a simple example of a DAX function
that returns the whole copy of the source table. Let’s have a look at some other
examples and see how it works in other scenarios.
Top 100 Customers as a Calculated Table
There are many examples that a business considers the top 10 or top 20 customers and
filter down the whole dashboard and set of reports only for them. Usually, the main
reason is that the top 10, 20 customers will bring most of the revenue to the business.
Fortunately, there is a TOPN function in DAX, which helps us to build such
calculations. The TOPN function returns a table. With the TOPN, we can choose how
many rows we want in the result set, and the grouping function to be applied (if there is
any) and the aggregation (if there is any).
In this example, I want to show you how to use a calculated table to generate a list of
top 100 customers. As a business requirement, I want to visualize the total revenue from
the top 100 customers and compare it with the total revenue of the whole business.
There might be different ways to calculate and visualize it, but I want to do it with a
calculated table as a sample scenario.
Summarize
Summarize is a DAX function that generates a grouped by list from a table. Summarize
works similar to Group By in T-SQL. So if I want to create a table with CustomerKeys
and their total sales amount, I can write this expression:
Here are details about parameters I passed in the expression above to Summarize
function:
The first parameter is the source table. The FactInternetSales is the source table
that I want the group by (summarize) operation to be applied to it.
The second parameter is the Group-by column. The CustomerKey in the
FactInternetSales table is the column that I want to use as the key for grouping.
The third parameter is the output column name. I named the output calculated
column name as Total Sales.
The fourth parameter is the output column’s calculation. Here I write the
calculation for the output column, which is simply the sum of the Total Sales
Column.
TOPN
Now that we have a list of customers with their total sales, it is easy to get the top 100
customers. I can simply use the TOPN function like this to create another calculated
table (I could do this example with only one calculated table instead of two, but I only
did it with two tables to help you understand the logic better);
Top 10 Customers = TOPN(100,'Customer Sales','Customer Sales'[Total Sales],DESC)
What is the difference between a Measure and a Calculated Column? In what situation
should we use each of these? In this chapter, I’m going to explain what is the difference
between DAX Calculated Column and Measure.
Read this if you have any of the questions below
This chapter is written for you if you want to understand the difference between
Calculated Column and Measure in DAX, and have any of the below questions;
What is a Calculated Column?
What is a Measure?
When should I write a calculated column or measure?
What is their difference in Performance?
What are operations that I cannot do with these?
And many other questions about the difference between these two types of
calculations in DAX.
What is a calculated column?
Calculated Column is a column like any other column created in the table. However, the
result of a calculated column is comping from calculating an expression (DAX). The
calculated column leverages a DAX expression that applies to every row in the dataset,
and the result of that will be stored in the new column.
Example: Profit as a calculated column
Consider a table that we have sales and costs information in it. Calculating Profit in
such a table would be simply deducting costs from sales for every row. So this
basically would be a calculated column.
Expression:
Profit = FactInternetSales[SalesAmount] - FactInternetSales[TotalProductCost]
Aggregation can be done with a number of functions in DAX, such as Sum, SumX,
Average, Calculate, and heaps of other aggregation functions. Now, let’s answer the
most important question:
How to see the Value of the Measure?
Measures are calculated on the fly. This is, in fact, one of the most conceptual
differences between a measure and a calculated column. Okay, measure values are
calculated on the fly, so how can you see the value?! The answer is by putting that into a
report!
If I drag the measure above in a report as a card visual, then I would get a result;
When there is no filter applied in the report, this will return the grand total of sales,
$29.36M. However, if I add a slicer in the report, and select a value in it, I’ll see a
different result;
Now the measure calculation only shows me the sum of sales for the year 2007, which
is $9.79M.
Filter Context
Measure evaluates on the fly. If there is a slicer value for 2007, then the calculation will
be done on the subset of data, which is for 2007. If there is a table in visualization
somewhere that slice and dice data by the Education category, the result of the measure
will take that into account as well. We can then say this;
Measure evaluates the value based on the subset of data selected
by filters, slicers, or slicing and dicing components of visuals in
the report. This filtered dataset, called Filter Context.
Filter Context basically is a combination of all filters that affect the calculation of
measure. There is much more to talk about when we say filter context. However, this
should be enough to understand the rest of this chapter.
Measures do not consume RAM; they consume CPU
Based on what you’ve learned above, measure calculation is done on the fly. This
means to measure value is not stored in the memory. The measure will not consume
Memory or RAM at all. On the other hand, Measures consume CPU because their
calculation should be done right at the time of visualizing it. If you change a filter or
slicer, the calculation should be done again. Because the response time should be fast,
then this calculation happens by CPU.
What is the side effect?
If you have many measures in your report, and their calculation is also complex
calculation, then with changing every filter or slicer, you end up with a lot of rounding
circles, which shows CPU is desperately working hard to calculate all values.
Measures highlights
Based on the above explanations, here are highlights of a Measure;
Calculated based on all filters: Filter Context (usually, not always)
Is not Stored and is not pre-calculated
Calculated on the Fly when you put it on a report page when you change a
slicer, filter, or click on a column chart or any other visual to highlight, and it
affects this measure’s value.
Measure consumes the CPU for calculation.
When to use Measure, and when to use Calculated Column
Now that you know about these two types of calculations, we come to the critical
question: When to use which? Do you need a Measure for your calculation or
Calculated Column? The answer to this question is depends on what you want to
calculate? This is an important question that you should be asking yourself when you
want to create a new calculation:
This Dynamic nature of Measure calculation in DAX is something that you cannot find
in many tools. That is why Measures are so commonly used in DAX. In fact, 70 percent
of your time when you write DAX is used for writing measures, if not more!
Summary: Calculated Column vs. Measure in a nutshell
Let’s wrap up it all and go through a comparison of these two types of calculations in
DAX;
Part 5: Better Data Model
You can take your Power BI data model to the next level if you leverage other
components of the Power BI ecosystem. In this module, you will learn how these
components can make your model even better.
Chapter 25: Move Your Shared Tables to
Dataflow; Build a Consistent Table in Power BI
In this chapter, I’m going to explain the best practice for creating shared tables in detail.
The method described in this chapter ensures that the solution is keeping a single
instance, the one version of the truth, for your Power Query table through Dataflow, and
build a consistent table in your Power BI solutions. Part of the content of this chapter is
copied from Microsoft’s documentation article that I wrote some time ago.
What is Power BI Dataflow?
Dataflow is Power Query scripts running on the cloud and storing the result in Azure
Data Lake storage. The output of the dataflow is not dependent on a Power BI dataset
and can be stored separately.
Shared Table in Power Query
It often happens that you work with multiple Power BI (*.pbix) files at your day to day
job, and it also happens that you might need one table to be available in two or even
more separate files. Customer, Date, Product, Vendor, and etc. are all examples of
tables that might be needed in multiple files. You might use the Product table in the
“Sales.pbix” file for sales analysis, and also in the Inventory.pbix file for warehouse
inventory analysis. The diagram below shows a Date table that is used in two different
*.pbix files;
The date dimension is not the only shared table you may have. Product, Customer, Sales
transactions, Vendor, Warehouse, and many other tables might be needed in multiple
files, and as a result, can be shared tables. You need to know the right way to handle
these tables.
Any Power Query table that you may use in multiple *.pbix file is
a shared table.
The first two methods are the BI developer methods of doing things, and the fact is that
probably 90 percent of Power BI users are not BI developers. On the other hand, some
of them involve other tools and services (such as SQL Server, ADF, SSIS), which you
might not have. Dataflow, however, is a perfect method, that despite being a good
solution, it is also easy to implement for even a business user with a small knowledge
of Power Query. And that is the method we are talking about here.
Dataflow for Shared Table
The highlight of the process is that dataflow will run the Power Query script, and it will
store the output in Azure Data Lake storage, and then using Get Data from Dataflow in
Power BI, we get data sourced from that single table. You can use that table then in
multiple *.pbix files, as much as needed. If you want to change the transformation, there
is only one place to change; the dataflow script of that table. This would be a consistent,
low-maintenance, and robust solution.
The gateway isn't needed for data sources residing in the cloud, such as an Azure SQL
database.
Configure connection
In the next step, configure the connection to the data source using the Configure
connection option, enter credentials, or anything else needed to connect to the data
source at this stage.
Verification
If you've done all the steps successfully, you should see a preview of the data in the
Power Query Editor.
Some Power Query Desktop functions require a gateway in Power
Query Online
Some of the functions might require a gateway, even if their source is not on-premises.
Among these are functions such as Web.BrowserContents and Web.Page. If this happens, you
might get an error message indicating which specific function isn't supported. The figure
below shows an example of one of these scenarios.
If a scenario like this happens, you have two options. You can set up the gateway for
that data source, or you need to update the query in the dataflow's Power Query editor
using a set of steps that are supported without the need for the gateway.
Refresh the dataflow entities
After migrating your queries to the dataflow, You must refresh the dataflow to get data
loaded into these entities. You can refresh a dataflow manually or configure an
automatic refresh based on a schedule of your choice.
You can schedule the refresh for the dataset and connect to on-premises sources
(through a gateway) or cloud-based sources.
About April 2017, The ability to create a report from Power BI Desktop that can point
at an existing dataset, and has a live connection to the existing dataset, became
available.
Those days, it was called; Get Data from Power BI service. Nowadays, this feature is
renamed as getting Data from Power BI Dataset;
A shared dataset is a dataset that is shared between multiple reports. Multiple reports
are connecting to one shared dataset. When that dataset gets refreshed, all of those
reports will have the new data. A shared dataset is one step closer to the multi-
developer tenant in the Power BI [https://radacad.com/power-bi-architecture-for-multi-
developer-tenant-using-dataflows-and-shared-datasets] environment.
Sharing Datasets Across Multiple Workspaces
For a long time, sharing datasets was only possible inside a workspace. You could not
use a dataset from workspace one as the source for a report in workspace 2.
However, recently, the feature became available, and you can share the dataset even
across multiple workspaces. This is an amazing update and changes the way that Power
BI Development works in the future.
When you get data from a Power BI dataset through the Power BI Desktop, you have the
option to select which dataset you want to get data from;
How does Shared Dataset work behind the scene?
When you share a dataset in the same workspace, everything is clear. You have one
dataset to schedule refresh, and multiple reports connected to it. However, when you
use a dataset shared from another workspace, you get something that might look a bit
different.
Linked Dataset
When you get data from a Power BI dataset, which is workspace 1, and then save your
report in workspace 2, you get something like a copy of your dataset in the workspace
2. This might look strange! You might say, this is not a shared dataset; it is a copied
dataset. The fact is that what you see is just a link. Power BI will bring a link to that
dataset into the new workspace; this link helps you to understand when the dataset gets
refreshed last time.
Here is what a linked dataset looks like, and you can see the difference between that
with normal datasets.
You cannot manually refresh or refresh based on the schedule on a linked dataset. The
refresh action can only be configured in the main dataset. The linked dataset is just a
link, showing you when was the last date and time for the refresh, and an easier way to
generate more reports from that dataset.
The labeling system helps Power BI developers to then see what the level of
certification that a dataset has to be used as a shared dataset is, and then can select
based on that respectively;
Shared Dataset in the Power BI Architecture
I have another book about the architecture of Power BI. in one chapter specifically, I
explained how Dataflow and Shared datasets could play an important role in the multi-
developer tenant of Power BI implementation [https://radacad.com/power-bi-
architecture-for-multi-developer-tenant-using-dataflows-and-shared-datasets].
In a nutshell, using the Dataflow makes sure that you can bring the data well prepared in
a central area, which you can call it a centralized data warehouse in the Azure Data
Lake. And using the shared datasets, you can build data marts that can be used by
multiple reports. Here is how the architecture works in the diagram view;
Instead of having silos of Power BI reports and files everywhere, You can build an
architecture that works best with multiple developers, less redundancy in the data, in the
code, and the logic, and easier maintenance approach. I highly recommend reading the
article mentioned above to learn more about this architecture and learning how the
shared dataset is located in this architecture as a key element.
Summary
Using a shared dataset, you can have centralized data models (data marts) that can serve
multiple reports. You can reduce the maintenance time, the redundancy of the code, and
the data through this approach. Having the labeling system of the certified or promoted
dataset is also a great way of putting some process and governance in place to make
sure the shared datasets has been through some process of testing and reconciling.
Chapter 27: Reduce the size of the Power BI file
The size of a Power BI file can be an important factor in the performance of the model.
In this chapter, you will learn a few ways to check what tables and columns are making
your Power BI file big and what to do about them.
Power BI Helper to determine the big columns
Doing the proper modeling is always a challenge in Power BI, especially when you are
working on a model for a while. One of the aspects of modeling is to make sure you do
not bring very heavy memory-intensive columns. We used DMVs to find out the amount
of memory that each column occupies in the model.
Here is a Power BI model which has the size of …
Wow! from 107 MB down to 11MB, saving of 90% memory size just with removing
heavy but unnecessary columns.
This very simple feature can give you an idea about expensive columns for the
compression engine of Power BI, and then you can control the model based on that.
Chapter 28: Design Tip for Power BI: Be Careful
of Dates
I have written about Date Dimension in previous chapters, and it makes sense now to
explain one of the most common performance challenges that I see in some of the
models. Sometimes, your model size grows significantly big, and you have no idea
why!? In this chapter, I will show you one of the performance issues which might cause
because of specific date values and how to fix it.
Sample Dataset
I have a sample dataset, with one table and three rows! As below;
This is not a big table at all, as you can see. However, when I save the *.pbix file, the
size of the file seems a bit crazy!
Yes, you saw it right. You can try it yourself, and you would probably get the same
experience. Now let’s see why this happens and how to fix it.
Why? Investigating Using Power BI Helper
If you have a large *.pbix file, you can investigate what are the columns and tables,
causing the highest storage consumption, using Power BI Helper. You can download
Power BI Helper for free from here [https://radacad.com/power-bi-helper]. I opened
the file above in Power BI Helper, and in the Modeling Advice tab, this is what I see:
As you can see in the above output, the Date field in the Date table is the biggest column
in this dataset. Taking 150MB runtime memory! This is considering that we have only
three distinct values in the column! It seems a bit strange, isn’t it? Let’s dig into the
reason more in deep.
Why? Default Date Dimension
Power BI creates a default date dimension for every single date field in your dataset.
Values in that date table would be from the minimum date value in that field (column) to
the maximum date value, with one row for each day. There is an option to disable the
default date dimension if you want and create your own date dimension. However,
having the default date dimension will make many of the date calculations simpler and
easier, and that is why many people are using it that way.
There are a few parts of the previous paragraph that you need to read more carefully:
This now means that the Date field in our sample table is not just a field, but it is a
table, which we call the default date dimension table.
Values in that date table would be from the minimum date value
in that field (column) to the maximum date value, with one row
for each day.
The minimum value in that column is the 1st of Jan 1990, and the maximum value is 31st
of Dec 9999! This means that in the date dimension, we will have all days (one day per
row) from 1st of Jan 1990 to 31st of Dec 9999. That is about 8,009 years, which makes
almost 3 Million rows in the default date table. The date column of that table, will have
3 million distinct values, and that is why we have such a big running memory for this
column;
Now that you know why the file size is so big let’s see how we can fix it with a simple
solution.
Solution: Not Applicable Date Value
The 31st of Dec of 9999 makes no sense when it comes to date calculations. This value
is probably coming from an operational system that puts the default value of the dates (if
not entered, or to show a far date which is inaccessible) using a date which is thousands
of years in the future.
If you are using the default date dimension, I do recommend not to use this approach and
replace any value like that with null. You can do that in Power Query with a simple
Replace Value option. Here is how to do it:
Open Edit Queries, to go to Power Query Editor window;
Then right-click on the column, and do the Replace Values
Replace the 31st of Dec 9999 with null;
Then click on Close and Apply in the Power Query Editor window. Our data table now
looks like below (null values will be shown as blank in Power BI)
And now, if we save the *.pbix file again. This is a new size:
The file size shrank from 26MB to 66KB! That is almost 400 times smaller! All of that
with such simple change. And now all columns are small in the in-memory storage
consumption;
Summary
Thanks to Rui for giving me the idea to write about this as a blog chapter. As you have
seen, the default date table doesn’t know that the furthest date value you have in 9999 is
just a “not applicable” value. The default date table will consider that as your biggest
date value and build a whole table based on that, which may cause performance issues
like what you have seen in this example. Change your “not applicable” date-values to
null to fix it very simply.
Book wrap up
Congratulations on finishing the book. I like to thank you for taking this step towards
having a better data model in Power BI. Although this book is finished, the journey
towards a better data model continues.
To leverage the learnings from this book, I encourage you to start applying the practices
you learned right away in your Power BI implementations. If you feel concerned or have
a question about a particular scenario, feel free to reach out to me directly using
RADACAD [https://radacad.com] website, I’d be more than happy to look into your
question.
Wishing you the best
Reza Rad
September 2020
Other books from Reza Rad
Power BI from Rookie to Rock Star
This is a series of four books, over 1200 pages, available for free to download from
here [https://radacad.com/online-book-power-bi-from-rookie-to-rockstar].