SQL, MDX, DAX – the languages of data

Ramblings of a retired data architect

Let me start by saying that I have been working with data for over thirty years. I think that just means I am old. Anyway, I have written blog posts, delivered presentations, and authored books on these languages through the years. Understanding and using these languages have grown and shaped my career through the years. I thought it would be fun to discuss my thoughts on each one. This is my take and some of my thoughts are definitely “tongue in cheek.” So, enjoy the ride and feel free to share your take in the comments.

  • SQL – ubiquitous and relational
  • MDX – complex and dimensional
  • DAX – formulaic and columnar
  • Thoughts and musings

SQL, structured query language

SQL is the oldest of the languages. It was designed to support relational databases (RDBMS). It is built on math principles to improve performance and optimize storage. Normalization rules were established to guide developers on the preferred approaches to building databases.

Why is SQL ubiquitous?

SQL is everywhere. SQL is the query language of choice for enterprise data platforms such as Microsoft SQL Server and Oracle. Open-source data platforms like MySQL and PostgreSQL are also built to use SQL.

How is this possible? SQL is an ANSI standard. This means that the core of the language is managed by a governing body. If you learn SQL, you should be able to write queries in all these databases, right? Sort of. You should be able to write a query like this in all the databases: SELECT field1, field2 FROM table WHERE field3 = 50.

However, vendors often implement their own variations of SQL to meet needs in their platform design or to provide nonstandard functionality to their users. (Engineering outpaces standards development.) For example, Microsoft created TSQL and Oracle created PL/SQL. One of my first experiences with this was returning a single row in query. I used TOP 1 in SQL Server, but there was no TOP keyword in Oracle.

Code examples

SQL Server

SELECT TOP 1 column1, column2 
FROM table_name
WHERE condition;

Oracle

SELECT column1, column2
FROM table_name
WHERE condition
AND ROWNUM = 1;

PostgreSQL / MySQL

SELECT column1, column2
FROM table_name
WHERE condition
LIMIT 1;

What does it all mean?

You can learn SQL and be efficient querying data from multiple data platforms. Whether you are a data engineer or a data analyst, you must know SQL if you are to be taken seriously as a data professional.

MDX, multidimensional expressions

I was introduced to MDX by SQL Server Analysis Services (SSAS). For me, it just clicked. More about that in a bit. MDX, like SQL, is heavily based on math. Whereas SQL is two dimensional (column and rows), MDX can theoretically use an unlimited number of dimensions. The number is limited in practice by the capability of the data platform. MDX was primarily used by two vendors, Microsoft and Hyperion.

One key difference between the two platforms is their purpose. Relational databases are optimized for transactions and small result sets. Multidimensional databases are built for analysis across huge datasets.

Why is MDX considered complex?

The toughest part for most data professionals is visualizing multidimensional datasets in their minds. Relational data is easy to visualize. It looks like a spreadsheet. Multidimensional data is not that simple. We call it a cube, but that is a simplistic representation with only three dimensions. It is a cool name though.

Earlier in my career I coached data consultants on their transition to BI consultants. As I helped a consultant with MDX, I told him at some point he would “get it.” I told him to call me when he did. Six months later he called me, told me that he got it, and hung up on me. Many consultants didn’t get it and either just forced their way through it or went back to relational.

MDX was designed to traverse dimensions, build sets, and aggregate values across those sets. I mentioned earlier that MDX made sense to me right away. The first time I was exposed to MDX, I learned about the various functions and methods to work with dimensions including child, parent, descendants, and ancestors. You could think of dimensions like family trees. I took a class in college about familial relationships which used similar concepts. My degree is in cultural anthropology.

  • Another difficult concept to master is context. You must understand the set or slice of data you are working with in a query.
  • Once you understand context, you need to realize that every measure is an aggregate of every dimension whether a part of the query or not.
  • Results can be shaped in multiple dimensions. Three or more dimensions cannot be visualized. If you want to visualize the data in a report, it needs to be formed into columns and rows.

Multidimensional databases and MDX are extremely powerful but complex. I enjoyed working with them and became one of the few experts in the technology. However, multidimensional databases and MDX are rarely used today. Microsoft is not advancing the technology, instead promoting columnar data structures.

DAX, data analysis expressions

My first experience with DAX was when PowerPivot was released with Excel. It was then that I saw the writing on the wall for MDX. DAX is simpler and more approachable than MDX. Microsoft then added tabular models built on the same data engine, Vertipaq. Eventually culminating in the Power BI model. The underlying data engine is a highly optimized columnar data structure.

Admittedly, I have the least amount of hands-on experience with DAX. However, I disliked it early on. Unlike SQL and MDX, DAX is not built around math principles and is not a query language. It is built with expressions. Instead of SELECT, it starts with an equal sign (=). This is more intuitive for Excel users. Early on, it was very frustrating for me.

DAX is continually being improved. Microsoft is also continuing to improve the underlying data engine and storage subsystem. Power BI models are one of the foundational building blocks of Microsoft Fabric.

Should you learn DAX?

If your business uses Power BI, then yes. DAX is used to aggregate, shape, and format data for usage by end users. It is not necessary for data engineers who don’t present data to end users.

Thoughts and musings

My first recommendation is to learn SQL if you want to be taken seriously as a data professional. It has been around since the beginning and will be around for a while to come. Relational data platforms are integrating columnar data storage technology which gives SQL users access to the performance available in Power BI models.

While I contend that MDX is more powerful, I concur that DAX is more approachable. As MDX goes the way of COBOL, SQL remains the powerhouse. Learn DAX if you intend to use Power BI, otherwise don’t bother.

That is my 2 cents. Have a different opinion? Sound off in the comments below.

SQL Saturday Atlanta BI Recap

https://sqlsaturday.com/2025-03-08-sqlsaturday1102/#schedule

SQL Saturday Atlanta BI is one of my favorite SQL Saturdays of the year. This year was especially sweet to see a lot of the SML (Saturday Morning Learning) crew plus even got to see a couple of first-time presenters! Pretty amazing to see this community continue to grow and shape more careers.
Thank you to everyone who attended my session 🙂 It was super fun to present to a post-lunch full room! If you weren’t able to attend (there were a ton of amazing sessions at the same time!), or want to refer back to the code we covered, please refer to this github folder for all the resources used during the demos. Thanks again to everyone who made this conference possible!
https://github.com/Anytsirk12/DataOnWheels/tree/main/2025%20SQL%20Saturday%20ATL%20BI

Power Hour- Introduction to Power BI by Christopher Small

https://www.linkedin.com/in/christopher-small-training/
https://github.com/Chris-JN/Power-Hour

Chris is a natural speaker, and it was great to see so many people eager to learn how to use Power BI for the first time! Check out his github link above for all the files used in the session and rebuild his demos!

Automated Testing with DAX Query View and Git Integration by John Kerski and Roshini Damodaran

https://www.linkedin.com/in/john-kerski/
https://www.linkedin.com/in/roshini-damodaran/

Why test?
– Find errors before customers do
– Never see the same error
– Tests are your safety net
DataOps = https://www.kerski.tech/

https://github.com/kerski/fabric-dataops-patterns/blob/main/DAX%20Query%20View%20Testing%20Pattern/dax-query-view-testing-pattern.md

DAX query view – don’t be afraid to use quick queries to easily generate DAX code for things like column stats. Column stats will give you DAX queries for a lot of information on columns/measures, including NULL counts, data types, etc.

Use performance analyzer to grab DAX queries for visuals to test.

PBIX Pattern:

  • Standardize schema and naming conventions (keep it simple)
  • Build Tests
    • Test content
    • Test calcs
    • Test schema and data types

One interesting thing to remember – in the online PBI/Fabric service, the DAX queries don’t stay with the model. That means you will need to store tests in a git repo if you want to reference them in notebooks and/or deployment pipelines.

Pretty mind blowing! Check out the git repo for some incredible code snippets to use it!

DAX 201: CALCULATE, Unfolding the Swiss Army Knife of DAX Functions by Mark Walter & Michael Hewitt

https://storybi.com/2020/10/01/learning/
https://www.linkedin.com/in/markwaltercpa/
https://www.linkedin.com/in/mdhewitt83/

Objectives:

  • Unfold CALCULATE
  • Understand where to use CALCULATE
  • What is CALCULATE doing behind the scenes

Order of evaluation is below. Each subsequent step can overwrite the step before it.
Steps 1 and 2 happen automatically. 3 – 5 are built by DAX devs within CALCULATE itself.

  1. Original report filters
  2. Table rows
    • Context transition
  3. CALCULATE modifiers
    • REMOVEFILTERS
    • USERELATIONSHIP
    • CROSSFILTER
  4. Add explicit filters
    • Table functions like ALL(), VALUES(), DISTINCT, color = Red, etc.
  5. KEEPFILTERS
    • Standalone protection. This can turn off the overwrite functionality that happens in steps 2-4 and makes 1 the priority.

Challenge #1 – USERELATIONSHIP vs TREATAS()

  • USERELATIONSHIP
    • You cannot use this on a table that leverages RLS because that only works on a single active relationship
    • Leverages inactive relationships
  • TREATAS()
    • This creates a virtual relationship and works similar to USERELATIONSHIP, but not the same as USERELATIONSHIP
    • The TREATAS() method will combine the active relationship with this virtual relationship so it will yield unexpected results at times because it effectively creates an AND filter.
    • You can use TREATAS() with REMOVEFILTERS() to allow the virtual relationship to be the only relationship that’s taken into consideration. The combination of TREATAS() and REMOVEFILTERS() will return results like USERELATIONSHIP

Challenge #2 – Slicer not filtering large orders

  • When users filter on a field that’s included in our CALCULATE statement, our filter overwrites so they don’t end up really filtering.
  • To solve, use a KEEPFILTERS() so that it’s more of an AND versus only taking the filter context from CALCULATE

CALCULATE in slow motion

Original measure that you make and pull into a visual: Sales = SUM(SalesFact[Sales]

  1. Under the hood iterator: SUMX(SalesFact,SalesFact[Sales])
    • Row context
  2. Measure sent to total sales: CALCULATE(SUMX(SalesFact,[Sales]),<Filters>)
    • Filter context
  3. CALCULATE filters the dim tables: CALCULATE([Total Sales, Customer Name = “Chris”, Vendor = “KARPARTS”, Cal Year = 2024)
    • this comes from the visual and the page filters
  4. Now our fact table will only contain values that match the filters that have been sent and will return the correct metric

CALCULATE = filter context.

Most important concept in DAX – Row Context and Filter Context.

  • Row Context: evaluates row by row, does not propagate over filters. Horizontal.
  • Filter Context: vertical. Filters propagate through relationships. Aggregating columns.
  • These work together.

Look at the slide deck, some killer examples in there about different options for syntax sugar vs different functionality. For example, CALCULATE([Sales], color = “Red”) is the same as CALCULATE([Sales], FILTER(ALL(Colors),color = “Red”) but CALCULATE([Sales], FILTER(VALUES(Color), color = “Red”) is different because the value will not display whenever red is not included in the filter/row context. Super interesting.

TREATAS() is pretty amazing. Look up more info here: https://dax.guide/treatas/

KEEPFILTERS builds a barrier around levels 2-4, preventing competing filters. You can also use KEEPFILTERS with a multiple nested calculates to respect the outer calculate filter context.

Microsoft Fabric Pipelines – Metadata-driven ETL patterns by Mike Diehl

Code and slides: https://github.com/xhead/SqlSatATL-2025
t-sql.dk/2025/02/etl-orchestration-air-traffic-control-for-data/

Why use metadata for Fabric Piplines?

  • Reduces ETL development, increased velocity compared to tools like SSIS
  • Lakehouse tables – schema evolution (vs SQL databases)

Scenario: metadata-driven approach Target: Lakehouse in OneLake

Set of tasks:

  • Get data from source
    • load all data, overwrite all data
    • load some data in staging then incrementally merge into target
  • Dependencies
    • Load dims first then facts that depend on dims
  • Develop Pipelines at the data source type level
    • SQL server, oracle, plus auth type
    • File Source (file system, sharepoint, azure) and type (XML, JSON, CSV)

Pretty incredible process that Mike has put together, check out his slides for all the awesome illustrations of the architecture and data processing flows (https://github.com/xhead/SqlSatATL-2025).

Dynamically Unpivot columns in SQL

Picture this, your data ingestion team has created a table that has the sales for each month year split into different columns. At first glance, you may think “what’s the big deal? Should be pretty easy, right? All I need to do is unpivot these columns in Power BI and I’m good to go.” So you go that route, and the report works for one month. Next month, you get an urgent email from your stakeholders saying they can’t see this month’s numbers. That’s when you realize that this table will grow with new columns every month. That means that any report you make needs a schema refresh every single month. Unfortunately, Power BI will not grab new columns from a table once it’s published into the online service. The only way for the Power Query to pivot the new columns is for you to open the report in your desktop, go to Power Query, and refresh the preview to get all the columns in that table.

If you’re like me, you may not have time (or the will) to manually download, fix, and reupload this report each month. But have no fear! SQL is here! The unpivot function in SQL can feel like a black box, designed for only black belts in SQL to use, but we will break it down together and create a dynamic unpivot script.

What does unpivot in SQL accomplish? Unpivot takes multiple rows and converts them to columns. The general syntax is as follows:

SELECT 
originalColumn1, originalColumnHeaders, originalValues
FROM 
(SELECT originalColumn1, originalColumn2, originialColumn3, originalColumn4 FROM dbo.Table) T 
UNPIVOT (originalColumnHeaders FOR originalValues in (originalColumn2, originialColumn3, originalColumn4)) as UP

Let’s break this down.

The first SELECT statement contains the column we don’t want to unpivot plus the two names of what we want our final columns to be. For example, using our table below, we want to keep the country column as is, but we want to unpivot those sales months and the sales values. So, we may name the originalColumnHeaders as SalesMonth and originalValues as Sales. You can use whatever you’d like, but it has to match what’s in the UNPIVOT statement.

The SELECT after FROM tells the query where to get data from. By default, SQL requires this to be hard-coded, but don’t worry, we have a way for that to be dynamic.

The UNPIVOT statement requires us to provide two new column names, the first will be what you want the column full of the old column headers to be called. In our case, this will be SalesMonth. The second will be the column full of the old values within those columns (aka Sales).

The hard-coded version of our example would look like this:

SELECT 
country, SalesMonth, Sales
FROM 
(SELECT country, sales_jan_2024, sales_feb_2024, sales_mar_2024, sales_apr_2024, sales_may_2024 FROM dbo.Table) T 
UNPIVOT (SalesMonth FOR Sales in (sales_jan_2024, sales_feb_2024, sales_mar_2024, sales_apr_2024, sales_may_2024)) as UP

The result of this code looks pretty good! Now we can easily pull the month and year out from SalesMonth and have some data we can trend. But we still have that pesky hard coded problem.

To get around the hard coding, we have to do some SQL jujitsu. We will use a couple of variables that can be dynamically populated then execute the SQL as a variable using sp_ExecuteQueries. This is a bit easier to explain with in-code notes, so feel free to look through this query and read the notes that go along with it for details on how it works.


--Use this script to create the table for the demo
/* --Comment this line out to run the create table code
DROP TABLE IF EXISTS dbo.PivotedSales
CREATE TABLE dbo.PivotedSales (
	country VARCHAR(250),
	sales_jan_2024 DECIMAL(17,3),
	sales_feb_2024 DECIMAL(17,3),
	sales_mar_2024 DECIMAL(17,3),
	sales_apr_2024 DECIMAL(17,3),
	sales_may_2024 DECIMAL(17,3)
);
INSERT INTO dbo.PivotedSales (country, sales_jan_2024, sales_feb_2024, sales_mar_2024, sales_apr_2024, sales_may_2024)
VALUES 
	('South Africa', 111.22, 222.33, 333.44, 444.55, 555.66),
	('Canada', 112.22, 227.33, 332.44, 400.55, 500.66),
	('United States', 113.22, 228.33, 330.44, 401.55, 501.66),
	('Mexico', 114.22, 229.33, 334.44, 404.55, 504.66),
	('Ireland', 115.22, 230.33, 335.44, 409.55, 509.66),
	('Germany', 116.22, 231.33, 336.44, 499.55, 599.66),
	('South Africa', 1011.22, 2022.33, 3303.44, 4044.55, 5505.66),
	('Canada', 1102.22, 2027.33, 3302.44, 4000.55, 5000.66),
	('United States', 1103.22, 2280.33, 3030.44, 4001.55, 5001.66),
	('Mexico', 1104.22, 2209.33, 3034.44, 4004.55, 5004.66),
	('Ireland', 1105.22, 2300.33, 3305.44, 4009.55, 5009.66),
	('Germany', 1106.22, 2310.33, 3036.44, 4909.55, 5909.66);

--*/

--Original table
SELECT * FROM PivotedSales


--Set up parameters
;DECLARE @UnpivotColumns NVARCHAR(MAX), @FilterPefix NVARCHAR(MAX) 
-- FilterPrefix is the prefix that all the columns we want to pivot have. This will allow us to dynamically grab any new columns as they get created
SET @FilterPefix = 'sales%' --Note, you can adjust this to be a suffix or a middle string if needed by moving the % (wildcard)


--This section sets our @Unpivot column variable to be a comma separated list of the columns in our table with the FilterPrefix
SELECT @UnpivotColumns = STRING_AGG(CONVERT(NVARCHAR(MAX), C.name),',')
FROM sys.columns C 
INNER JOIN sys.types T ON T.user_type_id = C.user_type_id
WHERE C.object_id = object_id('dbo.PivotedSales') --this ensures we only get columns from our table
	AND C.name LIKE ''+@FilterPefix+'' --this makes sure only columns with the filter prefix are returned
	AND (T.name = 'decimal' OR T.name LIKE '%int%') --this ensures we only grab columns with a decimal or int type. You can adjust this if needed

SELECT @UnpivotColumns AS 'Unpivot Columns'


--This section creates a dynamic SQL statement using the comma separated list we just generated
DECLARE @ExecuteSQL NVARCHAR(MAX)
SET @ExecuteSQL = '
SELECT Country, SalesMonth, Sales
FROM 
	(SELECT Country, ' + @UnpivotColumns + ' FROM dbo.PivotedSales) P
	UNPIVOT
	(Sales FOR SalesMonth IN (' + @UnpivotColumns + ')) as S'
SELECT @ExecuteSQL AS 'Dynamic SQL Script' --this will show you the SQL statement we've generated


--Finally, we will use the system stored proc sp_executesql to execute our dynamic sql script
EXECUTE sp_executesql @ExecuteSQL

Now that is a lot of code, so be sure to look at each chunk and make sure you adjust it for your use case. Happy coding folks!

Huge credit for this solution goes Colin Fitzgerald and Miranda Lochner! Thank you both for sharing this with me!

SQL Saturday Atlanta 2024

Hey there happy coders! Last weekend I had the pleasure of speaking at the SQL Saturday Atlanta event in Georgia! It was an awesome time of seeing data friends and getting to make some new friends. If you live near a SQL Saturday event and are looking for a great way to learn new things, I can’t recommend SQL Saturday’s enough. They are free and an excellent way to meet people who can help you face challenges in a new way. Below are my notes from various sessions attended as well as the materials from my own session. Enjoy!

Link to the event – https://sqlsaturday.com/2024-04-20-sqlsaturday1072/#schedule

My session – Real-Time Analytics in Fabric

Thank you so much to everyone who came out to see my first ever session on Real-Time Analytics in Fabric! We had a couple of glitches in the Logic App creation, but had an excellent time troubleshooting together as a group. Please check out my GitHub for the slide deck as well as all the code used in the demonstration. Happy coding!

Link to my GitHub – https://github.com/Anytsirk12/DataOnWheels/tree/main/Real-Time%20Analytics%20in%20Fabric

Introduction to SQL Server Essential Concepts by Bradley Ball

Goal is to understand the core things about databases to enable deeper understanding.

ACID = atomicity, consistency, isolation, and durability.
Atomicity = either all of it commits or none of it commits. Consistency = my data must be in a consistent state before a transaction completes. Isolation = transaction must operate independently from other transactions. Durability = complex logging that is the transaction log. The log has all the commit history and ensures accurate data.

Transaction isolation levels – serializable, read committed (SQL server default), read uncommitted, repeatable read, snapshot isolation. You can set some of these at the db level. Serializable = blocks anything trying to get data on the same page, you can set this at the transaction level. Read committed = I can’t read data if a transaction is currently occurring. Read uncommitted = a dirty read, grabs data that isn’t committed. Repeatable read = nobody uses this lol. It’s a shared lock that holds for a longer period of time than the typical micro second. Shared lock means everyone can read the data. Snapshot isolation = Oracle and postgres use this. Every thing is an append and update. If I have 4 transactions, 1 update and 3 read, usually update would block reads, but this would redirect readers to a copy of the data in the TempDB at the point in time they requested to read it (aka before the update commits).

DMV = dynamic management view. Bradley used one that allows you to see active sessions in your database. We can see that a read query is blocked by an uncommitted transaction. We can see the wait_type = LCK_M_S and the blocking_session_id which is our uncommitted transaction. To get around this, he can run the read script and get a dirty read by setting the isolation level to read uncommitted. To unblock the original request, he can use ROLLBACK TRANSACTION to allow it to unlock that page of data.

How does SQL Server work on the inside? We have a relational engine and a storage engine. User interacts with a SNI which translates the request to the relational engine. User > SNI > Relational Engine [command parser > optimizer (if not in planned cache otherwise goes straight to storage engine) > query executer] > Storage Engine [access methods (knows where all data is) > buffer manager (checks the data cache but if not found then goes to the disk and pulls that into the buffer pool data cache). This gets extremely complicated for other processes like in-memory OLTP. The SQL OS is what orchestrates all these items.

SQL OS – pre-emptive scheduling (operating system) & cooperative pre-emptive scheduling (accumulates wait stats to identify why something is running slower).

Locks, latches, waits. Locks are like a stop light (row, page, and table escalation). If you lock a row, it will lock a page. Latches are who watches the locks/watchmen. It’s a lock for locks. Waits are cooperative scheduling. If a query takes too long, it will give up it’s place in line willingly. That creates a signal wait which signals there’s too much lined up.

SQL data hierarchy. Records are a row. Records are on a data page (8 k). Extents are 8 pages (64 k). It’s faster to read extents than pages. Allocation bit maps are 1s and 0s that signify data on a data page that enables even faster data reads – allows governing on 400 GB of data on 1 8KB page. IAM chains and allocation units allows quick navigation of pages. Every table is divided into in row data, row overflow data (larger than 8064 k), and lob data (large object like VARCHAR max and images).

Allocation units are made of 3 types:
1. IN_ROW_DATA (also known as HoBTs or Heap or B-Trees)
2. LOB_DATA (also known as LOBs or large object data)
3. ROW_OVERFLOW_DATA (also known as SLOBs, small large object data)

Heaps vs Tables. Oracle stores data as a heap which is super fast to insert. In SQL, these have bad performance due to clustered indexes and inserting new data. This is very situational. A table is either heap or clustered index, cannot be both. But heaps can have non-clustered indexes.

B-Tree allows you to get to the record with less reads by following a logic tree (think h is before j so we don’t need to read records after j). Heaps create a 100% table scan without a clustered index. Adding the clustered index dropped that significantly to only 1 read instead of the 8000 reads.

Recovery models – full, bulk logged, simple (on-prem). In the cloud everything is full by default. Full means everything is backed up. Bulk means you can’t recover the data but you can rerun the input process. Simple means you can get a snapshot but you can’t do any point in time restore. This will largely be determined by any SLAs you have.

Transaction log. This will constantly be overwritten. Your log should be at least 2.5 as large as your largest cluster. DBCC SQLPERF(logspace) will get you all the space available for logs in the various dbs. Selecting from the log is always not recommended since it creates a lock and logs are always running, so don’t do this in prod lol. Rebuilding indexes will grow your transaction log massively. To free up space in the transaction log, you have to a backup log operation which is why those are super important.

Fun tip, when creating a table you can put DEFAULT ‘some value’ at the end of a column name to provide it a default value if one is not provided. Pretty cool.

You can use file group or piecemeal restores to restore hot data much faster then go back and restore older, cold data afterward. To restore, you must have zero locks on the db. While restoring, the database is not online. Note, if you do a file group restore, you cannot query data that is in a unrestored file group so queries like SELECT * will not work.

Tales from the field has a ton of YouTube videos on these subjects as well.

Lessons Learned using Fabric Data Factory dataflow by Belinda Allen

What are dataflows? Dataflows are a low-code interface tool for ingesting data from hundreds of data sources, transforming your data using 300+ data transformations. The goal is to allow for more people to manipulate and use data within your organization. At the heart, it’s Power Query.

Why start with dataflows as a citizen developer? It’s power query and you know that. It’s low-code data transformation. Excellent for migrating Power BI reports to Fabric.

Lots of great discussion about when it makes sense to use a dataflow gen2.

You can copy and paste power query from Power BI by going into the advanced editor OR you can hold shift and select all the queries you want then ctrl c then go to power query for a dataflow gen2 in the online service and hit ctrl v and it will populate with all your tables! Pretty neat. You can also make your relationships within the online portal.

DBA Back to Basics: Getting Started with Performance Tuning by John Sterrett

For the code visit: https://johnsterrett.com/blog-series/sql-server-performance-root-cause-analysis/

Goal of today’s session – arm anyone who is new to performance tuning with processes and sills to solve common problems.

Basic query runtime. SQL has something called wait stats that tells you what caused the query to be slow. When you run a massive query, it will go into a suspended state which will require reading from disc instead of from memory cache (T1). After that, you’re in a runable state (T2). Finally, you get to run it (T3).

Basic bottlenecks = memory, disk, CPU, network, locking blocking & deadlocks. Adding memory is typically the fastest way to improve performance.

Identify performance problems happening right now:

EXEC sp_whoisactive. This is an open source script that gives you insight into who’s running what right now. You can get this from https://whoisactive.com. The cool thing about this is there are more ways to run it than just EXEC sp_whoisactive. Identify what’s consuming the most CPU from the column. There’s also some parameters you can use like @sort_order. EXEC sp_whoIsActive @sort_order = ‘[CPU] DESC’, @get_task_info = 2. The task info parameter will give more information in a wait_info column. The best command is exec sp_whoIsActive @help = 1. This provides ALL the documentation on what it does. Adam (the creator) also has a 30 day blog series on everything it can do for you! One option to make things run faster is to kill the process causing the issue lol.

How to handle blocking. You can do explicit transactions with BEGIN TRANSACTION which will lock the table. At the end, you need to either COMMIT or ROLLBACK or else that lock holds. SQL uses pessimistic locking as default so it won’t let you read data that’s locked – it will simply wait and spin until that lock is removed. You can use exec sp_whoisactive @get_plans = 1 to get the execution plan. Be careful, the wait_info can be deceptive since the thing that takes the most time may not be the problem. It may be blocked by something else, check the blocking_session_id to ve sure. Also check the status and open_tran_count to see if something is sleeping and not committed. Keep in mind that the sql_text will only show you the last thing that ran in that session. SO if you run a select in the same session (query window) as the original update script, it won’t be blocked and can run and THAT query will show up in the who is active which can be super confusing. To resolve this issue, you can use ROLLBACK in that session to drop that UPDATE statement.

To find blocking queries use EXEC sp_whoIsActive @find_block_leaders = 1, @sort_order = ‘[blocked_session_count] DESC’.

Identifying top offenders over the long term:

There’s a feature in SQL 2016 forward called Query Store which persists data for you even after you restart data. It’s essentially a black box for SQL. Query Store is on by default in SQL 2022 and online servers. It’s available for express edition as well. Be sure to triple check this is on, because if you migrated servers it will keep the original settings from the old server. If you right click on the DB, you can navigate to query store and turn it on via Operation Mode (Requested) to Read write. Out of the box is pretty good, but you can adjust how often it refreshes and for how much history. To see if it’s enabled, you should see Query Store as a folder under the db in SSMS.

Under query store, you can select Top Resource Consuming queries. There’s lots of configuration options including time interval and what metric. SQL Server 2017 and newer have a Query Wait Statistics report as well to see what was causing pain. It’ll show you what queries were running in the blocking session. You won’t get who ran the query from query store, but you can write sp_whoisactive to a table that automatically loops (very cool). This will have overhead on top of your db, so be mindful of that.

Intro to execution plans:

Keep in mind, SQL’s goal is to get you a “good enough” plan, not necessarily the best plan. Follow the thick lines. That’s where things are happening. Cost will tell you the percentage of the total time taken.

Key lookups. It’s a fancy way to say you have an index, so we can skip the table and go straight to the data you have indexed. BUT if there’s a nest loop, then there’s an additional columns in the select statement so it’s doing that key lookup for every value. More indexes can make your select statements worse if it’s using the wrong index that isn’t best for your query.

Index tuning process.
1. Identify tables in query
2. Identify columns being selected
3. Identify filters (JOIN and WHERE)
4. Find total rows for each table in the query
5. Find selectivity (rows with filter/table rows)
6. Enable statistics io, time, and the actual execution plan
7. Run the query and document your findings
8. Review existed indexes for filters and columns selected
9. Add index for lowest selectivity adding the selected columns as included columns
10. Run the query again and document findings
11. Compare findings with baseline (step 7)
12. Repeat last 5 steps as needed

To see existing indexes, you can run sp_help ‘tableName’. In the example, there’s an index key on OnlineSalesKey but that field is not used in our filter context (joins and where statements) in the query. Order of fields in indexes do matter because it looks in that order.

Brent Ozar made a SP you can use called sp_blitzIndex that will give you a ton of info on an index for a table including stats, usage, and compression. It also includes Create TSQL and Drop TSQL for that index to alter the table.

To turn on stats, use SET STATISTICS IO, TIME ON at the beginning of the query. Be sure to also include the actual execution plan (estimated doesn’t always match what actually happened). Now we can benchmark. Use SET STATISTICS IO OFF and SET STATISTICS TIME OFF. Create an non clustered index with our filter context columns.

Derby City Data Days

It was awesome to see the Kentucky data community come out for the first Derby City Data Days in Louisville, KY! Bringing together communities from Ohio, Tennessee, and Kentucky, the Derby City Data Days event was an excellent follow-up to Data Tune in March and deepened relationships and networks made at the Nashville event. In case you missed it, below are my notes from the sessions I attended as well as the resources for my session. Be sure to check out these speakers if you see them speaking at a conference near you!

Building Self-Service Data Models in Power BI by John Ecken

What and why: we need to get out of the way of business insights. If you try to build one size fits all, it fits none. Make sure you keep your data models simple and streamlined.

Security is paramount to a self-service data model. RLS is a great option so folks only see their own data. You can provide access to the underlying data model for read and build which enables them to create their own reports off the data they have access to. If you give your user contributor access, then RLS will go away for that user. Keep in mind, business users need pro license OR need to be in a premium workspace.

One really great option is for people to use the analyze in Excel option to interact with the most popular BI tool – Excel. This allows them to build pivot tables that can refresh whenever needed. You can also directly connect to Power BI datasets from their organization! You can set up the display field option as well to get information about the record you connect to. Pretty slick! With this, security still applies from RLS which is awesome.

Data modeling basics – clean up your model by hiding or removing unnecessary columns (ie sorting columns). Relationships matter. Configure your data types intentionally. Appropriate naming is vital to business user success. Keep in mind where to do your transformations – SQL vs DAX (think Roche’s Maxum). Be sure to default your aggregations logically as well (year shouldn’t be summed up).

Power BI Measures – creations, quick create, measure context, time-based functions. Whenever possible, make explicit measures (using DAX) and hide the column that it was created off of so people utilize the measure you intended. Make sure you add descriptions, synonyms (for Copilot and QA), featured tables, and folders of measures. The functionality of featured tables makes it wise to use folders of measures within your fact tables.

John likes to use LOOKUP to pull dims back into the fact table so he ends up with as few tables as possible. There are drawbacks to this such as slower performance and model bloat, but the goal is for end users who don’t have data modeling experience or understanding. Not sure I agree with this method since it’s not scalable at all and destroys the purpose of a data model. Make sure you hide columns you don’t want end users to interact with.

To turn on feature table, go to the model view then go to Properties pane and toggle that is featured table button. It will require a description, the label that will populate, and the key column (cannot be hidden) that the user will put in excel as a reference for the business user to call records off of.

The PIVOT() TSQL Operations by Jeff Foushee

GitHub: https://github.com/jbfoushee/MyPresentations/tree/main/TSQL_Pivot_Operators

Be sure to look at his GitHub for the awesome source code!

Come to Lousiville on May 9th to see a presentation on JSON and TSQL.

The goal of this is to avoid FULL OUTER JOINs. This is extremely unscalable since maintenance would be terrible. We will avoid this by using pivot. Pivot means less rows, more columns. PIVOT promotes data to column headers.

You get to decide how the tuple that’s created on the pivot is aggregated (count, min, max, sum, avg, etc.). Exactly one aggregate can be applied, one column can be aggregates, and one columns values can be promoted into the column header.

PIVOT ( SUM(Col1) FOR [ID] IN ([ID_value_1], [ID_value_2], etc.)
SUM = the aggregate, ID = the column that will become more columns, the IN values = the column values from ID that will be promoted into the column header.

Time for a 3 column pivot. For this, we are doing a two column pivot and ignoring one of the fields. You can even pivot on computed fields but make sure you include the values in that inclusion clause. Be careful about adding unnecessary data.

How do you manage the VTCs (the column values that end up as column headers)? Option 1 – don’t. Option 2 – explicitly request the ones of interest and provision for future extras. Option 3 – use dynamic SQL! You can use cursor, XML, etc. Check out his ppt deck from github for code samples!

An n-column PIVOT works by essentially creating a 2-column pivot (at the end of the day it’s only two columns that ever get pivoted) and knowing which you want split into new columns.

Ugly side of PIVOT = lookups. The more fields you need to add in from additional tables, the worse performance will be. Your best option there would be to do a group by, the pivot. Another limitation is you can’t use a function in your pivot aggregation (SUM() vs SUM() *10). Get your raw data clean then pivot.

Time for UNPIVOT!

Unpivot = convert horizontal data to vertical. Less columns, more rows. Unpivot demotes column headers back into data.

Be very very careful with your data type your about to create. Remember lowest common denominator, all the values must be able to fit in one common datatype without overflow, truncation, or collation.

UNPIVOT( newColPropertyValue FOR newColPropertyName IN ([originalCol1], [originalCol2],etc.)

You need to make sure all your original columns have the same datatype. NULLs get automatically dropped out. If they are needed, you can convert them using an ISNULL function to a string value or int value depending on your need.

There’s also an option for XML-based unpivot.

Cross Apply = acquires a subset of data for each row found by the outer query. Behaves like an inner join, if no subset is found then the outer row disappears from the result set. Outer Apply is similar but it’s more like a left join. Cross Apply does keep your NULL values. You can also use a STRING_SPLIT with cross apply.

Multi-Unpivot normalizes hard-core denormalized data. You just add more UNPIVOT lines after your initial FROM statement! Make sure you have a WHERE statement to drop any records that don’t align at a column level. Something like WHERE LEFT(element1,8) = LEFT(element2, 8).

My Session – Time for Power BI To Git CI/CD

Thanks to everyone that attended my session! Had some great questions and conversations! Here’s the link to my github with the slide deck from today: https://github.com/Anytsirk12/DataOnWheels/tree/main/Power%20BI%20CICD

Medallion Architecture for Fabric by Steve Hughes

This session was awesome! We were able to watch a series of Fabric 5 minute videos and had an amazing discussion between them about options for building out a Fabric infrastructure using medallion techniques. Check out Steve’s YouTube channel for his Fabric 5 playlist and to learn more about his experience working with ALS.