PASSMN June 2020 – Data Classification with SQL Server and Azure

I presented at the virtual Minnesota SQL Server User Group meeting on June 16, 2020. The topic was data classification with SQL Server 2019 and Azure SQL Database.

Data Classification Basics

Data classification in both SQL Server and Azure allow you to discover and label data based on information type and sensitivity. Information type is a way to describe the content of the data at high level. This includes types such as Address, Name, Networking, and Credit Card. By tagging your columns with types you will be able to easily see the types of data stored in your tables. You can also label the sensitivity. This includes labels such as Confidential and Confidential-GPDR.

Using SQL Server 2019 and SSMS 18.4+

For on premises implementations, you can use SQL Server Management Studio. I would recommend that you use SSMS 18.4 or greater. This has the most capability. SQL Server 2019 includes the sys.sensitivity_classifications system catalog view so you can query to see what field have been labeled.

To get started, open up SSMS. Right click the database and choose Tasks > Data Discovery and Classification > Classify Data. This will allow you to

Finding the Data Discovery and Classification Options in SSMS

view the Data Classification window in SQL Server. You will get a list of recommendations and the ability to add custom classifications in your SQL Server database.

The Data Classification view in SSMS

Once you have classified some of your data, you are able to view a report that shows the coverage of the classification work you have done.

Data Classification Report in SSMS

Adding Data Classification in Azure SQL Database

Azure SQL Database supports similar functionality for discovering and classifying data. The primary differences are (1) it requires Advanced Data Security which costs $15/month per server and (2) audit logging support is built in.

You can find this in the Azure portal with your SQL Database.

Advanced Data Security in Azure SQL Database

As you can see above, you get a visual here initially. Click the Data Discovery & Classification panel to open a similar classification window that we see in SSMS. This will allow you to discover and classify your data.

The key difference is turning on auditing and logging information about people querying the classified data. In the Security section in your SQL Database view in the Azure portal, choose Auditing. You can now add auditing to your server or database. (Click here for information about setting up Auditing.) I chose to use Log Analytics which is in preview. Log Analytics has a dashboard which shows activity in your database with this data.

Log Analytics Dashboard which Shows Access to Sensitive Data

You can click into the dashboard to dig into details. You can also use the Log Analytics query features to build your own queries to further analyze the data. The details contain who accessed the information, their IP address, and what was accessed. You can build more reports from that information to support more sophisticated auditing.

Final Thoughts

I think that there is still work to be done on SQL Server to better support auditing. Azure is ahead of the game in this area. More importantly, Azure logging is a platform level solution. You should be able to integrate your logging from the applications to the database in Azure.

You do have the ability to update the policy in SQL Server with a JSON file. I recommend you export the file and modify it. In Azure, you can update the information policy in the Security Center. Updating this policy allows you to discover data or information that you want to classify based on rules you set up. This should be part of your data governance plan.

One other follow up from the meeting. The question was raised about Visual Studio support in database projects. The answer is “sort of”. First, you need to make sure your project is targeting SQL Server 2019 or Azure SQL Database. Once that is set, you can use the following code to add the classification manually or you can apply it to your database and do a scheme compare to bring it in.

ADD SENSITIVITY CLASSIFICATION TO
    [SalesLT].[Customer].[FirstName]
    WITH (LABEL = 'Confidential - GDPR', LABEL_ID = 'fe62dcde-72c0-475c-b1af-fb8de4c8fc7e', INFORMATION_TYPE = 'Name', INFORMATION_TYPE_ID = '57845286-7598-22f5-9659-15b24aeb125e', RANK = MEDIUM);

You will need to know the GUIDs for the labels and types in your solution to do this manually. However, once this is done, you can see the information in the Properties window for the field as well.

Data Classification Properties in Visual Studio

The key thing to be aware of is that the properties are read only. You have to use the code to change them or do the changes in the database and use Schema Compare to bring them in.

Thanks again to those of you who joined us at the meeting. Here is the slide deck from that meeting. I look forward to sharing more with all of you later.

Azure Data Relational Services

Today I’d like to talk about the Azure Relational Data Services Platform. This is an important foundational component for many things that are being built on Azure Platform as a Service related to databases.

One of the key PaaS offerings when Microsoft started with Azure was Azure SQL Database. Moving forward, changes were made to this and Azure SQL DW was released. Recently, Microsoft released a preview of the Azure SQL Database Managed Instance option. This is significant as it is a v-core plus storage option and intended to have parity with the on premises version of SQL Server, plus is a key step to separating compute and storage for Azure SQL Databases as well.

See you at Azure Data Week in a few days!

This is important since it allows Microsoft to standardize their relational database support pattern for other databases as well. This has existed for Azure DW for some time and was also improved in Gen 2. Check out more about this in some previous posts in this series.

Azure’s Relational Database platform supports Azure DW’s MPP platform, Azure SQL Database or SQL Server as PaaS, Azure Database for MySQL and PostgreSQL. So, open source databases are supported on the same relational data services platform. Azure Database for MariaDB is coming by the end of 2018.

You may be thinking, why is all this important and what does a common platform include?

  • First, Azure storage services as a foundation for all databases and all the data on the Azure platform. All data stored here, as well as Azure Databases, whether open source or SQL, are encrypted at rest.
  • Manages high availability of a solution by keeping free copies of data available for the platform at all times. So high availability built in and encryption at rest—secure and available.
  • Azure compute is the VMs supporting the compute needs of the databases. This is where you pick the cores that you want to provide scale up function. However, you’re not managing VMs, you’re managing capacity. Microsoft has taken on the task of understanding what you need from a capacity standpoint, like how do you want to scale up or down or how many v-cores do you want to set aside.
  • A key component of many things in Azure is that we can scale compute separate from storage. The database services platform sits on top of Azure storage and compute, so its strength is that the core of the solution lives in those 2 platforms. It allows support of MPP, open source and SQL databases with PaaS.
  • Databases services is where the next tier happens (or all the cool stuff). On top of the foundation, Microsoft adds a set of common components that are used across all these databases.
  • It’s a trusted platform with things like backup and restore, security, audit and isolation all managed in this service. This allows you to trust the platform and build databases with confidence in the security.
  • It’s flexible, enabling scalability and resource management within the platform. This includes features like scaling up or down on demand and adding storage as you need, giving flexibility to the platform. This is hard to do if you build this for yourself or use an IaaS solution.
  • It’s intelligent. We see big benefits in the fact that it provides monitoring, automated tuning and advisors to the platform. These are built in to make your databases better, so you can rely on good performance and know what is happening in your database when you need to.
  • Think of the third tier (after storage/compute and database services) as each unique database platform and the features each brings to your application. Whether you’re using an open source product that’s using MySQL or a SQL Server, their feature sets come forward in PaaS.

Another advantage to mention is by supporting standard SQL and managed instances, and MySQL and PostgreSQL Community Edition, it makes moving to the cloud so much easier. This open opportunities for you to migrate in clean fashion using all the capabilities of a system you’re familiar with.

Do You Need a Relational Data Warehouse?

Are you looking to do a major update to your data warehouse or looking to modernize? Many technologies have come about that are changing the landscape of what data warehouses are made of. In this Azure Every Day session, I’d like to talk about 3 new technologies in Azure and HDInsight that break the rules.

1. HDInsight (with Spark in particular)

You don’t have to use SSIS to get ETL into your big data storage. HDInsight with Spark can give us ETL, as well as bring to light a lot of machine learning and other technologies, so we no longer have a dependency on SSIS. Earlier this year, Microsoft released Azure Databricks which improves of what is in HDInsight with expanded Spark capabilities.

2. Azure Data Lake

This is a great place to store your data, and no, it won’t cost you a lot of money and it’s not hard to work with. Azure Data Lake gives you the ability to store all your data, regardless of where it comes from or how it looks, in that space – including real time data.

Have you ever considered putting real time data into your data warehouse? Streaming data into your data warehouse would break everything we consider conventional. So, don’t put it in a data warehouse, move it into a Hadoop structure in Azure Data Lake, and avoid the issue in a structured data warehouse of: I got it in there, how do I get it out?

Check out Azure Data Week coming in October 2018

3. Interactive Hive and Spark SQL (also part of HDInsight)

The everyday report writers and users are not going to learn how to do map reduce or all those other technologies that make us the cool big data/data science people. These users just need to build a report.

These technologies bring us to a place where we can write SQL against those data structures and not care where the data came from, how it got there, or the type of data format it’s in. We apply the schema after the fact and it means the same data that we put in our data lake, can be used in multiple scenarios.

Azure Data Warehouse and 60 Servers

In today’s segment of Azure Every Day, I’d like to talk about Azure Data Warehouse and to help you understand some of the underlying architecture and its impact on the workloads you’re planning to put into Azure. Azure Data Warehouse is a parallel data warehousing solution in the cloud, useful for large data workloads.

When you turn on Azure Data Warehouse for a workload, it will instantiate 60 SQL Server databases underneath, which means it’s looking at how to parallelize your workload across 60 databases. This is the standard Platform as a Service (PaaS) implementation. Unlike an appliance or APS, you don’t have control over the number of servers that you run. It is designed to handle this without you having to do anything to manage that workload.

So, when starting to use Azure Data Warehouse, there are 3 aspects that you should understand as to how it will affect how you plan and decide if the workload makes sense in Azure DW.

1. Tables Use Clustered Columnstore Indexes by Default

The first thing to realize is that all the tables going into Azure DW are, by default, cluster columnstore indexes. These indexes are great for compression and performance, but typically need a million rows in them before they become compressed. So, for example, if you think about a distributed workload across 60 databases, you need 60 million rows before you can see an effective compression solution or take advantage of the compression utilities in a cluster columnstore.

2. Tables Are Distributed, Use Partitions Sparingly

Be aware when you’re partitioning a table, that the table is already distributed. Hence, if you have a scenario where a table already has 100 partitions, that creates 6000 partitions across the Azure instance. To break this down further, you’re going to need about 6 billion rows before you can take advantage of the columnstore indexes, including compression. Therefore, if you’re not running a large system, partitioning may not give you the same benefit as on a SQL Server or Azure SQL database.

3. Parallelized Data Loads

Lastly, realize how the data is loaded into the data warehouse. The Azure Data Warehouse, with its 60 databases, makes it a great platform for loading data because you can parallelize the data load across those 60 databases. One tool that helps you is CTAS, CREATE TABLE AS,  SQL statements that were designed for use within Azure DW and can distribute workloads across partitions and load in parallel, resulting in a very fast load.

Used in conjunction with PolyBase, you can pull data from storage, like Azure Blob Storage or Azure Data Lake, and pull that data through and use CTAS to write that data out. A final thought on this scenario is if you use Azure Data Factory and its capabilities to further use that parallelism. With this, you can load a staging table or make sure it sends 60 files at a time so every server is being used.

Consumption Based Architecture for Modern Data Analytics

Throughout many years of working with BI solutions and data warehouse solutions, we have strived to put all the data in one location so it could be easily consumed by reporting and analysis tools from SQL Server Reporting Services to Microsoft Excel. We have followed the dimensional modeling processes promoted by Ralph Kimball and others. These techniques were developed to turn relational data platforms into viable and well-performing reporting platforms. They worked.

Throughout the years, I have built many star schemas, enterprise data warehouses, and reporting databases based on these techniques. However, they were not without their flaws. In the early days, it would take years to create the enterprise data warehouse. When it was done, the business had already moved on. So, we started creating data marts which were departmentally focused. This allowed us to shorten the development lifecycle to more quickly meet the needs of the business. But the speed of business continued to outpace IT’s ability to deliver effective BI solutions when they were needed.

During this same time, the worlds most ubiquitous BI tool, Microsoft Excel, ran more and more businesses, both large and small. Why was Excel so popular? It allowed the user, notExcel-2013-Icon_thumb.png IT, to do the analysis and produce results when the business needed them. Self-service BI is not new, we just refuse to accept Excel as a “real” BI tool in the industry. Inevitably, no matter how good your BI or reporting tool is, users want to know, “How do I export this to Excel?” I found it humorous that Microsoft suffered the same problem with Power BI. The ability to extract the data is hugely important to users. Why? So they could do this on their own. They feel empowered with Excel.

In today’s world, we are also witnessing a shift to a more mobile, tech savvy group of users. As my teenage and young adult children begin to enter the workforce, I still see a disconnect with enterprise BI solutions in most cases. They want the data at their fingertips and easily consumable to solve the question they have now, not in three weeks. That brings us to the architecture I have been promoting for a few years. It’s not new, but I needed a way to talk about modern data and BI solutions that focused on one the most significant needs in the business – consumable data.

The Consumption Based Architecture is based on the following key concepts:

  1. Keep the data close to the source
  2. Data interfaces should be easy to use
  3. Modern, in-memory tools make this possible

Keep the data close to the source

In the normal enterprise data warehouse solution, we process data to clean it up, reshape it, and generally make it “better”. However, the reality is that the users don’t see it as sourceoftruthbetter if it does not match the source. The only “source of truth” that matters in reality is the one that is closes to the data entry as possible. If that data is wrong, it needs to be corrected there. Why? Because users will always verify that the data warehouse or any BI solution is correct by checking the system of record.

Besides the source of truth issue, any time data is transformed or shaped differently than the source, documentation and maintenance are required. In most cases, documentation is lacking in data warehouse solutions. It’s hard and changes often. Furthermore, the amount of maintenance required to support a traditional data warehouse solution is a problem. When the business needs to change a field or the source changes, how do you measure how long it will take to get that in the data warehouse? Days? Weeks? Months? Years? By the time we have the change in place, tested and ready for use, the business has moved on.

In Consumption Based Architecture, the goal is to move the data as needed and transform it as little as possible. Operational Data Stores which are replicas or copies of the data from the source systems are the best mechanisms to move the data to a reporting area. Only transform if absolutely necessary. In most cases, our transforms will involve data type cleanup or other in cases where fields have changed use over time maybe adding a column to clarify the data. Each time you move data or transform data, it should be to make the data more consumable.

The primary exception to this rule is dimensional data. Dimensional data has a special place in the architecture. While it is not necessary to transform this data, the overall solution is improved by combining similar dimensions and using natural keys for relationships. For example, if you have a customer record in CRM and shipping data, you can create a “golden record” or conformed dimension which contains both natural keys. This will allow you to cross reference data easily in both solutions. Master data solutions help with this but are not required to be successful in this architecture.

Data Interfaces Should Be Easy

Having done a lot of work in the application development field where interfaces are used to simplify programming, I think we should have a similar concept in working with data. We have dabbled in the area for some time. I know that I have used views with schemas (or users in Oracle) that were specifically designed to support SQL Server Analysis Services. We used views to support a consistent data set to the cubes for processing and allow changes in the back end data as needed. The views operated as an interface between the data warehouse and the cube.

In a similar fashion, these interfaces need to be identified and used throughout the Consumption Based Architecture. If you are referencing a relational data structure, views continue to make sense. They allow consumers to interact with the data in a known fashion. They are also logical constructs which can be deprecated with a timeline for users to move off of them to the newer versions.

Analytic models such as those found in Power BI, Qlik, and SQL Server Analysis Services create a similar tool for consumers. Usually those models can be used in the tool itself or even in other tools to produce reports and dashboards. Depending on how the model is viewsmodvirtcreated, they are often a table based view of data. For instance, Power BI can turn a folder of files into a table structure for easy consumption in Power BI reports. Power Pivot models created in Excel can be shared in SharePoint and in Power BI. These are just a few examples of using analytic models as interfaces.

The third option is data virtualization. This tends to be fairly expensive. The two that I am aware of are Cisco Data Virtualization and RedHat OpenShift virtualization. The concept of data virtualization is perfect for Consumption Based Architecture; however, I have not seen this used much due to cost implications. Microsoft may be changing the landscape of data visualization. During PASS Summit 2016, they announced expanding the use of Polybase to reference other data sources such as Oracle and Teradata. Currently, Polybase supports Hadoop data, but this change could allow it to become a virtualization tool that is cost effective (included with SQL Server) and simple to use.  I will be reviewing these tools in a later post as I get more information on them.

Modern, In-Memory Tools

The most significant technological improvement in the past few years to make this architecture really possible is in-memory data tools. While this revolution has occurred in more technologies than Microsoft, Microsoft tools are what I am most familiar with. In the Microsoft arena, this started with Power Pivot in Excel. The ability to mashup various data sources using in-memory models is awesome.

When Power Pivot came on the scene, I was doing a lot of work with SQL Server Analysis Services cubes. While these cubes provided a great analytic layer, they were very fragile in my experience. However, Power Pivot allowed us create better performing models more quickly. At that point, I knew a shift was coming. Two of the biggest pain points with cube design were tackled – speed of development and ease of use. We could solve problems quicker and easier with Power Pivot.

speed-and-ease

Since the release of Power Pivot, Microsoft has also improved their overall in-memory solution set by adding SQL Server Analysis Services Tabular Models and OLTP in-memory and columnstore functionality in SQL Server. These improvements continue to make it easier than ever to build out consumable models in memory. You can now use columnstore in SQL Server without an analytics model, simplifying your architecture without performance penalties. With the release of SQL Server 2016, Microsoft has created a data engine capable of in-memory OLTP to improve transactional loads while also supporting columnstore indexes for reporting loads in the same database. While your mileage may vary on implementation, it is easy to see that Consumption Based Architectures are best able to take advantage of these advances and making data more easily accessible and consumable to our business users.

What’s Next

Over the next few weeks, I will be digging in on some of the topics that support Consumption Based Architecture. The goal is to help you begin to take advantage of this architecture in your business and build out a flexible, easily consumed data and analytics platform.

cba-featured-pic

Upcoming Topics:

  • Interface Layers
  • Data Dictionaries
  • Dimensional or Master Data
  • Moving and Transforming Data
  • Modeling in MSBI
  • Reporting with Consumption Based Architecture