End-to-End Azure Data Engineering Project |
Medallion Architecture with ADF, Databricks &
Synapse
This is an end-to-end Azure Data Engineering Project where I upload data fromGitHub HTTP
source to bronze layer through Azure Data factory. Transform the datain the silver layer using
Databricks and provide the data to downstream (Analytics,Data Science) using Azure Synapse
Analytics.
I have followed Medallion architecture where:
• Raw data is uploaded to bronze layer using parameters and dynamic pipelines.
• Raw data is Transformed in the solver layer using Databricks.
Finally, the Transformed data is served into gold layer, where views of the dataare created for
downstream analysis.
ABOUT THE DATASET:
This is an open-source dataset from Kaggle named, Adventure Works.
Here is a quick description of the columns in the dataset.
AdventureWorks_Calendar.csv
AdventureWorks_Customers.csv
AdventureWorks_Product_Categories.csv
AdventureWorks_Product_Subcategories.csv
AdventureWorks_Products.csv
AdventureWorks_Returns.csv
AdventureWorks_Sales_2015.csv
AdventureWorks_Sales_2016.csv
AdventureWorks_Sales_2017.csv
AdventureWorks_Territories.csv
I have used Azure DataLake Storage to store the data.
Why ADLS Gen 2?
Storage in Blog Vs. ADLS
In ADLS, we can store data in the form of hierarchies. While creating the storageaccount, make
sure to enable “Hierarchical Namespace” to create ADLS, else a blobstorage will be created by
default.
Containers in Storage Account
I have created 3 containers to follow Medallion architecture (bronze, silver andgold)
THE BRONZE LAYER:
Now, I will load the data from GitHub to the bronze layer (raw data). I am firstcreating a static
loading of data from GitHub to bronze, then I will create dynamicpulling of all files from a folder
to bronze.
HTTP Linked Service
I have created an HTTP Linked Service by giving the base URL of products.csv filefrom GitHub’s
Raw data.
Now, I will create a Azure DataLake Storage Linked Service to connect to ADLS.
ADLS Linked Service
COPY ACTIVITY:
Copy Activity requires the following information:
Source: It requires the name of the dataset, linked service and relative path ofthe source
dataset. (For data stored in ADLS, for GitHub we will be using anHTTP Linked Service)
Sink: It requires the name of the dataset, linked service and relative path of thesource dataset.
(To store data in ADLS)
Source Dataset from git
Sink Dataset for ADLS
After connecting Copy Activity to Source and Sink, now I will run the debug optionto execute
this Pipeline.
Successful Execution of Pipeline
Data Successfully Copied to the bronze folder
Now, I will be creating a Copy Activity which will be parameterised.
I created a new Pipeline where I dragged a Copy Activity.
The source will be parameterised in the following way:
Parameterised Source Dataset
The input (each rel_url will be coming from a ForEach Activity) of the Copy Activityreceives a
p_rel_url which will dynamically change upon each Copy step.
The sink will be parameterised in the following way:
Parameterised Sink Dataset
Now, I will now create a json file which will contain the following fields:
p_rel_url
p_sink_folder
p_sink_file
Sample json file snippet:
json file with Parameters
This git.json will be uploaded in the StorageLake under parameters folder.
I will be creating an Activity called LookUp (To get the information inside git.jsonfile)
LookUp Activity to Read git.json
Now, the output of this LookUp Activity will be directed to FarEach Activity. (makesure to
uncheck the “First Row” to read the entire data in the json file).
Output of LookUp Activity
I have now connected the Lookupjson Activity to ForEach1 such that the output (allfile
parameters) will be fed to ForEach Activity (output of Lookup.value).
Inside ForEach Activity canvas, I will paste the Copy Activity which I have createdalready and
input the parameters as following:
Source: Input of p_rel_url from ForEach Item
Sink: Input of p_folder_name and p_file_name from ForEach Item
Now, I will exit the ForEach Canvas and will run the debug option.
The Pipeline is Executed Successfully
All the Files are now Uploaded in the bronze container
The BRONZE layer of the Medallion architecture is now completed with aparameterised
approach.
THE SILVER LAYER:
Now I will be creating a Databricks Resource under the Resource Group (Azure-Project).
The compute information of the Databricks Cluster is as follows:
Compute Information of Cluster
I will now create a notebook (silver_notebook) under a folder (Databricks) inside theWorkspace
tab.
I will now register an application in Microsoft Entra ID, so that I can connectDatabricks to my
Storage Account by giving the Credentials of the application.
Application Details to connect Databricks and ADLS
Here is a simple flow diagram to connect Azure Databricks with ADLS:Connection of Azure
Databricks and ADLS
I have completed the following steps for connection:
Copy Paste the information regarding the Application
Create a secret and Copy the Value of the secret for secure connection
Grant Storage Blob Data Contributor to Storage Account through IAM and addthe Application as
the member.
Apply the connecting Credentials in the Databrick notebook
Code for Connecting Azure DataBricks with ADLS
Sample Loading of Data
If the code is running properly but you see red underlines in your editor, it likelymeans there are
syntax checking issues within the editor/IDE. These do not affectthe execution of the code.
Now I will be performing some Transformations and will push the Transformeddata into the
silver layer.
PYSPARK TRANSFORMATIONS:
I have created a new column called ‘Month’ and ‘Year’ by extracting the month andyear of the
Date column.
withColumn function is used to create or modify a column. It:
Creates a column if we provide different column name.
Modifies a column if we provide same column name.
Now, I will write the df_cal (Transformed data) into the silver container.
Load the data to silver container
Data Successfully Loaded into the silver container
Now I will be performing Transformations on df_cus data.
Creating FullName Column using concat() function
Here I have used the following functions:
• concat: concats different columns
• lit: since (‘ ‘) is a space (i.e a constant) I have used lit function.
• drop: I have dropped prefix, first and last name because I have combined theminto a
single column.
I will now write df_cus data into silver container and the code is same as df_calexcept that the
dataframe name only must be changed.
Now, I will Transform df_pro data.
Sample Snippet of df_pro Data
I will perform a Transformation that retrieves the size of a product fromProductSKU.
Notice that some of the records in ProductSKU doesn’t have any size to it. In thatcase we will
get NULL values. I will try to fill the NULL with the value NotApplicable.
Created ProductSize and filled NULL with Not Applicable
I will now write the df_pro into the silver container.
I will analyse data in df_sales using some Transformations.
GroupBy Function
GroupBy Aggregation:
• I grouped the data in accordance with TerritoryKey to see the performance ofOrders
based on Territories.
• I have performed a sum aggregation on OrderQuantity column to see the totalnumber of
orders.
• Finally I have used alias function to rename the sum(OrderQuantity) column.
I have made the following Transformations to df_sales data:
• regexp_replace function replaces a string with the string we want. In this case Ihave
replaces SO with ON (Order Number).
• Performed arithmetic operation on columns where OrderLineColumn isobtained by
multiplying OrderLineItem and OrderQuantity.
• Sometimes downstream professionals require date information in timestamps.So, I
have converted the OrderDate into timestamp.
• Finally, I have used Select function to show only the changed and modifiedcolumns
instead of the whole data frame.
Now, I will write all the remaining data frames to silver container.Transformed data in silver
container
Transformed data in silver container
VISUALISATION IN DATABRICKS:
Scenario 1:
I have done a Territory-wise analysis where we can understand which territory isbest performing
in sales.
Scenario 2:
Month-wise and Year-wise Sales Analysis
Here I have performed groupby on both OrderYear and OrderMonth such that thestakeholders
can easily hover over a specific month and see the total quantity soldby the distributors.
SCENARIO 3:
Here, we can see that most of the Stock manufacturing in the year 2004 were did infirst few
months and the Stock manufacturing in the year 2003 were did in lastmonths of the year.
The SILVER layer of the Medallion architecture is now completed where all theTransformation
and Visualisation is done in Databricks and is written into the silvercontainer.
For the gold layer presentation, I will be using Azure Synapse Analytics.
Creation of Synapse Workspace
I have created a default Storage Account and default File System which will be usingby Synapse
Analytics.
Creation of SQL Server Login
Creation of Managed Identity
Now, I have created a managed identity under the Storage blob data contributor.
Managed Identity helps in linking of Azure Resources with each other. Now SynapseAnalytics
can access data from ADLS.
Azure Synapse Analytics helps us in implementing lakehouse concept.
The above diagram represents an abstraction layer over Azure Data Lake Storage,allowing SQL
Server and users to query data using metadata.
Synapse Analytics retrieves data from Azure Data Lake Storage (ADLS) and enablesthe
implementation of SQL queries, advanced analytics, and other functionalities,making it easier
to process, analyse, and visualise large-scale data efficiently.
IAM to Query the Data as a User
I have now queried the data in ADLS, using SQL Syntax and displayed the data inparquet file
format in the form of a table.
OPENROWSET() function created an abstraction layer on top of ADLS and made meperform
SQL Queries on parquet file format.
Now, I will create a gold schema where I will store all the Views inside it.
Successfully Created Views for all Data in the silver Layer
Now the stakeholders, managers or data analysts can query the data as if it was aSQL Server,
but in reality, all the data are retrieved from ADLS using an AbstractionLayer.
Now, I have successfully created the gold schema in Azure Synapse Analytics andcompleted
the requirements in accordance with the Medallion Architecture.
Azure Synapse Analytics simplifies data processing by allowing you to query largedatasets
directly from Azure Data Lake without the need for expensive SQLdatabases. Synapse makes it
easy to run SQL queries on big data, providing fastinsights while keeping expenses low, making
it a powerful tool for modernbusinesses.