0% found this document useful (0 votes)

23 views26 pages

PROJECT 4 For Python

This document outlines an end-to-end Azure Data Engineering project utilizing Medallion architecture with Azure Data Factory, Databricks, and Synapse Analytics. It details the process of uploading raw data from GitHub to a bronze layer, transforming it in the silver layer, and serving it for analysis in the gold layer using Azure Synapse. The project employs an open-source dataset from Kaggle and leverages Azure DataLake Storage for data management and processing.

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views26 pages

PROJECT 4 For Python

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

End-to-End Azure Data Engineering Project |

Medallion Architecture with ADF, Databricks &

Synapse
This is an end-to-end Azure Data Engineering Project where I upload data fromGitHub HTTP
source to bronze layer through Azure Data factory. Transform the datain the silver layer using
Databricks and provide the data to downstream (Analytics,Data Science) using Azure Synapse
Analytics.

I have followed Medallion architecture where:

• Raw data is uploaded to bronze layer using parameters and dynamic pipelines.
• Raw data is Transformed in the solver layer using Databricks.

Finally, the Transformed data is served into gold layer, where views of the dataare created for
downstream analysis.

ABOUT THE DATASET:

This is an open-source dataset from Kaggle named, Adventure Works.

Here is a quick description of the columns in the dataset.

AdventureWorks_Calendar.csv

AdventureWorks_Customers.csv

AdventureWorks_Product_Categories.csv

AdventureWorks_Product_Subcategories.csv

AdventureWorks_Products.csv

AdventureWorks_Returns.csv

AdventureWorks_Sales_2015.csv

AdventureWorks_Sales_2016.csv

AdventureWorks_Sales_2017.csv

AdventureWorks_Territories.csv

I have used Azure DataLake Storage to store the data.

Why ADLS Gen 2?

Storage in Blog Vs. ADLS

In ADLS, we can store data in the form of hierarchies. While creating the storageaccount, make
sure to enable “Hierarchical Namespace” to create ADLS, else a blobstorage will be created by
default.

Containers in Storage Account

I have created 3 containers to follow Medallion architecture (bronze, silver andgold)

THE BRONZE LAYER:

Now, I will load the data from GitHub to the bronze layer (raw data). I am firstcreating a static
loading of data from GitHub to bronze, then I will create dynamicpulling of all files from a folder
to bronze.

HTTP Linked Service

I have created an HTTP Linked Service by giving the base URL of products.csv filefrom GitHub’s
Raw data.

Now, I will create a Azure DataLake Storage Linked Service to connect to ADLS.

ADLS Linked Service

COPY ACTIVITY:

Copy Activity requires the following information:

Source: It requires the name of the dataset, linked service and relative path ofthe source
dataset. (For data stored in ADLS, for GitHub we will be using anHTTP Linked Service)

Sink: It requires the name of the dataset, linked service and relative path of thesource dataset.
(To store data in ADLS)

Source Dataset from git

Sink Dataset for ADLS

After connecting Copy Activity to Source and Sink, now I will run the debug optionto execute
this Pipeline.

Successful Execution of Pipeline

Data Successfully Copied to the bronze folder

Now, I will be creating a Copy Activity which will be parameterised.

I created a new Pipeline where I dragged a Copy Activity.

The source will be parameterised in the following way:

Parameterised Source Dataset

The input (each rel_url will be coming from a ForEach Activity) of the Copy Activityreceives a
p_rel_url which will dynamically change upon each Copy step.

The sink will be parameterised in the following way:

Parameterised Sink Dataset

Now, I will now create a json file which will contain the following fields:

p_rel_url

p_sink_folder

p_sink_file

Sample json file snippet:

json file with Parameters

This git.json will be uploaded in the StorageLake under parameters folder.

I will be creating an Activity called LookUp (To get the information inside git.jsonfile)
LookUp Activity to Read git.json

Now, the output of this LookUp Activity will be directed to FarEach Activity. (makesure to
uncheck the “First Row” to read the entire data in the json file).

Output of LookUp Activity

I have now connected the Lookupjson Activity to ForEach1 such that the output (allfile
parameters) will be fed to ForEach Activity (output of Lookup.value).

Inside ForEach Activity canvas, I will paste the Copy Activity which I have createdalready and
input the parameters as following:

Source: Input of p_rel_url from ForEach Item

Sink: Input of p_folder_name and p_file_name from ForEach Item

Now, I will exit the ForEach Canvas and will run the debug option.

The Pipeline is Executed Successfully

All the Files are now Uploaded in the bronze container

The BRONZE layer of the Medallion architecture is now completed with aparameterised
approach.

THE SILVER LAYER:

Now I will be creating a Databricks Resource under the Resource Group (Azure-Project).

The compute information of the Databricks Cluster is as follows:

Compute Information of Cluster

I will now create a notebook (silver_notebook) under a folder (Databricks) inside theWorkspace
tab.

I will now register an application in Microsoft Entra ID, so that I can connectDatabricks to my
Storage Account by giving the Credentials of the application.
Application Details to connect Databricks and ADLS

Here is a simple flow diagram to connect Azure Databricks with ADLS:Connection of Azure
Databricks and ADLS

I have completed the following steps for connection:

Copy Paste the information regarding the Application

Create a secret and Copy the Value of the secret for secure connection

Grant Storage Blob Data Contributor to Storage Account through IAM and addthe Application as
the member.
Apply the connecting Credentials in the Databrick notebook

Code for Connecting Azure DataBricks with ADLS

Sample Loading of Data

If the code is running properly but you see red underlines in your editor, it likelymeans there are
syntax checking issues within the editor/IDE. These do not affectthe execution of the code.

Now I will be performing some Transformations and will push the Transformeddata into the
silver layer.

PYSPARK TRANSFORMATIONS:
I have created a new column called ‘Month’ and ‘Year’ by extracting the month andyear of the
Date column.

withColumn function is used to create or modify a column. It:

Creates a column if we provide different column name.

Modifies a column if we provide same column name.

Now, I will write the df_cal (Transformed data) into the silver container.

Load the data to silver container

Data Successfully Loaded into the silver container

Now I will be performing Transformations on df_cus data.

Creating FullName Column using concat() function

Here I have used the following functions:

• concat: concats different columns

• lit: since (‘ ‘) is a space (i.e a constant) I have used lit function.
• drop: I have dropped prefix, first and last name because I have combined theminto a
single column.
I will now write df_cus data into silver container and the code is same as df_calexcept that the
dataframe name only must be changed.

Now, I will Transform df_pro data.

Sample Snippet of df_pro Data

I will perform a Transformation that retrieves the size of a product fromProductSKU.

Notice that some of the records in ProductSKU doesn’t have any size to it. In thatcase we will
get NULL values. I will try to fill the NULL with the value NotApplicable.

Created ProductSize and filled NULL with Not Applicable

I will now write the df_pro into the silver container.

I will analyse data in df_sales using some Transformations.

GroupBy Function

GroupBy Aggregation:

• I grouped the data in accordance with TerritoryKey to see the performance ofOrders
based on Territories.
• I have performed a sum aggregation on OrderQuantity column to see the totalnumber of
orders.
• Finally I have used alias function to rename the sum(OrderQuantity) column.
I have made the following Transformations to df_sales data:

• regexp_replace function replaces a string with the string we want. In this case Ihave
replaces SO with ON (Order Number).
• Performed arithmetic operation on columns where OrderLineColumn isobtained by
multiplying OrderLineItem and OrderQuantity.
• Sometimes downstream professionals require date information in timestamps.So, I
have converted the OrderDate into timestamp.
• Finally, I have used Select function to show only the changed and modifiedcolumns
instead of the whole data frame.

Now, I will write all the remaining data frames to silver container.Transformed data in silver
container

Transformed data in silver container

VISUALISATION IN DATABRICKS:

Scenario 1:

I have done a Territory-wise analysis where we can understand which territory isbest performing
in sales.

Scenario 2:

Month-wise and Year-wise Sales Analysis

Here I have performed groupby on both OrderYear and OrderMonth such that thestakeholders
can easily hover over a specific month and see the total quantity soldby the distributors.

SCENARIO 3:

Here, we can see that most of the Stock manufacturing in the year 2004 were did infirst few
months and the Stock manufacturing in the year 2003 were did in lastmonths of the year.

The SILVER layer of the Medallion architecture is now completed where all theTransformation
and Visualisation is done in Databricks and is written into the silvercontainer.

For the gold layer presentation, I will be using Azure Synapse Analytics.
Creation of Synapse Workspace
I have created a default Storage Account and default File System which will be usingby Synapse
Analytics.
Creation of SQL Server Login

Creation of Managed Identity

Now, I have created a managed identity under the Storage blob data contributor.
Managed Identity helps in linking of Azure Resources with each other. Now SynapseAnalytics
can access data from ADLS.

Azure Synapse Analytics helps us in implementing lakehouse concept.

The above diagram represents an abstraction layer over Azure Data Lake Storage,allowing SQL
Server and users to query data using metadata.

Synapse Analytics retrieves data from Azure Data Lake Storage (ADLS) and enablesthe
implementation of SQL queries, advanced analytics, and other functionalities,making it easier
to process, analyse, and visualise large-scale data efficiently.
IAM to Query the Data as a User

I have now queried the data in ADLS, using SQL Syntax and displayed the data inparquet file
format in the form of a table.

OPENROWSET() function created an abstraction layer on top of ADLS and made meperform
SQL Queries on parquet file format.

Now, I will create a gold schema where I will store all the Views inside it.
Successfully Created Views for all Data in the silver Layer

Now the stakeholders, managers or data analysts can query the data as if it was aSQL Server,
but in reality, all the data are retrieved from ADLS using an AbstractionLayer.

Now, I have successfully created the gold schema in Azure Synapse Analytics andcompleted
the requirements in accordance with the Medallion Architecture.

Azure Synapse Analytics simplifies data processing by allowing you to query largedatasets
directly from Azure Data Lake without the need for expensive SQLdatabases. Synapse makes it
easy to run SQL queries on big data, providing fastinsights while keeping expenses low, making
it a powerful tool for modernbusinesses.

Azure de Project
No ratings yet
Azure de Project
29 pages
Piezo Electric Research Paper
No ratings yet
Piezo Electric Research Paper
29 pages
End To End Project ADF
100% (1)
End To End Project ADF
73 pages
PROJECT 12 For Python
No ratings yet
PROJECT 12 For Python
26 pages
Azure StreamSets Data Pipeline Guide
No ratings yet
Azure StreamSets Data Pipeline Guide
35 pages
Azure DE Interview Que
100% (2)
Azure DE Interview Que
25 pages
1 - Architecting For The Lakehouse
No ratings yet
1 - Architecting For The Lakehouse
115 pages
PROJECT 9 For Python
No ratings yet
PROJECT 9 For Python
14 pages
Data Warehousing and Business Intelligence DS-3003 Assignment # 1
No ratings yet
Data Warehousing and Business Intelligence DS-3003 Assignment # 1
6 pages
1 Vishal F Script
No ratings yet
1 Vishal F Script
3 pages
Azure Data Engineering Course Interview Questions 1751484980
No ratings yet
Azure Data Engineering Course Interview Questions 1751484980
20 pages
Azure Project Execution Plan ADF+DBX+CICD
No ratings yet
Azure Project Execution Plan ADF+DBX+CICD
5 pages
Data Lake Bootcamp Overview
No ratings yet
Data Lake Bootcamp Overview
46 pages
PROJECT 11 For Python
No ratings yet
PROJECT 11 For Python
22 pages
Azure Data Engineer Interview Questions - Part 1
No ratings yet
Azure Data Engineer Interview Questions - Part 1
19 pages
Krisp Summary
No ratings yet
Krisp Summary
3 pages
PROJECT 1 For Python
No ratings yet
PROJECT 1 For Python
42 pages
Course Content
No ratings yet
Course Content
13 pages
PROJECT 2 For Python
No ratings yet
PROJECT 2 For Python
41 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Reading 1
No ratings yet
Reading 1
4 pages
PDF 1733662736
No ratings yet
PDF 1733662736
17 pages
Data Migration Project
No ratings yet
Data Migration Project
36 pages
Azure Data Factory Data Movement Lab
No ratings yet
Azure Data Factory Data Movement Lab
26 pages
PROJECT 10 For Python
No ratings yet
PROJECT 10 For Python
16 pages
Azure Data Lake and U-SQL
No ratings yet
Azure Data Lake and U-SQL
51 pages
Azure Data Engineering Guide
No ratings yet
Azure Data Engineering Guide
11 pages
Ass 1
No ratings yet
Ass 1
31 pages
ADF Course Deck
No ratings yet
ADF Course Deck
88 pages
D25L2507 - Final Draft - Analyst Roadmap To Databricks - From SQL To End-to-End BI - 1747454926520001ZrON
No ratings yet
D25L2507 - Final Draft - Analyst Roadmap To Databricks - From SQL To End-to-End BI - 1747454926520001ZrON
44 pages
Azure Data Engineer
100% (1)
Azure Data Engineer
8 pages
Azure Data Engineering in Oil & Gas
No ratings yet
Azure Data Engineering in Oil & Gas
3 pages
Azure Data Factory Guide
No ratings yet
Azure Data Factory Guide
13 pages
Designing Datvault 2.0
No ratings yet
Designing Datvault 2.0
18 pages
Unity Abhishek-1
No ratings yet
Unity Abhishek-1
6 pages
Project Architecture
No ratings yet
Project Architecture
2 pages
Genbrooks Project Description
No ratings yet
Genbrooks Project Description
1 page
19 - AIML - WK-49 - A1 - Application Assignment
No ratings yet
19 - AIML - WK-49 - A1 - Application Assignment
6 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
Google Data Engineer Certification Workbook
100% (1)
Google Data Engineer Certification Workbook
80 pages
Azure Data Superstore Pipeline - End-to-End Data Engineering and Visualization Report
No ratings yet
Azure Data Superstore Pipeline - End-to-End Data Engineering and Visualization Report
23 pages
Azure Datalake
No ratings yet
Azure Datalake
8 pages
Final Report
No ratings yet
Final Report
22 pages
Azure Project
No ratings yet
Azure Project
13 pages
Azure Data Factory
100% (2)
Azure Data Factory
10 pages
Crack Your Databricks
100% (2)
Crack Your Databricks
103 pages
Usecase3
No ratings yet
Usecase3
38 pages
Building A Data Warehouse
100% (2)
Building A Data Warehouse
90 pages
Session 6 - Azure Case Study - Covid 19
No ratings yet
Session 6 - Azure Case Study - Covid 19
42 pages
Azure Data Factory Guide
0% (1)
Azure Data Factory Guide
2,982 pages
Data Engineering Solutions for Cymbal Retail
No ratings yet
Data Engineering Solutions for Cymbal Retail
80 pages
Eb Attunity Streaming Change Data Capture en
100% (1)
Eb Attunity Streaming Change Data Capture en
60 pages
ADE4 Topics To Brush 1
No ratings yet
ADE4 Topics To Brush 1
20 pages
Day 7
No ratings yet
Day 7
3 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Advanced Java Questions and Answers 1 To 10
100% (4)
Advanced Java Questions and Answers 1 To 10
11 pages
Visual Basic
No ratings yet
Visual Basic
69 pages
Mitsubishi Lathe Macro Guide. 70 Pages.
No ratings yet
Mitsubishi Lathe Macro Guide. 70 Pages.
70 pages
Tacite Ug
No ratings yet
Tacite Ug
129 pages
UNIX Basics and Commands Guide
No ratings yet
UNIX Basics and Commands Guide
18 pages
DLMS SERVER Object Library User Manual
No ratings yet
DLMS SERVER Object Library User Manual
26 pages
Informatica Interview Questions On Transformations
No ratings yet
Informatica Interview Questions On Transformations
112 pages
Quectel BG95BG77BG600L Series FILE Application Note V1.1
No ratings yet
Quectel BG95BG77BG600L Series FILE Application Note V1.1
30 pages
Manual Hollow Core Slab
100% (3)
Manual Hollow Core Slab
88 pages
Coa Module 5
No ratings yet
Coa Module 5
17 pages
DR290
No ratings yet
DR290
63 pages
MCAL SampleApplication
100% (1)
MCAL SampleApplication
27 pages
EX. NO Date Program NO Sign
No ratings yet
EX. NO Date Program NO Sign
80 pages
PHP S24 Model Answer Paper
100% (2)
PHP S24 Model Answer Paper
37 pages
Hydrology Modelling R Packages
No ratings yet
Hydrology Modelling R Packages
37 pages
10inch HMI Datasheet
No ratings yet
10inch HMI Datasheet
5 pages
Windows Update: Microsoft Powershell
No ratings yet
Windows Update: Microsoft Powershell
7 pages
Lab Manual
No ratings yet
Lab Manual
24 pages
COM 411 Web Dev Using PHP Part-1
No ratings yet
COM 411 Web Dev Using PHP Part-1
36 pages
FPP Python Exercises Overview
No ratings yet
FPP Python Exercises Overview
10 pages
Sim4me Portal
No ratings yet
Sim4me Portal
55 pages
T24 Componentisation
100% (7)
T24 Componentisation
32 pages
8D Process Guide for Problem Solvers
No ratings yet
8D Process Guide for Problem Solvers
3 pages
CPP Templates MCQ (156718)
No ratings yet
CPP Templates MCQ (156718)
5 pages
01 - Introduction To R and RStudio
No ratings yet
01 - Introduction To R and RStudio
6 pages
GV75W Track Air Interface Protocol V6.05-291-350
No ratings yet
GV75W Track Air Interface Protocol V6.05-291-350
60 pages
SQL Server Distributed Replay
No ratings yet
SQL Server Distributed Replay
43 pages
SpecBAS Reference Manual
67% (3)
SpecBAS Reference Manual
91 pages
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
No ratings yet
PY0101EN 3 5 Practice - Lab 20230526 1685059200.jupyterlite
7 pages
N2OS UserManual SDK 22.0.0
No ratings yet
N2OS UserManual SDK 22.0.0
86 pages

PROJECT 4 For Python

Uploaded by

PROJECT 4 For Python

Uploaded by

End-to-End Azure Data Engineering Project |

Medallion Architecture with ADF, Databricks &

I have followed Medallion architecture where:

ABOUT THE DATASET:

This is an open-source dataset from Kaggle named, Adventure Works.

Here is a quick description of the columns in the dataset.

I have used Azure DataLake Storage to store the data.

Storage in Blog Vs. ADLS

Containers in Storage Account

THE BRONZE LAYER:

HTTP Linked Service

ADLS Linked Service

Copy Activity requires the following information:

Source Dataset from git

Sink Dataset for ADLS

Successful Execution of Pipeline

Data Successfully Copied to the bronze folder

Now, I will be creating a Copy Activity which will be parameterised.

I created a new Pipeline where I dragged a Copy Activity.

The source will be parameterised in the following way:

Parameterised Source Dataset

The sink will be parameterised in the following way:

Parameterised Sink Dataset

Sample json file snippet:

json file with Parameters

This git.json will be uploaded in the StorageLake under parameters folder.

Output of LookUp Activity

Source: Input of p_rel_url from ForEach Item

Sink: Input of p_folder_name and p_file_name from ForEach Item

The Pipeline is Executed Successfully

All the Files are now Uploaded in the bronze container

THE SILVER LAYER:

The compute information of the Databricks Cluster is as follows:

I have completed the following steps for connection:

Copy Paste the information regarding the Application

Code for Connecting Azure DataBricks with ADLS

Sample Loading of Data

withColumn function is used to create or modify a column. It:

Creates a column if we provide different column name.

Modifies a column if we provide same column name.

Load the data to silver container

Now I will be performing Transformations on df_cus data.

Creating FullName Column using concat() function

Here I have used the following functions:

• concat: concats different columns

Now, I will Transform df_pro data.

Sample Snippet of df_pro Data

I will perform a Transformation that retrieves the size of a product fromProductSKU.

Created ProductSize and filled NULL with Not Applicable

I will now write the df_pro into the silver container.

I will analyse data in df_sales using some Transformations.

Transformed data in silver container

Month-wise and Year-wise Sales Analysis

Creation of Managed Identity

Azure Synapse Analytics helps us in implementing lakehouse concept.

You might also like