0% found this document useful (0 votes)

10 views42 pages

PROJECT 1 For Python

The document outlines an end-to-end project for building a real-time data pipeline using Azure Databricks, Delta Lake, and Unity Catalog to analyze traffic and roads datasets. It describes the Medallion Architecture (Bronze, Silver, Gold) for data processing, with incremental ingestion using Auto Loader and CI/CD practices via Azure DevOps. Key technologies include Delta Lake for storage, Power BI for reporting, and Unity Catalog for data governance and access control.

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views42 pages

PROJECT 1 For Python

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Azure Databricks end to end Project with

Unity Catalog CI/CD

ar
ek
ad
W
m
ha
ub
Sh
©
Real-Time Data Lakehouse for Traffic and Roads Analytics

Project Overview
This project builds a real-time data pipeline using Azure Data Lake and Delta Lake to process and
analyze traffic and roads datasets. Data is manually loaded into a landing zone to simulate
incremental ingestion. It then flows into the bronze layer using Spark Structured Streaming with
Auto Loader, capturing all raw records.

From the bronze layer, only new data is transformed and moved to the silver layer for cleaning and
enrichment. The final curated data is stored in the gold layer as optimized tables or views, ready
for reporting and data science.

ar
Key technologies include:

ek
Delta Lake for reliable storage and ACID transactions
• Auto Loader for incremental data ingestion
• Power BI for reporting and visualization
•

ad
Azure DevOps for CI/CD
• Access controls to simulate enterprise-grade security

W
This end-to-end pipeline supports real-time analytics in a scalable, cloud-native environment.

Project Architecture
m
ha
ub
Sh
©
The project uses a Medallion Architecture (Bronze → Silver → Gold) on Azure Databricks with Delta
Lake and Unity Catalog for governance. Data flows through the following stages:

• Landing Zone: Raw traffic and roads data are manually ingested into Azure Data Lake
Storage Gen2 (/landing folder) to simulate streaming input.
• Bronze Layer: Ingested incrementally using Auto Loader with Structured Streaming. Raw
data is stored as Delta tables (raw_traffic, raw_roads) under the bronze schema.
• Silver Layer: Transforms include renaming columns, deriving Electric_Vehicles_Count,
Motor_Vehicles_Count, and categorizing road types. Cleaned datasets (silver_traffic,
silver_roads) are stored in the silver schema.

ar
• Gold Layer: Final business logic is applied (e.g., Vehicle_Intensity), and tables are
optimized for reporting (gold_traffic, gold_roads). These are consumed in Power BI for
insights.

ek
• Governance: Managed via Unity Catalog with a three-tier namespace
(catalog.schema.table) and fine-grained access control.
• CI/CD: Implemented using Azure DevOps for deploying code and configurations across
environments (Dev → UAT → Prod).

ad
Project Setup W
m
ha
ub
Sh
©
This project is organized using Azure Data Lake Storage Gen2, structured into containers and
folders representing different stages of the data pipeline.

Containers and Folders

ar
ek
We have 3 containers in Azure Data Lake Storage:

ad
1. Landing:
a. Holds the raw input data.
b. Contains 2 sub folders: raw_traffic (for traffic dataset), raw_roads (for road
dataset). W
m
ha

2. Medallion:
ub

a. Implements the medallion architecture with subfolders.

b. Bronze – stores raw ingested data
c. Silver – stores cleaned and transformed data
Sh

d. Gold – stores curated, business-ready data

©
3. Checkpoints:
a. Stores streaming query checkpoints to support fault tolerance and exactly once
processing in Structured Streaming.

External Locations

The following external locations are registered in Unity Catalog to enable secure and governed access
to data:

1. Landing

ar
2. Checkpoints
3. Bronze
4. Silver

ek
5. Gold

ad
W
m
ha

Each location corresponds to a specific path in the data lake and is linked with appropriate storage
ub

credentials and access controls using Unity Catalog.

Data Sources
Sh

The datasets are from Kaggle. Raw

Traffic Dataset

The Raw Traffic Dataset is one of the core inputs for this project. It has actual data collected by
©

trained enumerators to feed data into road traffic estimates. It contains structured information
collected from traffic monitoring points across UK roads. This dataset has a raw count for the
number of vehicles of each type that flowed past at each point of day, broken by direction and an
hour.

It has pedal cycles, 2-wheeler vehicles, buses and coaches, LGV (Large Good Vehicles) and HGV
(Heavy goods vehicles), and electric vehicles. So, we need to find out at a given time within an hour,
how many vehicles are recorded in the raw traffic count dataset. We will analyze the type of
vehicle travelling at a given point along with roads.
Source and Storage

• The dataset is manually placed in the /landing/raw_traffic folder of Azure Data Lake

ar
ek
Storage Gen2.

• It simulates real-time ingestion and is later processed incrementally using Spark

ad
Structured Streaming with Auto Loader.
• Once ingested, it is first stored in the bronze layer as the table raw_traffic.

Purpose in the pipeline W

This dataset serves multiple purposes:

• Provides the raw measurements of traffic flow, needed to calculate derived metrics.
•
m
Enables tracking of hourly, daily, and yearly trends in vehicle movement.
• Acts as a base for data enrichment and transformation in the silver layer, eventually
powering analytical dashboards and reports on the gold layer.
ha

Data Dictionary
ub

Data Dictionary has all column names. It defines what information that each column has.
Sh
©
ar
ek
ad
Raw Roads Dataset

The raw roads dataset defines the road category. It provides essential metadata about the road
W
network across different regions. It includes classifications, measurements, and summaries of
road segments, which are later used for enriching traffic data and performing spatial analysis.

Source and storage

m
• The dataset is manually ingested into the /landing/raw_roads folder within Azure Data
Lake Storage Gen2.
ha
ub
Sh

• It simulates external data ingestion and is incrementally loaded into the bronze layer
using Spark Structured Streaming with Auto Loader.
• Stored as a Delta table named raw_roads in the bronze schema.
©

Purpose in the pipeline

The Raw Roads Dataset:

• Provides road classification and geographical context for traffic analysis.

• Helps join with traffic datasets using common region or road category IDs.
• Supports the derivation of road type attributes used in visualization and aggregation in the
gold layer.

The common link or common column from both the datasets is Road Category.

Creating Databricks Dev Catalog

We created the dev workspace for Azure Databricks (databricks-dev-ws).

ar
ek
ad
W
m
ha
ub

Access Connector for Databricks

In this project, the Access Connector for Azure Databricks is used to securely access Azure Data
Sh

Lake Storage Gen2, where all data layers (landing, bronze, silver, gold, checkpoints) are stored. It
enables Databricks to read raw traffic and raw roads datasets and write Delta tables without
using storage keys, by leveraging managed identity authentication. This ensures secure, role-based
access control and is required for integrating with Unity Catalog's external locations.
©
Metastore Creation
A metastore is a top-level container for data in Unity Catalog. Within a metastore, Unity Catalog
provides a 3-level namespace for organizing data (catalogs, schemas, tables/views).

If we do not assign the workspace to a metastore we will not be able to create a catalog or schema.
After assigning the workspace to the metastore we need to enable the Unity Catalog.

Creating all the Schemas in Dev Catalog

ar
We have created 3 schemas in the dev catalog (bronze, silver, gold).

ek
ad
W
m
ha

Ingestion to Bronze Layer

The bronze layer is the first structured zone in the medallion architecture where raw data is
ub

ingested from the landing zone and stored as Delta tables. It serves as the source of the truth,
capturing unprocessed data exactly as received.
Sh
©
Ingestion Process

• Source:
o Data is manually placed in:
▪ /landing/raw_traffic (for traffic data)
▪ /landing/raw_roads (for road data)

• Streaming Ingestion with Auto Loader:

o Spark reads incoming files using Auto Loader. It improves pipeline efficiency by

ar
only processing newly arrived data.
o Used to incrementally ingest raw traffic and road CSV files from the /landing
folder.
o We have created 2 autoloaders. One for raw_roads and the other for raw_traffic.

ek
o Enables real-time data ingestion into the bronze layer using Structured
Streaming.
o Reads data using .format("cloudFiles") with cloudFiles.format = "csv".

ad
o Stores schema information using cloudFiles.schemaLocation for schema
inference.
o Tracks progress using a checkpoint directory to ensure fault tolerance and
supports automatic detection of new files, eliminating the need for manual
W
triggers.
o Loads data into Delta tables: bronze.raw_traffic and bronze.raw_roads.

• Schema: Bronze
m

• Tables created:
o raw_traffic
ha

o raw_roads
ub

Ingestion of raw_traffic table

Sh
©
We have ingested the raw_traffic table into the bronze schema within the dev_catalog in the
Databricks workspace (dataricks_dev_ws).

Ingestion of raw_roads table

ar
ek
ad
W
We have ingested the raw_roads table into the bronze schema within the dev_catalog in the
Databricks workspace (dataricks_dev_ws).
m

Loading data to Bronze Tables

Data written for raw_traffic:

ub
Sh
©

After defining the schema and reading the raw_traffic CSV file from the landing zone in using
Auto Loader, the data was successfully written to the bronze layer in Delta format.
Data written for raw_roads:

ar
ek
ad
After defining the schema and reading the raw_roads CSV file from the landing zone using Auto
Loader, the data was written to the bronze layer in Delta format.

Proving Autoloader handles incremental loading

W
Autoloader tracks based on the checkpoint location, and it is going to write this data incrementally
by reading the only newly added rows.
m
ha
ub

• Checking the row count of the raw_traffic data

• The same we can say that the last record_id column value would also be the same
©

number as the above row count.

• Now if we upload another file, those many new records will now be having new extract
time value after 18546 records. There will be difference in the extract time value.

ar
ek
ad
W
m
ha
ub
Sh

The count changed from 18546 to 37092. We can check the timestamp between the last
few record_id till some of the newly added record_id.
©

• So, this proves that this is going to take that data based on a micro batch. For each micro
batch, it is going to process all records which are available and is going to write the last
record information somewhere in the checkpoint. When we upload another data, it goes
to the checkpoint, and it is going to see where exactly the previous load was done and is
going to compare that and then do the next load.
This proves that autoloader is capable to do the incremental loading.

• To run the notebook to process the newly added data, we need to monitor if there is any new
file available. We can have it in a scheduled manner as well like twice a day or thrice a
day. For now, we have just manually run the notebook to check the results.
Transforming Data into Silver Layer
Raw Traffic and Raw Roads data from the bronze layer is cleaned and enriched here.

➢ Schema: Silver
➢ Tables: silver_traffic, silver_roads

Transformations on raw_traffic silver_traffic:

• Renamed columns for easier querying and readability (e.g., Count point id →
Count_point_id).
• Removed duplicates.

ar
• Created Electric_Vehicles_Count = EV_Car + EV_Bike. It combines electric vehicle types to
get total EV presence at a location.
•

ek
Created Motor_Vehicles_Count = Two_wheeled_motor_vehicles + Cars_and_taxis +
Buses_and_coaches + LGV_Type + HGV_Type + Electric_Vehicles_Count.
It calculates the total number of motorized vehicles for a given record.
• Derived Vehicle_Intensity = Motor_Vehicles_Count / Link_length_km to measure traffic

ad
density.
• Added timestamp columns like Extract_Time (from bronze) to track ingestion time.
W
m
ha
ub
Sh

Transformations on raw_roads silver_roads:

• Renamed fields (e.g., Road category → Road_category).

• Created Road_Category_Name using mappings. It converts road category codes into
©

human-readable names to improve dashboard clarity.

o TA → Class A Trunk Road
o TM → Class A Trunk Motorway
o PA → Class A Principal Road
o PM → Class A Principal Motorway
o M → Class B Road
• Derived Road_Type groups road categories into broader classifications for filtering and
aggregation.
o If Road_Category_Name contains "Class A" → "Major"
o If Road_Category_Name contains "Class B" → "Minor"

ar
ek
ad
Transformations on Incrementally Loaded Data
W
Here we can see that only the newly added records were taken to process the data. It is because the
bronze tables can have thousands of records every time, the incremental loading will be taken
place from the landing zone to bronze, where incremental data will be appended to the bronze
table and in point of time somewhere the records may be a million records. And if we are trying to do
m
this silver layer transformation by creating a new column and applying the data that should not be
applied on the entire data set each time, this should be applied only to the changed data, which
means the rows which are newly added.
ha
ub
Sh

• Checking the count of the current records. It is 37092 after adding the 2nd traffic file.
©

• When we add the 3rd traffic file, the count changes to 55638. Now when we query the
silver transformed data for traffic data, we can see the changes in transformed time for the
new records only. The previous loaded data was the previous time when it was
transformed. So, this proves that the data was transformed incrementally and did
not transform the old data. This is possible because we are using the spark structure
streaming and there is a delta lake for this table.
ar
ek
ad
W
m

Loading Data to Gold Layer

The gold layer is the final layer in the medallion architecture, designed to provide high-value
datasets for reporting, dashboards, and advanced analytics. This serves as the consumption layer
for PowerBI and data science use cases.

It combines enriched traffic and road data to support business insights. Optimized for
ub

performance, clarity, and ease of use by business users.

Logic for gold layer aggregations

• Created Vehicle_Itensity column in silver_traffic table - Motor_Vehicles_Count /

Link_length_km

• Created Load_Time column – To check the time when the data got loaded in the table.
©

• Store as Delta Table/Views - Final datasets in gold layer are created as gold_traffic,
gold_roads.
ar
ek
ad
W
m
ha
ub

Jobs to orchestrate the flow

• There are certain notebooks which need to run daily to get the data, and some notebooks
need not run daily. Here also we need to load data to bronze table, silver layer
transformations and the gold transformations. So, all these notebooks need to be
executed one after another. Based on the cadence when the data arrives, we need to run
these notebooks in a flow.
©

• So, to run these notebooks in a flow, we need to give a job. A job will orchestrate all
these notebooks, and it will run all these notebooks in a flow.

• Created a playground notebook that has the count of all the records. Counts in gold
layer for both datasets:
o gold_traffic: 55638

ar
o gold_roads: 76

ek
ad
•
W
Created a job named ETL Flow having various tasks:
o Task Name: Load_to_Bronze
Cluster Used: Job Cluster (Once completed, it terminates)
o Task Name: Silver_Traffic
m

o Task Name: Silver_Roads

o Task_Name: Gold
ha
ub
Sh

•
©

We added new csv files in the landing zone for raw_traffic and raw_roads. Now we just
need to run this ETL Flow workflow and check the counts. The new records will be
added.
ar
ek
ad
W
m
ha
ub
Sh
©

• The counts changed after running the ETL Flow.

• Added Triggers:
o Raw_Traffic will be updated twice a day, not sure for the time so we use file
arrival trigger here. (tracking for new file path – landing external
location/raw_traffic/)

ar
ek
ad
W
m
ha

After every one minute it checks for the file as we have the file arrival trigger
available. When we upload a new csv file for raw_traffic in the landing zone it will
automatically run the ETL Flow job. In the Launched section we can see the job
started running by the file arrival.
ub
Sh
©
Added notification email on failure.

ar
ek
ad
W
m
ha

o Raw_Roads will be updated monthly basis, because this is kind of fixed

(example – road length, type of road). We cloned the ETL Flow workflow and
edited the trigger to scheduled type for every month at a particular time,
because we cannot add 2 triggers within the same workflow.
Sh
©
ar
Reporting Data to PowerBI
PowerBI is used as the reporting and visualization tool to consume and present the Gold Layer
data stored in Azure Data Lake via Azure Databricks. It provides interactive dashboards and

ek
data-driven insights from traffic and roads data. This will help to support decision-makers in
analyzing traffic patterns, road utilization, and vehicle trends across different regions and
timeframes.

ad
PowerBI connection with Azure Databricks

To get the gold data, select the ‘Get Data’ in PowerBI and search for Azure Databricks. PowerBI
W
connects to the Gold Layer Delta tables (gold_traffic, gold_roads).
m
ha
ub
Sh
©

Using the Databricks compute details we connect PowerBI to Azure Databricks as shown below.
©
Sh
ub
ha
m
W
ad
ek
ar
ar
ek
This PowerBI dashboard presents key insights from the gold layer of the project using curated traffic

ad
and road data.

Key points of the dashboard:

•
W
Date filters: Users can filter data by count date and view when the data was last
ingested.
• Extract Time: Shows when the dashboard is refreshed.
• KPI Tiles: Show total counts of:
m
o Pedal Cycles
o Electric Vehicles
o Motor Vehicles
ha

• Direction of Travel (Donut Chart): Shows vehicle distribution by travel direction (N, S, E, W).
• Electric Vehicles by Region (Bar Chart): Highlights regional EV usage, with Southeast and
London leading.
• Motor Vehicles by Road Category: Displays total vehicle count across road types (e.g.,
ub

Class A Principal, Class B).

• Location Map: Geospatial view of all traffic count points across the UK.
Sh

Continuous Integration & Continuous Deployment (CI/CD)

©
In this project, we have used Azure DevOps git as our repository. We have created the UAT
workspace for testing and we have implemented the CI/CD. This project uses Azure DevOps to
implement a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline for
managing notebooks, workflows, and configurations in Databricks. To ensure that code updates
are version-controlled and tested before deployment.

Since this is a sample project, we have created the UAT workspace and we have implemented
the CI/CD, where we have all the data from the dev workspace to the UAT workspace.

We have created the catalog and dynamically created all the schemas to represent the medallion
architecture.

ar
Data from dev Data to UAT

ek
Continuous Integration

ad
W
m

• Git repository used: Azure DevOps git

• Main branch holds all the changes done to the project. Whatever we work on notebooks, it is
stored in a centralized place that is called the main branch.
• Continuous Integration lets multiple developers work and all the changes are merged
ub

with the main branch repository.

• Pull requests need to be approved by a technical lead or technical head in the team,
who has access to Azure DevOps.
Sh

Continuous Deployment
©
Release Pipeline: It gets all the changes in a live folder to UAT workspace. This will go on after an
approval system has been done.

UAT Resources
• Resource Group: databricks-uat-rg
• Databricks workspace: databricks-uat-wsp
• Storage account: databricksuatstge
• Provided role assignment as storage blob data contributor to db-access-connector

ar
(Access connector for Azure Databricks) and gave the managed identity. Now UAT
workspace will be a part of Unity Catalog.

ek
ad
W
m
ha

• Created the same containers as we created for the dev workspace.

ub
Sh
©
ar
•

ek
Created uat_catalog with all the schemas.

ad
W
m
ha

• Created another credential for creating external locations.

ek
o bronze-uat
o silver-uat
o gold-uat
o checkpoints-uat

ad
o landing-uat

W
m
ha
ub
Sh

Integrating Azure DevOps with Azure Databricks

Continuous Integration:

Created a new project in Azure DevOps named as databricks-traffic.

ar
ek
ad
W
m
ha

• Created a repository in Azure DevOps.

ub
Sh
©

We can see the main branch in our repository (dbproject) as shown below. This is the place
where every code will be copied. This is used as the central repository.
•

ar
Link your Azure databricks dev workspace to Azure DevOps Services (Azure Active

ek
ad
W
Directory) in User Settings.
m
• Created our own repository in Azure Databricks so that we can integrate Azure
Databricks with Azure DevOps. We cloned our new project created in Azure DevOps and
copied the http link and pasted while creating a Repo to get all its details.
ha
ub
Sh

• In Azure DevOps we have set the minimum number of reviewers to be 1. So, when we
have a pull request, it needs to be approved by any technical person in this project which
I am doing. Mostly in every project we have more than 1 so we would be needing more
people to approve then. Since I am the only one working here so I will approve my own pull
©

request here.
ar
ek
• We can see we have a main branch and currently we can see it is empty. We need to have

ad
all our codes in the main branch, and for that we have created a feature branch in which
we will create a pull request.

W
m
ha

our own repository in our feature branch (feature-addnotebooks) for this project.

• We need to save all these changes to our feature branch. So, we will commit and push
the changes to our feature branch as shown below.
ar
ek
• Since we did the commit and push to our feature branch, in Azure DevOps we will get

ad
notified in our main branch that a new branch has been created that is having some
commit. Based on that commit we will be creating the pull request.
W
m
ha
ub
Sh
©

• Now we need to create the pull request.

ar
• Approved the pull request and completed it.

ek
• Now we can see that all our codes are available in the main branch in Azure DevOps and in

ad
W
m
ha

Azure Databricks as well.

ek
• Made separate folders for CICD and all notebooks. Committed and pushed the changes
to the feature branch and created and completed a pull request in Azure DevOps.

• The CICD YML File checks that when there is a change in the main branch, it needs to
trigger the CI pipeline, and it needs to deploy notebooks in the live folder.
• Created a library (dev-cicd-grp) and added all the required variables in the library.

ar
ek
ad
W
m
ha

• Created a pipeline and saved it.

ub
Sh
©

• Gave the permission to the library that we created with the new pipeline.
• Added pipeline permission to the environment and service connection in their security
option.

ar
ek
ad
•
W
To test the pipeline that we created in Azure DevOps we created a test notebook in our
feature branch and merged it with the main branch. We can see that the pipeline will
start running on its own as soon as there is any change in the main branch.
m
ha
ub

• After running the pipeline, we can see that there is a live folder created in Azure Databricks
which has all the notebooks. So, every time whoever pushes their changes to main branch
Sh

all the codes will be copied to the live folder.

• Previously we added a group where we need to get the credentials of all dev
workspaces. Since we want to deploy this to UAT we need to create a group that would hold
all the variables of UAT environment. Then create variables for the UAT environment just like
we did for dev.

• Tested with a change we can see that the main CI pipeline started running and it got

• After deploying it to dev, it is waiting for the approval so that it can deployed to the UAT
environment. We can go and confirm this and give the approval of the required user.
ar
ek
• After deploying it to UAT we can see the live folder in Azure Databricks with all the codes in
our UAT workspace also.

• After running all the notebooks, we can get all the data in UAT environment.
ar
ek
Delta Live Table

ad
Delta Live Tables (DLT) is used in this project to orchestrate, automate, and manage the data
pipeline from raw data ingestion to the creation of gold-level analytics tables. It automates the
creation of bronze, silver, and gold tables with built-in data quality checks and lineage tracking. It
W
reduces the complexity of manual job scheduling and notebook chaining.

• Created DLT in default schema.

• Used autoloader to handle incremental loading.
•
m
Mode of pipeline: Triggered
ha
ub
Sh
©
ar
ek
ad
We can see 2 modes in DLT pipeline:

•
•
W
Development: It retains the cluster for 2 hours. It does not make any retry.
Production: It invokes the cluster, and it stops the cluster once the execution is
completed. It does retry as well.
• Through this DLT pipeline we need only 1 table and we will join both the tables after
m
passing data quality checks.
• These are certain steps taken by Delta Live Table.
ha
ub
Sh
©

• Data quality metric is also visible in DLT. We can see the name of the constraint as valid,
and it shows the percentage of failure.
ar
• In the final gold table, we need only a few selected columns, so we select them and join
both the tables in DLT.

• We can create a job for DLT as well as shown below.

Conclusion

This project successfully demonstrates the design and implementation of a modern data
lakehouse architecture using Azure Databricks, Delta Lake, and Azure Data Lake Storage Gen2 for
managing and analyzing traffic and road datasets.

By following the medallion architecture (Bronze → Silver → Gold), we ensured a structured and

ar
scalable data pipeline that supports both batch and real-time data ingestion using Spark
Structured Streaming with Auto Loader.

Key achievements include:

ek
• Secure and governed access using Access Connector and Unity Catalog.
• Cleaned and transformed data in the Silver Layer with derived metrics like
Electric_Vehicles_Count and Vehicle_Intensity.

ad
• Creation of business-ready Gold Layer tables optimized for analytics and reporting.
• Integration with Power BI to generate interactive dashboards that provide insights into
traffic patterns, road usage, and electric vehicle trends.
W
• Implementation of CI/CD pipelines using Azure DevOps, enabling version control and
automated deployments across environments.
m
This end-to-end pipeline not only enables real-time analytics but also serves as a reusable
framework for building similar data-driven solutions in the transportation or smart city domain.
ha
ub
Sh
©

PROJECT 4 For Python
No ratings yet
PROJECT 4 For Python
26 pages
Krisp Summary
No ratings yet
Krisp Summary
3 pages
1 - Architecting For The Lakehouse
No ratings yet
1 - Architecting For The Lakehouse
115 pages
Use Case Scenarios Schema Thoughts Tentative Requirements Underlying Problem
No ratings yet
Use Case Scenarios Schema Thoughts Tentative Requirements Underlying Problem
4 pages
Unity Abhishek-1
No ratings yet
Unity Abhishek-1
6 pages
Final Report
No ratings yet
Final Report
22 pages
Designing Datvault 2.0
No ratings yet
Designing Datvault 2.0
18 pages
Azure de Project
No ratings yet
Azure de Project
29 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Delta Lake On Azure Databricks
No ratings yet
Delta Lake On Azure Databricks
18 pages
Azure Data Engineer Interview Questions - Part 1
No ratings yet
Azure Data Engineer Interview Questions - Part 1
19 pages
PDF 1733662736
No ratings yet
PDF 1733662736
17 pages
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
No ratings yet
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
36 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
41 pages
Data Lake Bootcamp Overview
No ratings yet
Data Lake Bootcamp Overview
46 pages
OD M2 Building A Data Lake
No ratings yet
OD M2 Building A Data Lake
59 pages
Crack Your Databricks
100% (2)
Crack Your Databricks
103 pages
Big Data
No ratings yet
Big Data
86 pages
Lectur 5
No ratings yet
Lectur 5
37 pages
p0200 Karagiorgou
No ratings yet
p0200 Karagiorgou
10 pages
Databricks Lakehouse & AI Overview
No ratings yet
Databricks Lakehouse & AI Overview
60 pages
Big Data Stream Processing Guide
No ratings yet
Big Data Stream Processing Guide
22 pages
Details of Delta Lake Tutorial
67% (3)
Details of Delta Lake Tutorial
43 pages
Big Data Analytics
100% (1)
Big Data Analytics
14 pages
Big Data Architectures
No ratings yet
Big Data Architectures
8 pages
Azure DE Interview Que
100% (2)
Azure DE Interview Que
25 pages
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
No ratings yet
Data Engineering With Databricks (Verma, Sumit) (Z-Library)
219 pages
End To End Project ADF
100% (1)
End To End Project ADF
73 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Bda Mid Ans
No ratings yet
Bda Mid Ans
18 pages
Medallion Architecture v1
No ratings yet
Medallion Architecture v1
7 pages
Sep Report
No ratings yet
Sep Report
16 pages
Bdavdoc
No ratings yet
Bdavdoc
41 pages
Determining Traffic Levels in Cities Using Google Maps 2017
No ratings yet
Determining Traffic Levels in Cities Using Google Maps 2017
4 pages
Uber
No ratings yet
Uber
14 pages
BDA Architecture
No ratings yet
BDA Architecture
15 pages
Data Analytics Unit 3
No ratings yet
Data Analytics Unit 3
14 pages
Cloud Data Engineering
No ratings yet
Cloud Data Engineering
34 pages
Interview Resource On ETL Architectures
No ratings yet
Interview Resource On ETL Architectures
27 pages
Azure StreamSets Data Pipeline Guide
No ratings yet
Azure StreamSets Data Pipeline Guide
35 pages
Introduction To Lambda Architecture - The Digital Talk
No ratings yet
Introduction To Lambda Architecture - The Digital Talk
3 pages
BASF Interview QA
No ratings yet
BASF Interview QA
4 pages
ETL Architecture Design 1749809396
No ratings yet
ETL Architecture Design 1749809396
15 pages
Databricks Training
100% (1)
Databricks Training
4 pages
PC 8 WRI 2017 Principles McWilliams and Morris
No ratings yet
PC 8 WRI 2017 Principles McWilliams and Morris
38 pages
Building Data Lakes on Google Cloud
No ratings yet
Building Data Lakes on Google Cloud
60 pages
Big Data Architecture Guide
No ratings yet
Big Data Architecture Guide
4 pages
BDA Unit 2 1
No ratings yet
BDA Unit 2 1
42 pages
Big Data
No ratings yet
Big Data
12 pages
Databricks 1742506222
No ratings yet
Databricks 1742506222
24 pages
Demystifying The Medallion and Lakehouse Architectures 1714820046
100% (1)
Demystifying The Medallion and Lakehouse Architectures 1714820046
19 pages
Chapter 6 - Big Data Architecture Part 1
No ratings yet
Chapter 6 - Big Data Architecture Part 1
41 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
5 pages
SQL Part 2
No ratings yet
SQL Part 2
20 pages
Smart City Decarbonization Strategy
No ratings yet
Smart City Decarbonization Strategy
153 pages
Sow
No ratings yet
Sow
15 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
24 pages
Delta Lake Data Engineering Overview
No ratings yet
Delta Lake Data Engineering Overview
59 pages
PROJECT 8 For Python
No ratings yet
PROJECT 8 For Python
31 pages
PROJECT 10 For Python
No ratings yet
PROJECT 10 For Python
16 pages
PROJECT 9 For Python
No ratings yet
PROJECT 9 For Python
14 pages
PROJECT 11 For Python
No ratings yet
PROJECT 11 For Python
22 pages
PROJECT 3 For Python
No ratings yet
PROJECT 3 For Python
23 pages
BST Height Comparison and Runtime Analysis
No ratings yet
BST Height Comparison and Runtime Analysis
182 pages
SystemVerilog for EE Students
No ratings yet
SystemVerilog for EE Students
27 pages
Rajkumar R - Resume
No ratings yet
Rajkumar R - Resume
3 pages
LazReport Setup and Usage Guide
No ratings yet
LazReport Setup and Usage Guide
14 pages
Scenario Based Servcienow Interview Question - Solution
No ratings yet
Scenario Based Servcienow Interview Question - Solution
12 pages
Spring Boot
No ratings yet
Spring Boot
1 page
WSL Docker NGINX Activity 9
No ratings yet
WSL Docker NGINX Activity 9
8 pages
GTA 5 Custom Gameconfig Mod Guide
No ratings yet
GTA 5 Custom Gameconfig Mod Guide
3 pages
Python Notes
No ratings yet
Python Notes
28 pages
Fault Detection in PV Systems Assignment
No ratings yet
Fault Detection in PV Systems Assignment
16 pages
System Design - Operating System (TEST-1)
No ratings yet
System Design - Operating System (TEST-1)
10 pages
Data Exchange and Interfaces: 14.1 DDE With Plant Simulation
No ratings yet
Data Exchange and Interfaces: 14.1 DDE With Plant Simulation
2 pages
Problem Solving Methodology Overview
No ratings yet
Problem Solving Methodology Overview
30 pages
TXCare MySQL Quick Installation Guide
No ratings yet
TXCare MySQL Quick Installation Guide
1 page
Sriram Hariharan Resume
No ratings yet
Sriram Hariharan Resume
2 pages
Report Template
No ratings yet
Report Template
6 pages
Basic HTML and CSS Basic
100% (1)
Basic HTML and CSS Basic
71 pages
Introduction, Lexical Analysis 1.1 Language Processors:: Compiled By: Dept. of CSE SJEC, M'luru
No ratings yet
Introduction, Lexical Analysis 1.1 Language Processors:: Compiled By: Dept. of CSE SJEC, M'luru
52 pages
Search Results
No ratings yet
Search Results
107 pages
Python Basics.
No ratings yet
Python Basics.
56 pages
Configure User Provisioning in Okta With SCIM Using Island Browser Application
No ratings yet
Configure User Provisioning in Okta With SCIM Using Island Browser Application
6 pages
Ds-Ebpl-Hackathon Question
No ratings yet
Ds-Ebpl-Hackathon Question
5 pages
Sanjay R Resume Step
No ratings yet
Sanjay R Resume Step
2 pages
(Ebook) Focus On SDL (The Premier Press Game Development Series) by Pazera, Ernest ISBN 9781592000302, 1592000304 Available Any Format
No ratings yet
(Ebook) Focus On SDL (The Premier Press Game Development Series) by Pazera, Ernest ISBN 9781592000302, 1592000304 Available Any Format
129 pages
Software Quality Assurance Practices and Their Business Impact in Bangladeshi Tech Companies
No ratings yet
Software Quality Assurance Practices and Their Business Impact in Bangladeshi Tech Companies
9 pages
An AML-Resume1
No ratings yet
An AML-Resume1
3 pages
File Upload
No ratings yet
File Upload
6 pages
CS50x Notes
No ratings yet
CS50x Notes
8 pages
PowerPoint Timeline Tool for Teams
No ratings yet
PowerPoint Timeline Tool for Teams
2 pages
Introduction to Pandas for Data Wrangling
No ratings yet
Introduction to Pandas for Data Wrangling
16 pages

PROJECT 1 For Python

Uploaded by

PROJECT 1 For Python

Uploaded by

Azure Databricks end to end Project with

Unity Catalog CI/CD

Containers and Folders

a. Implements the medallion architecture with subfolders.

d. Gold – stores curated, business-ready data

credentials and access controls using Unity Catalog.

The datasets are from Kaggle. Raw

• It simulates real-time ingestion and is later processed incrementally using Spark

Purpose in the pipeline W

Source and storage

Purpose in the pipeline

The Raw Roads Dataset:

• Provides road classification and geographical context for traffic analysis.

Creating Databricks Dev Catalog

We created the dev workspace for Azure Databricks (databricks-dev-ws).

Access Connector for Databricks

Creating all the Schemas in Dev Catalog

Ingestion to Bronze Layer

• Streaming Ingestion with Auto Loader:

Ingestion of raw_traffic table

Ingestion of raw_roads table

Loading data to Bronze Tables

Data written for raw_traffic:

Proving Autoloader handles incremental loading

• Checking the row count of the raw_traffic data

number as the above row count.

Transformations on raw_traffic silver_traffic:

Transformations on raw_roads silver_roads:

• Renamed fields (e.g., Road category → Road_category).

human-readable names to improve dashboard clarity.

Loading Data to Gold Layer

performance, clarity, and ease of use by business users.

Logic for gold layer aggregations

• Created Vehicle_Itensity column in silver_traffic table - Motor_Vehicles_Count /

Jobs to orchestrate the flow

o Task Name: Silver_Roads

• The counts changed after running the ETL Flow.

o Raw_Roads will be updated monthly basis, because this is kind of fixed

Key points of the dashboard:

Class A Principal, Class B).

Continuous Integration & Continuous Deployment (CI/CD)

• Git repository used: Azure DevOps git

with the main branch repository.

• Created the same containers as we created for the dev workspace.

• Created another credential for creating external locations.

Integrating Azure DevOps with Azure Databricks

Created a new project in Azure DevOps named as databricks-traffic.

• Created a repository in Azure DevOps.

• Now we need to create the pull request.

Azure Databricks as well.

• Created a pipeline and saved it.

all the codes will be copied to the live folder.

• Created DLT in default schema.

• We can create a job for DLT as well as shown below.

Key achievements include:

You might also like