PROJECT 1 For Python
PROJECT 1 For Python
ar
ek
ad
W
m
ha
ub
Sh
©
Real-Time Data Lakehouse for Traffic and Roads Analytics
Project Overview
This project builds a real-time data pipeline using Azure Data Lake and Delta Lake to process and
analyze traffic and roads datasets. Data is manually loaded into a landing zone to simulate
incremental ingestion. It then flows into the bronze layer using Spark Structured Streaming with
Auto Loader, capturing all raw records.
From the bronze layer, only new data is transformed and moved to the silver layer for cleaning and
enrichment. The final curated data is stored in the gold layer as optimized tables or views, ready
for reporting and data science.
ar
Key technologies include:
ek
Delta Lake for reliable storage and ACID transactions
• Auto Loader for incremental data ingestion
• Power BI for reporting and visualization
•
ad
Azure DevOps for CI/CD
• Access controls to simulate enterprise-grade security
W
This end-to-end pipeline supports real-time analytics in a scalable, cloud-native environment.
Project Architecture
m
ha
ub
Sh
©
The project uses a Medallion Architecture (Bronze → Silver → Gold) on Azure Databricks with Delta
Lake and Unity Catalog for governance. Data flows through the following stages:
• Landing Zone: Raw traffic and roads data are manually ingested into Azure Data Lake
Storage Gen2 (/landing folder) to simulate streaming input.
• Bronze Layer: Ingested incrementally using Auto Loader with Structured Streaming. Raw
data is stored as Delta tables (raw_traffic, raw_roads) under the bronze schema.
• Silver Layer: Transforms include renaming columns, deriving Electric_Vehicles_Count,
Motor_Vehicles_Count, and categorizing road types. Cleaned datasets (silver_traffic,
silver_roads) are stored in the silver schema.
ar
• Gold Layer: Final business logic is applied (e.g., Vehicle_Intensity), and tables are
optimized for reporting (gold_traffic, gold_roads). These are consumed in Power BI for
insights.
ek
• Governance: Managed via Unity Catalog with a three-tier namespace
(catalog.schema.table) and fine-grained access control.
• CI/CD: Implemented using Azure DevOps for deploying code and configurations across
environments (Dev → UAT → Prod).
ad
Project Setup W
m
ha
ub
Sh
©
This project is organized using Azure Data Lake Storage Gen2, structured into containers and
folders representing different stages of the data pipeline.
ar
ek
We have 3 containers in Azure Data Lake Storage:
ad
1. Landing:
a. Holds the raw input data.
b. Contains 2 sub folders: raw_traffic (for traffic dataset), raw_roads (for road
dataset). W
m
ha
2. Medallion:
ub
External Locations
The following external locations are registered in Unity Catalog to enable secure and governed access
to data:
1. Landing
ar
2. Checkpoints
3. Bronze
4. Silver
ek
5. Gold
ad
W
m
ha
Each location corresponds to a specific path in the data lake and is linked with appropriate storage
ub
Data Sources
Sh
Traffic Dataset
The Raw Traffic Dataset is one of the core inputs for this project. It has actual data collected by
©
trained enumerators to feed data into road traffic estimates. It contains structured information
collected from traffic monitoring points across UK roads. This dataset has a raw count for the
number of vehicles of each type that flowed past at each point of day, broken by direction and an
hour.
It has pedal cycles, 2-wheeler vehicles, buses and coaches, LGV (Large Good Vehicles) and HGV
(Heavy goods vehicles), and electric vehicles. So, we need to find out at a given time within an hour,
how many vehicles are recorded in the raw traffic count dataset. We will analyze the type of
vehicle travelling at a given point along with roads.
Source and Storage
• The dataset is manually placed in the /landing/raw_traffic folder of Azure Data Lake
ar
ek
Storage Gen2.
ad
Structured Streaming with Auto Loader.
• Once ingested, it is first stored in the bronze layer as the table raw_traffic.
• Provides the raw measurements of traffic flow, needed to calculate derived metrics.
•
m
Enables tracking of hourly, daily, and yearly trends in vehicle movement.
• Acts as a base for data enrichment and transformation in the silver layer, eventually
powering analytical dashboards and reports on the gold layer.
ha
Data Dictionary
ub
Data Dictionary has all column names. It defines what information that each column has.
Sh
©
ar
ek
ad
Raw Roads Dataset
The raw roads dataset defines the road category. It provides essential metadata about the road
W
network across different regions. It includes classifications, measurements, and summaries of
road segments, which are later used for enriching traffic data and performing spatial analysis.
• It simulates external data ingestion and is incrementally loaded into the bronze layer
using Spark Structured Streaming with Auto Loader.
• Stored as a Delta table named raw_roads in the bronze schema.
©
• Helps join with traffic datasets using common region or road category IDs.
• Supports the derivation of road type attributes used in visualization and aggregation in the
gold layer.
The common link or common column from both the datasets is Road Category.
ar
ek
ad
W
m
ha
ub
Lake Storage Gen2, where all data layers (landing, bronze, silver, gold, checkpoints) are stored. It
enables Databricks to read raw traffic and raw roads datasets and write Delta tables without
using storage keys, by leveraging managed identity authentication. This ensures secure, role-based
access control and is required for integrating with Unity Catalog's external locations.
©
Metastore Creation
A metastore is a top-level container for data in Unity Catalog. Within a metastore, Unity Catalog
provides a 3-level namespace for organizing data (catalogs, schemas, tables/views).
If we do not assign the workspace to a metastore we will not be able to create a catalog or schema.
After assigning the workspace to the metastore we need to enable the Unity Catalog.
ar
We have created 3 schemas in the dev catalog (bronze, silver, gold).
ek
ad
W
m
ha
ingested from the landing zone and stored as Delta tables. It serves as the source of the truth,
capturing unprocessed data exactly as received.
Sh
©
Ingestion Process
• Source:
o Data is manually placed in:
▪ /landing/raw_traffic (for traffic data)
▪ /landing/raw_roads (for road data)
ar
only processing newly arrived data.
o Used to incrementally ingest raw traffic and road CSV files from the /landing
folder.
o We have created 2 autoloaders. One for raw_roads and the other for raw_traffic.
ek
o Enables real-time data ingestion into the bronze layer using Structured
Streaming.
o Reads data using .format("cloudFiles") with cloudFiles.format = "csv".
ad
o Stores schema information using cloudFiles.schemaLocation for schema
inference.
o Tracks progress using a checkpoint directory to ensure fault tolerance and
supports automatic detection of new files, eliminating the need for manual
W
triggers.
o Loads data into Delta tables: bronze.raw_traffic and bronze.raw_roads.
• Schema: Bronze
m
• Tables created:
o raw_traffic
ha
o raw_roads
ub
ar
ek
ad
W
We have ingested the raw_roads table into the bronze schema within the dev_catalog in the
Databricks workspace (dataricks_dev_ws).
m
After defining the schema and reading the raw_traffic CSV file from the landing zone in using
Auto Loader, the data was successfully written to the bronze layer in Delta format.
Data written for raw_roads:
ar
ek
ad
After defining the schema and reading the raw_roads CSV file from the landing zone using Auto
Loader, the data was written to the bronze layer in Delta format.
• The same we can say that the last record_id column value would also be the same
©
ar
ek
ad
W
m
ha
ub
Sh
The count changed from 18546 to 37092. We can check the timestamp between the last
few record_id till some of the newly added record_id.
©
• So, this proves that this is going to take that data based on a micro batch. For each micro
batch, it is going to process all records which are available and is going to write the last
record information somewhere in the checkpoint. When we upload another data, it goes
to the checkpoint, and it is going to see where exactly the previous load was done and is
going to compare that and then do the next load.
This proves that autoloader is capable to do the incremental loading.
• To run the notebook to process the newly added data, we need to monitor if there is any new
file available. We can have it in a scheduled manner as well like twice a day or thrice a
day. For now, we have just manually run the notebook to check the results.
Transforming Data into Silver Layer
Raw Traffic and Raw Roads data from the bronze layer is cleaned and enriched here.
➢ Schema: Silver
➢ Tables: silver_traffic, silver_roads
• Renamed columns for easier querying and readability (e.g., Count point id →
Count_point_id).
• Removed duplicates.
ar
• Created Electric_Vehicles_Count = EV_Car + EV_Bike. It combines electric vehicle types to
get total EV presence at a location.
•
ek
Created Motor_Vehicles_Count = Two_wheeled_motor_vehicles + Cars_and_taxis +
Buses_and_coaches + LGV_Type + HGV_Type + Electric_Vehicles_Count.
It calculates the total number of motorized vehicles for a given record.
• Derived Vehicle_Intensity = Motor_Vehicles_Count / Link_length_km to measure traffic
ad
density.
• Added timestamp columns like Extract_Time (from bronze) to track ingestion time.
W
m
ha
ub
Sh
ar
ek
ad
Transformations on Incrementally Loaded Data
W
Here we can see that only the newly added records were taken to process the data. It is because the
bronze tables can have thousands of records every time, the incremental loading will be taken
place from the landing zone to bronze, where incremental data will be appended to the bronze
table and in point of time somewhere the records may be a million records. And if we are trying to do
m
this silver layer transformation by creating a new column and applying the data that should not be
applied on the entire data set each time, this should be applied only to the changed data, which
means the rows which are newly added.
ha
ub
Sh
• Checking the count of the current records. It is 37092 after adding the 2nd traffic file.
©
• When we add the 3rd traffic file, the count changes to 55638. Now when we query the
silver transformed data for traffic data, we can see the changes in transformed time for the
new records only. The previous loaded data was the previous time when it was
transformed. So, this proves that the data was transformed incrementally and did
not transform the old data. This is possible because we are using the spark structure
streaming and there is a delta lake for this table.
ar
ek
ad
W
m
The gold layer is the final layer in the medallion architecture, designed to provide high-value
datasets for reporting, dashboards, and advanced analytics. This serves as the consumption layer
for PowerBI and data science use cases.
It combines enriched traffic and road data to support business insights. Optimized for
ub
• Created Load_Time column – To check the time when the data got loaded in the table.
©
• Store as Delta Table/Views - Final datasets in gold layer are created as gold_traffic,
gold_roads.
ar
ek
ad
W
m
ha
ub
• There are certain notebooks which need to run daily to get the data, and some notebooks
need not run daily. Here also we need to load data to bronze table, silver layer
transformations and the gold transformations. So, all these notebooks need to be
executed one after another. Based on the cadence when the data arrives, we need to run
these notebooks in a flow.
©
• So, to run these notebooks in a flow, we need to give a job. A job will orchestrate all
these notebooks, and it will run all these notebooks in a flow.
• Created a playground notebook that has the count of all the records. Counts in gold
layer for both datasets:
o gold_traffic: 55638
ar
o gold_roads: 76
ek
ad
•
W
Created a job named ETL Flow having various tasks:
o Task Name: Load_to_Bronze
Cluster Used: Job Cluster (Once completed, it terminates)
o Task Name: Silver_Traffic
m
•
©
We added new csv files in the landing zone for raw_traffic and raw_roads. Now we just
need to run this ETL Flow workflow and check the counts. The new records will be
added.
ar
ek
ad
W
m
ha
ub
Sh
©
ar
ek
ad
W
m
ha
After every one minute it checks for the file as we have the file arrival trigger
available. When we upload a new csv file for raw_traffic in the landing zone it will
automatically run the ETL Flow job. In the Launched section we can see the job
started running by the file arrival.
ub
Sh
©
Added notification email on failure.
ar
ek
ad
W
m
ha
(example – road length, type of road). We cloned the ETL Flow workflow and
edited the trigger to scheduled type for every month at a particular time,
because we cannot add 2 triggers within the same workflow.
Sh
©
ar
Reporting Data to PowerBI
PowerBI is used as the reporting and visualization tool to consume and present the Gold Layer
data stored in Azure Data Lake via Azure Databricks. It provides interactive dashboards and
ek
data-driven insights from traffic and roads data. This will help to support decision-makers in
analyzing traffic patterns, road utilization, and vehicle trends across different regions and
timeframes.
ad
PowerBI connection with Azure Databricks
To get the gold data, select the ‘Get Data’ in PowerBI and search for Azure Databricks. PowerBI
W
connects to the Gold Layer Delta tables (gold_traffic, gold_roads).
m
ha
ub
Sh
©
Using the Databricks compute details we connect PowerBI to Azure Databricks as shown below.
©
Sh
ub
ha
m
W
ad
ek
ar
ar
ek
This PowerBI dashboard presents key insights from the gold layer of the project using curated traffic
ad
and road data.
•
W
Date filters: Users can filter data by count date and view when the data was last
ingested.
• Extract Time: Shows when the dashboard is refreshed.
• KPI Tiles: Show total counts of:
m
o Pedal Cycles
o Electric Vehicles
o Motor Vehicles
ha
• Direction of Travel (Donut Chart): Shows vehicle distribution by travel direction (N, S, E, W).
• Electric Vehicles by Region (Bar Chart): Highlights regional EV usage, with Southeast and
London leading.
• Motor Vehicles by Road Category: Displays total vehicle count across road types (e.g.,
ub
Since this is a sample project, we have created the UAT workspace and we have implemented
the CI/CD, where we have all the data from the dev workspace to the UAT workspace.
We have created the catalog and dynamically created all the schemas to represent the medallion
architecture.
ar
Data from dev Data to UAT
ek
Continuous Integration
ad
W
m
• Main branch holds all the changes done to the project. Whatever we work on notebooks, it is
stored in a centralized place that is called the main branch.
• Continuous Integration lets multiple developers work and all the changes are merged
ub
Continuous Deployment
©
Release Pipeline: It gets all the changes in a live folder to UAT workspace. This will go on after an
approval system has been done.
UAT Resources
• Resource Group: databricks-uat-rg
• Databricks workspace: databricks-uat-wsp
• Storage account: databricksuatstge
• Provided role assignment as storage blob data contributor to db-access-connector
ar
(Access connector for Azure Databricks) and gave the managed identity. Now UAT
workspace will be a part of Unity Catalog.
ek
ad
W
m
ha
ek
Created uat_catalog with all the schemas.
ad
W
m
ha
ek
o bronze-uat
o silver-uat
o gold-uat
o checkpoints-uat
ad
o landing-uat
W
m
ha
ub
Sh
Continuous Integration:
We can see the main branch in our repository (dbproject) as shown below. This is the place
where every code will be copied. This is used as the central repository.
•
ar
Link your Azure databricks dev workspace to Azure DevOps Services (Azure Active
ek
ad
W
Directory) in User Settings.
m
• Created our own repository in Azure Databricks so that we can integrate Azure
Databricks with Azure DevOps. We cloned our new project created in Azure DevOps and
copied the http link and pasted while creating a Repo to get all its details.
ha
ub
Sh
• In Azure DevOps we have set the minimum number of reviewers to be 1. So, when we
have a pull request, it needs to be approved by any technical person in this project which
I am doing. Mostly in every project we have more than 1 so we would be needing more
people to approve then. Since I am the only one working here so I will approve my own pull
©
request here.
ar
ek
• We can see we have a main branch and currently we can see it is empty. We need to have
ad
all our codes in the main branch, and for that we have created a feature branch in which
we will create a pull request.
W
m
ha
• After creating our feature branch, we moved all the codes from the user’s workspace to
ub
Sh
©
our own repository in our feature branch (feature-addnotebooks) for this project.
• We need to save all these changes to our feature branch. So, we will commit and push
the changes to our feature branch as shown below.
ar
ek
• Since we did the commit and push to our feature branch, in Azure DevOps we will get
ad
notified in our main branch that a new branch has been created that is having some
commit. Based on that commit we will be creating the pull request.
W
m
ha
ub
Sh
©
ek
• Now we can see that all our codes are available in the main branch in Azure DevOps and in
ad
W
m
ha
ek
• Made separate folders for CICD and all notebooks. Committed and pushed the changes
to the feature branch and created and completed a pull request in Azure DevOps.
ad
W
m
ha
ub
Sh
©
• The CICD YML File checks that when there is a change in the main branch, it needs to
trigger the CI pipeline, and it needs to deploy notebooks in the live folder.
• Created a library (dev-cicd-grp) and added all the required variables in the library.
ar
ek
ad
W
m
ha
• Gave the permission to the library that we created with the new pipeline.
• Added pipeline permission to the environment and service connection in their security
option.
ar
ek
ad
•
W
To test the pipeline that we created in Azure DevOps we created a test notebook in our
feature branch and merged it with the main branch. We can see that the pipeline will
start running on its own as soon as there is any change in the main branch.
m
ha
ub
• After running the pipeline, we can see that there is a live folder created in Azure Databricks
which has all the notebooks. So, every time whoever pushes their changes to main branch
Sh
• Previously we added a group where we need to get the credentials of all dev
workspaces. Since we want to deploy this to UAT we need to create a group that would hold
all the variables of UAT environment. Then create variables for the UAT environment just like
we did for dev.
• Tested with a change we can see that the main CI pipeline started running and it got
ar
ek
ad
deployed to the dev environment. W
m
ha
ub
Sh
©
• After deploying it to dev, it is waiting for the approval so that it can deployed to the UAT
environment. We can go and confirm this and give the approval of the required user.
ar
ek
• After deploying it to UAT we can see the live folder in Azure Databricks with all the codes in
our UAT workspace also.
ad
W
m
ha
ub
Sh
©
• After running all the notebooks, we can get all the data in UAT environment.
ar
ek
Delta Live Table
ad
Delta Live Tables (DLT) is used in this project to orchestrate, automate, and manage the data
pipeline from raw data ingestion to the creation of gold-level analytics tables. It automates the
creation of bronze, silver, and gold tables with built-in data quality checks and lineage tracking. It
W
reduces the complexity of manual job scheduling and notebook chaining.
•
•
W
Development: It retains the cluster for 2 hours. It does not make any retry.
Production: It invokes the cluster, and it stops the cluster once the execution is
completed. It does retry as well.
• Through this DLT pipeline we need only 1 table and we will join both the tables after
m
passing data quality checks.
• These are certain steps taken by Delta Live Table.
ha
ub
Sh
©
• Data quality metric is also visible in DLT. We can see the name of the constraint as valid,
and it shows the percentage of failure.
ar
• In the final gold table, we need only a few selected columns, so we select them and join
both the tables in DLT.
ek
ad
W
m
ha
ub
Sh
©
This project successfully demonstrates the design and implementation of a modern data
lakehouse architecture using Azure Databricks, Delta Lake, and Azure Data Lake Storage Gen2 for
managing and analyzing traffic and road datasets.
By following the medallion architecture (Bronze → Silver → Gold), we ensured a structured and
ar
scalable data pipeline that supports both batch and real-time data ingestion using Spark
Structured Streaming with Auto Loader.
ek
• Secure and governed access using Access Connector and Unity Catalog.
• Cleaned and transformed data in the Silver Layer with derived metrics like
Electric_Vehicles_Count and Vehicle_Intensity.
ad
• Creation of business-ready Gold Layer tables optimized for analytics and reporting.
• Integration with Power BI to generate interactive dashboards that provide insights into
traffic patterns, road usage, and electric vehicle trends.
W
• Implementation of CI/CD pipelines using Azure DevOps, enabling version control and
automated deployments across environments.
m
This end-to-end pipeline not only enables real-time analytics but also serves as a reusable
framework for building similar data-driven solutions in the transportation or smart city domain.
ha
ub
Sh
©