0% found this document useful (0 votes)

21 views14 pages

PROJECT 9 For Python

Python mini project

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views14 pages

PROJECT 9 For Python

Python mini project

Uploaded by

nikhilranjan2357

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Financial Data Lakehouse on Azure with

Medallion Architecture
Objective:

To build a scalable, maintainable, and analytics-ready financial data platform using a Medallion
architecture (Bronze → Silver → Gold) on Azure Data Lake Storage Gen2 using Apache Spark
(PySpark), with support for SCD Type 2, quality validation, and dimensional modeling

Architecture

Pipeline

1. Bronze Layer – Raw Ingestion

Purpose: Store raw, unprocessed CSV files from source systems.

Sources Ingested:

• raw_customer_data_consistent.csv
• raw_stock_data.csv
• raw_transaction_data_consistent.csv

Raw to Bronze source and Sink screenshots.

Sink dataset:

Files in bronze folder -ADLS Gen2

2. Silver Layer – Data Cleaning & Standardization

Purpose: Apply data cleansing, transformation, and standard formatting for analytics
readiness.

Transformations Done:

Customer Data:

• Remove duplicates based on Customer_ID.

• Trim and clean Name, Email, and Phone.
• Normalize date format for DOB.
• Lowercase and title-case where appropriate.
• Validate presence of critical fields.

Transaction Data:

• Filter transactions with valid Quantity, Price.

• Parse Transaction_Date.
• Fill missing values and remove nulls in critical fields.
• Deduplicate on Transaction_ID.

Stock Data:

• Deduplicate on Date + Stock_Symbol.

• Normalize Date field format.

Output Format:

• Saved as Parquet files to ADLS Gen2 silver container.

Bronze to Silver transformation in pyspark

#import libraries

from pyspark.sql.functions import *

from pyspark.sql.window import *

from delta.tables import *

# Read bronze_nifty_companies from ADLS or Delta

df_br_cust = spark.read.format("csv") \

.option("header","true") \

.option("inferschema","true") \

.load("abfss://[email protected]/raw_customer_data_consistent.c
sv")

df_br_stock = spark.read.format("csv") \

.option("header","true") \

.option("inferschema","true") \

.load("abfss://[email protected]/raw_stock_data.csv")

df_br_trans = spark.read.format("csv") \

.option("header","true") \

.option("inferschema","true") \

.load("abfss://[email protected]/raw_transaction_data_consistent
.csv")

# transactions clean up

df_tran_clean = df_br_trans \

.withColumn("Transaction_Date", to_date(col("Transaction_Date"), "yyyy-MM-dd")) \

.filter(col("Quantity") > 0) \

.filter(col("Price") > 0) \

.fillna("Unknown",subset=["Customer_ID","Product"]) \

.dropna(subset=["Transaction_Type"]) \

.dropDuplicates(["Transaction_ID"])

#clean up stocks

df_stock_clean = df_br_stock \

.withColumn("Date",to_date(col("Date"),"YYYY-MM-DD")) \

.dropDuplicates(["Date","Stock_symbol"])
#cleaning customer data

df_cust_clean = df_br_cust.dropDuplicates(["Customer_ID"]) \

.withColumn("Name", trim("Name")) \

.withColumn("Email", lower(trim("Email"))) \

.withColumn("Phone", trim("Phone")) \

.withColumn("DOB", to_date("DOB", "yyyy-MM-dd")) \

.filter(col("DOB").isNotNull()) \

.dropna(subset=["Customer_ID", "Name", "DOB", "Email"]) \

.withColumn("Name", initcap("Name"))

#writing files to ADLS GEN2 silver folder

df_tran_clean.write.mode("overwrite").format("parquet").save("abfss://silver@policyadlsgen2.
dfs.core.windows.net/transactions")

df_stock_clean.write.mode("overwrite").format("parquet").save("abfss://silver@policyadlsgen2
.dfs.core.windows.net/stocks")

df_cust_clean.write.mode("overwrite").format("parquet").save("abfss://silver@policyadlsgen2.
dfs.core.windows.net/customers")

Files from Silver layer - ADLS GEN2

Customer file in silver layer:

Stocks file in Silver layer:

Transaction file in Silver layer

3. Gold Layer – Dimensional Modeling & Business Logic

Purpose: Create analytics-ready dimensional models with Slowly Changing Dimensions (SCD)
support.

Dimensions:

Customer Dimension (dim_customer):

• Implements SCD Type 2 using Delta Lake.

• Uses SHA-256 record_hash to detect changes.
• Tracks is_current, start_date, end_date.

Stock Dimension (dim_stock):

• Tracks only the latest available metadata (Latest_Metadata_Date) using a window

functions

Fact Table:

• Transaction Fact (fact_transaction):

• Enriched with customer and stock dimension lookups.
• Adds derived columns: Year, Month, Day from Transaction_Date.
• Supports time-based analytics and joins with dimensions.

Output Format:

• Saved as Delta Lake format to ADLS Gen2 gold container.

Silver to Gold transformations:

#Read files from silver files

silver_path = "abfss://[email protected]/"

df_g_cust = spark.read.format("parquet").load(f"{silver_path}customers")

df_g_transactions = spark.read.format("parquet").load(f"{silver_path}transactions")

df_g_stocks = spark.read.format("parquet").load(f"{silver_path}/stocks")

# Add SCD Type 2 metadata columns to incoming data

df_cust_transformed = df_g_cust.withColumn("record_hash", sha2(concat_ws("||",

*df_g_cust.columns), 256)) \

.withColumn("is_current", lit(True)) \

.withColumn("start_date", current_timestamp()) \

.withColumn("end_date", lit(None).cast("timestamp"))
# Define target table path for Gold dim_customer

gold_cust_path = "abfss://[email protected]/dim_customer/"

# Check if Gold table exists

if DeltaTable.isDeltaTable(spark, gold_cust_path):

delta_gold = DeltaTable.forPath(spark, gold_cust_path)

df_existing = delta_gold.toDF().filter("is_current = True")

# Join on business key (e.g., Customer_ID) and hash comparison

join_cond = [df_existing["Customer_ID"] == df_cust_transformed["Customer_ID"]]

df_changes = df_existing.join(df_cust_transformed, join_cond, "inner") \

.filter(df_existing["record_hash"] = df_cust_transformed["record_hash"]) \

.drop(df_existing["Customer_ID"],

df_existing["Name"],

df_existing["DOB"],

df_existing["Email"],

df_existing["Phone"],

df_existing["record_hash"],

df_existing["is_current"],

df_existing["start_date"],

df_existing["end_date"])

if df_changes.count() > 0:

# Expire old records

delta_gold.alias("tgt").merge(

df_changes.alias("src"),

"tgt.Customer_ID = src.Customer_ID AND tgt.is_current = true"

).whenMatchedUpdate(set={

"is_current": lit(False),

"end_date": current_timestamp()

}).execute()
# Insert new version

df_cust_transformed.alias("new_data") \

.join(df_existing.select("Customer_ID"), "Customer_ID", "left_anti") \

.unionByName(df_changes) \

.write.format("delta").mode("append").save(gold_cust_path)

else:

# First time load

df_cust_transformed.write.format("delta").mode("overwrite").save(gold_cust_path)

# 1. Get latest metadata per Stock_Symbol based on Date

window_spec = Window.partitionBy("Stock_Symbol").orderBy(col("Date").desc())

df_dim_stock = df_g_stocks.withColumn("row_num", row_number().over(window_spec)) \

.filter("row_num = 1") \

.select("Stock_Symbol", "Date") \

.withColumnRenamed("Date", "Latest_Metadata_Date")

# 2. Save to Gold layer

df_dim_stock.write.format("delta").mode("overwrite") \

.save("abfss://[email protected]/dim_stock/")

# Read Gold-layer customer and stock dimensions (latest SCD state)

df_dim_customer =
spark.read.format("delta").load("abfss://[email protected]/dim_cust
omer/") \

.filter("is_current = true")

df_dim_stock =
spark.read.format("delta").load("abfss://[email protected]/dim_stoc
k/")
# Join with dimension tables using business keys

df_fact = df_g_transactions \

.join(df_dim_customer.select("Customer_ID"), on="Customer_ID", how="inner") \

.join(df_dim_stock.select("Stock_Symbol"), on="Stock_Symbol", how="inner") \

.withColumn("Year", year("Transaction_Date")) \

.withColumn("Month", month("Transaction_Date")) \

.withColumn("Day", dayofmonth("Transaction_Date"))

# Select fact table schema

df_fact_selected = df_fact.select(

"Transaction_ID",

"Customer_ID",

"Stock_Symbol",

"Transaction_Date",

"Transaction_Type",

"Quantity",

"Price",

"Product",

"Year",

"Month",

"Day"

# Write fact table to Gold layer in Delta format

df_fact_selected.write.format("delta").mode("overwrite") \

.save("abfss://[email protected]/fact_transaction/")
Files in Gold Layer:

Customer file in Gold layer:

Stock file in Gold layer:

Transactions file in Gold layer

Next day- New Customer

- Some existing records were updated

- new records were added

The existing records had some update, so the existing record was marked “is_current”
False and end-date as processed date.
The brand new records were added with start date as ingestion date and Is_current as ”True”
and end-date is null.

Key Features

• Clean separation of layers using Medallion Architecture.

• Robust SCD Type 2 handling in Customer dimension.
• Daily incremental ready transformations (supports idempotency and hash diffing).
• Modular and extensible ETL logic.
• Stored in Delta & Parquet formats optimized for analytical workloads.

Ass 1
No ratings yet
Ass 1
31 pages
PROJECT 4 For Python
No ratings yet
PROJECT 4 For Python
26 pages
Azure de Project
No ratings yet
Azure de Project
29 pages
Azure DE Interview Que
100% (2)
Azure DE Interview Que
25 pages
Autoloader S3 Toz ADLS
No ratings yet
Autoloader S3 Toz ADLS
8 pages
PROJECT 6 Python
No ratings yet
PROJECT 6 Python
9 pages
ADF Data Flow Cheat Sheet
No ratings yet
ADF Data Flow Cheat Sheet
9 pages
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
No ratings yet
Presented By: - Preeti Kudva (106887833) - Kinjal Khandhar (106878039)
72 pages
SQL Part 2
No ratings yet
SQL Part 2
20 pages
Azure Data Engineer Interview Questions - Part 1
No ratings yet
Azure Data Engineer Interview Questions - Part 1
19 pages
End To End Project ADF
100% (1)
End To End Project ADF
73 pages
Data Warehouse
No ratings yet
Data Warehouse
14 pages
DM Cia1
No ratings yet
DM Cia1
31 pages
Data Warehouse
No ratings yet
Data Warehouse
11 pages
Project 5
No ratings yet
Project 5
5 pages
Data Warehousing and Business Intelligence DS-3003 Assignment # 1
No ratings yet
Data Warehousing and Business Intelligence DS-3003 Assignment # 1
6 pages
MIS Framework for XYZ Bank Operations
No ratings yet
MIS Framework for XYZ Bank Operations
12 pages
1 - Architecting For The Lakehouse
No ratings yet
1 - Architecting For The Lakehouse
115 pages
?stuck in A Loop of Rejections - Let's Break The Cycle!?
No ratings yet
?stuck in A Loop of Rejections - Let's Break The Cycle!?
7 pages
Data Warehouse & Mining Essentials
No ratings yet
Data Warehouse & Mining Essentials
4 pages
Data Extraction Process in Warehousing
No ratings yet
Data Extraction Process in Warehousing
14 pages
Solutions For Data Warehousing 7
No ratings yet
Solutions For Data Warehousing 7
18 pages
Data Wharehousing, OLAP and Data Mining
No ratings yet
Data Wharehousing, OLAP and Data Mining
84 pages
ETL Architecture Designs
No ratings yet
ETL Architecture Designs
18 pages
Data Transformation
100% (2)
Data Transformation
26 pages
Datawarehouse and Data Mining Final Notes
No ratings yet
Datawarehouse and Data Mining Final Notes
9 pages
20bcs087 Akhil Kholia
No ratings yet
20bcs087 Akhil Kholia
28 pages
221
No ratings yet
221
2 pages
PROJECT 2 For Python
No ratings yet
PROJECT 2 For Python
41 pages
Data Mining & BI Exam Guide 2023
No ratings yet
Data Mining & BI Exam Guide 2023
45 pages
Confluence Stuff
No ratings yet
Confluence Stuff
100 pages
Datadwm 1
No ratings yet
Datadwm 1
8 pages
Shortnjn
No ratings yet
Shortnjn
12 pages
Tasbi Ul Hasan-20023247
No ratings yet
Tasbi Ul Hasan-20023247
10 pages
Data Lake Bootcamp Overview
No ratings yet
Data Lake Bootcamp Overview
46 pages
DWDM Lab
No ratings yet
DWDM Lab
91 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
Practice Use Case
100% (1)
Practice Use Case
3 pages
Data Warehouse Essentials & ETL Process
No ratings yet
Data Warehouse Essentials & ETL Process
31 pages
Infosys Data Engineering Questions and Answers - 2025
No ratings yet
Infosys Data Engineering Questions and Answers - 2025
25 pages
Sem3 Unit1 DW
No ratings yet
Sem3 Unit1 DW
12 pages
Azure Data Lake & Big Data Concepts Explained
No ratings yet
Azure Data Lake & Big Data Concepts Explained
4 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Important Questions
No ratings yet
Important Questions
26 pages
Interview
No ratings yet
Interview
2 pages
Aniket DWDM Assignment
No ratings yet
Aniket DWDM Assignment
12 pages
Unit 2
No ratings yet
Unit 2
19 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
41 pages
Expt 2 - 2-1
No ratings yet
Expt 2 - 2-1
31 pages
Genbrooks Project Description
No ratings yet
Genbrooks Project Description
1 page
Data Warehouse
No ratings yet
Data Warehouse
10 pages
Azure Data Engineering Guide
No ratings yet
Azure Data Engineering Guide
11 pages
Advanced Data Engineering With Databricks
No ratings yet
Advanced Data Engineering With Databricks
154 pages
Data Mining
No ratings yet
Data Mining
3 pages
DWDM Questions
No ratings yet
DWDM Questions
8 pages
Azure Data Factory Guide
No ratings yet
Azure Data Factory Guide
13 pages
MCS-221 2024-25 em
No ratings yet
MCS-221 2024-25 em
34 pages
PROJECT 8 For Python
No ratings yet
PROJECT 8 For Python
31 pages
PROJECT 10 For Python
No ratings yet
PROJECT 10 For Python
16 pages
PROJECT 11 For Python
No ratings yet
PROJECT 11 For Python
22 pages
PROJECT 3 For Python
No ratings yet
PROJECT 3 For Python
23 pages
PROJECT 1 For Python
No ratings yet
PROJECT 1 For Python
42 pages
Essential Boiler Interview Questions
100% (5)
Essential Boiler Interview Questions
5 pages
Gretl Tutorial
No ratings yet
Gretl Tutorial
35 pages
EM2 G3 M3 TC Quiz3
No ratings yet
EM2 G3 M3 TC Quiz3
2 pages
Manual VCDS - Audi Q3
50% (2)
Manual VCDS - Audi Q3
4 pages
Coplanar Waveguide-Fed Uniplanar Bow-Tie Antenna
No ratings yet
Coplanar Waveguide-Fed Uniplanar Bow-Tie Antenna
2 pages
N-T Coordinate System (A) 635410672717374182
100% (1)
N-T Coordinate System (A) 635410672717374182
14 pages
Duplexers and Repeaters Some Basic Information
No ratings yet
Duplexers and Repeaters Some Basic Information
5 pages
Outline Understanding Quran 1
No ratings yet
Outline Understanding Quran 1
5 pages
HAZOP Study For Risk Analysis of Pipelines
No ratings yet
HAZOP Study For Risk Analysis of Pipelines
6 pages
Sens4 VSM1 Vacuum Sensor
No ratings yet
Sens4 VSM1 Vacuum Sensor
8 pages
Ex6 - SEMIBATCH REACTOR
No ratings yet
Ex6 - SEMIBATCH REACTOR
4 pages
Shipbuilding Catalog of Koerting
No ratings yet
Shipbuilding Catalog of Koerting
12 pages
IE 303 Discrete-Event Simulation: Lecture 3: Event-Scheduling Algorithm
No ratings yet
IE 303 Discrete-Event Simulation: Lecture 3: Event-Scheduling Algorithm
26 pages
Sagittal Diagram
No ratings yet
Sagittal Diagram
5 pages
Introduction to Hadoop & Big Data
No ratings yet
Introduction to Hadoop & Big Data
22 pages
Organic Chemistry Basics & Atomic Structure
75% (4)
Organic Chemistry Basics & Atomic Structure
11 pages
Courier 6 HX: High-Performance On-Stream Solution Analyzer System From Outokumpu Technology
No ratings yet
Courier 6 HX: High-Performance On-Stream Solution Analyzer System From Outokumpu Technology
8 pages
Transport Query - Variants &amp Layouts
No ratings yet
Transport Query - Variants &amp Layouts
15 pages
Leica LS15 & LS10 Digital Levels Specs
No ratings yet
Leica LS15 & LS10 Digital Levels Specs
6 pages
CS 332 Exam 1 Review Guide
No ratings yet
CS 332 Exam 1 Review Guide
60 pages
Science7 Q3 SLM1
80% (5)
Science7 Q3 SLM1
15 pages
Nmms 2019-20
No ratings yet
Nmms 2019-20
16 pages
Understanding Tyre Specifications and Ratings
No ratings yet
Understanding Tyre Specifications and Ratings
4 pages
Aerodynamic Moments 7
No ratings yet
Aerodynamic Moments 7
38 pages
1 s2.0 S221478532206597X Main
No ratings yet
1 s2.0 S221478532206597X Main
5 pages
Series 700 Intelligent Conventional Fire Detection Range: Sell Sheet
No ratings yet
Series 700 Intelligent Conventional Fire Detection Range: Sell Sheet
4 pages
FMI-Specification-2 0 1
No ratings yet
FMI-Specification-2 0 1
128 pages
Lesson Plan in Earth and Life Science 12
No ratings yet
Lesson Plan in Earth and Life Science 12
2 pages
Hazen-Williams Head Loss Data for HDPE
No ratings yet
Hazen-Williams Head Loss Data for HDPE
2 pages
World Happiness Report Analysis 2015-2019
No ratings yet
World Happiness Report Analysis 2015-2019
7 pages

PROJECT 9 For Python

Uploaded by

PROJECT 9 For Python

Uploaded by

Financial Data Lakehouse on Azure with

1. Bronze Layer – Raw Ingestion

Purpose: Store raw, unprocessed CSV files from source systems.

Raw to Bronze source and Sink screenshots.

Files in bronze folder -ADLS Gen2

• Remove duplicates based on Customer_ID.

• Filter transactions with valid Quantity, Price.

• Deduplicate on Date + Stock_Symbol.

• Saved as Parquet files to ADLS Gen2 silver container.

from pyspark.sql.functions import *

from pyspark.sql.window import *

from delta.tables import *

# Read bronze_nifty_companies from ADLS or Delta

.withColumn("Transaction_Date", to_date(col("Transaction_Date"), "yyyy-MM-dd")) \

.withColumn("DOB", to_date("DOB", "yyyy-MM-dd")) \

.dropna(subset=["Customer_ID", "Name", "DOB", "Email"]) \

#writing files to ADLS GEN2 silver folder

Files from Silver layer - ADLS GEN2

Customer file in silver layer:

Transaction file in Silver layer

Customer Dimension (dim_customer):

• Implements SCD Type 2 using Delta Lake.

Stock Dimension (dim_stock):

• Tracks only the latest available metadata (Latest_Metadata_Date) using a window

• Transaction Fact (fact_transaction):

• Saved as Delta Lake format to ADLS Gen2 gold container.

Silver to Gold transformations:

#Read files from silver files

# Add SCD Type 2 metadata columns to incoming data

df_cust_transformed = df_g_cust.withColumn("record_hash", sha2(concat_ws("||",

# Check if Gold table exists

delta_gold = DeltaTable.forPath(spark, gold_cust_path)

df_existing = delta_gold.toDF().filter("is_current = True")

# Join on business key (e.g., Customer_ID) and hash comparison

join_cond = [df_existing["Customer_ID"] == df_cust_transformed["Customer_ID"]]

df_changes = df_existing.join(df_cust_transformed, join_cond, "inner") \

# Expire old records

"tgt.Customer_ID = src.Customer_ID AND tgt.is_current = true"

.join(df_existing.select("Customer_ID"), "Customer_ID", "left_anti") \

# First time load

# 1. Get latest metadata per Stock_Symbol based on Date

df_dim_stock = df_g_stocks.withColumn("row_num", row_number().over(window_spec)) \

# 2. Save to Gold layer

# Read Gold-layer customer and stock dimensions (latest SCD state)

.join(df_dim_customer.select("Customer_ID"), on="Customer_ID", how="inner") \

.join(df_dim_stock.select("Stock_Symbol"), on="Stock_Symbol", how="inner") \

# Select fact table schema

# Write fact table to Gold layer in Delta format

Customer file in Gold layer:

Stock file in Gold layer:

Transactions file in Gold layer

- Some existing records were updated

- new records were added

• Clean separation of layers using Medallion Architecture.

You might also like