0% found this document useful (0 votes)
8 views27 pages

Data Engineering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views27 pages

Data Engineering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Engineering

What is Data Engineering?

 Definition: Data Engineering is the discipline of collecting raw, messy data from many
sources, processing and transforming it into clean, structured, reliable formats, and
making that processed data available to stakeholders (analysts, ML models, dashboards,
applications).
 High-level flow:
Raw data (generation) → Ingestion → Storage (lake/warehouse/OLTP) →
Processing & Transformation (ETL/ELT) → Serving → Stakeholders

Workflow of Data Engineering — the three pillars

1. Data Production / Generation


o Where data originates: user actions (clicks, searches), devices/sensors,
applications, APIs, third-party partners, logs, and streaming platforms.
o Examples: OTP events, search queries, clickstreams, IoT sensor telemetry,
payment transactions.
2. Data Transformation (Processing)
o Cleaning, validating, enriching, and structuring raw data.
o Common tasks: schema normalization/standardization, deduplication, imputation,
parsing, type casting, joining, aggregations.
o Tools/engines: Apache Spark, Flink, Beam.
o orchestration: Airflow, Prefect.
o Patterns: ETL (Extract → Transform → Load) and ELT (Extract → Load →
Transform) — ELT is common when you land raw data in a data lake and
transform inside a data warehouse or compute engine.
3. Data Serving
o Exposing processed data to consumers: BI dashboards, reports, ML models,
downstream applications, or APIs.
o Serving layers include data warehouses (fast analytics), serving databases (low-
latency lookups), model feature stores, and REST APIs.

Upstream and Downstream

 Upstream: Systems that produce and send data (APIs, web/mobile apps, devices, logs,
3rd-party feeds).
 Downstream: Systems and users that consume processed data (dashboards, analysts, ML
models, business teams).
OLTP vs OLAP (corrected and expanded)

 OLTP — Online Transaction Processing


o Optimized for writes and transactional integrity (many small, frequent
transactions).
o Schema is often normalized to avoid redundancy and ensure consistency (ACID
properties).
o Typical tech: MySQL, PostgreSQL, MongoDB (for certain transactional NoSQL),
Oracle.
o Use cases: banking transactions, order entry, user account updates.
o Managed by DBAs (Database Administrators) — DBAs are responsible for
schema design, backups, availability, and performance tuning.

 Points
 Full form: Online Transactional Processing — this is the transactional/source
database that stores individual transactions (writes/updates).
 Purpose: optimized for high write and update performance (storing transactions as
they occur).
 Ownership/management: typically managed by DBAs (and sometimes software
engineers in smaller orgs); OLTP is treated as the source system for downstream
processing.
 Data modeling: uses relational modeling and normalization (1NF → 3NF commonly)
to avoid redundancy.
 Typical examples: relational databases such as PostgreSQL, MySQL, MS SQL
Server, Oracle are cited as common OLTP systems

Normalization

 OLTP uses relational database modeling and normalization rules.

 Normalization ensures data is stored efficiently by avoiding redundancy and organizing into
related tables.

 He mentions the common normal forms:

 1NF (First Normal Form)


 2NF (Second Normal Form)
 3NF (Third Normal Form)

 While higher forms exist (up to 6–7 NF), in practice up to 3NF is followed, since beyond that
it becomes too complex to manage.

 He stresses that DBAs handle normalization, not data engineers — as a data engineer you
treat OLTP as a source system, but you should know what normalization is.
 OLAP — Online Analytical Processing (Data Warehouse)
o Optimized for read-heavy analytical queries across large volumes of data.
o Schema is often denormalized (star/snowflake dimensional models) for fast joins
and aggregations.
o Typical tech: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse.
o Use cases: BI dashboards, historical analytics, aggregated reporting, ML training
datasets.
o Dimensional modeling: organizing data into facts (measurable events) and
dimensions (context), e.g. sales fact + product/time/customer dimensions.

 Points

o Full form / alias: Online Analytical Processing — in practice this refers to the
data warehouse built for analytics and reporting.
o Purpose: optimized for fast reads and complex queries (reporting, analytics)
rather than frequent small writes.
o Data modeling: uses dimensional modeling (fact tables + dimension tables)
rather than normalized OLTP models; common patterns are star schema (most
used) and snowflake schema (less common).
o Architecture/layers: data usually flows via ETL from OLTP → staging layer →
core layer (facts & dimensions). Staging can be transient (temporary) or
persistent (keeps history).
o Examples / tools: common cloud/on-DW tools mentioned include Snowflake,
Redshift, Synapse (data-warehouse platforms).

(Note/correction): OLTP and OLAP are not mutually exclusive; data often flows from OLTP
systems into OLAP systems (via ETL/ELT) for analytics.

Data Ingestion and Storage (additional important pieces)

 Ingestion patterns: streaming (Apache Kafka, Kinesis), micro-batch, batch (file uploads,
DB dumps).
 Landing zones: raw data often lands in a data lake (S3, HDFS) for inexpensive,
schema-on-read storage.
 Cold vs hot storage: hot (frequent reads, low-latency), cold (archive).
 Schema-on-read vs schema-on-write: data lakes often use schema-on-read (accept raw
data, interpret later); OLTP/warehouses use schema-on-write (structured at write time).
 Change Data Capture (CDC): important for near-real-time replication (Debezium,
AWS DMS).

Data Quality, Governance & Observability (must-have)


 Data quality checks: completeness, uniqueness, validity, consistency, freshness.
 Lineage & metadata: keep track of data origin and transformations (helps debugging
and audits).
 Governance & security: access control, encryption at rest/in transit, PII masking, policy
enforcement.
 Monitoring & alerting: pipeline health, job failures, lag metrics, SLA tracking.

Orchestration & Scheduling

 Workflow orchestration tools: Apache Airflow, Dagster, Prefect — schedule and


manage ETL/ELT, retries, dependencies.
 Idempotency & retries: design jobs to be safe to rerun and able to handle partial
failures.

Common Tools & Tech Stack (examples)

 Ingestion/Streaming: Kafka, Kinesis, Flink, NiFi


 Processing/Transform: Spark, Flink, Beam
 Storage: PostgreSQL/MySQL (OLTP), S3/HDFS (data lake),
Snowflake/Redshift/BigQuery (DW)
 Orchestration: Airflow, Prefect
 Monitoring: Prometheus, Grafana, Datadog
 Feature stores / Serving: Feast, Redis
 Schema & governance: Apache Atlas, Amundsen, Data Catalogs

Quick checklist for a Data Engineering pipeline design

 Where does the data come from? (source, volume, velocity)


 Do you need real-time or batch processing?
 Where will you store raw and transformed data?
 How will you ensure data quality and schema evolution?
 How will you secure the data and manage access?
 How will you monitor and alert on pipeline health?
 How will you support downstream consumers (APIs, BI, ML)?
Diagrams
Data Engineering: End-to-end Workflow — shows Generations → Ingestion → Storage →

Processing → Serving.

1. OLTP vs OLAP (Comparison Table) — clear side-by-side comparison of aspects.


ET

ETL
 Why we need ETL
o To transform the data from extracted data and created ware house
o We cant do it again and again
o For that reason we design ETL pipeline
o Owe pipeline will extract the data transform the data and also load the
data in data warehouse
o And now I do not

Data Warehouse and Layers


Layers
Staging layer
 When we pull the data from the source and put in staging layer and then that
staging layer we transform our data and the push it to core layer and here we
have fact ad dimension [dimension layer is about fact and dimension]
 Why first staging layer
o You should not query the data from core layder directly it slow the
application for that reason we create the staging layer use that data
for core layer
o The we have somting called breakdown of staging layer
 Transient layer -> Temporary layer [Store data temporary ][most
use layer]
 Persistent layer -> if want preserve something [want to store
someone history]
load only the new data since the last run instead of re-pulling everything

 suppose you have data from Jan 2023 to Jan 01, 2024 already loaded into your staging and
core layers.

 Next day, the source system has new data for Jan 02, 2024.

 Instead of reloading the whole year again, you only load the new day’s data (Jan 02) into
staging → then to core.

 This avoids pulling old data repeatedly, saving time, computation, and cost.

 The cycle keeps running daily — only new data is incrementally added.

 The instructor stresses this is standard practice in industry and a favorite interview
question: “What is incremental loading and how do you apply it?
Dimensional Modeling
It is modeling technique in which we store our data in fact and dimensions
In case of dimension model, we create data model in which we a=have one fact
table and dimension table which are connect to fact table
In fact table we only store numeric value [bills, mrp, weights]
Rest of thing will go to the dimension [name information etc.]
If we have more cluster for those cluster we can make dimension for that
Star??
 we have two type of schema
 Star Schema
 Snowflake schema
Snowflake Schema
 When we have hierarchy
 This creates a “branching” structure like a snowflake.
 harder to manage, so it’s less common
SLOWLY CHANGING DIMENSIONS
 We are going to use most of time for project
 Type - 0 -> we assume that dimension table is not changing
 if

o
o In above database mistakenly write comb as hair not clothing to takel
this problem we have three approach
o Type – 1
o Type – 2
o Type – 3
 Type – 1 [UPSERT] -> [Update + Insert]

o We over write the value


o Now clothing category is gone for comb
o Most use method
o In step wise
 Spot the change in the source [product “Com” changed category
from Clothing to Hair.]
 Land the new records and decide to “Upsert”
 Type 1 - UPSERT (UPDATE + INSERT). It’s literally called
out as “upsert”.
 Match on the business key and act:
 If the key is new → INSERT a new dimension row.
 If the key exists → UPDATE the row and overwrite the
changed attributes.
 Don’t keep history.
 You replace the old value; now facts that join to this
dimension will see the new category/value going forward
 It’s used most of the time (≈70–80%) because it’s simple and
performant—no history columns to manage.
 Type – 2 [History]
o to tackle the type 1 approch i.e not store history

o Store both the error row as well as update row


o the error row is not in use because we use updated row
o steps:
 Detect a change in the source [Product Com changes category
from Clothing → Hair.]
 Unlike Type 1, we do not overwrite the old category.
 Insert a new row instead of updating
 Use surrogate keys & versioning:
 Each row gets its own surrogate key (e.g., DimProductID).
 A version column or date range (Start_Date,
End_Date) marks which row was valid when.
 Expire the old row
 The previous record is updated with an End_Date (or
marked inactive).
 The new record starts with a Start_Date and is marked
current (Is_Current = Y).
 Preserves full history of attribute changes.
 Common when business requires trend analysis, e.g., “What
were sales when Com was still under Clothing?”
 Type 3 [Previous Value]

o Preserve previous value


 Data lake

o Data ware house is only familiar to cover with structured data


[table ,column ,rows]
o Data is generated rapidly
o It is vary difficult to store the data in structure format
o Here data lake comes in picture can store unstructured data
o Json , csv
o We can deal with different file format
 LAKEHOUSE [Lake + ware [house]]

o A Lakehouse is a modern data architecture that combines features


of a Data Lake and a Data Warehouse in one platform.
o Data Lakes are cheap and store raw data of any type, but lack
structure and governance.
o Data Warehouses give strong governance, fast SQL analytics, but are
expensive and limited to structured data.
o A Lakehouse merges both worlds.
o Store data in a data lake (low cost, scalable, raw + semi-
structured).
o Add warehouse-like features: ACID transactions, schema
enforcement, indexing, and BI/ML support.
o Databricks Lakehouse (Delta Lake)
o Snowflake (with Unistore)
o Unified storage: one copy of data for multiple workloads (BI, ML,
streaming).
o Lower cost than running separate lake + warehouse.
o When you need both analytics and ML/AI on the same data. ->
LakeHouse
o A Lakehouse is a hybrid architecture that combines the low-cost
storage and flexibility of a Data Lake with the reliability and
performance of a Data Warehouse, enabling both analytics and ML on
the same platform.
FILE FORMAT

1 ] Row Based
-> CSV
-> AVRO
2] Column Based
-> Parquet
-> ORC
The main difference between them is How the data is stored in disk
 If my data is in CSV Format
 One by one the row is store [writing and updating efficiently]
 1 Honey Food 2 Shirt Clothing 3 Comb Clothing
 More use to OLTP

 In Column
 1
 2
 3
 Then
 Honey
 Shirt
 Comb
 Then
 Food
 Clothing
 Clothing
 Ex. Why reading is fast
 Select Name from table
 Column based most use in OLAP format
 Return format should be
 1 2 3 Honey Shirt Comb Food Clothing Clothing

DELTA FORMAT
 This is an open table format
 Build in on parquet format
 Parquet file have high query reading format
 Let say
o Parquet file -> build on this parquet file -> Delta file format
o What is need delta format

-> get new thing


-> Transactional Log
-> Compare to parquet file the metadata store in individual file
-> but in delta format
-> store all metadata in Transaction log
-> due to that we can do versioning
Ex.
Apply update
Delete
Now we can go to 1 st file format
This is version control
 open-source storage layer built on top of data lakes.
 stores data in Parquet files but adds transaction logs for reliability.
 Raw data lakes are cheap but messy → no ACID transactions, schema issues,
duplicates, hard to manage.
 Delta fixes this by bringing database-like features to data lakes.
 ACID transactions (Atomicity, Consistency, Isolation, Durability).
 Schema enforcement & evolution (ensures data matches expected
structure, allows safe changes).
 A special _delta_log folder stores JSON + checkpoint files with metadata (like
a transaction log).
 Query engines (Spark, Databricks, etc.) read both Parquet + logs to give
consistent views.
 Supported in Databricks, Spark, Delta-RS, Presto, Trino.
 Competes with Apache Iceberg and Apache Hudi (all are “table formats”
for lakes).
 Reliable, consistent data lake (“Lakehouse” foundation).
Big Data Engineering

Databricks is management layer for apache spark which manages its cluster
Spark
Distributed computing
 We ask multiple machine to do word
 Spark architecture include
o Process this data
o Spark have something called driver node
o Driver note is do like orchestrate the work
o It just tell what to do to another machine
o The rest of machine process the data and the process data come back
to cluster and cluster to client if client want the data

Cloud Data Engineering


 90 % of data in todays world is store in cloud
 The biggest player in Cloud
o Azure
o AWS
o Google Cloud
o Oracle
o IBM
o Heroku
Cloud computing means delivering computing resources (servers, storage, databases,
networking, software, AI, analytics) over the internet (“the cloud”) on demand.

Instead of owning physical hardware, you rent what you need from a cloud provider.

 Key features

 On-demand self-service → resources available instantly.


 Scalability & elasticity → scale up/down as needed.
 Pay-as-you-go pricing → pay only for what you use.
 Global access → available anywhere over the internet.
 Managed services → provider handles infra, updates, and reliability.

 Service models

 IaaS (Infrastructure as a Service) → raw servers, storage, networks (e.g., AWS EC2).
 PaaS (Platform as a Service) → managed platforms to build apps (e.g., Google App
Engine).
 SaaS (Software as a Service) → ready-to-use apps (e.g., Gmail, Salesforce).

 Deployment models

 Public cloud (AWS, Azure, GCP),


 Private cloud (on-premise for one company),
 Hybrid cloud (mix of both).

 Advantages

 Cost-efficient (no upfront infra cost).


 Flexible and scalable.
 Reliable with backup, disaster recovery, and high availability.
 Secure with enterprise-grade compliance.

 Examples

 AWS, Microsoft Azure, Google Cloud, IBM Cloud.


Medellion Architecture
 This is the data Architecture that we currently follow
 It is a data design pattern used in Lakehouse system (Popularized by
DataBricks)
 It Organizes data into three layers to improve quality step by step
o Bronze
o Silver
o Gold
 Bronze Layer
o Push our data as it is from source without any transformation and
changes
o May contain duplicates, nulls, or errors.
o Used as the single source of truth for all downstream processing.
 Silver layer
o Data is cleaned, validated, deduplicated, and enriched.
o Then we apply transformation, aggregation, cleaning on data and
push data to gold layer
o Supports reporting and exploratory analytics.
 Gold
o Aggregated, business-ready data
o Deliver continuously update , clean data to downstream users and
apps
o Supports BI, ML, and business decision-making.
 Ensures progressive data quality (raw → clean → curated).
 We do apply on cloud architecture
Azure Events Hub

 If we have streaming data


 Application which give data continuously can store in Azure Events Hub

Cloud SQL Database

 A fully managed relational database service by Microsoft Azure


 Based on the SQL Server engine, but runs in the cloud.

 PaaS (Platform as a Service): no need to manage servers, patching, backups, or


upgrades.

 Scalable: can auto-scale compute and storage depending on workload.

 Highly available: built-in replication and automatic failover across regions.

 Secure: supports encryption (TDE), firewall rules, authentication (Azure AD).

 Intelligent: built-in performance tuning, monitoring, and anomaly detection.

 Hosting OLTP workloads (transactional apps).

 Backend database for web and mobile apps.

 Can use SQL Statement


o Select
o Update
o Delete
o DDL
DATA LAKE

 Store Structure , semi-structured , Unstructured data


 Data lake comes under storage account
 We create Storage account
 Four services comes under Storage account

 File Storage → Managed SMB/NFS file shares (like a network drive in the cloud).

 Queue Storage → Simple message queuing between app components.

 Table Storage → NoSQL key-value store for structured, non-relational data.

 Data Lake Storage Gen2 (ADLS Gen2) → Big Data analytics storage, built on Blob
but adds hierarchical namespace, POSIX permissions, and Hadoop compatibility for
analytics/ML.

 In data lake we get the hierarchical Name Space


 In data lakes we have container
 Within that container we can create container within container


 When we are working with big data is required to store folder within folter

Cloude ETL tool

Azure Data Factory


 Low code required
 API  Azure Data Factory  Store in SQL data base
 Can do any thing with data

Big Data Transformation

Databricks

 when we want to process the data


 Raw data in bronz layer  Databricks  Convert to give silver layer

Cloud Data Warehousing

Azure Synapse Analytics

 Equivalent to snowflake, Redshift , big query


 Cloud data warehousing solution  Can build data warehouses on cloud using this
Data Reporting

Power BI

 Raw data  Transformation  Gold Layers  can use that gold layer to decision
making i.e. the data go to power BI
 These POWER BI visualization

Flow

IOT Hub sending stream  Azure Event Hubs  Azure Data Factory  send to
Medellion Architecture  store Azure Data lake storage -> from Bronze to spark 
and push data to silver layer  send data to Gold layer [two sides [ Data Scientist ]
[ Data Analyst [ Power BI]]]  Azure Synaps Analytics [make star and snowflak
schemas]
Data Governance team
Azure DevOps
 CICD


Azure Key valuate
 To Store some secrate


 Azure Purview (Microsoft Purview): Data governance tool to discover, catalog, and track
data with lineage and compliance.

 Azure DevOps: End-to-end DevOps suite with Repos, Pipelines, Boards, Artifacts, and Test
Plans.

 Azure Key Vault: Securely stores secrets, keys, certificates, and connection strings.

 Microsoft Entra ID (Azure AD): Cloud identity and access management with SSO, MFA,
and conditional access.

 Azure Monitor: End-to-end monitoring solution — collects logs/metrics, builds dashboards,


sets alerts.

 Microsoft Cost Management: Helps track, analyze, and optimize Azure spending with
budgets and recommendations

You might also like