0% found this document useful (0 votes)

8 views27 pages

Data Engineering

Uploaded by

surajsuryavanshi2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views27 pages

Data Engineering

Uploaded by

surajsuryavanshi2003

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Engineering

What is Data Engineering?

 Definition: Data Engineering is the discipline of collecting raw, messy data from many
sources, processing and transforming it into clean, structured, reliable formats, and
making that processed data available to stakeholders (analysts, ML models, dashboards,
applications).
 High-level flow:
Raw data (generation) → Ingestion → Storage (lake/warehouse/OLTP) →
Processing & Transformation (ETL/ELT) → Serving → Stakeholders

Workflow of Data Engineering — the three pillars

1. Data Production / Generation

o Where data originates: user actions (clicks, searches), devices/sensors,
applications, APIs, third-party partners, logs, and streaming platforms.
o Examples: OTP events, search queries, clickstreams, IoT sensor telemetry,
payment transactions.
2. Data Transformation (Processing)
o Cleaning, validating, enriching, and structuring raw data.
o Common tasks: schema normalization/standardization, deduplication, imputation,
parsing, type casting, joining, aggregations.
o Tools/engines: Apache Spark, Flink, Beam.
o orchestration: Airflow, Prefect.
o Patterns: ETL (Extract → Transform → Load) and ELT (Extract → Load →
Transform) — ELT is common when you land raw data in a data lake and
transform inside a data warehouse or compute engine.
3. Data Serving
o Exposing processed data to consumers: BI dashboards, reports, ML models,
downstream applications, or APIs.
o Serving layers include data warehouses (fast analytics), serving databases (low-
latency lookups), model feature stores, and REST APIs.

Upstream and Downstream

 Upstream: Systems that produce and send data (APIs, web/mobile apps, devices, logs,
3rd-party feeds).
 Downstream: Systems and users that consume processed data (dashboards, analysts, ML
models, business teams).
OLTP vs OLAP (corrected and expanded)

 OLTP — Online Transaction Processing

o Optimized for writes and transactional integrity (many small, frequent
transactions).
o Schema is often normalized to avoid redundancy and ensure consistency (ACID
properties).
o Typical tech: MySQL, PostgreSQL, MongoDB (for certain transactional NoSQL),
Oracle.
o Use cases: banking transactions, order entry, user account updates.
o Managed by DBAs (Database Administrators) — DBAs are responsible for
schema design, backups, availability, and performance tuning.

 Points
 Full form: Online Transactional Processing — this is the transactional/source
database that stores individual transactions (writes/updates).
 Purpose: optimized for high write and update performance (storing transactions as
they occur).
 Ownership/management: typically managed by DBAs (and sometimes software
engineers in smaller orgs); OLTP is treated as the source system for downstream
processing.
 Data modeling: uses relational modeling and normalization (1NF → 3NF commonly)
to avoid redundancy.
 Typical examples: relational databases such as PostgreSQL, MySQL, MS SQL
Server, Oracle are cited as common OLTP systems

Normalization

 OLTP uses relational database modeling and normalization rules.

 Normalization ensures data is stored efficiently by avoiding redundancy and organizing into
related tables.

 He mentions the common normal forms:

 1NF (First Normal Form)

 2NF (Second Normal Form)
 3NF (Third Normal Form)

 While higher forms exist (up to 6–7 NF), in practice up to 3NF is followed, since beyond that
it becomes too complex to manage.

 He stresses that DBAs handle normalization, not data engineers — as a data engineer you
treat OLTP as a source system, but you should know what normalization is.
 OLAP — Online Analytical Processing (Data Warehouse)
o Optimized for read-heavy analytical queries across large volumes of data.
o Schema is often denormalized (star/snowflake dimensional models) for fast joins
and aggregations.
o Typical tech: Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse.
o Use cases: BI dashboards, historical analytics, aggregated reporting, ML training
datasets.
o Dimensional modeling: organizing data into facts (measurable events) and
dimensions (context), e.g. sales fact + product/time/customer dimensions.

 Points

o Full form / alias: Online Analytical Processing — in practice this refers to the
data warehouse built for analytics and reporting.
o Purpose: optimized for fast reads and complex queries (reporting, analytics)
rather than frequent small writes.
o Data modeling: uses dimensional modeling (fact tables + dimension tables)
rather than normalized OLTP models; common patterns are star schema (most
used) and snowflake schema (less common).
o Architecture/layers: data usually flows via ETL from OLTP → staging layer →
core layer (facts & dimensions). Staging can be transient (temporary) or
persistent (keeps history).
o Examples / tools: common cloud/on-DW tools mentioned include Snowflake,
Redshift, Synapse (data-warehouse platforms).

(Note/correction): OLTP and OLAP are not mutually exclusive; data often flows from OLTP
systems into OLAP systems (via ETL/ELT) for analytics.

Data Ingestion and Storage (additional important pieces)

 Ingestion patterns: streaming (Apache Kafka, Kinesis), micro-batch, batch (file uploads,
DB dumps).
 Landing zones: raw data often lands in a data lake (S3, HDFS) for inexpensive,
schema-on-read storage.
 Cold vs hot storage: hot (frequent reads, low-latency), cold (archive).
 Schema-on-read vs schema-on-write: data lakes often use schema-on-read (accept raw
data, interpret later); OLTP/warehouses use schema-on-write (structured at write time).
 Change Data Capture (CDC): important for near-real-time replication (Debezium,
AWS DMS).

Data Quality, Governance & Observability (must-have)

 Data quality checks: completeness, uniqueness, validity, consistency, freshness.
 Lineage & metadata: keep track of data origin and transformations (helps debugging
and audits).
 Governance & security: access control, encryption at rest/in transit, PII masking, policy
enforcement.
 Monitoring & alerting: pipeline health, job failures, lag metrics, SLA tracking.

Orchestration & Scheduling

 Workflow orchestration tools: Apache Airflow, Dagster, Prefect — schedule and

manage ETL/ELT, retries, dependencies.
 Idempotency & retries: design jobs to be safe to rerun and able to handle partial
failures.

Common Tools & Tech Stack (examples)

 Ingestion/Streaming: Kafka, Kinesis, Flink, NiFi

 Processing/Transform: Spark, Flink, Beam
 Storage: PostgreSQL/MySQL (OLTP), S3/HDFS (data lake),
Snowflake/Redshift/BigQuery (DW)
 Orchestration: Airflow, Prefect
 Monitoring: Prometheus, Grafana, Datadog
 Feature stores / Serving: Feast, Redis
 Schema & governance: Apache Atlas, Amundsen, Data Catalogs

Quick checklist for a Data Engineering pipeline design

 Where does the data come from? (source, volume, velocity)

 Do you need real-time or batch processing?
 Where will you store raw and transformed data?
 How will you ensure data quality and schema evolution?
 How will you secure the data and manage access?
 How will you monitor and alert on pipeline health?
 How will you support downstream consumers (APIs, BI, ML)?
Diagrams
Data Engineering: End-to-end Workflow — shows Generations → Ingestion → Storage →

Processing → Serving.

1. OLTP vs OLAP (Comparison Table) — clear side-by-side comparison of aspects.

ETL
 Why we need ETL
o To transform the data from extracted data and created ware house
o We cant do it again and again
o For that reason we design ETL pipeline
o Owe pipeline will extract the data transform the data and also load the
data in data warehouse
o And now I do not

Data Warehouse and Layers

Layers
Staging layer
 When we pull the data from the source and put in staging layer and then that
staging layer we transform our data and the push it to core layer and here we
have fact ad dimension [dimension layer is about fact and dimension]
 Why first staging layer
o You should not query the data from core layder directly it slow the
application for that reason we create the staging layer use that data
for core layer
o The we have somting called breakdown of staging layer
 Transient layer -> Temporary layer [Store data temporary ][most
use layer]
 Persistent layer -> if want preserve something [want to store
someone history]
load only the new data since the last run instead of re-pulling everything

 suppose you have data from Jan 2023 to Jan 01, 2024 already loaded into your staging and
core layers.

 Next day, the source system has new data for Jan 02, 2024.

 Instead of reloading the whole year again, you only load the new day’s data (Jan 02) into
staging → then to core.

 This avoids pulling old data repeatedly, saving time, computation, and cost.

 The cycle keeps running daily — only new data is incrementally added.

 The instructor stresses this is standard practice in industry and a favorite interview
question: “What is incremental loading and how do you apply it?
Dimensional Modeling
It is modeling technique in which we store our data in fact and dimensions
In case of dimension model, we create data model in which we a=have one fact
table and dimension table which are connect to fact table
In fact table we only store numeric value [bills, mrp, weights]
Rest of thing will go to the dimension [name information etc.]
If we have more cluster for those cluster we can make dimension for that
Star??
 we have two type of schema
 Star Schema
 Snowflake schema
Snowflake Schema
 When we have hierarchy
 This creates a “branching” structure like a snowflake.
 harder to manage, so it’s less common
SLOWLY CHANGING DIMENSIONS
 We are going to use most of time for project
 Type - 0 -> we assume that dimension table is not changing
 if

o
o In above database mistakenly write comb as hair not clothing to takel
this problem we have three approach
o Type – 1
o Type – 2
o Type – 3
 Type – 1 [UPSERT] -> [Update + Insert]

o We over write the value

o Now clothing category is gone for comb
o Most use method
o In step wise
 Spot the change in the source [product “Com” changed category
from Clothing to Hair.]
 Land the new records and decide to “Upsert”
 Type 1 - UPSERT (UPDATE + INSERT). It’s literally called
out as “upsert”.
 Match on the business key and act:
 If the key is new → INSERT a new dimension row.
 If the key exists → UPDATE the row and overwrite the
changed attributes.
 Don’t keep history.
 You replace the old value; now facts that join to this
dimension will see the new category/value going forward
 It’s used most of the time (≈70–80%) because it’s simple and
performant—no history columns to manage.
 Type – 2 [History]
o to tackle the type 1 approch i.e not store history

o Store both the error row as well as update row

o the error row is not in use because we use updated row
o steps:
 Detect a change in the source [Product Com changes category
from Clothing → Hair.]
 Unlike Type 1, we do not overwrite the old category.
 Insert a new row instead of updating
 Use surrogate keys & versioning:
 Each row gets its own surrogate key (e.g., DimProductID).
 A version column or date range (Start_Date,
End_Date) marks which row was valid when.
 Expire the old row
 The previous record is updated with an End_Date (or
marked inactive).
 The new record starts with a Start_Date and is marked
current (Is_Current = Y).
 Preserves full history of attribute changes.
 Common when business requires trend analysis, e.g., “What
were sales when Com was still under Clothing?”
 Type 3 [Previous Value]

o Preserve previous value

 Data lake

o Data ware house is only familiar to cover with structured data

[table ,column ,rows]
o Data is generated rapidly
o It is vary difficult to store the data in structure format
o Here data lake comes in picture can store unstructured data
o Json , csv
o We can deal with different file format
 LAKEHOUSE [Lake + ware [house]]

o A Lakehouse is a modern data architecture that combines features

of a Data Lake and a Data Warehouse in one platform.
o Data Lakes are cheap and store raw data of any type, but lack
structure and governance.
o Data Warehouses give strong governance, fast SQL analytics, but are
expensive and limited to structured data.
o A Lakehouse merges both worlds.
o Store data in a data lake (low cost, scalable, raw + semi-
structured).
o Add warehouse-like features: ACID transactions, schema
enforcement, indexing, and BI/ML support.
o Databricks Lakehouse (Delta Lake)
o Snowflake (with Unistore)
o Unified storage: one copy of data for multiple workloads (BI, ML,
streaming).
o Lower cost than running separate lake + warehouse.
o When you need both analytics and ML/AI on the same data. ->
LakeHouse
o A Lakehouse is a hybrid architecture that combines the low-cost
storage and flexibility of a Data Lake with the reliability and
performance of a Data Warehouse, enabling both analytics and ML on
the same platform.
FILE FORMAT

1 ] Row Based
-> CSV
-> AVRO
2] Column Based
-> Parquet
-> ORC
The main difference between them is How the data is stored in disk
 If my data is in CSV Format
 One by one the row is store [writing and updating efficiently]
 1 Honey Food 2 Shirt Clothing 3 Comb Clothing
 More use to OLTP

 In Column
 1
 2
 3
 Then
 Honey
 Shirt
 Comb
 Then
 Food
 Clothing
 Clothing
 Ex. Why reading is fast
 Select Name from table
 Column based most use in OLAP format
 Return format should be
 1 2 3 Honey Shirt Comb Food Clothing Clothing

DELTA FORMAT
 This is an open table format
 Build in on parquet format
 Parquet file have high query reading format
 Let say
o Parquet file -> build on this parquet file -> Delta file format
o What is need delta format

-> get new thing

-> Transactional Log
-> Compare to parquet file the metadata store in individual file
-> but in delta format
-> store all metadata in Transaction log
-> due to that we can do versioning
Ex.
Apply update
Delete
Now we can go to 1 st file format
This is version control
 open-source storage layer built on top of data lakes.
 stores data in Parquet files but adds transaction logs for reliability.
 Raw data lakes are cheap but messy → no ACID transactions, schema issues,
duplicates, hard to manage.
 Delta fixes this by bringing database-like features to data lakes.
 ACID transactions (Atomicity, Consistency, Isolation, Durability).
 Schema enforcement & evolution (ensures data matches expected
structure, allows safe changes).
 A special _delta_log folder stores JSON + checkpoint files with metadata (like
a transaction log).
 Query engines (Spark, Databricks, etc.) read both Parquet + logs to give
consistent views.
 Supported in Databricks, Spark, Delta-RS, Presto, Trino.
 Competes with Apache Iceberg and Apache Hudi (all are “table formats”
for lakes).
 Reliable, consistent data lake (“Lakehouse” foundation).
Big Data Engineering

Databricks is management layer for apache spark which manages its cluster
Spark
Distributed computing
 We ask multiple machine to do word
 Spark architecture include
o Process this data
o Spark have something called driver node
o Driver note is do like orchestrate the work
o It just tell what to do to another machine
o The rest of machine process the data and the process data come back
to cluster and cluster to client if client want the data

Cloud Data Engineering

 90 % of data in todays world is store in cloud
 The biggest player in Cloud
o Azure
o AWS
o Google Cloud
o Oracle
o IBM
o Heroku
Cloud computing means delivering computing resources (servers, storage, databases,
networking, software, AI, analytics) over the internet (“the cloud”) on demand.

Instead of owning physical hardware, you rent what you need from a cloud provider.

 Key features

 On-demand self-service → resources available instantly.

 Scalability & elasticity → scale up/down as needed.
 Pay-as-you-go pricing → pay only for what you use.
 Global access → available anywhere over the internet.
 Managed services → provider handles infra, updates, and reliability.

 Service models

 IaaS (Infrastructure as a Service) → raw servers, storage, networks (e.g., AWS EC2).
 PaaS (Platform as a Service) → managed platforms to build apps (e.g., Google App
Engine).
 SaaS (Software as a Service) → ready-to-use apps (e.g., Gmail, Salesforce).

 Deployment models

 Public cloud (AWS, Azure, GCP),

 Private cloud (on-premise for one company),
 Hybrid cloud (mix of both).

 Advantages

 Cost-efficient (no upfront infra cost).

 Flexible and scalable.
 Reliable with backup, disaster recovery, and high availability.
 Secure with enterprise-grade compliance.

 Examples

 AWS, Microsoft Azure, Google Cloud, IBM Cloud.

Medellion Architecture
 This is the data Architecture that we currently follow
 It is a data design pattern used in Lakehouse system (Popularized by
DataBricks)
 It Organizes data into three layers to improve quality step by step
o Bronze
o Silver
o Gold
 Bronze Layer
o Push our data as it is from source without any transformation and
changes
o May contain duplicates, nulls, or errors.
o Used as the single source of truth for all downstream processing.
 Silver layer
o Data is cleaned, validated, deduplicated, and enriched.
o Then we apply transformation, aggregation, cleaning on data and
push data to gold layer
o Supports reporting and exploratory analytics.
 Gold
o Aggregated, business-ready data
o Deliver continuously update , clean data to downstream users and
apps
o Supports BI, ML, and business decision-making.
 Ensures progressive data quality (raw → clean → curated).
 We do apply on cloud architecture
Azure Events Hub

 If we have streaming data

 Application which give data continuously can store in Azure Events Hub

Cloud SQL Database

 A fully managed relational database service by Microsoft Azure

 Based on the SQL Server engine, but runs in the cloud.

 PaaS (Platform as a Service): no need to manage servers, patching, backups, or

upgrades.

 Scalable: can auto-scale compute and storage depending on workload.

 Highly available: built-in replication and automatic failover across regions.

 Secure: supports encryption (TDE), firewall rules, authentication (Azure AD).

 Intelligent: built-in performance tuning, monitoring, and anomaly detection.

 Hosting OLTP workloads (transactional apps).

 Backend database for web and mobile apps.

 Can use SQL Statement

o Select
o Update
o Delete
o DDL
DATA LAKE

 Store Structure , semi-structured , Unstructured data

 Data lake comes under storage account
 We create Storage account
 Four services comes under Storage account

 File Storage → Managed SMB/NFS file shares (like a network drive in the cloud).

 Queue Storage → Simple message queuing between app components.

 Table Storage → NoSQL key-value store for structured, non-relational data.

 Data Lake Storage Gen2 (ADLS Gen2) → Big Data analytics storage, built on Blob
but adds hierarchical namespace, POSIX permissions, and Hadoop compatibility for
analytics/ML.

 In data lake we get the hierarchical Name Space

 In data lakes we have container
 Within that container we can create container within container


 When we are working with big data is required to store folder within folter

Cloude ETL tool

Azure Data Factory

 Low code required
 API  Azure Data Factory  Store in SQL data base
 Can do any thing with data

Big Data Transformation

Databricks

 when we want to process the data

 Raw data in bronz layer  Databricks  Convert to give silver layer

Cloud Data Warehousing

Azure Synapse Analytics

 Equivalent to snowflake, Redshift , big query

 Cloud data warehousing solution  Can build data warehouses on cloud using this
Data Reporting

Power BI

 Raw data  Transformation  Gold Layers  can use that gold layer to decision
making i.e. the data go to power BI
 These POWER BI visualization


Flow

IOT Hub sending stream  Azure Event Hubs  Azure Data Factory  send to
Medellion Architecture  store Azure Data lake storage -> from Bronze to spark 
and push data to silver layer  send data to Gold layer [two sides [ Data Scientist ]
[ Data Analyst [ Power BI]]]  Azure Synaps Analytics [make star and snowflak
schemas]
Data Governance team
Azure DevOps
 CICD


Azure Key valuate
 To Store some secrate


 Azure Purview (Microsoft Purview): Data governance tool to discover, catalog, and track
data with lineage and compliance.

 Azure DevOps: End-to-end DevOps suite with Repos, Pipelines, Boards, Artifacts, and Test
Plans.

 Azure Key Vault: Securely stores secrets, keys, certificates, and connection strings.

 Microsoft Entra ID (Azure AD): Cloud identity and access management with SSO, MFA,
and conditional access.

 Azure Monitor: End-to-end monitoring solution — collects logs/metrics, builds dashboards,

sets alerts.

 Microsoft Cost Management: Helps track, analyze, and optimize Azure spending with
budgets and recommendations

PDF Data Engineering Interview Questions and Answers
No ratings yet
PDF Data Engineering Interview Questions and Answers
18 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
Big Query
No ratings yet
Big Query
8 pages
19 Databricks
No ratings yet
19 Databricks
28 pages
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
No ratings yet
100 Important Questions With Solutions For Data Warehousing & Data Mining (BCS058)
119 pages
New DE
No ratings yet
New DE
4 pages
Comprehensive Guide to Data Management Concepts
No ratings yet
Comprehensive Guide to Data Management Concepts
17 pages
ETL Interview Preparation
No ratings yet
ETL Interview Preparation
18 pages
Big Data
No ratings yet
Big Data
4 pages
DSS ch2
No ratings yet
DSS ch2
112 pages
Data Warehousing
No ratings yet
Data Warehousing
6 pages
Azure Data Engineering Complete Guide
No ratings yet
Azure Data Engineering Complete Guide
130 pages
Data Extraction Process in Warehousing
No ratings yet
Data Extraction Process in Warehousing
14 pages
DP 900 Day 4
No ratings yet
DP 900 Day 4
40 pages
Data Engineer Questions
No ratings yet
Data Engineer Questions
10 pages
Warehousing
No ratings yet
Warehousing
13 pages
Cat Data Mining
No ratings yet
Cat Data Mining
4 pages
Ex 1
No ratings yet
Ex 1
14 pages
Important Concepts in Big Data
No ratings yet
Important Concepts in Big Data
6 pages
Solutions For Data Warehousing 7
No ratings yet
Solutions For Data Warehousing 7
18 pages
ETL Question and Answers
No ratings yet
ETL Question and Answers
6 pages
Data Engineer Interview Questions With Examples
No ratings yet
Data Engineer Interview Questions With Examples
8 pages
Datawarehouse and Data Mining Final Notes
No ratings yet
Datawarehouse and Data Mining Final Notes
9 pages
2-Data Warehouse Architecture - Three-Tier Data Warehouse Architecture-16!12!2024
No ratings yet
2-Data Warehouse Architecture - Three-Tier Data Warehouse Architecture-16!12!2024
30 pages
Data Engineer
No ratings yet
Data Engineer
7 pages
BD&CC Unit2
No ratings yet
BD&CC Unit2
14 pages
Data Management Concepts Learned
No ratings yet
Data Management Concepts Learned
5 pages
DWDM Mid 1
No ratings yet
DWDM Mid 1
10 pages
Big Data Architectures and The Data Lake: James Serra
No ratings yet
Big Data Architectures and The Data Lake: James Serra
53 pages
Day 06
No ratings yet
Day 06
34 pages
Data Engineering and Data Engineer - Students
No ratings yet
Data Engineering and Data Engineer - Students
56 pages
Module 3 DM
No ratings yet
Module 3 DM
9 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
DW Concepts
100% (1)
DW Concepts
40 pages
DW Notes
No ratings yet
DW Notes
57 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
33 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
20 pages
DWDM Unit 1 (R23)
No ratings yet
DWDM Unit 1 (R23)
85 pages
Session1 - Big Data Overview
No ratings yet
Session1 - Big Data Overview
55 pages
Understanding Data Warehousing Basics
No ratings yet
Understanding Data Warehousing Basics
58 pages
Selected Topics of Recent Trends in Information Technology
No ratings yet
Selected Topics of Recent Trends in Information Technology
21 pages
Unit 1.1data Science Technology Stack
No ratings yet
Unit 1.1data Science Technology Stack
87 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
Sem3 Unit1 DW
No ratings yet
Sem3 Unit1 DW
12 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
Adbms
No ratings yet
Adbms
19 pages
DWDM
No ratings yet
DWDM
14 pages
Data Mining & Warehousing
No ratings yet
Data Mining & Warehousing
8 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
5 pages
Interview Topics 1749449767
No ratings yet
Interview Topics 1749449767
5 pages
Data Warehouse Architecture Layers Explained
No ratings yet
Data Warehouse Architecture Layers Explained
21 pages
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
No ratings yet
Leveraging Enterprise Data Warehousing (EDW) To The Lakehouse Architecture
36 pages
OD M2 Building A Data Lake
No ratings yet
OD M2 Building A Data Lake
59 pages
Data Warehousing for Analysts
No ratings yet
Data Warehousing for Analysts
40 pages
DWDM Questions
No ratings yet
DWDM Questions
8 pages
Course1 Summary
No ratings yet
Course1 Summary
4 pages
Rainfall Analysis Implementing On Data Warehouse
No ratings yet
Rainfall Analysis Implementing On Data Warehouse
12 pages
Data Warehousing Concepts Explained
No ratings yet
Data Warehousing Concepts Explained
40 pages
Module 2 Data Engineering 6 Mark Answers
No ratings yet
Module 2 Data Engineering 6 Mark Answers
3 pages
Unit5 DM&DW
No ratings yet
Unit5 DM&DW
17 pages
E - Catalogue - Data Science, AI and Fintech 2025
No ratings yet
E - Catalogue - Data Science, AI and Fintech 2025
124 pages
0 Begin-Bw
No ratings yet
0 Begin-Bw
3 pages
Complete Download Cataloging Correctly For Kids An Introduction To The Tools 5th Ed Edition Fountain PDF All Chapters
100% (26)
Complete Download Cataloging Correctly For Kids An Introduction To The Tools 5th Ed Edition Fountain PDF All Chapters
55 pages
Question Bank Unit 1
No ratings yet
Question Bank Unit 1
3 pages
It Viva Questions
100% (1)
It Viva Questions
3 pages
Database Administrator
No ratings yet
Database Administrator
17 pages
SQL Create Table Statement: Syntax
No ratings yet
SQL Create Table Statement: Syntax
19 pages
BS en 12952-4
No ratings yet
BS en 12952-4
22 pages
DMS PDF
No ratings yet
DMS PDF
181 pages
Information Management
100% (2)
Information Management
18 pages
Data DeDuplication For Dummies PDF
100% (1)
Data DeDuplication For Dummies PDF
43 pages
Dbms Gtupapers 2021 2025
No ratings yet
Dbms Gtupapers 2021 2025
3 pages
Test Result JDA Implementation
No ratings yet
Test Result JDA Implementation
18 pages
Infoman (CS Obtl)
No ratings yet
Infoman (CS Obtl)
2 pages
183
No ratings yet
183
74 pages
Painting Hire Business Case Study
No ratings yet
Painting Hire Business Case Study
14 pages
Big Data's Impact on Accounting Analytics
No ratings yet
Big Data's Impact on Accounting Analytics
17 pages
Statathon
No ratings yet
Statathon
1 page
Query Optimization in Mysql Database Usi F8e2fb8b
No ratings yet
Query Optimization in Mysql Database Usi F8e2fb8b
7 pages
Resume SSIS, SSRS, SSAS, SQL Server Developer
No ratings yet
Resume SSIS, SSRS, SSAS, SQL Server Developer
7 pages
Publish To The World: Search
No ratings yet
Publish To The World: Search
3 pages
Business Intelligence Syllabus 2020
No ratings yet
Business Intelligence Syllabus 2020
3 pages
Self-Checkout System Proposal
No ratings yet
Self-Checkout System Proposal
7 pages
536S3A
No ratings yet
536S3A
2 pages
DBAS6211 Ta
No ratings yet
DBAS6211 Ta
6 pages
Student Railway Project
No ratings yet
Student Railway Project
13 pages
OpenText Documentum Workflow Designer CE 22.4 - User Guide English (EDCPKL220400-AWF-EN-01) PDF
No ratings yet
OpenText Documentum Workflow Designer CE 22.4 - User Guide English (EDCPKL220400-AWF-EN-01) PDF
108 pages
Premium 1z0-083 Dumps for Oracle Exam
No ratings yet
Premium 1z0-083 Dumps for Oracle Exam
13 pages
EfficiDays Architecture Guide
No ratings yet
EfficiDays Architecture Guide
8 pages

Data Engineering

Uploaded by

Data Engineering

Uploaded by

Data Engineering

What is Data Engineering?

Workflow of Data Engineering — the three pillars

1. Data Production / Generation

Upstream and Downstream

 OLTP — Online Transaction Processing

 OLTP uses relational database modeling and normalization rules.

 He mentions the common normal forms:

 1NF (First Normal Form)

Data Ingestion and Storage (additional important pieces)

Data Quality, Governance & Observability (must-have)

Orchestration & Scheduling

 Workflow orchestration tools: Apache Airflow, Dagster, Prefect — schedule and

Common Tools & Tech Stack (examples)

 Ingestion/Streaming: Kafka, Kinesis, Flink, NiFi

Quick checklist for a Data Engineering pipeline design

 Where does the data come from? (source, volume, velocity)

1. OLTP vs OLAP (Comparison Table) — clear side-by-side comparison of aspects.

Data Warehouse and Layers

o We over write the value

o Store both the error row as well as update row

o Preserve previous value

o Data ware house is only familiar to cover with structured data

o A Lakehouse is a modern data architecture that combines features

-> get new thing

Cloud Data Engineering

 On-demand self-service → resources available instantly.

 Public cloud (AWS, Azure, GCP),

 Cost-efficient (no upfront infra cost).

 AWS, Microsoft Azure, Google Cloud, IBM Cloud.

 If we have streaming data

Cloud SQL Database

 A fully managed relational database service by Microsoft Azure

 PaaS (Platform as a Service): no need to manage servers, patching, backups, or

 Scalable: can auto-scale compute and storage depending on workload.

 Highly available: built-in replication and automatic failover across regions.

 Secure: supports encryption (TDE), firewall rules, authentication (Azure AD).

 Intelligent: built-in performance tuning, monitoring, and anomaly detection.

 Hosting OLTP workloads (transactional apps).

 Backend database for web and mobile apps.

 Can use SQL Statement

 Store Structure , semi-structured , Unstructured data

 Queue Storage → Simple message queuing between app components.

 Table Storage → NoSQL key-value store for structured, non-relational data.

 In data lake we get the hierarchical Name Space

Cloude ETL tool

Azure Data Factory

Big Data Transformation

 when we want to process the data

Cloud Data Warehousing

Azure Synapse Analytics

 Equivalent to snowflake, Redshift , big query

 Azure Monitor: End-to-end monitoring solution — collects logs/metrics, builds dashboards,

You might also like