Data Engineering Unit - 2

Data Engineering Unit 2 notes

Uploaded by

Anu Uthayam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views7 pages

Data Engineering Unit - 2

Data Engineering Unit 2 notes

Uploaded by

Anu Uthayam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

UNIT – II THE DATA ENGINEERING LIFECYLE

The Data Life cycle vs the Data Engineering Life Cycle- Major undercurrent across
the Data Engineering Lifecycle – Security-Data Management-Data Ops-Data
Architecture-Software Engineering -Principle of Good Data Architecture

Data Engineering Life Cycle: -

Data Life Cycle Versus Data Engineering Life Cycle: -
Data Life Cycle
The Data Life Cycle refers to the various stages data goes through from its creation to
its disposal. This cycle is more conceptual and focuses on the management, use, and
governance of data.

Stages in the Data Life Cycle:

1. Data Creation/Collection:
o Data is generated or collected from various sources such as sensors,
applications, or user interactions.
2. Data Storage:
o Data is stored in databases, data lakes, or other storage systems for future
use.
3. Data Processing:
o Data is cleaned, transformed, and prepared for analysis.
4. Data Analysis/Usage:
o Data is analysed to derive insights, trends, and patterns.
5. Data Sharing/Distribution:
o Data is shared with stakeholders, other systems, or users.
6. Data Archival:
o Older or less frequently used data is archived for long-term storage.
7. Data Deletion:
o Data is securely deleted when it is no longer needed or after meeting
regulatory requirements.
Focus:
• Data management
• Governance and compliance
• Business and analytical insights
Data Engineering Life Cycle
The Data Engineering Life Cycle is a more technical framework that focuses
on building, maintaining, and optimizing systems for collecting, storing, and
processing data efficiently.

Stages in the Data Engineering Life Cycle:

1. Data Ingestion:
o Data engineers set up pipelines to gather data from various sources (APIs,
databases, logs, etc.).
2. Data Storage Design:
o Decisions are made regarding data models, schemas, and storage systems
(e.g., relational databases, NoSQL databases, data lakes).
3. Data Transformation (ETL/ELT):
o Data is extracted, transformed, and loaded into systems for analysis or
further processing.
4. Data Quality and Validation:
o Ensuring data accuracy, consistency, and reliability through validation
processes.
5. Data Orchestration:
o Automating workflows and ensuring efocient execution of data pipelines.
6. Data Optimization:
o Enhancing performance through partitioning, indexing, caching, or other
techniques.
7. Data Monitoring and Maintenance:
o Monitoring systems to ensure reliability and resolving issues in data
pipelines.
8. Scaling and Upgrading:
o Adapting to growing data volumes and implementing new tools or
methodologies.
Focus:
• Building and optimizing pipelines
• Scalability, efficiency, and performance
• Infrastructure and tooling
Data Life Cycle Vs Data Engineering Life Cycle
The Data Life Cycle and Data Engineering Life Cycle are related but focus on
different aspects of handling data. Here’s a breakdown of their differences:
Aspect Data Life Cycle Data Engineering
Definition The stages data goes The process of designing,
through from creation to building, and maintaining
disposal data infrastructure.
Focus Data Management and Data infrastructure and
usage over time processing workflows
Stages/Phases 1. Data Creation
2. Data Collection 1. Requirements
3. Data Processing Gathering
4. Data Storage 2. Data Ingestion
5. Data Analysis 3. Data Storage
6. Data Sharing Design
7. Data Archiving / 4. Data Processing
Deletion 5. Data Monitoring &
Optimization
Objectives Ensuring data is useful Ensuring data is useful and
and properly handled properly handled
Stakeholders Business users, data Data Engineers, Software
scientists, compliance Developers, IT
officers Administrators
Tools Used Excel, SQL, BI tools ETL tools (Apache NiFi,
(Power BI, Tableau) Airflow), Databases
(Snowflake, Redshift), Big
Data tools (Hadoop,
Spark).

End Goal Meaningful insights and Efficient data pipelines

compliance and infrastructure

Generation: Source System

A Source system is the origin of data used throughout the data engineering process.
Examples include IoT device, application message queues, and transactional
databases. Data engineers consume data from these systems but do not typically own
or control them. They must understand how source systems generate data, the speed
and frequency of data flow, and the variety of data types involved.
Maintaining communication with source system owners is crucial to handle changes
that may affect data pipelines and analytics. Changes such as modifications to
application code or migration to a new database cam impact data structures and flows.
Examples of Source Systems
1. Traditional Source System: Application Database
 Consists of application servers connected to a relational database
management system (RDBMS).
 This pattern has been in use since the 1980s and remains popular today
with microservices architecture, where each service has its own database.
 Example: An e-commerce platform where product and customer data are
stored in a MySQL or PostgreSQL database.

2. Modern Source System: IoT Swarm

 Comprises numerous IoT devices (e.g., sensors, smart appliances) sending
data to a central system via message queues or cloud services.
 These systems generate high-velocity, real-time data streams that need
processing and analysis.
 Example: A network of weather sensors sending temperature and humidity
data to a cloud-based collection system.

Key Considerations for Evaluating Source Systems

When working with a source system, data engineers should ask:
1. What type of system is it?
 Is it an application, IoT devices, or something else?
2. How is data stored?
 Is the data stored long-term or deleted after a short time?
3. How fast is data generated?
 Are there millions of events per second, or a few per hour?
4. Is the data reliable?
 Are there missing values, incorrect formats, or duplicates?
5. How often do errors occur?
 Does the system have frequent failures?
6. Does data arrive late?
 Some messages might be delayed due to network issues.
7. What is the data structure(schema)?
 Is the data spread across multiple tables or systems?
 How are schema changes handled?
8. How often should data be collected?
 Is data collected in real-time or at fixed intervals?
9. Will reading data slow down the system?
 Extracting data could impact system performance.
Understanding Source System Limits
Each source system has unique data volume and frequency characteristics. A
data engineer should know how data is generated and any specific quirks of the
system. It’s also crucial to understand the system’s limitations, such as whether
running analytical queries could slow down its performance.
One of the most challenging variations of source data is the schema. The schema
defines the structure of data, from the overall system down to individual tables and
fields. Handling schema correctly is crucial, as the way data is structured impacts how
data is ingested and processed. There are two common approaches:
1. Schema less Systems: In these systems, the schema is dynamic and defined as
data is written. This is often the case with NoSQL databases like MongoDB or
when data is written to a messaging queue or a blob storage.
2. Fixed Schema Systems: In more traditional relational database systems, the
schema is predefined, and any data written to the database must conform to it.
Data Engineers need to adapt to schema evaluation, as source systems may change
over time. For example, in an agile development process, the schema may evolve to
accommodate new requirements, and data engineers must ensure that their data
pipelines can handle these changes without disrupting downstream analytics.

Storage in Data Engineering

Once data is ingested, it must be stored appropriately. Selecting the right storage
solution is crucial for the success of the entire data life cycle, yet it is one of the most
complex stages due to several factors.
1. Complexity of Storage Solutions:
 Cloud architectures often use multiple storage solutions
 Many storage solutions offer not only storage but also data transformation
capabilities (eg., Amazon S3 Select).
 Storage overlaps with other lifecycle stages such as ingestion,
transformation, and serving.
2. Impact across the Data Life Cycle:
 Storage occurs at multiple points in a data pipeline and affects processes
across the life cycle.
 Cloud data warehouses can store, process, and serve data.
 Streaming platforms like Apache Kafka and Pulsar serve as ingestion,
storage, and query systems, with object storage as a common data
transmission layer.
Key Considerations for Evaluating Storage Systems:
When selecting a storage system for a data warehouse, Lakehouse, database, or object
storage, key questions include:
 Is this storage solution compatible with the architecture’s required write and
read speeds?
 Will storage create a bottleneck for downstream processes?
 Do you understand how this storage technology works? Are you utilizing the
storage system optimally or committing unnatural acts?
 Will this storage system handle anticipated future scale?
 Will downstream users and processes be able to retrieve data in the required
service-level agreement (SLA)?
 Are you capturing metadata about schema evolution, data flows, data lineage,
and so forth?
 Is this a pure storage solution (object storage), or does it support complex
query patterns (i.e., a cloud data warehouse)?
 Is the storage system schema-agnostic, flexible schema, or enforced schema?
 How are you tracking master data, golden records, data quality, and data
lineage for governance?
Understanding Data Access Frequency (“Data Temperatures”)
Data is accessed at different frequencies, leading to classification based on
“temperature”:
 Hot Data: Frequently accessed, often multiple times per day or second. Stored
for fast retrieval, suitable for real-time systems.
 Lukewarm data: Accessed occasionally, such as weekly or monthly.
 Cold Data: Rarely accessed, typically stored for compliance or backup
purposes. Traditionally stored on tapes, but cloud vendors now offer low-cost
archival options with high retrieval costs.
Selecting the Right Storage Solution
The choice of storage depends on various factors:
 Use Cases: Different storage types suit different needs.
 Data Volume: Large volumes may require scalable solutions.
 Ingestion Frequency: High-frequency ingestion may need specialized
storage.
 Data Format and Size: The structure and size influence storage decisions.
There is no universal storage solution – each technology comes with trade-offs, and
the choice should align with the specific needs of the data architecture.

Data Ingestion
Data Ingestion is the second phase of the data engineering lifecycle, involving the
collection of data from various source systems. After understanding the data sources
and their characteristics, it becomes essential to ensure smooth and reliable data flow
into storage, processing, and serving systems.

https://www.scribd.com/document/880044637/DE-UNIT-2

De Unit-2
100% (1)
De Unit-2
17 pages
De Unit-2
No ratings yet
De Unit-2
10 pages
De Unit - I
No ratings yet
De Unit - I
43 pages
Wa0008.
No ratings yet
Wa0008.
19 pages
Unit 1 Introduction To Data Engineering
No ratings yet
Unit 1 Introduction To Data Engineering
32 pages
Data Engineering UNIT 1
100% (1)
Data Engineering UNIT 1
16 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
Data Engineering Life Cycle
No ratings yet
Data Engineering Life Cycle
5 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
DE Unit I
No ratings yet
DE Unit I
12 pages
Essentials of Data engineeringByMukeshSaini
No ratings yet
Essentials of Data engineeringByMukeshSaini
30 pages
Data Engineering Lifecycle
No ratings yet
Data Engineering Lifecycle
13 pages
Essentials of Data Engineering - Saini, DR - Mukesh - 2024 - Anna's Archive
No ratings yet
Essentials of Data Engineering - Saini, DR - Mukesh - 2024 - Anna's Archive
431 pages
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
No ratings yet
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
43 pages
Fundamentals of Data Engineering by Joe Reis and Matt Housley 83
No ratings yet
Fundamentals of Data Engineering by Joe Reis and Matt Housley 83
6 pages
Data Engineeing 1 Pages 2
No ratings yet
Data Engineeing 1 Pages 2
14 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
163 pages
Data Engineering Syllabus
No ratings yet
Data Engineering Syllabus
1 page
Fundamentals of Data Engineering Index
No ratings yet
Fundamentals of Data Engineering Index
17 pages
The Essential Guide To Data Engineering
No ratings yet
The Essential Guide To Data Engineering
12 pages
Fundamentals of Data Engineering Concepts
No ratings yet
Fundamentals of Data Engineering Concepts
219 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
13 pages
Data Engineering - Beginner's Guide
100% (1)
Data Engineering - Beginner's Guide
9 pages
Data Engineering
No ratings yet
Data Engineering
6 pages
Introduction to Data Engineering
No ratings yet
Introduction to Data Engineering
3 pages
Print 2 de
No ratings yet
Print 2 de
24 pages
Data Engineering Career Guide
100% (2)
Data Engineering Career Guide
14 pages
Data Engineering
No ratings yet
Data Engineering
144 pages
Introduction To Data Engineering
100% (1)
Introduction To Data Engineering
6 pages
4.data Engineering
No ratings yet
4.data Engineering
9 pages
De Notes
No ratings yet
De Notes
3 pages
A Internship Report UTTAM
No ratings yet
A Internship Report UTTAM
9 pages
Lecture 3 Data Engineering Concepts, Processes, and Tools
No ratings yet
Lecture 3 Data Engineering Concepts, Processes, and Tools
2 pages
The Essence of Data Engineering
No ratings yet
The Essence of Data Engineering
3 pages
Data Engineering Questions Answers 1679109980
100% (1)
Data Engineering Questions Answers 1679109980
26 pages
Data Engineering & ETL Essentials
No ratings yet
Data Engineering & ETL Essentials
20 pages
Data Engineering UNIT-1
No ratings yet
Data Engineering UNIT-1
5 pages
Data Engineering Training Technology Agnostic Foundations
No ratings yet
Data Engineering Training Technology Agnostic Foundations
50 pages
Intro To Data Engineering!
No ratings yet
Intro To Data Engineering!
34 pages
Life
No ratings yet
Life
3 pages
An Introduction To Data Engineering
No ratings yet
An Introduction To Data Engineering
2 pages
Data Engineer Roadmap 2024 - Navigating The Landscape of Data Engineering - by Ansam Yousry - in Technology Hits - Freedium
No ratings yet
Data Engineer Roadmap 2024 - Navigating The Landscape of Data Engineering - by Ansam Yousry - in Technology Hits - Freedium
12 pages
Data Engineering: Key Roles & Trends
No ratings yet
Data Engineering: Key Roles & Trends
3 pages
Data Engineering 101
No ratings yet
Data Engineering 101
1 page
Data Engineering Essentials
No ratings yet
Data Engineering Essentials
24 pages
100 Data Engineering QUESTIONS ANSWERS
No ratings yet
100 Data Engineering QUESTIONS ANSWERS
59 pages
GFG Data Engg
No ratings yet
GFG Data Engg
23 pages
NFR Interview
No ratings yet
NFR Interview
5 pages
Course - Data Engineering
No ratings yet
Course - Data Engineering
3 pages
The Evolving Role of The Data Engineer
No ratings yet
The Evolving Role of The Data Engineer
61 pages
Data Engineering Notes Expanded
No ratings yet
Data Engineering Notes Expanded
2 pages
Big Book of Data Engineering 3rd Edition 1 27 2025
100% (1)
Big Book of Data Engineering 3rd Edition 1 27 2025
126 pages
Data Engineering Interview Things
No ratings yet
Data Engineering Interview Things
13 pages
Unit 1 Introduction To Business Intelligence (BI) Systems: Structure
No ratings yet
Unit 1 Introduction To Business Intelligence (BI) Systems: Structure
24 pages
Paper 2
No ratings yet
Paper 2
29 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
M.Tech Software Engineering Grade History
No ratings yet
M.Tech Software Engineering Grade History
5 pages
Fashion Industry Stock Optimization
No ratings yet
Fashion Industry Stock Optimization
12 pages
BCS058
No ratings yet
BCS058
2 pages
Data Analytics Life Cycle Guide
No ratings yet
Data Analytics Life Cycle Guide
83 pages
Banking Problem Database
No ratings yet
Banking Problem Database
5 pages
Data Preprocessing Techniques Explained
No ratings yet
Data Preprocessing Techniques Explained
14 pages
Data Transformation Challenges
No ratings yet
Data Transformation Challenges
22 pages
Azure SQL Database Essentials Guide
100% (2)
Azure SQL Database Essentials Guide
13 pages
Prakash Chokkalingam: Contact
No ratings yet
Prakash Chokkalingam: Contact
4 pages
Resume Chejerla Prasanna Kumar
No ratings yet
Resume Chejerla Prasanna Kumar
7 pages
Development of Cinema Movie Data Warehouse: Daniel Tanjung, Fernando Lioxander, Abba Suganda Girsang, Diana
No ratings yet
Development of Cinema Movie Data Warehouse: Daniel Tanjung, Fernando Lioxander, Abba Suganda Girsang, Diana
7 pages
IBM Courses
No ratings yet
IBM Courses
19 pages
Data Warehousing Case Studies .Compressed-1
No ratings yet
Data Warehousing Case Studies .Compressed-1
8 pages
1626759783293resume Vijay
No ratings yet
1626759783293resume Vijay
3 pages
OLAP
No ratings yet
OLAP
42 pages
OPTIMA Operations and Maintenance Guide
No ratings yet
OPTIMA Operations and Maintenance Guide
486 pages
FALLSEM2025-26 VL CSI3017 00100 TH 2025-08-12 Parallel-Development-Tracks - BI-Framework
No ratings yet
FALLSEM2025-26 VL CSI3017 00100 TH 2025-08-12 Parallel-Development-Tracks - BI-Framework
17 pages
Slowly Changing Dimension
100% (1)
Slowly Changing Dimension
14 pages
Spend Analysis
100% (1)
Spend Analysis
5 pages
IIS DataStageSortPerformance PDF
No ratings yet
IIS DataStageSortPerformance PDF
26 pages
Sai Charan de
No ratings yet
Sai Charan de
9 pages
Teradata Developer Resume: Anil Putturu
No ratings yet
Teradata Developer Resume: Anil Putturu
7 pages
Griffiths v3
No ratings yet
Griffiths v3
72 pages
Data Architect
No ratings yet
Data Architect
10 pages
Building Big Data Pipelines with Beam
No ratings yet
Building Big Data Pipelines with Beam
8 pages
Business Intelligence Basics for Professionals
No ratings yet
Business Intelligence Basics for Professionals
23 pages
Gartner Reprint - Critical Capabilities For Insight Engines
No ratings yet
Gartner Reprint - Critical Capabilities For Insight Engines
24 pages

Data Engineering Unit - 2

Uploaded by

Data Engineering Unit - 2

Uploaded by

UNIT – II THE DATA ENGINEERING LIFECYLE

Data Engineering Life Cycle: -

Stages in the Data Life Cycle:

Stages in the Data Engineering Life Cycle:

End Goal Meaningful insights and Efficient data pipelines

Generation: Source System

2. Modern Source System: IoT Swarm

Key Considerations for Evaluating Source Systems

Storage in Data Engineering

You might also like