UNIT – II THE DATA ENGINEERING LIFECYLE
The Data Life cycle vs the Data Engineering Life Cycle- Major undercurrent across
the Data Engineering Lifecycle – Security-Data Management-Data Ops-Data
Architecture-Software Engineering -Principle of Good Data Architecture
Data Engineering Life Cycle: -
Data Life Cycle Versus Data Engineering Life Cycle: -
Data Life Cycle
The Data Life Cycle refers to the various stages data goes through from its creation to
its disposal. This cycle is more conceptual and focuses on the management, use, and
governance of data.
Stages in the Data Life Cycle:
1. Data Creation/Collection:
o Data is generated or collected from various sources such as sensors,
applications, or user interactions.
2. Data Storage:
o Data is stored in databases, data lakes, or other storage systems for future
use.
3. Data Processing:
o Data is cleaned, transformed, and prepared for analysis.
4. Data Analysis/Usage:
o Data is analysed to derive insights, trends, and patterns.
5. Data Sharing/Distribution:
o Data is shared with stakeholders, other systems, or users.
6. Data Archival:
o Older or less frequently used data is archived for long-term storage.
7. Data Deletion:
o Data is securely deleted when it is no longer needed or after meeting
regulatory requirements.
Focus:
• Data management
• Governance and compliance
• Business and analytical insights
Data Engineering Life Cycle
The Data Engineering Life Cycle is a more technical framework that focuses
on building, maintaining, and optimizing systems for collecting, storing, and
processing data efficiently.
Stages in the Data Engineering Life Cycle:
1. Data Ingestion:
o Data engineers set up pipelines to gather data from various sources (APIs,
databases, logs, etc.).
2. Data Storage Design:
o Decisions are made regarding data models, schemas, and storage systems
(e.g., relational databases, NoSQL databases, data lakes).
3. Data Transformation (ETL/ELT):
o Data is extracted, transformed, and loaded into systems for analysis or
further processing.
4. Data Quality and Validation:
o Ensuring data accuracy, consistency, and reliability through validation
processes.
5. Data Orchestration:
o Automating workflows and ensuring efocient execution of data pipelines.
6. Data Optimization:
o Enhancing performance through partitioning, indexing, caching, or other
techniques.
7. Data Monitoring and Maintenance:
o Monitoring systems to ensure reliability and resolving issues in data
pipelines.
8. Scaling and Upgrading:
o Adapting to growing data volumes and implementing new tools or
methodologies.
Focus:
• Building and optimizing pipelines
• Scalability, efficiency, and performance
• Infrastructure and tooling
Data Life Cycle Vs Data Engineering Life Cycle
The Data Life Cycle and Data Engineering Life Cycle are related but focus on
different aspects of handling data. Here’s a breakdown of their differences:
Aspect Data Life Cycle Data Engineering
Definition The stages data goes The process of designing,
through from creation to building, and maintaining
disposal data infrastructure.
Focus Data Management and Data infrastructure and
usage over time processing workflows
Stages/Phases 1. Data Creation
2. Data Collection 1. Requirements
3. Data Processing Gathering
4. Data Storage 2. Data Ingestion
5. Data Analysis 3. Data Storage
6. Data Sharing Design
7. Data Archiving / 4. Data Processing
Deletion 5. Data Monitoring &
Optimization
Objectives Ensuring data is useful Ensuring data is useful and
and properly handled properly handled
Stakeholders Business users, data Data Engineers, Software
scientists, compliance Developers, IT
officers Administrators
Tools Used Excel, SQL, BI tools ETL tools (Apache NiFi,
(Power BI, Tableau) Airflow), Databases
(Snowflake, Redshift), Big
Data tools (Hadoop,
Spark).
End Goal Meaningful insights and Efficient data pipelines
compliance and infrastructure
Generation: Source System
A Source system is the origin of data used throughout the data engineering process.
Examples include IoT device, application message queues, and transactional
databases. Data engineers consume data from these systems but do not typically own
or control them. They must understand how source systems generate data, the speed
and frequency of data flow, and the variety of data types involved.
Maintaining communication with source system owners is crucial to handle changes
that may affect data pipelines and analytics. Changes such as modifications to
application code or migration to a new database cam impact data structures and flows.
Examples of Source Systems
1. Traditional Source System: Application Database
Consists of application servers connected to a relational database
management system (RDBMS).
This pattern has been in use since the 1980s and remains popular today
with microservices architecture, where each service has its own database.
Example: An e-commerce platform where product and customer data are
stored in a MySQL or PostgreSQL database.
2. Modern Source System: IoT Swarm
Comprises numerous IoT devices (e.g., sensors, smart appliances) sending
data to a central system via message queues or cloud services.
These systems generate high-velocity, real-time data streams that need
processing and analysis.
Example: A network of weather sensors sending temperature and humidity
data to a cloud-based collection system.
Key Considerations for Evaluating Source Systems
When working with a source system, data engineers should ask:
1. What type of system is it?
Is it an application, IoT devices, or something else?
2. How is data stored?
Is the data stored long-term or deleted after a short time?
3. How fast is data generated?
Are there millions of events per second, or a few per hour?
4. Is the data reliable?
Are there missing values, incorrect formats, or duplicates?
5. How often do errors occur?
Does the system have frequent failures?
6. Does data arrive late?
Some messages might be delayed due to network issues.
7. What is the data structure(schema)?
Is the data spread across multiple tables or systems?
How are schema changes handled?
8. How often should data be collected?
Is data collected in real-time or at fixed intervals?
9. Will reading data slow down the system?
Extracting data could impact system performance.
Understanding Source System Limits
Each source system has unique data volume and frequency characteristics. A
data engineer should know how data is generated and any specific quirks of the
system. It’s also crucial to understand the system’s limitations, such as whether
running analytical queries could slow down its performance.
One of the most challenging variations of source data is the schema. The schema
defines the structure of data, from the overall system down to individual tables and
fields. Handling schema correctly is crucial, as the way data is structured impacts how
data is ingested and processed. There are two common approaches:
1. Schema less Systems: In these systems, the schema is dynamic and defined as
data is written. This is often the case with NoSQL databases like MongoDB or
when data is written to a messaging queue or a blob storage.
2. Fixed Schema Systems: In more traditional relational database systems, the
schema is predefined, and any data written to the database must conform to it.
Data Engineers need to adapt to schema evaluation, as source systems may change
over time. For example, in an agile development process, the schema may evolve to
accommodate new requirements, and data engineers must ensure that their data
pipelines can handle these changes without disrupting downstream analytics.
Storage in Data Engineering
Once data is ingested, it must be stored appropriately. Selecting the right storage
solution is crucial for the success of the entire data life cycle, yet it is one of the most
complex stages due to several factors.
1. Complexity of Storage Solutions:
Cloud architectures often use multiple storage solutions
Many storage solutions offer not only storage but also data transformation
capabilities (eg., Amazon S3 Select).
Storage overlaps with other lifecycle stages such as ingestion,
transformation, and serving.
2. Impact across the Data Life Cycle:
Storage occurs at multiple points in a data pipeline and affects processes
across the life cycle.
Cloud data warehouses can store, process, and serve data.
Streaming platforms like Apache Kafka and Pulsar serve as ingestion,
storage, and query systems, with object storage as a common data
transmission layer.
Key Considerations for Evaluating Storage Systems:
When selecting a storage system for a data warehouse, Lakehouse, database, or object
storage, key questions include:
Is this storage solution compatible with the architecture’s required write and
read speeds?
Will storage create a bottleneck for downstream processes?
Do you understand how this storage technology works? Are you utilizing the
storage system optimally or committing unnatural acts?
Will this storage system handle anticipated future scale?
Will downstream users and processes be able to retrieve data in the required
service-level agreement (SLA)?
Are you capturing metadata about schema evolution, data flows, data lineage,
and so forth?
Is this a pure storage solution (object storage), or does it support complex
query patterns (i.e., a cloud data warehouse)?
Is the storage system schema-agnostic, flexible schema, or enforced schema?
How are you tracking master data, golden records, data quality, and data
lineage for governance?
Understanding Data Access Frequency (“Data Temperatures”)
Data is accessed at different frequencies, leading to classification based on
“temperature”:
Hot Data: Frequently accessed, often multiple times per day or second. Stored
for fast retrieval, suitable for real-time systems.
Lukewarm data: Accessed occasionally, such as weekly or monthly.
Cold Data: Rarely accessed, typically stored for compliance or backup
purposes. Traditionally stored on tapes, but cloud vendors now offer low-cost
archival options with high retrieval costs.
Selecting the Right Storage Solution
The choice of storage depends on various factors:
Use Cases: Different storage types suit different needs.
Data Volume: Large volumes may require scalable solutions.
Ingestion Frequency: High-frequency ingestion may need specialized
storage.
Data Format and Size: The structure and size influence storage decisions.
There is no universal storage solution – each technology comes with trade-offs, and
the choice should align with the specific needs of the data architecture.
Data Ingestion
Data Ingestion is the second phase of the data engineering lifecycle, involving the
collection of data from various source systems. After understanding the data sources
and their characteristics, it becomes essential to ensure smooth and reliable data flow
into storage, processing, and serving systems.
https://www.scribd.com/document/880044637/DE-UNIT-2