Data Engineering Unit 2 Notes
Data Engineering Unit 2 Notes
A data pipeline is an organised system that moves data from its source to its destination
while ensuring quality, reliability, and usability. As modern businesses rely heavily on
data-driven decision-making, designing effective pipelines has become a critical
requirement. A well-designed pipeline provides smooth extraction, cleaning, storage,
transformation, governance, delivery, and monitoring of data. The design principles and
architectural patterns serve as core guidelines for building systems that can handle
large-scale, real-time, and complex datasets without failure. This topic explores the
fundamental principles, the need for structured pipeline design, technical
considerations, and commonly used design and architectural patterns.
Data pipelines must ensure that raw data is collected efficiently, processed accurately,
and delivered to the right systems for analysis. Whether the purpose is analytics,
machine learning, reporting, business intelligence, or application-level integration, the
foundation lies in how the pipeline is designed. Organisations today process terabytes
of structured, semi-structured, and unstructured data originating from highly diverse
sources. Therefore, scalability, modularity, quality, security, and interoperability have
become the cornerstones of modern pipeline engineering.
1.1 Scalability
Scalability ensures that the pipeline can handle an increasing volume, variety, and
velocity of data. As data grows due to user activity, IoT devices, application logs, or
transactional systems, the pipeline must support parallel processing, distributed
computation, and elastic resource allocation. Horizontal scalability, where new
machines are added to share workloads, is considered more efficient than vertically
upgrading a single system.
Data pipelines cannot afford failures because they directly impact business operations.
Reliability ensures that data continues to flow even when components fail. Techniques
include retry logic, replication, checkpointing, distributed clusters, and redundant
storage. Fault-tolerant pipelines automatically resume processing after failures,
ensuring no data loss or duplication.
Cloud-based pipelines must balance performance and cost. Resources should scale
dynamically, idle components must shut down automatically, and storage tiers must be
optimised (hot, warm, cold storage). Understanding the trade-off between performance
and expense is a critical design element.
Traditional data pipelines first extract data, then transform it using business rules, and
finally load it into a data warehouse. ETL is best for structured data, strict governance,
and enterprise reporting. It ensures high-quality, cleaned, and enriched data before
storage.
Modern cloud systems prefer ELT, where data is extracted and loaded directly into cloud
storage or data lakes, and transformation occurs afterwards. It supports large datasets,
faster ingestion, and flexible analytics or machine learning use cases.
Batch systems process large volumes of data at scheduled intervals. Suitable for daily
reports, weekly aggregations, payroll processing, and analytics. They offer stability, cost
efficiency, and simplicity but lack real-time capabilities.
2.4 Streaming Processing Pattern
Streaming patterns process data continuously in real time. They are essential for fraud
detection, IoT monitoring, log analytics, customer personalisation, and high-frequency
event processing. Streaming pipelines require low latency, distributed messaging
systems, and event-driven architecture.
Lambda architecture combines batch and streaming layers to provide both accuracy
and low latency. The batch layer ensures comprehensive historical data processing,
while the speed layer handles real-time events.
Data pipelines built using microservices allow independent components for ingestion,
transformation, validation, enrichment, and delivery. Each microservice can be
deployed, scaled, or replaced individually.
Tools like Apache Airflow, AWS Step Functions, and Prefect orchestrate pipelines
through workflows. This pattern helps manage dependencies, automation, scheduling,
and monitoring.
Metadata defines data origin, definitions, schema, freshness, and lineage. Pipelines
must support metadata catalogs to ensure traceability and governance.
Contracts specify how producers and consumers interact. They define schema, format,
quality, frequency, and protocols, ensuring stability and compatibility between
components.
3.3 Idempotency
Idempotent operations ensure that running a job multiple times produces the same
output. This prevents duplication, errors, and inconsistent results.
3.4 Backpressure Handling
During high load, pipelines must regulate input traffic to avoid overwhelming
downstream components. Backpressure techniques improve stability and prevent
system crashes.
Good design ensures accuracy, speed, reliability, and scalability. Poorly designed
pipelines lead to corrupted data, failed jobs, performance issues, and incorrect business
decisions. With growing data complexity, strong architectural principles and design
patterns are essential for modern data engineering.
The evolution of data architectures reflects how organisations have transformed their
approach to collecting, storing, processing, analysing, and governing data over the
decades. As data volumes expanded, technologies became more sophisticated, and
business needs shifted toward real-time intelligence, the corresponding data
architecture models adapted accordingly. From traditional single-server systems to
distributed cloud-native lakehouses and streaming ecosystems, the journey of data
architecture represents one of the most significant technological transitions in the IT
industry. Understanding this evolution is crucial for modern data engineers, architects,
and analysts because current systems still rely on principles that originated in earlier
architectures.
Data processing was strictly batch-oriented. Jobs ran overnight, and results were
available the next day. Data integration was minimal because each application
managed its own isolated dataset. Reporting capabilities were basic, limited to static
tabular formats. As business demands grew, these systems struggled to accommodate
increased workload, multiple sources, or complex analytical needs.
However, relational databases and EDWs had limitations. They were optimized for
structured data only. They were expensive, required large upfront investment, and not
suitable for unstructured sources like emails, logs, images, or sensor data. As the web
era began, data volumes exploded beyond RDBMS capabilities.
The 2000s witnessed the rise of massive data generated by online platforms, social
media, e-commerce, and IoT. Traditional systems could not scale to petabyte-level
datasets. This led to the emergence of distributed architectures, especially after
Google published its seminal papers on Google File System (GFS) and MapReduce.
These ideas inspired Hadoop, which introduced HDFS (distributed storage) and
MapReduce (distributed computation).
This era marked the transition from vertical scaling to horizontal scaling. Organisations
could now store huge volumes of raw data across clusters of inexpensive commodity
hardware. Instead of schema-on-write, big data systems adopted schema-on-read,
enabling flexible analysis. Processing evolved into batch-based parallel computation.
Though powerful, Hadoop systems were slow because MapReduce involved heavy disk
I/O and multiple job stages.
The big data era introduced new problems—data lakes storing ungoverned raw data,
slow batch analytics, and limited real-time processing. Still, it represented a major
milestone in handling variety and volume.
Data lakes became mainstream, stored on cloud object storage systems such as
Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. Data lakes support
all data types—structured, semi-structured, unstructured—making them ideal for
machine learning and highly flexible analytics. ELT pipelines replaced ETL, allowing raw
data to be stored first and transformed later based on use-case.
The introduction of distributed query engines like Presto, Athena, BigQuery, and
Snowflake allowed high-performance SQL analytics on data lake storage. The
architecture further shifted from traditional warehouses toward lakehouse
architectures, blending the reliability of warehouses with the flexibility of lakes.
Modern organisations face problems of data silos, centralised bottlenecks, and scaling
challenges as they expand globally. As a result, newer architectural paradigms
emerged.
Lakehouse Architecture
Data Democratization
Mobile apps, IoT sensors, clickstreams, and real-time logs influenced highly scalable
ingestion systems.
8. Summary of Evolution
The evolution reflects the movement from centralised, structured systems to flexible,
scalable, distributed, cloud-native, AI-driven ecosystems.
Cloud architectures separate storage from compute layers. This allows independent
scaling, reduces cost, and enhances flexibility. Data is stored in low-cost object storage
while compute clusters scale dynamically.
Cloud platforms support relational databases, NoSQL systems, streaming engines, data
warehouses, lakehouses, and AI platforms. This allows organisations to choose the
best tool for each workload.
Serverless services like AWS Lambda, Azure Functions, and GCP Cloud Functions
automatically run code applications without provisioning servers. Event-driven models
allow instant processing based on triggers such as data arrival, file uploads, or message
queue events.
1.6 Security, Governance, and Compliance
Amazon Web Services offers one of the most extensive data ecosystems.
AWS S3 is the foundation of the modern data lake architecture. It is durable, scalable,
cost-efficient, and supports versioning, lifecycle policies, and object locking. AWS also
offers:
AWS offers Amazon Athena for SQL queries directly on S3, Redshift Spectrum for lake
analytics, and Quicksight for BI dashboards. This enables flexible consumption and
supports a lakehouse approach through AWS Lake Formation.
Azure Data Factory (ADF) serves as a central ingestion and orchestration service. It
supports:
● Batch ingestion
● Real-time ingestion via Azure Event Hubs
● IoT ingestion via Azure IoT Hub
ADF also provides ETL/ELT capabilities through Data Flow and pipeline orchestration.
● Azure Data Lake Storage (ADLS Gen2) for large-scale lake storage
● Azure SQL Database for relational workloads
● Cosmos DB for globally distributed NoSQL
● Azure Synapse SQL pools for warehousing
ADLS Gen2 is the default storage layer for modern architectures on Azure.
Azure Machine Learning integrates pipelines for model training, deployment, and
monitoring. Azure Synapse Studio provides an integrated environment for analytics,
visualization, and management of large volumes of data.
These enable real-time ingestion with automatic scaling and minimal operational load.
BigQuery separates compute and storage fully, enabling lightning-fast queries without
cluster management.
GCP integrates seamlessly with TensorFlow, AutoML, and Vertex AI for machine
learning applications.
BigQuery ML allows machine learning directly using SQL. Looker and Data Studio
support business intelligence and visual analytics.
Despite differences, AWS, Azure, and GCP share key characteristics in their modern
data architectures:
Most platforms now integrate the flexibility of lakes with the reliability of warehouses.
Stream processing engines and event-driven architectures are built into all major cloud
ecosystems.
5.3 Auto-Scaling and Elastic Compute
Workflow orchestration tools like AWS Step Functions, Azure Data Factory, and GCP
Composer streamline complex pipelines.
● Managed ML models
● Feature stores
● Model deployment capabilities
● Monitoring and retraining ecosystems
Cloud-native IAM systems, encryption, audit logs, and catalog services ensure
compliance.
● Unlimited scalability
● Lower operational cost
● Faster deployment timelines
● Integration with AI and advanced analytics
● Flexible compute and storage separation
● High reliability with built-in redundancy
● Simplified management through serverless services
● Serverless analytics
● Unified data platforms (Databricks, Snowflake)
● Multi-cloud and hybrid-cloud strategies
● AI-driven governance
● In-memory processing engines
● Vector databases for generative AI
These trends reflect the shift towards systems that can support real-time
decision-making, massive global workloads, and advanced intelligent applications.
Modern data architecture focuses not only on collecting and storing data, but also on
transforming, enriching, analyzing, and operationalizing it in ways that support
organizational decisions. The Processing and Consumption layer is the heart of this
architecture because it converts raw data into business value. While ingestion and
storage ensure data availability, processing and consumption ensure data usability,
relevance, and intelligence. This stage integrates advanced analytics, real-time
computation, business intelligence, machine learning, and operational applications to
drive data-driven decisions.
A. Batch Processing
● Used for: financial reporting, customer segmentation, fraud pattern mining, ETL
jobs, historical trend analysis.
● Benefits: high throughput, optimized for large datasets, cost-effective.
● Technologies: Hadoop MapReduce, Spark batch jobs, AWS Batch, Azure Data
Factory pipelines.
Batch processing is essential when time sensitivity is lower and accuracy and
completeness are priorities. For example, end-of-day sales reports or monthly risk
evaluations require batch-oriented pipelines.
● Used for: anomaly detection, IoT monitoring, credit scoring in real-time, fraud
prevention, recommendation engines.
● Benefits: low latency, instant event detection, supports operational decision
making.
● Technologies: Apache Kafka Streams, Apache Flink, Spark Structured
Streaming, AWS Kinesis, Azure Stream Analytics.
C. Micro-Batch Processing
E. Data Transformation
● Standardization
● Deduplication
● Normalization
● Feature engineering
● Aggregations
● Time-series restructuring
● Joining datasets from multiple sources
● Data anonymization and masking for privacy
Transformation is often the most time-consuming component but also the most critical
for decision-making accuracy.
● Predictive modeling
● Natural Language Processing
● Deep learning pipelines
● Recommendation systems
● Forecasting algorithms
● Anomaly detection models
● Reinforcement learning engines
These pipelines rely on feature stores, training datasets, model registries, MLOps
workflows, and automated retraining loops.
After processing, the next crucial step is delivering insights or data outputs to users,
applications, or systems. This is the consumption stage of the pipeline where
decision-makers actually interact with the results.
● Power BI
● Tableau
● Looker
● Qlik Sense
● Google Data Studio
BI tools enable analysts and managers to explore metrics, generate queries, identify
patterns, and interpret KPIs.
B. Analytical Applications
● Jupyter
● Databricks
● SageMaker
● Google Vertex AI
● Azure ML
Here, they run experiments, train models, and perform exploratory analysis using
curated datasets prepared by the pipeline.
● Data catalogs
● Data marketplaces
● Controlled access layers
● Role-based permissions
● Auditing and lineage tracking
Modern consumption systems ensure that data users access the correct version of the
correct dataset with full lineage and security compliance.
A. Scalability
Must handle sudden spikes in data (e.g., sale events, viral traffic, IoT surges).
B. Low Latency
Data pipelines must be highly resilient, recover from failures gracefully, and prevent
data loss.
D. Modularity
E. Interoperability
Modern processing and consumption pipelines are largely cloud-native. Each cloud
provider offers specialized tools:
AWS
Azure
Google Cloud
Snowflake / Databricks
Act as unified cloud data platforms supporting SQL analytics, ML, streaming, and BI
integration.
Thus, modern enterprises consider this stage as the core engine powering digital
transformation.
Processing and consumption form the brain of modern data architecture. They
convert raw data into refined intelligence and ensure that the right stakeholders receive
the right information at the right time. The sophistication of a company’s processing and
consumption pipeline directly reflects its analytical maturity and competitiveness.
Organizations that excel in this domain unlock superior agility, predictive capabilities,
and customer value creation, positioning themselves strongly in data-driven markets.
A. Low Latency
Latency refers to the delay between data generation and insight production. In
streaming analytics, latency must be extremely low—often in milliseconds—so
decisions can be made instantly. Low latency is critical in areas like fraud detection,
emergency systems, stock trading, and IoT alerts.
C. Event-Driven Architecture
Streaming systems must handle sudden spikes in incoming data—for example, when a
viral trend causes millions of new interactions or when IoT sensors send bursts of
readings. Cloud-native scaling enables automatic adjustment of compute resources.
E. Fault Tolerance
Because streaming never stops, failures must be handled gracefully. Systems use
checkpointing, replication, and recovery techniques to ensure no event is lost or
duplicated.
These sources produce continuous event flows that must be rapidly captured.
The ingestion layer collects streaming data and forwards it into processing engines. It
must be durable, fault-tolerant, and fast.
● Apache Kafka
● AWS Kinesis
● Azure Event Hubs
● Google Pub/Sub
● MQTT for IoT streams
This layer also supports partitioning, buffering, ordering, and retention of streaming
data.
This is the core computational engine of the pipeline, responsible for performing
real-time operations on incoming events.
● Filtering
● Windowing (e.g., sliding, tumbling, session windows)
● Aggregations and counts
● Joining streams with reference datasets
● Event correlation
● Machine learning inference on streams
● Real-time anomaly detection
● Apache Flink
● Apache Spark Structured Streaming
● Kafka Streams
● Apache Storm
● Google Dataflow
● AWS Kinesis Data Analytics
● Azure Stream Analytics
These frameworks allow the creation of directed workflows where each event is
processed as it arrives.
This layered storage architecture supports both real-time queries and long-term
analytics.
Modern streaming systems integrate machine learning models directly into the stream
for:
● Predictive analytics
● Fraud scoring
● Personalized recommendations
● Dynamic risk evaluations
● Real-time demand forecasting
Model inference often runs inside the stream processor itself, enabling instant
decision-making.
F. Consumption/Output Layer
A. Event Generation
Raw data events are produced by devices, applications, users, or operations. These
events typically include timestamps, identifiers, context, and metrics. The pipeline must
capture them the moment they occur.
Events are temporarily stored in distributed messaging systems. This protects the
pipeline from failure and ensures reliable delivery. The ingestion layer acts as a buffer
that decouples producers and consumers.
C. Real-Time Transformation
Transformation ensures that data is usable and consistent before being analyzed.
The processing engine works under strict time constraints, using streaming windows to
compute metrics such as “sum over the last 10 seconds” or “average over the last 1
hour.”
F. Real-Time Action/Consumption
Websites track user clicks, searches, and scrolls in real-time. Streaming pipelines power
recommendation engines, dynamic pricing, and customer engagement strategies.
Brands use streaming analytics to monitor trending hashtags, customer sentiment, and
viral engagement in real-time.
Smart cities use streaming pipelines to analyze traffic flows, signal timings, congestion
patterns, and live GPS data.
Streaming analytics identifies unusual access patterns, DDoS attacks, failed login
anomalies, and suspicious network behavior instantly.
Organizations gain the ability to act immediately rather than react slowly.
Streaming analytics helps identify issues before they escalate into failures.
E. Continuous Intelligence
F. Competitive Advantage
Real-time engines, cloud compute, and scalable ingestion services can be expensive.
D. Latency-Sensitive Constraints
Applications requiring ultra-low latency demand specialized hardware or optimized
configurations.
These advancements will make streaming more efficient, automated, and widespread.
In modern organizations, data pipelines act as the backbone of all digital operations.
They continuously ingest, process, transform, store, and serve data for analytics,
machine learning, reporting, and business decision-making. As enterprises collect data
at massive scales from transactional systems, IoT sensors, mobile devices, and cloud
applications, the need for highly secure and extremely scalable data pipelines has
become unavoidable. Any weakness in the security of the pipeline can expose sensitive
business or customer information, whereas a lack of scalability can result in system
failures, bottlenecks, or complete pipeline breakdowns during peak loads. Hence,
securing and scaling a data pipeline are no longer optional—they are foundational
principles of modern data engineering.
A data pipeline that is both secure and scalable must address five major dimensions:
1. Cloud security, which protects data, networks, identities, and infrastructure on
cloud platforms.
2. Security of analytics workloads, ensuring dashboards, analytical queries, BI
usage, and cluster operations remain safe.
3. Machine Learning security, which protects model data, training pipelines,
inference endpoints, and model integrity.
4. Scaling the data pipeline, which ensures that ingestion, computation, storage,
and serving layers can grow seamlessly with demand.
5. Creating scalable infrastructure and scalable components, meaning every
part of the pipeline—data ingestion, message queues, ETL jobs, storage layers,
analytics engines, and ML systems—must be independently scalable.
Each of these areas forms a critical layer in the architecture of modern data systems,
and weaknesses in any layer can compromise the reliability of the entire pipeline. The
following sections explain each dimension in detail, along with diagrams for easy
retention and examination use.
As organizations increasingly adopt cloud platforms such as AWS, Microsoft Azure, and
Google Cloud Platform, cloud security becomes the foundation of pipeline protection.
Cloud providers offer physical security, network isolation, data encryption tools, and
automated monitoring, but the customer is responsible for securing data access,
managing user identities, defining network policies, handling encryption keys, and
enforcing compliance.
Cloud security begins with Identity and Access Management (IAM), which defines
who can access which resource and what actions they can perform. Strong IAM policies
prevent unauthorized access to storage buckets, database systems, message queues,
and compute clusters. A secure pipeline also uses key management services for
encryption, ensuring both data in transit and data at rest remain cryptographically
protected.
Network security in the cloud is achieved using Virtual Private Clouds (VPCs), private
subnets, network access control lists, firewalls, and service endpoints that restrict
internal communication to secure channels. In addition, cloud-native monitoring systems
like AWS CloudTrail, Google Cloud Audit Logs, and Azure Security Center detect
anomalies such as unauthorized access attempts, high-risk API calls, or unexpected
data transfers.
Cloud security also involves securing data services, such as managed databases,
data warehouses, and message brokers. Access to these services must go through
role-based authorization, encryption-enabled connections, and IP whitelisting. With
sensitive workloads, data masking, tokenization, and anonymization become crucial.
Analytics workloads include dashboard tools, query engines, distributed clusters, and
user-facing analytical applications. These components frequently access sensitive,
aggregated, or decrypted data, making them a major target for attacks. To secure
analytics workloads, organizations implement role-based access, ensuring users only
see data relevant to their duties.
Additionally, BI tools such as Power BI, Tableau, Looker, and Qlik must enforce
row-level security, column-level security, and data masking where needed.
Sensitive fields such as salary data, financial records, or personally identifiable
information (PII) must be masked or tokenized to prevent misuse.
Machine learning systems introduce a new security dimension because both models
and training data can be attacked. ML models rely on data integrity—any poisoning or
manipulation of input data leads to compromised model behavior. Attackers may submit
harmful inputs to training data, generate adversarial examples to fool models, or attempt
model extraction through repeated queries.
ML security begins with securing the training pipeline. Training should occur in
isolated, authenticated environments with validated input data. Access to training
datasets, model checkpoints, and hyperparameter configurations must be restricted.
The trained model must be stored in encrypted form with signature-based verification to
ensure it is not tampered with. During the inference phase, ML models deployed as
APIs must implement rate limiting, token-based authentication, and payload
validation to prevent adversarial requests.
Scaling a pipeline means enabling it to handle increasing data volume, higher user
loads, faster data velocities, and more complex workloads. Scaling can be horizontal
(adding more machines), vertical (adding more power to existing machines), or elastic
(automatic scaling based on load).
A scalable pipeline begins at the ingestion layer. Systems like Kafka, Kinesis,
Pub/Sub, or API gateways must support partitioning, replication, and load balancing.
The processing layer, built on Spark, Flink, Beam, or serverless engines, must
dynamically expand cluster size as workloads increase.
Storage layers must scale to petabytes. This is achieved using distributed file systems,
object storage (S3, Blob, GCS), NoSQL databases (Cassandra, DynamoDB), and data
warehouses that support massive parallelism.
Analytics workloads are also scaled using caching systems, query accelerators,
pre-computed aggregates, and distributed SQL engines that split queries across
multiple compute nodes.
A scalable infrastructure ensures that all pipeline components grow seamlessly without
disruptions. This infrastructure is usually built on:
● Distributed computing frameworks for large-scale processing
● Elastic cloud resources such as auto-scaling clusters, serverless functions, and
container orchestration (Kubernetes)
● High-throughput messaging systems like Kafka
● Fault-tolerant distributed storage
● Monitoring systems for resource optimization
Every pipeline layer must be built in a modular, loosely coupled manner. Scalable
components include: