0% found this document useful (0 votes)
26 views68 pages

Data Engg Unit 2

Data engineering life cycle under current and principle

Uploaded by

vanak43332
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views68 pages

Data Engg Unit 2

Data engineering life cycle under current and principle

Uploaded by

vanak43332
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit II

Data
Engineering
• Chapter 3 - Designing Good Data
Contents Architecture
• Chapter 4 - Choosing Technologies
– Unit 2 Across the Data Engineering
Lifecycle
Designing Good Data
Architecture
Chapter 3
Contents – Chapter 3
• What Is Data Architecture?
• Principles of Good Data Architecture
• Major Architecture Concepts
• Examples and Types of Data Architecture
• Who’s Involved with Designing a Data Architecture?
Data Architecture
• Good data architecture provides seamless capabilities across every
step of the data lifecycle and undercurrent
• Subset of Enterprise Architecture
Different definitions of Enterprise
Architecture
• TOGAF (The Open Group Architecture Framework) Definition
• The term “enterprise” in the context of “enterprise architecture” can denote an
entire enterprise—encompassing all of its information and technology services,
processes, and infrastructure—or a specific domain within the enterprise. In both
cases, the architecture crosses multiple systems, and multiple functional groups
within the enterprise.
• Gartner’s Definition
• Enterprise architecture (EA) is a discipline for proactively and holistically leading
enterprise responses to disruptive forces by identifying and analyzing the execution
of change toward desired business vision and outcomes. EA delivers value by
presenting business and IT leaders with signature-ready recommendations for
adjusting policies and projects to achieve targeted business outcomes that capitalize
on relevant business disruptions.
• EABOK’s (Enterprise Architecture Book of Knowledge) Definition
• Enterprise Architecture (EA) is an organizational model; an abstract
representation of an Enterprise that aligns strategy, operations, and
technology to create a roadmap for success
• Definition
• Enterprise architecture is the design of systems to support change in the
enterprise, achieved by flexible and reversible decisions reached through
careful evaluation of trade-offs
• Enterprise architecture balances flexibility and trade-offs
Definitions – Data Architecture
• Data architecture is a subset of enterprise architecture, inheriting its
properties: processes, strategy, change management, and technology
• TOGAF’s definition
• A description of the structure and interaction of the enterprise’s major types
and sources of data, logical data assets, physical data assets, and data
management resources.
• DAMA’s definition
• Identifying the data needs of the enterprise (regardless of structure)
and designing and maintaining the master blueprints to meet those
needs. Using master blueprints to guide data integration, control data
assets, and align data investments with business strategy.
• Definition
• Data architecture is the design of systems to support the evolving data needs of an
enterprise, achieved by flexible and reversible decisions reached through a careful
evaluation of trade-offs
• Data engineering architecture is a subset of general data architecture.
• Data engineering architecture is the systems and frameworks that make up
the key sections of the data engineering lifecycle
• Aspects of data architecture
• Operational (what needs to be done)
• encompasses the functional requirements of what needs to happen related to people,
processes, and technology.
• Eg:
• what business processes does the data serve?
• How does the organization manage data quality?
• What is the latency requirement from when the data is produced to when it becomes available to
query?
• Technical (how it will happen)
• outlines how data is ingested, stored, transformed, and served along the data engineering
lifecycle.
• Eg:
• how will you move 10 TB of data every hour from a source database to your data lake?
Good Data Architecture
Grady Booch, “Architecture represents the significant design decisions that shape a system, where significant is
measured by cost of change.”

• Good • Bad
• serves business requirements with • is authoritarian and tries to cram a
a common, widely reusable set of bunch of one-size-fits-all decisions
building blocks while maintaining into a big ball of mud
flexibility and making appropriate • is tightly coupled, rigid, overly
trade-offs. centralized, or uses the wrong
• is flexible and easily maintainable tools for the job, hampering
• evolves in response to changes development and change
within the business and new management.
technologies and practices
• undercurrents of the data
engineering lifecycle form the
foundation
Principles of Good Architecture
1. Choose common components wisely.
2. Plan for failure.
3. Architect for scalability.
4. Architecture is leadership.
5. Always be architecting.
6. Build loosely coupled systems.
7. Make reversible decisions.
8. Prioritize security.
9. Embrace FinOps
Principle 1 - Choose Common Components
Wisely
• enable agility within and across teams in conjunction with shared
knowledge and skills.
• can be anything that has broad applicability within an organization.
• include object storage, version-control systems, observability, monitoring
and orchestration systems, and processing engines.
• should be accessible to everyone with an appropriate use case, and teams
are encouraged to rely on common components already in use rather than
reinventing the wheel.
• support robust permissions and security to enable sharing of assets among
teams while preventing unauthorized access
• Cloud platforms are an ideal place to adopt common components
Principle 2 – Plan for Failure
• few key terms for evaluating failure scenarios
• Availability
• The percentage of time an IT service or component is in an operable state.
• Reliability
• The system’s probability of meeting defined standards in performing its intended function
during a specified interval.
• Recovery time objective
• The maximum acceptable time for a service or system outage.
• The recovery time objective (RTO) is generally set by determining the business impact of an
outage
• Recovery point objective
• The acceptable state after recovery. In data systems, data is often lost during an outage.
• Here, the recovery point objective (RPO) refers to the maximum acceptable data loss.
Principle 3 - Architect for Scalability
• Scale up
• allows us to handle extreme loads temporarily
• Eg:
• large cluster to train a model on a petabyte of customer data
• scale out a streaming ingestion system to handle a transient load spike
• Scale down
• Once the load spike ebbs, automatically remove capacity to cut costs.
• An elastic system can scale dynamically in response to load, ideally in an
automated fashion
• Scale to zero
• shut down completely when not in use.
• Once the large model-training job completes, delete the cluster
Principle 4 - Architecture Is Leadership
• Data architects
• responsible for technology decisions and architecture descriptions and disseminating these
choices through effective leadership and training
• should be highly technically competent
• Need of the hour - Strong leadership skills combined with high technical
competence
• Today’s data engineers
• possess the technical skills of a data engineer but no longer practice data engineering day to
day
• mentor current data engineers
• make careful technology choices in consultation with their organization
• disseminate expertise through training and leadership
• train engineers in best practices
• bring the company’s engineering resources together to pursue common goals in both
technology and business
Principle 5 - Always Be Architecting
• An architect should
• develop deep knowledge of the baseline architecture (current state)
• develop a target architecture
• map out a sequencing plan to determine priorities and the order of
architecture changes
• The target architecture becomes a moving target, adjusted in
response to business and technology changes internally and
worldwide.
• The sequencing plan determines immediate priorities for delivery
Principle 6 - Build Loosely Coupled Systems
• For software architecture, a loosely coupled system has the following
properties:
1. Systems are broken into many small components.
2. These systems interface with other services through abstraction layers,
such as a messaging bus or an API. These abstraction layers hide and
protect internal details of the service, such as a database backend or
internal classes and method calls.
3. Internal changes to a system component don’t require changes in other
parts. Details of code updates are hidden behind stable APIs. Each piece
can evolve and improve separately.
4. There is no waterfall, global release cycle for the whole system. Instead,
each component is updated separately as changes and improvements are
made
• Applying to organisational characteristics
1. Many small teams engineer a large, complex system. Each team is tasked
with engineering, maintaining, and improving some system components.
2. These teams publish the abstract details of their components to other
teams via API definitions, message schemas, etc.
1. Teams need not concern themselves with other teams’ components; they simply use
the published API or message specifications to call these components.
2. They iterate their part to improve their performance and capabilities over time.
3. They might also publish new capabilities as they are added or request new stuff from
other teams
4. Teams work together through loosely coupled communication
3. Each team can rapidly evolve and improve its component independently of
the work of other teams
4. Teams can release updates to their components with minimal downtime.
Teams release continuously during regular working hours to make code
changes and test them
Principle 7- Make Reversible Decisions
• Aim for reversible decisions, to simplify your architecture and keep it
agile
• Strive to pick the best-of-breed solutions that work for today
• Be prepared to upgrade or adopt better practices as the landscape
evolves
Principle 8 - Prioritize Security
• Assume responsibility for the security of the systems they build and
maintain
• Focus now on two main ideas:
• Zerotrust security
• Only perimeter security
• A hardened network perimeter with “trusted things” inside and “untrusted things”
outside.
• Drawback - vulnerable to insider attacks, as well as external threats such as spear
phishing
• In a cloud-native environment
• All assets are connected to the outside world to some degree.
• Virtual private cloud (VPC) networks can be defined with no external connectivity, but the API
control plane that engineers use to define these networks still faces the internet
Principle 8 - Prioritize Security
• Shared responsibility security model
• Typically used by cloud providers
• Divides security into the security of the cloud and security in the cloud
• Security OF THE CLOUD
• Eg: AWS responsible for security of the cloud
• protecting the infrastructure that runs AWS services in the AWS Cloud
• AWS also provides you with services that you can use securely
• Security IN THE CLOUD
• Responsibility of AWS users
• determined by the AWS service used
• also responsible for other factors including the sensitivity of the data, organization’s
requirements, and applicable laws and regulations
Principle 9 - Embrace FinOps
• On-premises setting • In the cloud
• data systems are generally acquired • most data systems are pay-as-you-go
with a capital expenditure for a new
system every few years and readily scalable
• have to balance budget against desired • Systems can run on a cost per query
compute and storage capacity model, cost per processing capacity
• Overbuying entails wasted money model, or another variant of a pay-
• Underbuying means hampering future as-you-go model
data projects and driving significant
personnel time to control system load • makes spending far more dynamic.
and data size; • The new challenge for data leaders is
• Underbuying may require faster to manage budgets, priorities, and
technology refresh cycles, with efficiency
associated extra costs
Principle 9 - Embrace FinOps
• With FinOps,
• engineers need to learn to think about the cost structures of cloud systems.
• Eg:
• What is the appropriate mix of AWS spot instances when running a distributed cluster?
• What is the most appropriate approach for running a sizable daily job in terms of cost-effectiveness and
performance?
• When should the company switch from a pay-per-query model to reserved capacity?
• monitor spending on an ongoing basis
• Eg:
• Rather than simply monitor requests and CPU utilization for a web server, FinOps might monitor the ongoing
cost of serverless functions handling traffic, as well as spikes in spending trigger alerts
• Help address issues
• Eg:
• When sharing data publicly, data teams can address excessive downloads by setting requester-pays policies, or
simply monitoring for excessive data access spending and quickly removing access if spending begins to rise to
unacceptable levels
Major Architecture Concepts
• Domains & Services
• A domain is the real-world subject area for which you’re architecting.
• A service is a set of functionality whose goal is to accomplish a task.
• Eg: a sales order-processing service whose task is to process orders as they
are created. The sales order-processing service’s only job is to process orders;
it doesn’t provide other functionality, such as inventory management or
updating user profiles.
• Eg: A sales domain with three services: orders, invoicing, and products. Each
service has particular tasks that support the sales domain. Other domains
may also share services
Major Architecture Concepts
• Distributed Systems, Scalability, and Designing for Failure
• Related to principles 2 & 3
• Scalability
• Allows us to increase the capacity of a system to improve performance and handle the
demand.
• Eg: we might want to scale a system to handle a high rate of queries or process a huge data
set.
• Elasticity
• The ability of a scalable system to scale dynamically; a highly elastic system can automatically
scale up and down based on the current workload.
• Scaling up is critical as demand increases, while scaling down saves money in a cloud
environment.
• Modern systems sometimes scale to zero, meaning they can automatically shut down when
idle.
Major Architecture Concepts
• Distributed Systems, Scalability, and Designing for Failure
• Availability
• The percentage of time an IT service or component is in an operable state.
• Reliability
• The system’s probability of meeting defined standards in performing its intended
function during a specified interval.
• low reliability can lead to low availability
• elasticity improves reliability
• Vertical Scaling
• A single machine can be scaled vertically; you can increase resources (CPU, disk,
memory, I/O)
Major Architecture Concepts
• Distributed Systems, Scalability, and Designing for Failure
• Horizontal Scaling
• allows you to add more machines to satisfy load and resource requirements
• Common horizontally scaled systems have a leader node that acts as the main point of
contact for the instantiation, progress, and completion of workloads.
• When a workload is started, the leader node distributes tasks to the worker nodes within
its system, completing the tasks and returning the results to the leader node
• Typical modern distributed architectures also build in redundancy.
• Data is replicated so that if a machine dies, the other machines can pick up where the
missing server left off; the cluster may add more machines to restore capacity
Major Architecture Concepts
• Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
• Tightly coupled
• Extremely centralized dependencies and workflows
• Every part of a domain and service is vitally dependent upon every other domain and
service
• Loosely coupled
• decentralized domains and services that do not have strict dependence on each other
• easy for decentralized teams to build systems whose data may not be usable by their
peers
Major Architecture Concepts
• Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
• Architecture Tiers
• architecture has layers—data, application, business logic, presentation, and so forth
• understanding of how to decouple these layers
• tight coupling of modalities presents obvious vulnerabilities
• structure the layers of your architecture to achieve maximum reliability and flexibility
• Single tier
• database and application are tightly coupled, residing on a single server
• This server could be your laptop or a single virtual machine (VM) in the cloud.
• The tightly coupled nature means if the server, the database, or the application fails, the
entire architecture fails.
• While single-tier architectures are good for prototyping and development, they are not
advised for production environments because of the obvious failure risks.
Major Architecture Concepts
• Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
• Multitier
• decouple the data and application.
• A multitier (also known as n-tier) architecture is composed of separate layers: data,
application, business logic, presentation, etc.
• These layers are bottom-up and hierarchical, meaning the lower layer isn’t necessarily
dependent on the upper layers; the upper layers depend on the lower layers.
• The notion is to separate data from the application, and application from the
presentation.
• 3-tier commonly used
• consists of data, application logic, and presentation tiers
• Each tier is isolated from the other, allowing for separation of concerns.
• free to use whatever technologies you prefer within each tier without the need to be
monolithically focused.
Major Architecture Concepts
• Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
• Shared-nothing architecture
• a single node handles each request, meaning other nodes do not share resources such as
memory, disk, or CPU with this node or with each other.
• Data and resources are isolated to the node
• Shared-disk architecture
• share the same disk and memory accessible by all nodes.
• common when you want shared resources if a random node failure occurs.
Major Architecture Concepts
• Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
• Monolith
• consists of a single codebase running on a single machine that provides both the
application logic and user interface
• Coupling within monoliths can be viewed in two ways:
• technical coupling and domain coupling.
• Technical coupling
• refers to architectural tiers
• Domain coupling
• refers to the way domains are coupled together.
• attributes of a monolith—interwoven services, centralization, and tight coupling among
services
Major Architecture Concepts
• Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
• Microservices
• comprises separate, decentralized, and loosely coupled services.
• Each service has a specific function and is decoupled from other services operating
within its domain.
• If one service temporarily goes down, it won’t affect the ability of other services to
continue functioning.
•Monolith = A giant department store: one big building, shared backroom. Renovating the shoe section
shuts down the whole store.
•Microservices = A shopping mall: each shop runs independently. If the shoe store remodels, the
electronics store keeps running.
•Tiers = How each store organizes front desk (presentation), staff workroom (application), and storeroom
(data).

Monolithic Architecture – The “All-in-One” Approach


•Scenario: The e-commerce company builds a single order management system that:
• Handles user interface (website + mobile app),
• Processes orders,
• Stores customer & inventory data,
• Generates sales reports.
•Tight Coupling: All parts share the same database and codebase.
If you change the database schema for inventory, you might break the checkout process.
•Pros: Easier to start, faster initial development.
•Cons: Scaling is hard (you have to scale everything even if only one part is under load), harder
to update without side effects.
Microservices Architecture – The “Small, Specialized Shops” Approach
•Scenario: The same e-commerce platform is broken into independent services:
• User Service: Manages profiles and authentication.
• Inventory Service: Manages stock levels.
• Order Service: Handles checkout and payment.
• Analytics Service: Generates reports.
•Each service has its own database and communicates via APIs or messaging queues.
•Loose Coupling: If you change the Inventory database structure, other services are
unaffected as long as the API contract remains the same.
•Pros: Independent scaling, easier to update parts without touching the whole system.
•Cons: More complexity in communication and data consistency.
Tiered Architecture – The “Layered Cake” Approach
•Scenario: Whether monolith or microservices, the system can still be 3-tier:
• Presentation Tier: Website UI, mobile app.
• Application Tier: Business logic (microservices or monolithic modules).
• Data Tier: Databases, data warehouses, data lakes.
•Tight or loose coupling here depends on how each tier interacts:
• Tightly Coupled: UI directly queries the database (bad practice, fragile).
• Loosely Coupled: UI calls APIs, which handle data access behind the scen

Example in E-Commerce Data


Coupling Type
Architecture
Monolith where analytics code directly
Tight Coupling queries production DB tables —
changing table names breaks reports.
Microservice where analytics gets data
Loose Coupling via a published API or Kafka stream —
schema changes hidden behind API.
Major Architecture Concepts
• User Access: Single Versus Multitenant
• All cloud services are multitenant
• this multitenancy occurs at various grains.
• Eg: a cloud compute instance is usually on a shared server, but the VM itself provides
some degree of isolation.
• Object storage is a multitenant system, but cloud vendors guarantee security and
isolation so long as customers configure their permissions correctly
• two factors to consider in multitenancy: performance and security
• will the system support consistent performance for all tenants
• data from different tenants must be properly isolated. When a company has multiple
external customer tenants, these tenants should not be aware of one another, and
engineers must prevent data leakage
Major Architecture Concepts
• Event-Driven Architecture
• Events
• getting a new customer, a new order from a customer, or an order for a product or service
• Cause a a change in the state of something - a new order might be created by a customer, or a
customer might later make an update to this order
• An event-driven workflow encompasses the ability to create, update, and
asynchronously move events across various parts of the data engineering lifecycle.
• This workflow boils down to three main areas: event production, routing, and
consumption.
• An event must be produced and routed to something that consumes it without tightly
coupled dependencies among the producer, event router, and consumer
• Advantage: distributes the state of an event across multiple services. This is helpful if a
service goes offline, a node fails in a distributed system, or you’d like multiple consumers
or services to access the same events
Major Architecture Concepts
• Brownfield Versus Greenfield Projects
• Brownfield projects
• involve refactoring and reorganizing an existing architecture
• constrained by the choices of the present and past
• require a thorough understanding of the legacy architecture and the interplay of various old and new
technologies
• New systems slowly and incrementally replace a legacy architecture’s components. Eventually, the legacy
architecture is completely replaced
• critical to demonstrate value on the new platform by gradually increasing its maturity to show evidence of
success and then follow an exit plan to shut down old systems

• Greenfield projects
• allows you to pioneer a fresh start, unconstrained by the history or legacy of a prior
architecture.
• easier than brownfield projects
• Always prioritize requirements over building something cool
Major Architecture Concepts
Feature Greenfield Brownfield
Starting point No existing system Existing legacy system
Limited by legacy
Freedom of design Very high
dependencies
Complexity of integration
Risks Unproven design choices
& migration
Shorter at first, but more Longer due to
Timeline
unknowns compatibility work
Bank migrating mainframe
Startup’s new real-time
Example reports to a cloud data
data platform
warehouse
Major Architecture Concepts
Brownfield Project – “Upgrading What Already Exists”
Definition:
An upgrade, migration, or improvement to an existing data architecture. Must integrate with or replace
legacy systems while keeping business operations running.
Eg:
•Scenario: A large retail chain has:
• An on-premises Oracle data warehouse,
• 200+ scheduled ETL jobs,
• Reports generated in Excel from nightly batch loads.
•They want to migrate to Snowflake and enable real-time inventory tracking.
•Constraints:
• Must keep old reports running during migration,
• Migrate historical data without downtime,
• Some old ETL scripts must be refactored to work with the new pipeline.
•Why Brownfield? They’re modernizing an existing ecosystem, not starting fresh — legacy compatibility is a
major factor.
Eg: Renovating an old railway station — you can’t shut it down completely, so you rebuild section by section
while trains still run.
Major Architecture Concepts
Greenfield Project – “Starting from Scratch”
Definition:
A new data architecture built without any constraints from existing systems — no legacy data formats, no old
ETL jobs, no outdated pipelines.
You have full freedom to design from the ground up.
Real-World Example:
•Scenario: A new fintech startup wants to build a real-time fraud detection platform.
•They design a modern cloud-native data stack:
• Streaming ingestion via Kafka,
• Data lakehouse on Databricks,
• Machine learning models deployed with MLflow,
• All in a fully serverless AWS architecture.
•Why Greenfield? There’s no legacy database or reporting system to integrate with — they can choose the
latest tools and best practices without worrying about breaking anything existing.
Analogy: Building a brand-new airport on empty land — you choose the runway layout, terminals, and tech
from scratch.
Examples and Types of Data Architecture
1. Data Warehouse
2. Data Lake
3. Convergence, Next-Generation Data Lakes, and the Data Platform
4. Modern Data Stack
5. Lambda Architecture
6. Kappa Architecture
7. The Dataflow Model and Unified Batch and Streaming
8. Architecture for IoT
9. Data Mesh
Data Warehouse
Definition:
A centralized repository for structured, processed data, optimized for reporting and analytics.
•Stores clean, transformed data from multiple sources.
•Uses a schema-on-write approach (data is structured before storage).
•Great for business intelligence (BI) dashboards and KPIs.
Real-World Example:
•Company: Amazon
•Amazon’s sales reporting system stores structured data from transactions, customer info, and inventory in Amazon Redshift
(a cloud data warehouse) to generate revenue reports and trend analysis.

Data Lake
Definition:
A large storage repository that can hold raw, unprocessed data — structured, semi-structured, and unstructured — at any
scale.
•Uses schema-on-read (data structure is applied only when accessed).
•Ideal for data science, machine learning, and advanced analytics.
•Can store JSON, images, logs, clickstreams, IoT sensor data, etc.
Real-World Example:
•Company: Netflix
•Netflix stores raw user activity logs, viewing history, and recommendation model inputs in an AWS S3-based data lake
before processing them for personalized recommendations.
Data Mart
Definition:
A subset of a data warehouse designed for a specific business function or department.
•Contains subject-specific data for targeted analytics.
•Faster to query since it’s smaller and focused.
Real-World Example:
•Company: Walmart
•Walmart has a Sales Data Mart focused solely on POS transactions, promotions, and product
performance, separate from HR or logistics data, so store managers can quickly run sales
reports.
Feature Data Warehouse Data Lake Data Mart
All types (structured,
Data Type Structured semi-structured, Structured
unstructured)
Processing Schema-on-write Schema-on-read Schema-on-write
Enterprise-wide Data science, ML, raw Department-level
Purpose
analytics, BI storage analytics

Example Company Amazon (Redshift) Netflix (AWS S3) Walmart (Sales Mart)

Specific departments
Data scientists,
Users Analysts, BI teams (e.g., Sales,
engineers
Marketing)
Examples and Types of Data Architecture
• Convergence, Next-Generation Data Lakes, and the Data Platform
• Data lakehouse
• a convergence between data lakes and data warehouses
• incorporates the controls, data management, and data structures found in a data
warehouse while still housing data in object storage and supporting a variety of query
and transformation engines
• Data Platforms
• combine data lake and data warehouse capabilities
Examples and Types of Data Architecture
• Modern Data Stack
• a trendy analytics architecture that highlights abstraction
• the main objective is to use cloud-based, plug-and-play, easy-to-use, off-the-
shelf components to create a modular and cost-effective data architecture.
• These components include data pipelines, storage, transformation, data
management/governance, monitoring, visualization, and exploration
• to reduce complexity and increase modularization
• Integrates with data platform
• Key outcomes of the modern data stack are self-service (analytics and
pipelines), agile data management, and using open source tools or simple
proprietary tools with clear pricing structures
Examples and Types of Data Architecture
• Lambda Architecture
• the early to mid-2010 - the popularity of working with streaming data
exploded with the emergence of Kafka as a highly scalable message queue
and frameworks such as Apache Storm and Samza for streaming/real-time
analytics.
• These technologies allowed companies to perform new types of analytics and
modeling on large amounts of data, user aggregation and ranking, and
product recommendations.
• Problem - Data engineers needed to figure out how to reconcile batch and
streaming data into a single architecture.

• Lambda architecture offered solution to this problem


Examples and Types of Data Architecture
• Lambda Architecture
• have systems operating independently of each other—batch, streaming, and
serving.
• The source system is ideally immutable and append-only, sending data to two
destinations for processing: stream, and batch. In-stream processing intends
to serve the data with the lowest possible latency in a “speed” layer, usually a
NoSQL database.
• In the batch layer, data is processed and transformed in a system such as a
data warehouse, creating precomputed and aggregated views of the data.
• The serving layer provides a combined view by aggregating query results from
the two layers.
Examples and Types of Data Architecture
• Lambda Architecture
• Drawbacks
• Managing multiple systems with different codebases is difficult
• Creates error-prone systems with code and data that are extremely difficult to reconcile
Case Study – Lamda Architecture - Uber’s Ride Pricing & ETA Prediction
Scenario:
Uber needs to show users instant ride prices and ETAs while also improving accuracy based on historical trends.
1. Batch Layer (Accuracy)
•Data Source: Historical trip data, driver locations, traffic patterns, surge history.
•Processing: Runs daily/weekly jobs on a data warehouse like Hadoop/Spark.
•Output: Accurate, large-scale models for pricing and ETA prediction.
2. Speed Layer (Low Latency)
•Data Source: Live GPS pings from drivers, live ride requests, current traffic feeds.
•Processing: Stream processing engine like Apache Kafka + Apache Flink/Spark Streaming.
•Output: Immediate estimates for ride price and ETA based on real-time conditions.
3. Serving Layer (Unified View)
•Merges real-time estimates (speed layer) with historically refined models (batch layer).
•Example:
• At 5:32 PM, a passenger in Mumbai requests a ride.
• The speed layer instantly calculates ETA: 8 minutes (based on current traffic).
• The batch layer model later updates the prediction to 7 minutes after including historical congestion
patterns for that route.
Why Uber Uses Lambda Architecture
Speed layer ensures the app responds in seconds.
Batch layer ensures predictions are accurate over time.
Merging both ensures users get fast and reliable estimates without waiting for full batch jobs to run.
Examples and Types of Data Architecture
• Kappa Architecture
• Overcomes drawbacks of Lambda architecture
• uses a stream-processing platform as the backbone for all data handling—
ingestion, storage, and serving
• Real-time and batch processing can be applied seamlessly to the same data
by reading the live event stream directly and replaying large chunks of data
for batch processing
• Drawbacks
• Turns out to be complicated and expensive in practice
Case Study – Kappa Architecture - Netflix’s Real-Time Recommendations
Scenario:
Netflix constantly updates movie/show recommendations for users as they watch content.
1. Event Stream
•Data Source:
• User watch events (start, pause, stop, rewind)
• Search queries
• Ratings and likes
•All events are ingested in real time using Apache Kafka.
2. Stream Processing
•Tools: Apache Flink / Spark Structured Streaming.
•Logic:
• Analyze viewing patterns instantly.
• Update recommendation scores as soon as new events happen.
• Adjust content rankings dynamically.
3. Serving Layer
•Recommendations are stored in a fast data store like Cassandra or ElasticSearch.
•The UI instantly reflects changes — e.g., if you finish watching a thriller, the recommendations update within seconds.

Why Netflix Uses Kappa Architecture


Continuous data flow — recommendations must adapt instantly to user behavior.
No heavy batch jobs — processing is always happening in real time.
Easier maintenance — only one code path for real-time and historical data (just replay Kafka logs for reprocessing).
Choosing Technologies Across the
Data Engineering Lifecycle
Chapter 4
Architecture vs Tools

Architecture Tools
• Architecture is strategic • Tools are tactical
• Architecture is the what, why, • Tools make the architecture a
and when reality - tools are the how
• Decide Architecture first • Decide tools once architecture is
decided
Considerations for choosing data technologies
across the data engineering lifecycle
• Team size and capabilities
• Speed to market
• Interoperability
• Cost optimization and business value
• Today versus the future: immutable versus transitory technologies
• Location (cloud, on prem, hybrid cloud, multicloud)
• Build versus buy
• Monolith versus modular
• Serverless versus servers
• Optimization, performance and the benchmark wars.
• The undercurrents of the data engineering lifecycle
Team size and capabilities
• Small team vs big team
• A team’s size roughly determines the amount of bandwidth your team
can dedicate to complex solutions
• For small teams or teams with weaker technical chops, use as many
managed and SaaS tools as possible, and dedicate your limited
bandwidth to solving the complex problems that directly add value to
the business
• Stick with technologies and workflows with which the team is familiar
• Learning new technologies, languages, and tools is a considerable
time investment, so make these investments wisely
Speed to market
• Means choosing the right technologies that help you deliver features and
data faster while maintaining high-quality standards and security.
• It also means working in a tight feedback loop of launching, learning,
iterating, and making improvements.
• Deliver value early and often.
• Use what works.
• Team members will likely get better leverage with tools they already know.
• Avoid undifferentiated heavy lifting that engages your team in
unnecessarily complex work that adds little to no value.
• Choose tools that help you move quickly, reliably, safely, and securely.
Interoperability
• When choosing a technology or system, ensure that it interacts and
operates with other technologies.
• Interoperability describes how various technologies or systems connect,
exchange information, and interact.
• Vendors and open source projects will target specific platforms and
systems to interoperate.
• Most data ingestion and visualization tools have built-in integrations with
popular data warehouses and data lakes.
• Popular data-ingestion tools will integrate with common APIs and services,
such as CRMs, accounting software
• Almost all databases allow connections via Java Database Connectivity
(JDBC) or Open Database Connectivity (ODBC)
• Design for modularity and giving yourself the ability to easily swap out
technologies as new practices and alternatives become available.
Cost Optimization and Business Value
• Budgets and time are finite, and the cost is a major constraint for
choosing the right data architectures and technologies
• Costs are seen through three main lenses: total cost of ownership,
opportunity cost, and FinOps
• Total cost of ownership (TCO) is the total estimated cost of an
initiative, including the direct and indirect costs of products and
services utilized.
• Direct costs can be directly attributed to an initiative.
• Eg: the salaries of a team working on the initiative or the AWS bill for all services
consumed.
• Indirect costs, also known as overhead, are independent of the initiative and
must be paid regardless of where they’re attributed
• how something is purchased impacts the way costs are accounted for.
Expenses fall into two big groups: capital expenses and operational
expenses.
• Capital expenses, also known as capex, require an up-front investment.
• a significant capital outlay with a long-term plan to achieve a positive ROI on the effort
and expense put forth
• Operational expenses, also known as opex, are the opposite of capex in
certain respects.
• Opex is gradual and spread out over time.
• Whereas capex is long-term focused, opex is short-term.
• Opex can be pay-as-you-go or similar and allows a lot of flexibility.
• Opex is closer to a direct cost, making it easier to attribute to a data project.
• Opex allows for a far greater ability for engineering teams to choose their software and
hardware.
• Cloud-based services let data engineers iterate quickly with various software and
technology configurations, often inexpensively
• Total opportunity cost of ownership (TOCO) is the cost of lost
opportunities that we incur in choosing a technology, an architecture,
or a process.
• FinOps
• The goal of FinOps is to fully operationalize financial accountability and
business value by applying the DevOps-like practices of monitoring and
dynamically adjusting systems
Today Versus the Future: Immutable Versus
Transitory Technologies
• Choose the best technology for the moment and near future, but in a way
that supports future unknowns and evolution
• Understand what is likely to change and what tends to stay the same
• Immutable technologies
• components that underpin the cloud or languages and paradigms that have stood
the test of time.
• Benefit from the Lindy effect: the longer a technology has been established, the
longer it will be used
• In the cloud, examples of immutable technologies are object storage, networking,
servers, and security.
• Object storage such as Amazon S3 and Azure Blob Storage will be around from today
until the end of the decade, and probably much longer.
• For languages, SQL and bash have been around for many decades
• Transitory technologies
• those that come and go.
• The typical trajectory begins with a lot of hype, followed by meteoric growth
in popularity, then a slow descent into obscurity
• Find the immutable technologies along the data engineering lifecycle,
and use those as your base.
• Build transitory tools around the immutables
• Consider how easy it is to transition from a chosen technology
Location
• Principal places to run your technology stack:
• On premises
• Cloud
• Hybrid cloud
• On-premises applications generate event data that can be pushed to the cloud essentially for free.
• The bulk of data remains in the cloud where it is analyzed, while smaller amounts of data are pushed
back to on premises for deploying models to applications, reverse ETL, etc
• Multicloud
• refers to deploying workloads to multiple public clouds.
• Companies may have several motivations for multicloud deployments.
• SaaS platforms often wish to offer services close to existing customer cloud workloads. Snowflake and Databricks
provide their SaaS offerings across multiple clouds for this reason. This is especially critical for data-intensive
applications, where network latency and bandwidth limitations hamper performance, and data egress costs can
be prohibitive
• Take advantage of the best services across several clouds. For example, a company might want to handle its
Google Ads and Analytics data on Google Cloud and deploy Kubernetes through GKE. And the company might
also adopt Azure specifically for Microsoft workloads. Also, the company may like AWS because it has several
best-in-class services (e.g., AWS Lambda) and enjoys huge mindshare, making it relatively easy to hire AWS-
proficient engineers
• Disadv of multicloud
• Data egress costs and networking bottlenecks are critical.
• Can introduce significant complexity.
• Companies must now manage a dizzying array of services across several clouds;
cross-cloud integration and security present a considerable challenge;
multicloud networking can be diabolically complicated

You might also like