0% found this document useful (0 votes)
324 views52 pages

Data Platform Fundamentals

Data engineering with Dagster

Uploaded by

tsgrpp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
324 views52 pages

Data Platform Fundamentals

Data engineering with Dagster

Uploaded by

tsgrpp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

Data

Platform
Fundamentals
Joe Naso Colton Padden
Table of Contents
Chapter 01: Data Platforms 03 Chapter 05: Data Quality 35
Introduction 04 The Dimensions of Data Quality 37
An Open Data Platform 05 Timeliness 37
Why do we need data platforms? 05 Completeness 38
The Control Plane 06 Accuracy 38
Benefits of Data Platforms 07 Validity 39
Observability and Developer 08 Uniqueness 39
Experience Consistency 40
Tools and Technologies 41
Chapter 02: Architecting 09 Enforcement of Data Quality 42
your data platform The full data lifecycle 42
Scaling with the Business 11 A central control plane 43
Composability and Extensibility 12 (orchestrator)
Common Data Architectures 14
Extract-Transform-Load (ETL) 15 Chapter 06: Example Data 44
Extract-Load-Transform (ELT) 16 Platforms
Data Lakehouses 17 The Lightweight Data Lake 46
Event-Driven 18 The GCP Stack 47
The AWS Stack 48
Chapter 03: Design Patterns 19 MDS by the Book 49
Design Patterns for Data Pipelines 20 Data Lakehouse 50
Push 21 Event Driven 51
Pull 21
Poll 22 Conclusion 52
Idempotency 23 Meet the authors 52

Chapter 04: Data Modeling 25


Designing a Well-Configured 29
Platform
Must-have Features 34

02
01
Data
Platforms
01 | Data Platforms

Introduction
Many companies follow a similar maturity Only one data platform should exist in an
curve in data engineering. Initial, small-scale organization. While a single platform may
automations grow to support large-scale require a larger upfront investment, the
orchestration needs. Eventually, standardization it provides and productivity
organizations recognize the need for boost for the team pay dividends as the
dedicated ownership—a team or individual platform grows.
tasked with building and maintaining a
scalable platform to support their An extensible platform provides better
automation and data pipeline requirements. business alignment and less duplicated work.
For more information on what Dagster believes
While organizations may have very similar about data platforms, see the “What Dagster
data needs, headcount, and workflows, it’s Believes About Data Platforms” blog post.
common for the platforms they build to be
significantly different. This is because no This book will be a comprehensive guide
distinct solution exists for building a data for data platform owners looking to build a
platform. This can come down to decisions stable and scalable data platform, starting
about their cloud provider, preference for with the fundamentals:
open or closed-source software, and budget.
Although the underlying technologies
Architecting your Data
may be unique, these organizations often
Platform
follow common, well-known patterns and
principles to design their data platforms. Design Patterns and Tools
We will review some of those patterns and
Observability
principles in this e-book.
Data Quality
Dagster believes in an open data platform
that is heterogeneous and centralized. It Finally, we’ll tie it together with real-world
should support specific business use cases examples illustrating how different teams
and accommodate various data storage have built in-house data platforms for
and processing tools, all while providing a their businesses.
central unified control plane across all of the
processes in the organization.

04
01 | Data Platforms

An Open Data Platform


You can find the Dagster team’s internal
Open Data Platform on GitHub. Throughout
this e-book, you’ll see how Dagster’s core
tenet of building open-source software can
help establish the standards for a scalable,
maintainable, and stable data platform.

This book is for data practitioners interested


in building and maintaining a data platform.
You may be a team of one or part of an
established team with a growing area of
responsibility. In either case, common issues,
patterns, and needs are core to building a
data platform. This book will help you
navigate those topics and establish a data
platform that evolves alongside your business.

Why do we need data platforms?


Most data tools fall into two categories: Unbundling your entire stack reveals one
all-in-one solutions or unbundled tools. One major weakness–maintaining consistency
of the biggest criticisms of the Modern Data and control over various tooling choices
Stack is that its design over-indexes on and access points. Addressing this flaw is
unbundled tools, requiring teams to stitch the fundamental purpose of building a
together a large number of specialized tech data platform.
to address problems historically managed by
larger, managed platforms. It helps us bridge the gap between a fully
unbundled data stack and a singular, closed
Is it really necessary to adopt a dozen tools platform. And it all comes down to the
to do what was previously managed by control plane.
only a handful? The problem is not deciding
which tools you need but how to integrate
those tools into a cohesive whole.

05
01 | Data Platforms

The Control Plane


The control plane is more than orchestration. The control plane extends beyond creating
For many orchestrators, the ability to and modifying datasets; it can connect
schedule and initiate jobs is the bulk of what assets across various systems, resources,
they provide. Data Platforms require more and tools. Knowing if your scheduled job
than that. failed or alerted you on a failed data quality
check is only somewhat useful. Knowing
Traditionally, the control plane is used to the data lineage of that failure downstream
configure and activate the data plane. This is far more important than the simple
is perhaps the most critical piece of the observation that “it failed.”
data platform puzzle. It provides a standard
of control and enables a single view of all This is the marriage of orchestration,
inputs into and outputs of your platform. metadata, and observability, and it is the
core value-add for most data platforms.
In many cases, this requires various tools,
which must be integrated.

06
01 | Data Platforms

Benefits of Data Platforms


The flexibility of an unbundled Data Stack
can quickly evolve from a value-add initiative Note
to a maintenance burden. Separate—though
integrated—tools often develop opaque and In the context of data engineering,
brittle dependencies. An upstream change may “code encapsulation” refers to the
practice of bundling together the
have an unexpected impact far downstream.
functions that operate on, and generate,
data into a single unit, typically in the
Testing alone is not the solution to making
form of modules or classes. This practice
sure various tools work well together. The helps in managing code complexity,
true culprit is often disparate codebases and maintainability, and promoting reusability.
a disjointed developer experience. Many tools
often mean many repos, poorly documented
dependencies, and an increasingly complex
system architecture which leads to siloing Data platforms help manage and orchestrate
within teams and organizations. workflows, so it is reasonable to expect “code
locations” to include both the logical
These are common problems for many components of a workflow and the
companies and familiar headaches for all configuration related to that workflow.
data engineers. Managing a variety of jobs in disparate
systems makes this incredibly difficult. As the
Thankfully, a centralized data platform helps complexity of the jobs grows, this becomes
address these issues holistically with your increasingly difficult. A centralized platform
team’s and system’s common language: code. that introduces standardized workflow
patterns and provides a shared interface for
In software engineering there is the concept multiple jobs is a primary benefit of establishing
of code encapsulation. This idea is also a data platform. The enforcement of
prevalent in establishing a data platform. standards by the platform owner enables
While the organization operates within a them to ensure improved validation and
single data platform, individual contributors, support of the platform as a whole.
like data scientists, analysts, data engineers,
and machine learning engineers, can operate
with freedom from within their own
subdivisions of the platform.

07
01 | Data Platforms

Observability and Developer


Experience
Observability and developer experience are These elements contribute to the
two other key benefits of a data platform, developer’s experience on the data platform.
alongside code locations and encapsulation. The biggest pitfall of data platforms - or any
They are also the common threads that help large data infrastructure initiative - is an
elevate your pipelines from miscellaneous inconsistent experience. This applies to both
code sets to a stable and mature platform. the creation of new features and datasets
within the platform and the maintenance of
We’ll explore these topics in later chapters, existing features and datasets. Without the
but observability can take multiple forms. At ability to share standards across your code
its core, observability is the ability to locations, and have a clear view of the
reasonably inspect what is happening on the workings of the platform, your developer
platform. This includes changes to your experience will suffer. As a result, your
system inputs, changes in the outputs over team’s velocity will suffer, and your data
time, and other details about the inner platform will likely fail to provide the value
workings of your platform. The Data Visibility you expect.
Primer is another great resource on this topic.

08
02
Architecting
Your Data
Platform
02 | Architecting Your Data Platform

the effectiveness of your platform: for


Data Platforms often example, risk of vendor lock-in, scalability
comprise many tools. limitations, and maintenance overhead.
Design decisions and tooling selections
should also scale with your team’s and your
Independently, these tools solve discrete
company’s growth. It can seem daunting
problems. Collectively, however, they
when taken at face value, but established
provide the means to solve broader pain
design patterns make this process much
points common to data teams and technical
more manageable.
organizations: inconvenient data silos, poor
data quality, and ever-changing stakeholder
This chapter will review the configurations,
requirements can make delivering value
tooling, and architectures that the data
difficult. But, these challenges can be
team should consider. Whether starting
addressed holistically - and preemptively -
from square one or rearchitecting a legacy
when architecting your data platform.
system, you can set yourself up for success
by being intentional and thoughtful about
Strong consideration for system design at
the overall system.
the inception of your project is crucial, as
early architecture decisions can influence

10
02 | Architecting Your Data Platform

Scaling with the Business


early on. If you expect consumers to only
Data Platforms need to interface with the platform through BI
grow and evolve with the tools, you should facilitate a design that
business. If data platform ensures data is readily available through
that entry point.
development stagnates
while the business is still If data is primarily shared via APIs or batch
exports, your platform should support that
changing, it’s likely a as a primary feature. These entry points can
byproduct of poor evolve over time, but you should expect to
support any platform’s new entry point for
architecture and overly the long term.
rigid design.
Making it easy for users to consume data
A centralized data platform is the best way through a BI tool can be a double-edged
to ensure stability and consistency across all sword. Often, “self-service” is the goal, but
your data initiatives. Designing an extensible without strong controls on structure and
platform will allow you to keep up with the access, you introduce the possibility of
business’s changing demands. misuse and misinterpretation of the data
that’s been made available. Centralized data
The architecture of your platform establishes platforms improve this through transparency
which kinds of data access patterns are of data usage, lineage to better understand
possible for the rest of the organization. The data origin, and higher quality data through
data platform, and data teams, must be able end-to-end pipeline validation.
to meet the needs of the numerous
stakeholders across organizational teams. These details must be considered when
These can manifest as any number of data designing your platform’s internals and
deliverables, whether those be dashboards, the entry points for your customers
data exports or machine learning models. and stakeholders.
These expectations need to be understood

11
02 | Architecting Your Data Platform

Composability and
Extensibility
Rapidly growing datasets, tightened SLAs, and their data volume increases, their initial
and prohibitive costs are common drivers for set of tools are no longer able to handle the
platform redesign and architecture overhauls. additional traffic.

A composable and extensible platform As the lifespan of the tools within a data
makes meeting these requirements possible platform are tied to the needs of the
without an expensive redesign. business, the goal should not be to avoid
Requirements often change at organizations platform changes but to design a
and the data platform needs to be able to platform that makes operating under this
react. A startup may find that as they grow constraint manageable.

12
02 | Architecting Your Data Platform

Composability and Extensibility

If you build an extensible and composable You probably have heard the phrase “prefer
platform, you can continue to operate at a composition over inheritance”. in the
high level with minimal downtime. You also context of software application design. This
afford yourself the flexibility of updating applies to data platform design in much the
your platform without migrating major same way, in that the composition of a data
system components. platform typically comprises multiple
purpose-built tools.
Changing transformation tooling, pipeline
design, and migrating data stores are common Often the composition of data tools is
activities for any Data Platform Owner. The abstracted and hidden from the downstream
ease with which you can execute those analysts, and data engineers, allowing for
changes depends on the platform’s design. these individuals to be more efficient in
completing their tasks. Data engineering is
What does an extensible platform look like software engineering after all, and the
in the real world? Often, it comes down design and composition of tools, pipelines,
to abstractions. and models, should be aligned with software
engineering best practices.

Composability and extensibility are key


Note components of a well-architected data
platform, and something we will cover in
In object-oriented design and
software architecture, the principle more detail later in these writings.
to “prefer composition over
inheritance” advocates for reusability
and organization of code over
heavy reliance on inheritance. With
inheritance code can become tightly
coupled leading to fragility and
inflexibility. Whereas with composition,
there is a loose coupling between code
references, thus leading to improved
maintenance, and often clearer and
more understandable code.

13
02 | Architecting Your Data Platform

Common Data Architectures


Historically, data architectures were
considerably simpler than what you see
today. As the data needs of organizations
have become more demanding, the underlying 01 Extract-Transform
Load (ETL)
technologies have evolved to meet them:
data needs to be more real time, with higher
volume, and through more and more 02 Extract-Load
Transform (ELT)
sophisticated modelling.

In recent years, a proliferation of tools has


emerged to handle the specialized 03 Data Lakehouses
requirements of data processing and modelling,
and a bundling of these tools has been
defined with the name of the “Modern Data
04 Event-Driven

Stack”. Many teams have become frustrated


with the modern data stack, and its many
vendors, and critics often stress that these
new tools introduce undue complexity, and
some of those concerns are warranted.

But there is no denying that the wide range


of data tooling has lent itself to a range of
architectures previously unavailable. Let’s
look at some of those data architectures and
critical components of each design.

14
02 | Architecting Your Data Platform

Common Data Architectures

Extract-Transform-Load (ETL)

This architecture is the simplest and By design, the state of the data in its “final”
oldest approach of this list. Data form is different from what is extracted from
starts from an origin datastore, often a its origin. This design is still commonplace,
production database, and is transformed and can be quite effective at reducing
in transit before being written to a final compute usage at the destination. The
storage mechanism. Typically, you’ll see tradeoff is that you may lose the granularity
aggregations and other denormalization of data, and your downstream usage is
applied to these datasets before they are limited by the logic applied in transit.
written to their final destination.

15
02 | Architecting Your Data Platform

Common Data Architectures

Extract-Load-Transform (ELT)

This design has grown dramatically in table in Postgres, are captured and loaded
popularity thanks to the Modern Data Stack into storage. Then, that data can be re-
and its modular design. It has also become constructed in the transformation step,
commonplace thanks to the advent of reconstituting the original dataset from the
Massively Parallel Processing databases and operations that took place. Comparing this
their modern counterparts. to “ETL”, the raw change capture would be
lost as the transformation would take place
Instead of applying transformations to your as the data is loaded into storage.
data before it lands in your warehouse,
you retain a raw copy of that data as it Because the raw data is retained, it is
arrives. This design pattern lends to various possible to recreate the transformations
extraction methodologies, including change and models of this data, ultimately providing
data capture (CDC) and ingestion of data more flexibility. But, this also results in
from APIs or file servers. As an example, additional computation, and increased
in the case of CDC, all of the operations storage costs.
that are performed on a dataset, like a

16
02 | Architecting Your Data Platform

Common Data Architectures

Data Lakehouses

The Data Lakehouse architecture is relatively


new, but its underlying components have
been around for many years. See a term you’re not
familiar with? Be sure to
Data Lakehouses provide the scale of data check out the Dagster
lakes with the structure of data warehouses.
data engineering glossary.
Their design allows for large-scale data
processing and storage, with most of the “heavy
lifting” done outside the data warehouse.

Typically, data is written to columnar file formats These files typically follow a Medallion
like Parquet and stored in cloud storage like Architecture, which we’ll cover in more detail
S3 or GCS. Recently, Iceberg, Delta, and Hudi later in this book. These systems often
formats have also been gaining popularity. In use specialized analytics engines, such as
the case of Delta, this is an additional layer on Spark, to transform the data in these files
top of Parquet in which additional metadata into clean and consumable datasets. The
and transaction logs are included, enabling final output is often written for use in a
support for ACID (atomicity, consistency, data warehouse or otherwise exposed to
isolation, durability) transactions. downstream consumers.

17
02 | Architecting Your Data Platform

Common Data Architectures

Event-Driven

Streaming and “real-time” data platforms Often streaming can be paired with batch
blend event-driven design patterns processing, in that rolling computations can
with some of the components of the take place on the event stream, and persisted
architectures mentioned above. events landing in cloud storage can be later
batch processed for additional reporting.
In an event-driven paradigm, services
exchange data and initiate their workload Some considerations of stream based
through external triggers rather than relying processing is the expected cost of long-
on the coordination provided by a signal running compute, the additional complexity
orchestrator. Some examples of triggers of event processing, and the requirement
include messages landing in a Kafka for specialized tools that may not conform
message queue, a call to a web hook, or an to the existing tools being used at your
S3 event notification. organization. However, when real-time
analytics are required for insights that need
These architectures often use specialized to happen fast, like fraud detection, then
tools to enable steam processing for real- streaming is a fantastic option.
time and rolling-window transformations.
This approach is often paired with the
Lakehouse architecture to serve analytical
workflows for a wide range of SLAs.

18
03
Design
Patterns
03 | Design Patterns

Design Patterns for Data


Pipelines
As we know, addressing your business’s There may be layers of abstraction built on top
data needs is at the core of your data of these patterns, but the fundamental
platform. To do this, we need to capture implementations will fall into one of these
and process data. Pipeline designs may be approaches. We won’t be going into much
constrained by the tools you use within your depth in this chapter about pipeline
platform, but don’t be fooled by shiny or new abstractions, but you can find more details
technologies. There are only a finite number about Dagster’s approach to abstractions here.
of data pipeline patterns that you can use to
find, ingest, and process new datasets for These categories are not unique to data
use downstream. engineering; any software engineer who has
worked on system integrations, consumed
Countless tools can help you process new webhooks from a third-party application,
data, but only a handful of fundamental or interacted with a REST API should find
patterns exist. these familiar. However, certain patterns
can be found quite frequently within a data
Whether your data platform is entirely platform, which we have seen through our
nascent or quite mature, the pipelines experiences implementing data platforms
powering your platform can fall into one of at various organizations, and through
these three categories: relationships with data tooling customers.

01 Push

02 Pull

03 Poll

20
03 | Design Patterns

Design Patterns for Data Pipelines

Push For pipelines that are based on the push


methodology, the consumer is required to An example pipeline that follows the push
wait for data to be pushed from a source access pattern may look like this:
system; the destination could be a location in
01 A process for the producer starts at a
cloud storage, an API endpoint, or a database.
scheduled time or preset interval

For push-based systems, the consumer can 02 The producer performs some form of
dictate the schema of the data that is being processing or creation of data
received, however, a common complexity 03 The producer pushes their datasets to
in these systems is synchronization of a storage location like S3, GCS, or an
schema definitions between producers and SFTP Server
consumers. In these cases, schema registries 04 The consumer ingests that new
can come in handy for defining a single source dataset on their own schedule
of truth for the expected format of data.

Pull Pipelines using a pull methodology have


a slightly different definition of ownership. An example pipeline that follows the pull
Most notably, the consumer dictates the access pattern may look like this:
cadence of ingestion. Typically, this happens
01 The process starts at a scheduled
on a predefined schedule, though it can also
time or preset interval
be extended to use an event-driven trigger.
We’ll look at how this would work next. 02 Consumer job fetches new data
available
With pull based data access patterns, the 03 Consumer processes new data
consumer pulls the data based on the access as needed for integration into the
pattern of the producer. For example, with a platform
REST API, the consumer may have to perform
pagination, and supply query parameters in
a very specific structure. For replication of
data in a database, the consumer would be
responsible for defining the SQL query for
pulling data, along with the logic for
incremental loads. Note that tools like dlt
and Sling support this behavior out-of-the-box.
21
03 | Design Patterns

Design Patterns for Data Pipelines

Poll The final flavor of the pipeline uses a polling


paradigm. It can look similar to the pull pattern, Here is a typical poll-based pipeline trigger
but this implementation typically pairs well through an external webhook:
with event-driven architectures. It requires
01 Messages are frequently being
the consumer to maintain the state in ways
published to a Kafka message broker
not necessarily needed with other designs.
02 The consumer polls this message
Polling implementations transfer the cadence broker, and processes the records
of ingestion to the consumer. Maintaining a 03 It keeps an identifier or “cursor” of the
stateful pipeline - one knowledgeable of what record that has most recently been
data it has processed - is critical! Without collected
that knowledge, subsequent runs of the 04 The consumer then polls again, at a
pipeline may result in duplicate processing regular interval, providing the “cursor”
or unexpected downstream changes. to only retrieve new records

22
03 | Design Patterns

Idempotency
Many of the design patterns that were For example let’s look at two snippets of
discussed above benefit from following an Python code, one which is idempotent, and
idempotent design. Idempotency is the idea the other that is not.
that an operation, for example a pipeline,
produces the same output each time the In this example, the non_deterministic_
same set of parameters are provided. This is function is taking a location and destination
often a great practice in designing data parameter, and writing the current
pipelines as it creates consistency in temperature to a file. Each time this function
processing jobs, but it is especially important is called, the output will be different, as it
for the poll based data access pattern. relies on an always changing metric being
returned from the MyWeatherService.
Deterministic systems in software
engineering is the expectation that the Now, let’s compare that to something that
output of a function is only based on its is idempotent.
inputs. There are no hidden state changes as
a result of calling that function.

Not Idempotent

Python

1 def current_temperature(location: str) -> int:


2 return MyWeatherService().get_current_temperature(location)
3
4 def non_idempotent_function(location: str, destination: str) -> None:
5 with open(destination, ‘w’) as f:
6 f.write(current_temperature(location))

Idempotent

Python

1 def get_temperature(timestamp: str, location: str) -> int:


2 return MyWeatherService().get_temperature(location, timestamp)
3
4 def idempotent_function(timestamp: str, location: str, destination: str) -> None:
5 with open(destination, ‘w’) as f:
6 f.write(get_temperature(timestamp, location))

23
03 | Design Patterns

Idempotency

This example differs in that our idempotent_ context of an orchestrator, for example, the
function requires a timestamp parameter. So trigger time of a pipeline, or the categorical
instead of getting the temperature at a given partition data that can be used as a
location at the time the function is called, parameter to your processing code.
we are responsible for explicitly providing
the time, and each time this function is For instance, when running a batch pipeline
called with that specific timestamp, it should daily, each execution would not look at the
return the same value. last 24 hours of data. Instead, it would use a
specific 24-hour period between two
Practically, you want your pipelines to produce specific “bookends.”
the same result given the same context.
Making this happen - or failing to do so - By using an explicit window of time rather
is often at the root of pipeline complexity. than something relative to the execution of
the job, you remove one potential cause of
The means of doing this can be relatively pipeline drift. Even the best orchestrators
simple. At its most basic, this may mean can be resource-constrained. Also, reducing
recording IDs, filenames, or anything that your exposure to slightly delayed jobs is a
uniquely identifies your input data that has small but essential implementation detail as
already been processed by the data pipeline. your platform grows.
This can often be done by leveraging the

Python

1 @asset(
2 partitions_def=DailyPartitionsDefinition(start_date=»2024-10-01»)
3 )
4 def daily_partitioned_events(context: AssetExecutionContext)
5 «»» Each execution of this asset is configured to be tied to a specific
6 day, thanks to the DailyPartitionsDefinition. Subsequent
7 materializations of this asset are each tied to a specific day,
8 meaning executing them more than once should result in the same
9 output
10 «»»
11 partition_as_str = context.partition_key # 2024-10-01
12 # do work on the data relevant for this specific partition
13 # this may include using this partition key within a WHERE clause

24
04
Data
Modeling
Some patterns are universal when
it comes to data architecture and
data modeling. Both topics are
incredibly vast - and beyond the
scope of this chapter. But whether
you are implementing a Data
Lakehouse, Data Lake, or Data
Warehouse as the central data store
for your platform, you will find many
similarities in their structures.

26
04 | Data Modeling

You’ve probably heard of the Medallion Consumers typically want clean, well-
Architecture. While the term may be most structured data. This may not be the case
commonly associated with Data Lakehouse when working with LLMs or other flavors of
architectures, the core concepts are essentially ML models, but business use cases always
the same as those commonly found in Data expect clean and structured data.
Warehouses that follow an ELT design.
A Medallion Architecture makes this
A Medallion Architecture’s fundamental relatively easy to manage and understand.
concept is to ensure data isolation at This general pattern is very good at reducing
specific points in its lifecycle. You’ll often the downstream impact of reporting through
hear the terms “bronze”, “silver” and “gold” separation of data ingestion, cleaning, and
used to describe this lifecycle. consumer-ready datasets; it also pairs
In a data warehouse context, you may see with data governance and data security.
“staged”, “core” and “data mart”. Some Compartmentalization is a core concept.
others use “raw”, “transformed”, and
“production”. Regardless of the terminology Lakehouse architectures typically apply this
used, the buckets have semantic and structure to files stored in cloud storage
functional significance. services like S3 or GCS. But, this structure
is not always necessary, and can be overly
complex for many businesses. Instead, you
Bronze • Raw data is stored as it is can apply these same logical distinctions
received. Ingested data may be
within your data warehouse. Companies
schemaless
following a Lakehouse architecture often
• Often called a “Landing Zone”
push the Gold layer into a data warehouse
Silver • Cleaned and structured datasets for easier consumption by BI Tools and
derived from the Bronze layer business users.
• Primary validation layer for data
quality

Gold • Consumer-ready datasets


• May include pre-processed
aggregates and metrics
• Consumers may be business
users, applications, or other data
stores

27
04 | Data Modeling

Where the Lakehouse architecture uses unexpected source data changes or has
structured cloud storage service buckets, not had to combine “legacy” and “modern”
a data warehouse would instead segregate datasets into one table. Doing so without
these datasets by schema. separating reading data and writing data is
very difficult.
Modern warehouses like BigQuery,
Databricks, Snowflake, and Clickhouse Hopefully, it will seem obvious that raw
are blurring the boundaries between Data data can be captured in one place and
Lakehouse and Data Warehouse. Regardless materialized as clean data elsewhere. This
of your technology choice, segregating single design decision will save you many
your data based on how well it has been hours of future headaches. Remember, your
validated, cleaned, and transformed is current data architecture has a finite lifespan
always a good decision. and will evolve. You also should expect
your input data to change over time. This is
A bonus of this structure is the separation especially true when consuming data from
of reads and writes. Though it adds some a transactional database. Isolating raw data
overhead, separating these two operations allows you to abstract away the processing
allows for a more stable platform for a few needed for handling miscellaneous edge
reasons. This separation sets the foundation cases or combining legacy datasets with
for more complex CI/CD workflows like their current versions. You cannot always
Blue/ Green Deployments. It also enables rely on upstream systems to provide
our platform to be resilient to upstream consistent input data; instead, you should
changes. You’ll be hard-pressed to find design your data platform to accommodate
a data engineer who has not dealt with these inevitable changes.

28
04 | Data Modeling

Designing a Well-Configured
Platform
A well-configured platform makes building a they may even limit your dependency on any
composable and extensible data platform single tool in the entire system.
easier, but what constitutes “well-configured”?
Avoiding lock-in applies to paid software
Regardless of your data platform’s stage of solutions and open-source tooling choices.
maturity, you should be designing a system There are plenty of “modern” data tools, and
that addresses specific use cases for your every tool category has many options.
team. This will require supporting common Maintaining your ability to change data
functionality typically found within mature pipeline vendors, transformation tooling, or
software products. Some of that functionality reporting and metrics layers provides a
may be bespoke, some from OSS tools, and reasonable level of “future-proofing” without
some from paid vendor tooling. requiring a large upfront investment.

Well-configured platforms avoid vendor So, what components should we consider


lock-in. To a lesser degree, they may also when designing our future-proofed data
avoid “tooling” lock-in. In the extreme case, platform?

At a high level, your data platform is likely to incorporate tooling for the
following components:

01 Orchestration
05 Ingestion

02 Data Cataloging
06 Transformation

03 Storage
07 Observability

04 Replication
08 Data Versioning

29
04 | Data Modeling

Designing a Well-Configured Platform

You might find that a single tool covers various compute platforms to handle
more than one of these categories. But you the transformation of data with differing
can certainly find specialized tools for each volumes, or different technologies for
component as well. replication of data from varying sources.

Additionally, as platforms evolve, you may Each of the components of our well-configured
find that multiple solutions are required for platform has many options. Let’s run through
any given category. For example, you may some popular choices for each category:
require storage in multiple cloud providers,

01 Orchestration

Orchestrators vary greatly, but they all allow you to automatically execute a series
of steps and track the results. Typically, they coordinate the inputs and outputs of
your platform.

Popular
Choices:

02 Data Cataloging

Observation and tagging of your data is crucial so that stakeholders and analysts can
better understand the data that they use day-to-day. It is common for the catalog
to be a joint effort between the data platform team, and the data governance team,
ensuring that cataloging is standard, secure, and compliant to organizational policies.

Popular
choices:

30
04 | Data Modeling

Designing a Well-Configured Platform

03 Storage

Storage can take many shapes, including databases, warehouses, blob


storage, and various file formats. Depending on your compute engine and
how you want to perform transformations, some options are cheaper and
better suited.

Popular
Warehouses:

Popular
Cloud Storage:

Popular
File Formats:

Popular
Table Formats:

04 Replication

If you want to get data out of one storage location and available in another,
you’ll need some replication mechanism. You may be able to do this with
some custom code run on your Orchestrator, but there are many purpose-
built options to use instead. Change-Data-Capture is a common choice for
database replication

Popular
Choices:

31
04 | Data Modeling

Designing a Well-Configured Platform

05 Ingestion

Many Replication tools also work well for general purpose Ingestion, but it’s important
your ingestion tooling supports a wide range of inputs - APIs, different file formats,
and webhooks are only a small sample.

Popular
Choices:

Note
The lines can blur between replication and ingestion tools, and some tools may be
suitable for both!

06 Transformation

The data coming in from your Replication and Ingestion tools will most likely need to
be cleaned, denormalized, and made ready for business users. Transformation tooling
is responsible for this. While any language is appropriate, SQL is the most common.

Popular
Choices:

07 Observability

Your Orchestrator should provide some observability features, but purpose-built


tooling goes a layer deeper. These tools are typically used to proactively monitor your
input and output data and alert when anomalies or expensive access patterns are found.

Popular
Choices:

32
04 | Data Modeling

Designing a Well-Configured Platform

08 Data Versioning

When working in different environments, or as your data evolves, it can be nice to


maintain a history of the changes being made. Data versioning tools can provide a
layer on top of your data storage to enable this functionality.

Popular
Choices:

It’s important to note that a given data when combined into a data platform, they
platform may not need a tool for every one form a mutually exclusive and collectively
of these categories, but all data platforms exhaustive set of features that provide a
utilize tooling - either vendored, open- strong foundation for platform development.
source, or home-grown - to address these
areas of concern. The key to going from just a collection of
tools to an effective platform is incorporating
These tools, however, are not terribly useful these tools within well-understood design
on their own. By themselves, they are patterns and architectures. And that topic is
specialized technology choices that may coming up shortly.
not fit well within your organization. But,

33
04 | Data Modeling

Must have features


There are many ways to design a data platform, These tools help you support the evolution of
and many pieces that make up that whole. And, your technology stack without compromising
though there are plenty of tools available, some the standard of software your team expects.
features should be considered standard.
You would seldom release an application to
These quality of life features that should be production without runtime logging or observability
required within all data systems, and are often tooling. Data platforms are no different.
the differentiator between a robust data
platform from a collection of one-off scripts: It should be clear by this point that there are
many parts that make the “whole” of a data
Logging platform. Whether tools, pipeline design or
general data architectures, there is no
Retries shortage of options or decisions to make.
Backfilling
But don’t get confused - many of the tools
Self-healing pipelines and topics we covered in this book
complement one another. The data pipeline
These features can drastically improve the designs we covered are not mutually exclusive,
reliability of your data platform, for example, and can be used alongside one another.
some upstream data producers or teams Most of these patterns are just specific
can be unpredictable, or, outages can occur names for familiar and common activities
for services or compute environments. found in countless software applications.
Without having these common features, the
trust in your data and data platform can be Designing an effective data platform from
significantly tarnished. scratch can be challenging, and there are many
contradictory examples online. But rather
Restarting failed pipelines and services, than focusing on tool selection and losing
inspecting logs, and backfilling datasets are the forest for the trees, it’s critical that you
common, repeated tasks familiar to every recognize the components of your data
data engineer. Before you jump into designing platform are tools to solve various problems.
how you’ll process your data inputs, you
need to address these fundamental pieces. Often, teams in and across organizations are
Often these fall under the purview of your solving very similar problems. While the domains
orchestrator, but not always. and datasets may be different, there is a common
pattern to the architectures and design principles
that are used to build a robust data platform.

34
05
Data
Quality
A core component of data platforms is data quality, or
more appropriately, a framework for enforcing data quality
standards on the data produced by the pipelines in the data
platform. This is because data quality metrics are indicators
that are used to determine if data is suitable for use in
decision-making.

Data is used to both run and improve


the ways that the organization
achieves their business objectives
… [data quality ensures that] data
is of sufficient quality to meet the
business needs.
David Loshin
The Practitioner’s Guide to Data Quality Improvement

Having a framework defined for how to enforce data


quality, along with a set of standards for what is considered
acceptable is a crucial step in building a data platform
that an organization can trust. Defining these standards
often originates from the governance team. However, it is
often a cross-organization and multi-team collaboration
for well-defined data quality standards. Many rules come
from the stakeholders who have expertise in the underlying
data and reporting needs, but it is often common for
these stakeholders too to not have a full understanding
of the origin of their data. This is why it’s important for the
definitions of data quality to be a result of a team effort.

36
05 | Data Quality

The Dimensions of Data


There exist 6-commonly defined These dimensions can be referenced
dimensions for defining rules of data quality. in defining rules of data quality and
These include: timeliness, completeness, implementing quality checks.
accuracy, validity, uniqueness
and consistency.

Timeliness
Example: An organization expects a
financial report to land in an S3 bucket on
Timeliness refers to how up-to-date data
a weekly basis, however, this report hasn’t
is. If data is being produced by an upstream been received for the past month
system every hour, and the downstream
models haven’t been updated in the past Impact: Data not received within the
week, then this report is not timely. On the expected latency window can result in
contrary, if you have data being produced operational impacts and inaccuracies in
hourly, and the replication and modeling of analysis and reporting
that data triggering just after it is produced,
then that is a timely pipeline.

date_week_end date_report_received status

2025-01-01 2025-01-01 ON_TIME

2025-01-08 2025-01-08 ON_TIME

2025-01-15 2025-01-15 ON_TIME

2025-01-22 NULL MISSING

2025-01-29 NULL MISSING

37
05 | Data Quality

The dimensions of data

Completeness
Example: Customer record — first name,
last name, e-mail address
No required field should be missing for a record
of data, whether that be a column in a database
Impact: Missing customer information can
table, or an attribute of new-line delimited
result in skewed analysis
JSON. If an attribute is indicated as optional,
then completeness may not apply, however,
validation of these missing records may be
more complex, as completeness still applies
to them, but only when they are populated.

user_id first_name last_name email

1 Augustus Gloop [email protected]

2 Violet Beauregarde NULL

3 Charlie Bucket [email protected]

Accuracy
Example: A medical record indicates that
allergies of a given individual, this data
Data values should be accurate to the
should be consistent and accurate across
expected numerical or categorical values for
systems, and be an accurate representation
a given data point. of the real world

Impact: Inaccurate data can significantly


impact downstream reporting and
real-world processes

first_name last_name allergies

Augustus Gloop Chocolate

Violet Beauregarde Blueberries

Charlie Bucket -
05 | Data Quality

The dimensions of data

Validity
Example: A bank maintains a table of
Categorical data entries must adhere to an customer accounts, with an account type
field that can either be checking or savings,
expected list of values. Values must match
but a value is incorrectly entered as loan
the structure or format, for example, pattern
matching, or schema.
Impact: Failures and errors in
transactional processing

user_id account_type balance

13ed536c-851e-41ae-8cf1- checking 530.44

4d5829a4-700f-419b-8b2f- checking 2.50

ac3ea5af-92d6-48fe-a81a- loan 3.75

Uniqueness
Example: An online retailer maintains a
list of products with the expectation that
Data values must be free of duplicate values
Stock Keeping Unit (SKUs) are unique, but
when there are not expected to be any, for there is a duplicate entry.
example, for values that are primary keys in
a relational database. Impact: duplicate SKUs could result in
inventory tracking errors, incorrect product
listings, or order fulfillment mistakes.

sku name price_usd weight_lbs

CW21001 Fizzy Lifting Drinks 12.99 0

CW21001 Everlasting 4.95 0.05

CW21002 Magnet Bites 6.72 0.10

39
05 | Data Quality

The dimensions of data

Consistency
Example: A data team replicates a postgres
Data is to be aligned across systems and table to delta live tables for analytics,
sources–often tied to replication of data however the schema of the schema of the
replicated table incorrectly uses an
from a source system to a data warehouse.
integer value whereas the upstream data
uses floats

Impact: Incorrect reporting and analysis of


data can occur resulting in invalid metrics

Source measurement_time temperature humidity_pctt pressure_inhg


Dataset
1722885084 92.50 78.44 33.40

1722898737 92.55 78.78 33.75

Replication measurement_time temperature humidity_pctt pressure_inhg


Dataset
1722885084 92 78 33

1722898737 92 78 33

40
05 | Data Quality

Tools and Technologies


In enforcing data quality standards on a A benefit of tools like Soda and Great
data platform, there are a wide variety of Expectations is that they provide a large set
tools available within the data ecosystem. of rules out-of-the-box. Instead of having to
Some tools are better suited to specific write custom logic for validating the 6 data
data processing frameworks, for example dimensions that were previously mentioned,
the Deequ framework works particularly it is possible to apply the logic of already
well with Apache Spark, whereas other existing rules.
tools are more general, for example Great
Expectations which may work with a Ultimately, it’s important to choose the data
number of data tools and formats. However, quality framework that is most appropriate
the enforcement of rules should remain for the other tools in your data platform,
consistent across tools. and it is perfectly acceptable to compose
multiple tools that suit your needs. But it’s
A non-exhaustive list of tools that we’ve found important to design your platform in such
to be prevalent in data platforms includes: a way that you have a central observation
layer over these disparate tools and
frameworks, likely done at the orchestrator,
Data quality testing for
so that the stakeholders can have a strong
SQL-, Spark-, and Pandas-
understanding and confidence in their data.
accessible data

General purpose data


validation tool with built-in
rulesets

Library built on top of Apache


Spark for defining “unit tests
for data”, which measure data
quality in large datasets

Assertions you make about


your models and other
resources in your dbt project

Asset checks

41
05 | Data Quality

Enforcement of Data Quality


The full data lifecycle

It should be noted that the enforcement of It’s important to validate that the
data quality should occur throughout all representation of this user is accurate and
stages of the data lifecycle. This includes that there are no faults in how these models
the application layer, data replication, the are being formed or how this data is being
analytical layer, and reporting. joined with other datasets. This can be
particularly tricky as data from many
To provide an example, imagine you are a datasets come together.
SaaS company that maintains a list of their
users. The e-mail address that is provided by Finally, these models will likely be surfaced
the user should be verified at the application in some kind of way, whether that be
layer, both the client side and server side. through dashboards, some kind of alerting,
or through reverse-ETL loaded back into the
When this data is replicated in an analytical application database. It is important to
warehouse, validation should also occur to validate that the data being surfaced meets
ensure that the data that is present in the the expectations of the business.
warehouse matches what exists in the
source database table. It is important to
ensure that the record exists and that no
duplicate records have been created from
some faulty replication logic. Additionally, while
the e-mail address should be structured in a
valid format for an e-mail address, for
defensive data quality enforcement, the
structure of the address should also be
validated at the warehouse layer.

In the analytical layer, it is common for many


downstream models to be produced that
reference application data, like this user table.

42
05 | Data Quality

Enforcement of Data Quality

A central control plane


(orchestrator)

When constructing a data platform, having


a central control plane, like an orchestrator,
makes it much easier to enforce data quality
standards at a high level and throughout all
stages of the data lifecycle.

There are several benefits to this approach,


one being that the data platform owner
can act as the overseer to the standards.
If the organization has a number of teams,
each with its own set of rules, these can
be consolidated into a single playbook for
enforcing standards by the data platform
owner through the orchestrator.

Additionally, the orchestrator is responsible


for determining which jobs should run, if any
data quality issues arise, it has control over
the downstream jobs that may surface data
to a dashboard or application. By having full
visibility into the quality of data at the various
points of its life in the orchestrator, reporting
of this problematic data can be prevented.

Finally, the orchestrator is also home to


system-wide alerting, along with updating
of data in data catalogs. The orchestrator
can keep the engineers and platform owners
informed of problematic data in a standard
way for on-call duties.

43
06
Example
Data
Platforms
To better understand how one
might architect a data platform,
let’s walk through several example
data platform stacks. In these
examples, we outline the origin
of data, the warehouse, and the
orchestrator, how data is ingested,
how it’s transformed, and what is
used for reporting. There are many
ways to architect a data platform;
however, these examples provide
a referential starting point.

45
06 | Example Data Platforms

The Source Data

Warehouse
S3, various third-party

Snowflake

Lightweight Orchestrator Dagster

Data Lake
Ingestion Stitch, Sling

Transformation Framework dbt

Reporting Holisitics

A hypothetical SaaS company is used by This traditional E-T-L method allowed


thousands of e-commerce stores–their core for in-transit aggregation, reducing the
application uses MySQL for transactional compute load on the warehouse and limited
data, and transactional events are captured exposure of PII in the pipeline. The decision
and written to S3 via AWS Kinesis. to aggregate in-process was intentional, as
downstream users were only interested in
The company used Stitch to ingest third- aggregate trends based on the event data.
party data, but relied on a custom pipeline
orchestrated with Dagster to process and Third-party data was processed in the
ingest the data in s3. typical E–L-T pattern.

46
06 | Example Data Platforms

The Source Data

Warehouse
Postgres (Cloud SQL)

BigQuery

GCP Stack Orchestrator Airflow (Cloud Composer)

Ingestion Custom, DataStream

Transformation Framework DataForm

Reporting Looker

This company opted for a simple GCP-native A common practice is to use Airflow (Cloud
architecture due to its existing usage of GCP. Composer, in this case) for simple E/L
This stack leans heavily toward simplicity operations that rely on simple SQL queries.
and convenience, favoring readily available This works fine for scrappy, fast-paced
tools rather than third-party vendors. development but can quickly get out of
hand. The move to DataStream is a prudent
one, specifically for replicating transactional
database tables.

47
06 | Example Data Platforms

The Source Data

Warehouse
Amazon RDS, DynamoDB,

Redshift

AWS Stack Orchestrator Airflow (MWAA)

Ingestion Lambda, DMS

Transformation Framework AWS Glue

Reporting Amazon QuickSight

In this stack we’ve elected to use the AWS DMS. Data is stored in S3 cloud
services explicitly provided by Amazon, storage, and transformed using AWS Glue.
with source data coming from DynamoDB Finally, analysis and reporting occurs with
and Amazon RDS, and replication occurring Amazon Athena, and Amazon QuickSight.
through custom Lambda processing and

48
06 | Example Data Platforms

MDS Source Data

Warehouse
Postgres

Snowflake

by the Book Orchestrator Dagster

Ingestion Fivetran, dlt, Sling

Transformation Framework SDF

Reporting Looker, Hex

The “Modern Data Stack” is popular and This stack falls squarely into the E-L-T
used by thousands of companies, it is often pattern, with heavy reliance on third-party
a combination of cloud provider tools, open vendors. It is very fast to get started.
source frameworks, and data services. A
common theme of the modern data stack
is that it is a composition of several smaller
purpose-built tools with an orchestrator
acting as a unified control plane and single
observation layer.

49
06 | Example Data Platforms

Data Source Data

Warehouse
Postgres, s3

Lakehouse Orchestrator AWS Step Functions

Ingestion AWS Glue

Transformation Framework Spark, Athena

Reporting Looker

If you are dealing with extremely large data In lieu of a warehouse, data in s3 was
sets, you may be interested in foregoing indexed by AWS Glue, transformed with AWS
a traditional data warehouse in favor of a EMR (Spark), and consumed downstream
compute engine like Spark. This company through Athena.
processed many millions of events on a daily
basis and relied heavily on Spark for large- This architecture is not complicated but it is
scale computation a significant deviation from the data stacks
relying on cloud warehouses for storage and
computing.

50
06 | Example Data Platforms

Event Driven Source Data

Warehouse
SFTP, s3

BigQuery

Orchestrator -

Ingestion AWS Lambda, AWS Fargate

Transformation Framework dbt

Reporting Omni

Event-driven design does not necessarily These files were ingested into BigQuery
need an orchestrator; the events themselves and ultimately transformed with dbt. End
“orchestrate” the execution of the pipeline. users consumed the data through Omni
In this company’s case, files landed in SFTP dashboards.
and were eventually copied to s3.
Event-driven architectures are a way to
s3 provides native functionality to trigger process data in a near-real-time fashion, but
AWS Lambda functions when new files they have their downsides. Re-execution of
arrive. As an alternative, long-running the pipeline can be tedious and missed or
processes were also executed via AWS dropped events may go unnoticed without
Fargate when the AWS Lambda runtimes sufficient checks in place.
limits were not sufficient.
The key consideration is that most of these
components are run within a Docker container.

51
Conclusion
That concludes the first edition of organizations, or even if you’re just tinkering
Fundamentals of Data Platforms. We’ve with a side project.
covered many topics from an introduction to
what a data platform is, common architecture Also keep in mind that tools in the data
and design patterns, important concepts like ecosystem are always evolving, but don’t
data quality, and we even dove into some fret, as many of the core architectures and
examples of tools and architectures. design patterns covered in this book have
stood the test of time.
The data ecosystem is broad, but hopefully
this introduction gives you the information Thank you for taking the time to read
that you need to get started in designing Fundamentals of Data Platforms, and we
and building a data platform for either your hope you learned something new.

Meet the authors

Joe Naso Colton Padden


Joe is a Fractional Data Engineer. He helps SaaS Colton is a data engineer, and developer
companies build reliable data platforms and advocate with experience building data platforms
monetize their data. He’s previously managed at fortune 500 institutions, government
multiple data teams at different startups. You can agencies, and startups. He is currently employed
find him talking about the intersection of data by Dagster Labs helping build and educate
and business on Substack, or you can give him a engineers on the future of data orchestration,
shout on LinkedIn. and is a strong advocate for open source
software and community.

52

You might also like