Data Platform Fundamentals
Data Platform Fundamentals
Platform
Fundamentals
Joe Naso Colton Padden
Table of Contents
Chapter 01: Data Platforms 03 Chapter 05: Data Quality 35
Introduction 04 The Dimensions of Data Quality 37
An Open Data Platform 05 Timeliness 37
Why do we need data platforms? 05 Completeness 38
The Control Plane 06 Accuracy 38
Benefits of Data Platforms 07 Validity 39
Observability and Developer 08 Uniqueness 39
Experience Consistency 40
Tools and Technologies 41
Chapter 02: Architecting 09 Enforcement of Data Quality 42
your data platform The full data lifecycle 42
Scaling with the Business 11 A central control plane 43
Composability and Extensibility 12 (orchestrator)
Common Data Architectures 14
Extract-Transform-Load (ETL) 15 Chapter 06: Example Data 44
Extract-Load-Transform (ELT) 16 Platforms
Data Lakehouses 17 The Lightweight Data Lake 46
Event-Driven 18 The GCP Stack 47
The AWS Stack 48
Chapter 03: Design Patterns 19 MDS by the Book 49
Design Patterns for Data Pipelines 20 Data Lakehouse 50
Push 21 Event Driven 51
Pull 21
Poll 22 Conclusion 52
Idempotency 23 Meet the authors 52
02
01
Data
Platforms
01 | Data Platforms
Introduction
Many companies follow a similar maturity Only one data platform should exist in an
curve in data engineering. Initial, small-scale organization. While a single platform may
automations grow to support large-scale require a larger upfront investment, the
orchestration needs. Eventually, standardization it provides and productivity
organizations recognize the need for boost for the team pay dividends as the
dedicated ownership—a team or individual platform grows.
tasked with building and maintaining a
scalable platform to support their An extensible platform provides better
automation and data pipeline requirements. business alignment and less duplicated work.
For more information on what Dagster believes
While organizations may have very similar about data platforms, see the “What Dagster
data needs, headcount, and workflows, it’s Believes About Data Platforms” blog post.
common for the platforms they build to be
significantly different. This is because no This book will be a comprehensive guide
distinct solution exists for building a data for data platform owners looking to build a
platform. This can come down to decisions stable and scalable data platform, starting
about their cloud provider, preference for with the fundamentals:
open or closed-source software, and budget.
Although the underlying technologies
Architecting your Data
may be unique, these organizations often
Platform
follow common, well-known patterns and
principles to design their data platforms. Design Patterns and Tools
We will review some of those patterns and
Observability
principles in this e-book.
Data Quality
Dagster believes in an open data platform
that is heterogeneous and centralized. It Finally, we’ll tie it together with real-world
should support specific business use cases examples illustrating how different teams
and accommodate various data storage have built in-house data platforms for
and processing tools, all while providing a their businesses.
central unified control plane across all of the
processes in the organization.
04
01 | Data Platforms
05
01 | Data Platforms
06
01 | Data Platforms
07
01 | Data Platforms
08
02
Architecting
Your Data
Platform
02 | Architecting Your Data Platform
10
02 | Architecting Your Data Platform
11
02 | Architecting Your Data Platform
Composability and
Extensibility
Rapidly growing datasets, tightened SLAs, and their data volume increases, their initial
and prohibitive costs are common drivers for set of tools are no longer able to handle the
platform redesign and architecture overhauls. additional traffic.
A composable and extensible platform As the lifespan of the tools within a data
makes meeting these requirements possible platform are tied to the needs of the
without an expensive redesign. business, the goal should not be to avoid
Requirements often change at organizations platform changes but to design a
and the data platform needs to be able to platform that makes operating under this
react. A startup may find that as they grow constraint manageable.
12
02 | Architecting Your Data Platform
If you build an extensible and composable You probably have heard the phrase “prefer
platform, you can continue to operate at a composition over inheritance”. in the
high level with minimal downtime. You also context of software application design. This
afford yourself the flexibility of updating applies to data platform design in much the
your platform without migrating major same way, in that the composition of a data
system components. platform typically comprises multiple
purpose-built tools.
Changing transformation tooling, pipeline
design, and migrating data stores are common Often the composition of data tools is
activities for any Data Platform Owner. The abstracted and hidden from the downstream
ease with which you can execute those analysts, and data engineers, allowing for
changes depends on the platform’s design. these individuals to be more efficient in
completing their tasks. Data engineering is
What does an extensible platform look like software engineering after all, and the
in the real world? Often, it comes down design and composition of tools, pipelines,
to abstractions. and models, should be aligned with software
engineering best practices.
13
02 | Architecting Your Data Platform
14
02 | Architecting Your Data Platform
Extract-Transform-Load (ETL)
This architecture is the simplest and By design, the state of the data in its “final”
oldest approach of this list. Data form is different from what is extracted from
starts from an origin datastore, often a its origin. This design is still commonplace,
production database, and is transformed and can be quite effective at reducing
in transit before being written to a final compute usage at the destination. The
storage mechanism. Typically, you’ll see tradeoff is that you may lose the granularity
aggregations and other denormalization of data, and your downstream usage is
applied to these datasets before they are limited by the logic applied in transit.
written to their final destination.
15
02 | Architecting Your Data Platform
Extract-Load-Transform (ELT)
This design has grown dramatically in table in Postgres, are captured and loaded
popularity thanks to the Modern Data Stack into storage. Then, that data can be re-
and its modular design. It has also become constructed in the transformation step,
commonplace thanks to the advent of reconstituting the original dataset from the
Massively Parallel Processing databases and operations that took place. Comparing this
their modern counterparts. to “ETL”, the raw change capture would be
lost as the transformation would take place
Instead of applying transformations to your as the data is loaded into storage.
data before it lands in your warehouse,
you retain a raw copy of that data as it Because the raw data is retained, it is
arrives. This design pattern lends to various possible to recreate the transformations
extraction methodologies, including change and models of this data, ultimately providing
data capture (CDC) and ingestion of data more flexibility. But, this also results in
from APIs or file servers. As an example, additional computation, and increased
in the case of CDC, all of the operations storage costs.
that are performed on a dataset, like a
16
02 | Architecting Your Data Platform
Data Lakehouses
Typically, data is written to columnar file formats These files typically follow a Medallion
like Parquet and stored in cloud storage like Architecture, which we’ll cover in more detail
S3 or GCS. Recently, Iceberg, Delta, and Hudi later in this book. These systems often
formats have also been gaining popularity. In use specialized analytics engines, such as
the case of Delta, this is an additional layer on Spark, to transform the data in these files
top of Parquet in which additional metadata into clean and consumable datasets. The
and transaction logs are included, enabling final output is often written for use in a
support for ACID (atomicity, consistency, data warehouse or otherwise exposed to
isolation, durability) transactions. downstream consumers.
17
02 | Architecting Your Data Platform
Event-Driven
Streaming and “real-time” data platforms Often streaming can be paired with batch
blend event-driven design patterns processing, in that rolling computations can
with some of the components of the take place on the event stream, and persisted
architectures mentioned above. events landing in cloud storage can be later
batch processed for additional reporting.
In an event-driven paradigm, services
exchange data and initiate their workload Some considerations of stream based
through external triggers rather than relying processing is the expected cost of long-
on the coordination provided by a signal running compute, the additional complexity
orchestrator. Some examples of triggers of event processing, and the requirement
include messages landing in a Kafka for specialized tools that may not conform
message queue, a call to a web hook, or an to the existing tools being used at your
S3 event notification. organization. However, when real-time
analytics are required for insights that need
These architectures often use specialized to happen fast, like fraud detection, then
tools to enable steam processing for real- streaming is a fantastic option.
time and rolling-window transformations.
This approach is often paired with the
Lakehouse architecture to serve analytical
workflows for a wide range of SLAs.
18
03
Design
Patterns
03 | Design Patterns
01 Push
02 Pull
03 Poll
20
03 | Design Patterns
For push-based systems, the consumer can 02 The producer performs some form of
dictate the schema of the data that is being processing or creation of data
received, however, a common complexity 03 The producer pushes their datasets to
in these systems is synchronization of a storage location like S3, GCS, or an
schema definitions between producers and SFTP Server
consumers. In these cases, schema registries 04 The consumer ingests that new
can come in handy for defining a single source dataset on their own schedule
of truth for the expected format of data.
22
03 | Design Patterns
Idempotency
Many of the design patterns that were For example let’s look at two snippets of
discussed above benefit from following an Python code, one which is idempotent, and
idempotent design. Idempotency is the idea the other that is not.
that an operation, for example a pipeline,
produces the same output each time the In this example, the non_deterministic_
same set of parameters are provided. This is function is taking a location and destination
often a great practice in designing data parameter, and writing the current
pipelines as it creates consistency in temperature to a file. Each time this function
processing jobs, but it is especially important is called, the output will be different, as it
for the poll based data access pattern. relies on an always changing metric being
returned from the MyWeatherService.
Deterministic systems in software
engineering is the expectation that the Now, let’s compare that to something that
output of a function is only based on its is idempotent.
inputs. There are no hidden state changes as
a result of calling that function.
Not Idempotent
Python
Idempotent
Python
23
03 | Design Patterns
Idempotency
This example differs in that our idempotent_ context of an orchestrator, for example, the
function requires a timestamp parameter. So trigger time of a pipeline, or the categorical
instead of getting the temperature at a given partition data that can be used as a
location at the time the function is called, parameter to your processing code.
we are responsible for explicitly providing
the time, and each time this function is For instance, when running a batch pipeline
called with that specific timestamp, it should daily, each execution would not look at the
return the same value. last 24 hours of data. Instead, it would use a
specific 24-hour period between two
Practically, you want your pipelines to produce specific “bookends.”
the same result given the same context.
Making this happen - or failing to do so - By using an explicit window of time rather
is often at the root of pipeline complexity. than something relative to the execution of
the job, you remove one potential cause of
The means of doing this can be relatively pipeline drift. Even the best orchestrators
simple. At its most basic, this may mean can be resource-constrained. Also, reducing
recording IDs, filenames, or anything that your exposure to slightly delayed jobs is a
uniquely identifies your input data that has small but essential implementation detail as
already been processed by the data pipeline. your platform grows.
This can often be done by leveraging the
Python
1 @asset(
2 partitions_def=DailyPartitionsDefinition(start_date=»2024-10-01»)
3 )
4 def daily_partitioned_events(context: AssetExecutionContext)
5 «»» Each execution of this asset is configured to be tied to a specific
6 day, thanks to the DailyPartitionsDefinition. Subsequent
7 materializations of this asset are each tied to a specific day,
8 meaning executing them more than once should result in the same
9 output
10 «»»
11 partition_as_str = context.partition_key # 2024-10-01
12 # do work on the data relevant for this specific partition
13 # this may include using this partition key within a WHERE clause
24
04
Data
Modeling
Some patterns are universal when
it comes to data architecture and
data modeling. Both topics are
incredibly vast - and beyond the
scope of this chapter. But whether
you are implementing a Data
Lakehouse, Data Lake, or Data
Warehouse as the central data store
for your platform, you will find many
similarities in their structures.
26
04 | Data Modeling
You’ve probably heard of the Medallion Consumers typically want clean, well-
Architecture. While the term may be most structured data. This may not be the case
commonly associated with Data Lakehouse when working with LLMs or other flavors of
architectures, the core concepts are essentially ML models, but business use cases always
the same as those commonly found in Data expect clean and structured data.
Warehouses that follow an ELT design.
A Medallion Architecture makes this
A Medallion Architecture’s fundamental relatively easy to manage and understand.
concept is to ensure data isolation at This general pattern is very good at reducing
specific points in its lifecycle. You’ll often the downstream impact of reporting through
hear the terms “bronze”, “silver” and “gold” separation of data ingestion, cleaning, and
used to describe this lifecycle. consumer-ready datasets; it also pairs
In a data warehouse context, you may see with data governance and data security.
“staged”, “core” and “data mart”. Some Compartmentalization is a core concept.
others use “raw”, “transformed”, and
“production”. Regardless of the terminology Lakehouse architectures typically apply this
used, the buckets have semantic and structure to files stored in cloud storage
functional significance. services like S3 or GCS. But, this structure
is not always necessary, and can be overly
complex for many businesses. Instead, you
Bronze • Raw data is stored as it is can apply these same logical distinctions
received. Ingested data may be
within your data warehouse. Companies
schemaless
following a Lakehouse architecture often
• Often called a “Landing Zone”
push the Gold layer into a data warehouse
Silver • Cleaned and structured datasets for easier consumption by BI Tools and
derived from the Bronze layer business users.
• Primary validation layer for data
quality
27
04 | Data Modeling
Where the Lakehouse architecture uses unexpected source data changes or has
structured cloud storage service buckets, not had to combine “legacy” and “modern”
a data warehouse would instead segregate datasets into one table. Doing so without
these datasets by schema. separating reading data and writing data is
very difficult.
Modern warehouses like BigQuery,
Databricks, Snowflake, and Clickhouse Hopefully, it will seem obvious that raw
are blurring the boundaries between Data data can be captured in one place and
Lakehouse and Data Warehouse. Regardless materialized as clean data elsewhere. This
of your technology choice, segregating single design decision will save you many
your data based on how well it has been hours of future headaches. Remember, your
validated, cleaned, and transformed is current data architecture has a finite lifespan
always a good decision. and will evolve. You also should expect
your input data to change over time. This is
A bonus of this structure is the separation especially true when consuming data from
of reads and writes. Though it adds some a transactional database. Isolating raw data
overhead, separating these two operations allows you to abstract away the processing
allows for a more stable platform for a few needed for handling miscellaneous edge
reasons. This separation sets the foundation cases or combining legacy datasets with
for more complex CI/CD workflows like their current versions. You cannot always
Blue/ Green Deployments. It also enables rely on upstream systems to provide
our platform to be resilient to upstream consistent input data; instead, you should
changes. You’ll be hard-pressed to find design your data platform to accommodate
a data engineer who has not dealt with these inevitable changes.
28
04 | Data Modeling
Designing a Well-Configured
Platform
A well-configured platform makes building a they may even limit your dependency on any
composable and extensible data platform single tool in the entire system.
easier, but what constitutes “well-configured”?
Avoiding lock-in applies to paid software
Regardless of your data platform’s stage of solutions and open-source tooling choices.
maturity, you should be designing a system There are plenty of “modern” data tools, and
that addresses specific use cases for your every tool category has many options.
team. This will require supporting common Maintaining your ability to change data
functionality typically found within mature pipeline vendors, transformation tooling, or
software products. Some of that functionality reporting and metrics layers provides a
may be bespoke, some from OSS tools, and reasonable level of “future-proofing” without
some from paid vendor tooling. requiring a large upfront investment.
At a high level, your data platform is likely to incorporate tooling for the
following components:
01 Orchestration
05 Ingestion
02 Data Cataloging
06 Transformation
03 Storage
07 Observability
04 Replication
08 Data Versioning
29
04 | Data Modeling
You might find that a single tool covers various compute platforms to handle
more than one of these categories. But you the transformation of data with differing
can certainly find specialized tools for each volumes, or different technologies for
component as well. replication of data from varying sources.
Additionally, as platforms evolve, you may Each of the components of our well-configured
find that multiple solutions are required for platform has many options. Let’s run through
any given category. For example, you may some popular choices for each category:
require storage in multiple cloud providers,
01 Orchestration
Orchestrators vary greatly, but they all allow you to automatically execute a series
of steps and track the results. Typically, they coordinate the inputs and outputs of
your platform.
Popular
Choices:
02 Data Cataloging
Observation and tagging of your data is crucial so that stakeholders and analysts can
better understand the data that they use day-to-day. It is common for the catalog
to be a joint effort between the data platform team, and the data governance team,
ensuring that cataloging is standard, secure, and compliant to organizational policies.
Popular
choices:
30
04 | Data Modeling
03 Storage
Popular
Warehouses:
Popular
Cloud Storage:
Popular
File Formats:
Popular
Table Formats:
04 Replication
If you want to get data out of one storage location and available in another,
you’ll need some replication mechanism. You may be able to do this with
some custom code run on your Orchestrator, but there are many purpose-
built options to use instead. Change-Data-Capture is a common choice for
database replication
Popular
Choices:
31
04 | Data Modeling
05 Ingestion
Many Replication tools also work well for general purpose Ingestion, but it’s important
your ingestion tooling supports a wide range of inputs - APIs, different file formats,
and webhooks are only a small sample.
Popular
Choices:
Note
The lines can blur between replication and ingestion tools, and some tools may be
suitable for both!
06 Transformation
The data coming in from your Replication and Ingestion tools will most likely need to
be cleaned, denormalized, and made ready for business users. Transformation tooling
is responsible for this. While any language is appropriate, SQL is the most common.
Popular
Choices:
07 Observability
Popular
Choices:
32
04 | Data Modeling
08 Data Versioning
Popular
Choices:
It’s important to note that a given data when combined into a data platform, they
platform may not need a tool for every one form a mutually exclusive and collectively
of these categories, but all data platforms exhaustive set of features that provide a
utilize tooling - either vendored, open- strong foundation for platform development.
source, or home-grown - to address these
areas of concern. The key to going from just a collection of
tools to an effective platform is incorporating
These tools, however, are not terribly useful these tools within well-understood design
on their own. By themselves, they are patterns and architectures. And that topic is
specialized technology choices that may coming up shortly.
not fit well within your organization. But,
33
04 | Data Modeling
34
05
Data
Quality
A core component of data platforms is data quality, or
more appropriately, a framework for enforcing data quality
standards on the data produced by the pipelines in the data
platform. This is because data quality metrics are indicators
that are used to determine if data is suitable for use in
decision-making.
36
05 | Data Quality
Timeliness
Example: An organization expects a
financial report to land in an S3 bucket on
Timeliness refers to how up-to-date data
a weekly basis, however, this report hasn’t
is. If data is being produced by an upstream been received for the past month
system every hour, and the downstream
models haven’t been updated in the past Impact: Data not received within the
week, then this report is not timely. On the expected latency window can result in
contrary, if you have data being produced operational impacts and inaccuracies in
hourly, and the replication and modeling of analysis and reporting
that data triggering just after it is produced,
then that is a timely pipeline.
37
05 | Data Quality
Completeness
Example: Customer record — first name,
last name, e-mail address
No required field should be missing for a record
of data, whether that be a column in a database
Impact: Missing customer information can
table, or an attribute of new-line delimited
result in skewed analysis
JSON. If an attribute is indicated as optional,
then completeness may not apply, however,
validation of these missing records may be
more complex, as completeness still applies
to them, but only when they are populated.
Accuracy
Example: A medical record indicates that
allergies of a given individual, this data
Data values should be accurate to the
should be consistent and accurate across
expected numerical or categorical values for
systems, and be an accurate representation
a given data point. of the real world
Charlie Bucket -
05 | Data Quality
Validity
Example: A bank maintains a table of
Categorical data entries must adhere to an customer accounts, with an account type
field that can either be checking or savings,
expected list of values. Values must match
but a value is incorrectly entered as loan
the structure or format, for example, pattern
matching, or schema.
Impact: Failures and errors in
transactional processing
Uniqueness
Example: An online retailer maintains a
list of products with the expectation that
Data values must be free of duplicate values
Stock Keeping Unit (SKUs) are unique, but
when there are not expected to be any, for there is a duplicate entry.
example, for values that are primary keys in
a relational database. Impact: duplicate SKUs could result in
inventory tracking errors, incorrect product
listings, or order fulfillment mistakes.
39
05 | Data Quality
Consistency
Example: A data team replicates a postgres
Data is to be aligned across systems and table to delta live tables for analytics,
sources–often tied to replication of data however the schema of the schema of the
replicated table incorrectly uses an
from a source system to a data warehouse.
integer value whereas the upstream data
uses floats
1722898737 92 78 33
40
05 | Data Quality
Asset checks
41
05 | Data Quality
It should be noted that the enforcement of It’s important to validate that the
data quality should occur throughout all representation of this user is accurate and
stages of the data lifecycle. This includes that there are no faults in how these models
the application layer, data replication, the are being formed or how this data is being
analytical layer, and reporting. joined with other datasets. This can be
particularly tricky as data from many
To provide an example, imagine you are a datasets come together.
SaaS company that maintains a list of their
users. The e-mail address that is provided by Finally, these models will likely be surfaced
the user should be verified at the application in some kind of way, whether that be
layer, both the client side and server side. through dashboards, some kind of alerting,
or through reverse-ETL loaded back into the
When this data is replicated in an analytical application database. It is important to
warehouse, validation should also occur to validate that the data being surfaced meets
ensure that the data that is present in the the expectations of the business.
warehouse matches what exists in the
source database table. It is important to
ensure that the record exists and that no
duplicate records have been created from
some faulty replication logic. Additionally, while
the e-mail address should be structured in a
valid format for an e-mail address, for
defensive data quality enforcement, the
structure of the address should also be
validated at the warehouse layer.
42
05 | Data Quality
43
06
Example
Data
Platforms
To better understand how one
might architect a data platform,
let’s walk through several example
data platform stacks. In these
examples, we outline the origin
of data, the warehouse, and the
orchestrator, how data is ingested,
how it’s transformed, and what is
used for reporting. There are many
ways to architect a data platform;
however, these examples provide
a referential starting point.
45
06 | Example Data Platforms
Warehouse
S3, various third-party
Snowflake
Data Lake
Ingestion Stitch, Sling
Reporting Holisitics
46
06 | Example Data Platforms
Warehouse
Postgres (Cloud SQL)
BigQuery
Reporting Looker
This company opted for a simple GCP-native A common practice is to use Airflow (Cloud
architecture due to its existing usage of GCP. Composer, in this case) for simple E/L
This stack leans heavily toward simplicity operations that rely on simple SQL queries.
and convenience, favoring readily available This works fine for scrappy, fast-paced
tools rather than third-party vendors. development but can quickly get out of
hand. The move to DataStream is a prudent
one, specifically for replicating transactional
database tables.
47
06 | Example Data Platforms
Warehouse
Amazon RDS, DynamoDB,
Redshift
In this stack we’ve elected to use the AWS DMS. Data is stored in S3 cloud
services explicitly provided by Amazon, storage, and transformed using AWS Glue.
with source data coming from DynamoDB Finally, analysis and reporting occurs with
and Amazon RDS, and replication occurring Amazon Athena, and Amazon QuickSight.
through custom Lambda processing and
48
06 | Example Data Platforms
Warehouse
Postgres
Snowflake
The “Modern Data Stack” is popular and This stack falls squarely into the E-L-T
used by thousands of companies, it is often pattern, with heavy reliance on third-party
a combination of cloud provider tools, open vendors. It is very fast to get started.
source frameworks, and data services. A
common theme of the modern data stack
is that it is a composition of several smaller
purpose-built tools with an orchestrator
acting as a unified control plane and single
observation layer.
49
06 | Example Data Platforms
Warehouse
Postgres, s3
Reporting Looker
If you are dealing with extremely large data In lieu of a warehouse, data in s3 was
sets, you may be interested in foregoing indexed by AWS Glue, transformed with AWS
a traditional data warehouse in favor of a EMR (Spark), and consumed downstream
compute engine like Spark. This company through Athena.
processed many millions of events on a daily
basis and relied heavily on Spark for large- This architecture is not complicated but it is
scale computation a significant deviation from the data stacks
relying on cloud warehouses for storage and
computing.
50
06 | Example Data Platforms
Warehouse
SFTP, s3
BigQuery
Orchestrator -
Reporting Omni
Event-driven design does not necessarily These files were ingested into BigQuery
need an orchestrator; the events themselves and ultimately transformed with dbt. End
“orchestrate” the execution of the pipeline. users consumed the data through Omni
In this company’s case, files landed in SFTP dashboards.
and were eventually copied to s3.
Event-driven architectures are a way to
s3 provides native functionality to trigger process data in a near-real-time fashion, but
AWS Lambda functions when new files they have their downsides. Re-execution of
arrive. As an alternative, long-running the pipeline can be tedious and missed or
processes were also executed via AWS dropped events may go unnoticed without
Fargate when the AWS Lambda runtimes sufficient checks in place.
limits were not sufficient.
The key consideration is that most of these
components are run within a Docker container.
51
Conclusion
That concludes the first edition of organizations, or even if you’re just tinkering
Fundamentals of Data Platforms. We’ve with a side project.
covered many topics from an introduction to
what a data platform is, common architecture Also keep in mind that tools in the data
and design patterns, important concepts like ecosystem are always evolving, but don’t
data quality, and we even dove into some fret, as many of the core architectures and
examples of tools and architectures. design patterns covered in this book have
stood the test of time.
The data ecosystem is broad, but hopefully
this introduction gives you the information Thank you for taking the time to read
that you need to get started in designing Fundamentals of Data Platforms, and we
and building a data platform for either your hope you learned something new.
52