Data Engineering With Python
Data Engineering With Python
Daniel Beach
This book is for sale at http://leanpub.com/dataengineeringwithpython
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Knowledge and Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
What are the topics we will cover? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Chapter 4 - Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
SQL/NoSQL Databases vs Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
File Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Row vs Columnar Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Common file types in data engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Parquet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Avro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Orc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
CSV / Flat-file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Storage location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Introduction
This book is all about the movement of data, specifically developing data pipelines and how to
become an awesome Data Engineer.
This book focuses on the crucial task of moving data, specifically building data pipelines, and how to
become an exceptional Data Engineer. In today’s age of Business Intelligence, Data Science, Machine
Learning, and the ever-growing need for companies to gather as much data as possible, designing
data pipelines has become an invaluable skill.
Despite occasional layoffs in the tech industry, Data Engineers remain highly sought after, making it
a recession-resistant career choice. Data engineering is a unique combination of technical and non-
technical skills, setting it apart from traditional software engineering disciplines. This book provides
a comprehensive overview of the fundamental topics and highlights the essential skills necessary for
a successful Data Engineer.
Despite the increasing demand for skilled Data Engineers, finding qualified individuals for the role
remains a challenge. Acquiring the necessary skills as an individual can also be daunting, given the
lag in training and education offerings that cater to real-world Data Engineering knowledge. It can
Introduction 2
be overwhelming for beginners to navigate the vast array of tools, programming languages, and
skills required to excel in this field.
This book aims to bridge this gap by providing comprehensive coverage of the essential topics in
Data Engineering. As someone who has experienced the challenges of starting out in this field, I
understand the difficulties that come with knowing where to start and which topics to prioritize.
What To Expect
In this book, I want to give you the skills and knowledge, especially the underlying theory, to write
beautiful, fast, scalable data pipelines. It’s impossible to teach everything and cover every topic, but
I at least want you to know, what you should focus on. Hopefully, you discover many topics that
you can dive into at your leisure. But I also want to be clear about what this book is not.
This book isn’t about how to write code.
Data pipelines are so different and varied in their structure, based on the technology stacks being
used, but most of the concepts are the same. Some people wrongly assume that they should learn
how to be a great coder, especially in the beginning, sure, that is helpful. But, as you grow in your
career you will quickly realize that it’s other skills that enable you to be a good Data Engineer.
What I don’t want to teach you is how to write code. You will see me using Python in my examples,
and that is just for the ease of code readability. I expect you are a smart and savvy person, you
reading this book after all.
The theory and ideas behind many data engineering topics are more important than how well you
write code, which comes with time and experience. Let’s take a preview of what will be coming in
the next chapters ahead as you move forward.
Chapters
Here are the chapters and topics you can expect to encounter.
• Data Modeling
• Data Quality
• DevOps
In each chapter, we will try to introduce the topic at a high level, and help you understand why it
is important. Many times we will then dive into more specifics and examples to help cement those
ideas in.
I encourage you to read the chapter is the order they are given as well, there is a certain rhythm to
life and Data Engineering that will be better served if done in order.
I’m going to give you the headstart you need to help you surpass all your contemporaries and learn
the skills that are central to becoming a successful data engineer. The best part is, you can do all
this with Python, in which most of our examples will be written, but the choice of language doesn’t
matter as much as the skill sets and thought processes.
Never let someone tell you X language is better than Y.
Focus on learning and growth, the tools are secondary to the underlying skillset.
I’ve personally built a successful career as a Senior Data Engineer, never have taken a Computer
Science class in my life, and used Python for 90%+ of my professional life. This proves that like
Introduction 4
many things in life, focusing on the basics will get you much farther in the long run, than focusing
on semantics or tying yourself to a particular tool or language.
I want to share those experiences, tips, and tricks in this book to jump-start you into building reliable,
scalable data pipelines.
Theory can be dry and I will do my best to be to the point. Having a high-level understanding of
what we are trying to accomplish with our data pipelines should make writing them a little bit more
manageable.
• A source: This is the data that is being processed by the pipeline. The source can be a database,
a file system, or an application.
• A processing step: This is the step that transforms the data from the source into a format
that can be used by the destination. Processing steps can include filtering, cleaning, and
transforming data.
Chapter 1 - The Theory. 6
• A destination: This is the final location where the data is stored. The destination can be a
database, a file system, or an application.
Data pipelines are typically automated using a variety of tools and technologies, including:
Data pipelines are an essential part of the modern data landscape. They allow businesses to efficiently
and reliably move data from a variety of sources to a variety of destinations, where it can be used
for a variety of purposes.
You might be tempted to think that it’s simply the movement of data, but it’s more complicated than
that. I would rather say that a good data pipeline is about …
“facilitating the movement, storage, and access to data in a repeatable, resilient, and scalable
manner.”
These are the fundamentals you should think about when it comes to data engineering and data
pipelines.
• Movement
• Storage
• Access
• Repeatable
• Resilient
• Scalable
A little more than you thought? Maybe you think I’m nitpicking and throwing out jargon? We will
dig into these 6 fundamentals shortly.
I would argue the difference in understanding a data pipeline as “just the movement data” vs
“facilitating the movement, storage, and access to data in a repeatable, resilient, and scalable manner”
is the difference between an amateur, broken, and unworthy pipeline and one built by a professional
that will be in place for years.
Chapter 1 - The Theory. 7
Data pipelines built with passion and creativity are the ones that are most likely to succeed. When
you build a data pipeline with passion, you are more likely to be invested in its success. You are
also more likely to be creative in your approach to solving problems. This can lead to innovative
solutions that can help your business grow.
There are a few things you can do to build data pipelines with passion and creativity:
• Start with a clear goal in mind. What do you want to achieve with your data pipeline? Once
you know your goal, you can start to think about how you can achieve it.
• Be creative in your approach to solving problems. Don’t be afraid to try new things. The best
way to find innovative solutions is to experiment.
• Be passionate about your work. If you are passionate about what you do, it will show in your
work. Your passion will motivate you to work hard and to come up with creative solutions.
Building data pipelines with passion and creativity is essential for the success of your business. When
you are passionate about your work, you are more likely to be successful. You are also more likely
to be creative in your approach to solving problems. This can lead to innovative solutions that can
help your business grow.
All this can lead to data engineering jobs being very stressful and all-time consuming. If you have
a passion for working with data, this will help offset the burdens that can come with the job.
Passion for data engineering should lead to excellence in the code we write and help us design more
creative long-term solutions.
Chapter 1 - The Theory. 8
Creativity is the spice in life, both in coding and leisure. Data engineering is a hard profession to
master, not thinking outside the box and being rigid will curtail your career and results. Complex
problems require creative solutions, some of the best data engineers come from non-software
backgrounds.
Truly good data pipelines are built with heart and love. You have to see it as more than just another
task that is going to make your boss happy. If we see the movement of data in a system as a simple,
meaningless, and overhead task, the obvious result is going to be lackluster and the codebase will
fail at a critical moment.
Worse, it will compromise trust from business units and lead to incorrect results. Let’s examine our
theory of what a data pipeline is and how it will affect the code we will write later.
There can be an attitude in tech that Data Engineering is somehow beneath Software Engineering
and doesn’t ascribe to the same standards. This is a crucial mistake. Data engineers should hold
themselves and the complex pipelines they design to the same high level of standards.
There are a few key data engineering standards that are important to follow in order to build reliable
and efficient data pipelines. These standards include:
Chapter 1 - The Theory. 9
• Data quality: Data quality is essential for any data pipeline. Data should be clean, accurate,
and consistent. There are a number of tools and techniques that can be used to improve data
quality, such as data cleansing, data validation, and data profiling.
• Data scalability: Data pipelines should be able to handle large volumes of data. This means that
they should be able to scale up or down to meet the needs of the business. There are a number
of ways to improve data scalability, such as using distributed computing technologies, such as
Hadoop and Spark.
• Data reliability: Data pipelines should be reliable. This means that they should be able to run
without errors or failures. There are a number of ways to improve data reliability, such as using
fault-tolerant technologies, such as replication and load balancing.
• Data efficiency: Data pipelines should be efficient. This means that they should be able to
process data quickly and use resources efficiently. There are a number of ways to improve data
efficiency, such as using optimized algorithms and data structures.
Following these standards can help to build reliable and efficient data pipelines that can meet the
needs of the business
Let’s dive into more of what this means and looks like day-to-day.
Movement
The movement of data is basic to the data pipeline. We are by definition picking up data from
somewhere and dropping it off at “home”, with some number of stops in between. Therefore we
can’t brush past the topic of data movement. In Data Engineering some movements will be complex
and some will be simple.
For example, reading a CSV and doing something every row is about as basic as it gets. A very small
but simple data pipeline.
1 import csv
2
3 def open_csv_file(file_location: str) -> object:
4 with open(file_location) as f:
5 csv_reader = csv.reader(f)
6 for row in csv_reader:
7 print(row)
• Batch processing.
• Micro batch processing (near-real-time).
• ETL / ELT
When data engineers think about the movement of data, one of the first thoughts should be … “Is
this data pipeline streaming, batch, or something in between?”
This type of basic approach to data movement is critical because all downstream actions are most
likely very different based upon the answer of streaming vs batch.
Don’t be fooled, the basic choice at the foundation of building data pipelines will have a huge impact
on how you design the system. Your code will take shape around the technology choice you make
here. It’s something you have to think about, “How will my data move point A to B?”
If you choose a messaging service like Pub/Sub, Kafka, Pulsar, etc, that code is going to look
completely different and have a different setup requirement than if you decide to push CSV files
to Parquet files with Spark. Those options couldn’t be more different.
When considering what the movement of the data should look like, remember complexity is a killer.
Do you already have numerous custom complex transformations that need to take place? Will adding
a complex data movement or storage option make the codebase nearly unapproachable? What are
the speed requirements of the project? Is some nominal data loss acceptable?
Data pipelines are critical for efficiently processing and analyzing large amounts of data in modern
businesses. However, as data pipelines grow more complex, they can become challenging to manage
and maintain, which can result in higher costs, longer processing times, and lower data quality.
• Reducing complexity in data pipelines is essential to ensure they remain efficient and effective.
Here are a few strategies that can help:
• Simplify the data pipeline architecture: Data pipelines can be simplified by reducing the
number of components, eliminating unnecessary steps, and streamlining the data flow.
• Adopt standardization: Establishing standardization across the data pipeline can reduce com-
plexity and make the process more manageable. This can include standardizing data formats,
naming conventions, and data processing protocols.
• Automate repetitive tasks: Automating routine tasks, such as data cleansing and transforma-
tion, can reduce manual errors and simplify the data pipeline. This can also improve the
accuracy and consistency of data processing.
• Implement modular design: Building modular components in the data pipeline can make
it easier to manage and maintain. Each module can be developed, tested, and deployed
independently, making the overall process more manageable.
• Leverage cloud computing: Cloud computing can provide scalable, flexible, and cost-effective
infrastructure for data pipelines. This can reduce complexity by eliminating the need to manage
physical hardware and software.
• Monitor and optimize: Regular monitoring and optimization of the data pipeline can help
identify bottlenecks, errors, and inefficiencies. This can help prevent issues before they arise
and improve overall performance.
By adopting these strategies, businesses can reduce complexity in their data pipelines, which can
lead to better data quality, faster processing times, and lower costs.
If your dealing with financial data, sure a messaging system where it becomes hard to “lose” data
might be a good choice. If you’re dealing with terabytes of satellite imagery, missing one image
won’t have a reasonable effect on the outcome.
Sounds like a lot of jargon just for deciding the data movement stack you will choose for your
pipelines?
The point I’m trying to make is that your code is probably a pyramid of complexity. The foundational
decisions you make upfront will direct the type and scope of the code that is needed to implement
the rest of the project. Don’t take it lightly.
Again, the movement of data by code is fundamentally what Data Engineering is all about. We know
we have to move and act upon the data, it’s the details of how to implement that data movement is
where the Data Engineer provides value.
Very closely related to data movement that we just covered is storage, or data at rest. Obviously,
with the rise of cloud storage options like s3, gs, azure blob, it’s been much easier to ignore choices
of file and data storage. But, just because it’s easy doesn’t mean that data pipelines won’t depend in
many ways on the type of storage chosen.
The difference between a CSV file and parquet is enormous (we will explore details later), especially
as data grows.
Again, many of the file storage choices can be hard to back out of once baked into the software
solution. Understanding the data that is flowing and how it is used will always give clues to the
storage option needed.
In data engineering, storage and file types are critical components that play a vital role in the overall
data processing and analysis pipeline. Here’s a brief explanation of these concepts:
Storage:
Storage refers to the physical or virtual location where data is stored. Data storage can be classified
into two categories: primary storage and secondary storage. Primary storage, also known as volatile
storage, is temporary storage used by the computer’s memory to hold data that is currently being
processed. On the other hand, secondary storage, also known as non-volatile storage, is long-term
storage that is used to hold data that is not currently being used. Examples of secondary storage
include hard drives, solid-state drives, cloud storage, and network-attached storage.
File Types:
File types refer to the different formats in which data can be stored. Each file type has a specific
structure, encoding, and extension that determines how the data is stored, processed, and accessed.
Here are some common file types used in data engineering:
CSV (Comma Separated Values): This is a text file format used to store tabular data, such as
spreadsheets. Each row of data is represented as a separate line in the file, and the values in each
row are separated by commas.
Chapter 1 - The Theory. 13
JSON (JavaScript Object Notation): This is a lightweight data interchange format that is easy to
read and write. JSON files store data as key-value pairs, and the data is represented in a hierarchical
format.
Parquet: This is a columnar storage file format designed to improve the efficiency of data processing.
Parquet files store data in columns instead of rows, which can improve query performance and
reduce storage costs.
Avro: This is a data serialization system that is used to exchange data between different systems.
Avro files are compact and support schema evolution, which allows changes to the data structure
over time without breaking compatibility.
ORC (Optimized Row Columnar): This is another columnar storage file format that is designed to
optimize performance and storage efficiency. ORC files use compression and encoding techniques
to reduce file size and improve query performance.
Choosing the right storage and file types for a particular use case is critical for efficient data
processing, storage, and analysis. Understanding the strengths and weaknesses of each option can
help data engineers make informed decisions that lead to better data outcomes.
I’ve seen tens of millions of individual JSON files stored in cloud buckets, and then a year later
analytics need to be done. Of course, at this point, even a massive AWS EMR cluster was having
trouble reading tens of millions of individual JSON files to retrieve needed business analytics.
Obviously, for that project, someone chose JSON because it was easy, and the consequences were
large and required a sizable project to regain insights into the data that could have been avoided by
proper storage upfront.
Storage can be seen as just an in-consequential decision when building data pipelines, but this is
rarely the case once the project is popular and starts to scale.
Storage and file types have a significant impact on technology and code in data engineering. Here
are some ways in which storage and files affect technology and code:
Data processing and analysis:
Different storage and file types require different processing and analysis techniques. For example,
columnar storage file formats like Parquet and ORC require specialized processing techniques to
efficiently access and analyze the data. In contrast, text-based file formats like CSV and JSON require
different processing techniques. As a result, technology and code used in data engineering must be
tailored to the specific storage and file types used in a particular use case.
Scalability:
The choice of storage and file types can impact the scalability of a data pipeline. For example, storing
large amounts of data in a traditional relational database may not be practical due to scalability
limitations. In contrast, using a distributed storage system like Hadoop Distributed File System
(HDFS) can provide greater scalability and performance. Similarly, using columnar storage file
Chapter 1 - The Theory. 14
formats like Parquet can significantly reduce storage costs and improve query performance for large
datasets.
Data integration:
Integrating data from multiple sources requires compatibility between the storage and file types
used in each source. For example, if one source uses CSV files and another uses JSON files, the
data integration process may require additional code to transform the data into a compatible format.
Choosing storage and file types that are compatible with the majority of data sources can simplify
the data integration process and reduce the amount of code required.
Code efficiency:
The choice of storage and file types can impact the efficiency of code used in data engineering.
For example, using a columnar storage file format like Parquet can reduce the amount of data that
needs to be read from disk, which can improve the performance of data processing and analysis
code. Similarly, using a distributed storage system like HDFS can allow code to be parallelized across
multiple nodes, improving overall code efficiency.
In summary, storage and file types have a significant impact on technology and code in data
engineering. Choosing the right storage and file types for a particular use case can have a profound
impact on the scalability, efficiency, and effectiveness of data processing and analysis pipelines.
Access
Continuing on with our definition of data pipeline, we come to data access.
What does access have to do with data pipelines? Everything. The data output from a pipeline or
intermediate steps should be easily and quickly accessible. Why is this so important?
In data engineering, data access patterns refer to the different ways in which data is accessed,
processed, and analyzed. Understanding data access patterns is crucial for designing efficient and
effective data pipelines. Here are some common data access patterns in data engineering:
Batch processing:
In this data access pattern, data is processed in large batches at regular intervals, such as daily
or weekly. Batch processing is commonly used for analyzing large datasets that are stored in data
warehouses or data lakes. Batch processing is useful for data that is not time-sensitive, and the
analysis does not need to be performed in real-time.
Real-time processing:
In this data access pattern, data is processed as it is generated or received, in real-time. Real-time
processing is commonly used for applications that require immediate feedback or response, such
as fraud detection or recommendation engines. Real-time processing requires low latency and high
throughput to ensure that data is processed quickly and accurately.
Stream processing:
In this data access pattern, data is processed in real-time as it is generated, but the processing is
Chapter 1 - The Theory. 15
performed in small batches, rather than continuously. Stream processing is commonly used for
applications that require continuous data processing, such as log analysis or sensor data processing.
Stream processing requires high throughput and low latency to ensure that data is processed quickly
and accurately.
Interactive processing:
In this data access pattern, data is processed interactively, allowing users to query and analyze data
in real-time. Interactive processing is commonly used for data visualization, dashboarding, and ad
hoc analysis. Interactive processing requires low latency and high interactivity to ensure that users
can query and analyze data quickly and easily.
Near-line processing: In this data access pattern, data is processed on-demand, but the processing
is not performed in real-time. Near-line processing is commonly used for applications that require
moderate latency and throughput, such as data archiving or backup. Near-line processing requires
moderate latency and throughput to ensure that data is processed in a timely and efficient manner.
Each data access pattern has its strengths and weaknesses, and choosing the right pattern depends on
the specific use case and requirements. Understanding data access patterns is critical for designing
efficient and effective data pipelines that can meet the needs of the business.
Well, what happens when the inevitable bug or business requirements change and you need to
troubleshoot? What happens when the pipeline has scaled to hundreds or even terabytes of data?
You will care about data access then.
Simply put, how do people or machines consume the data in question? Data is no good if it cannot
be used.
Chapter 1 - The Theory. 16
I’ve seen complex projects completed by some very smart people, with code that looks like it should
be admired. And those codebases solved very hard problems. But they failed in the end with no
adoption even within the same engineering group.
Why? The data access layer was impossible to understand and interact with. It made the whole
project useless. Don’t let your project or pipeline fall into that hole.
When you design a data pipeline never forgot about how the data will be accessed and explored. This
comes down to how and where the data is stored at rest, and then presented for use by consuming
users or applications.
Moral of the story? Choosing storage and file types is one of the most important topics and with
far-reaching consequences when designing data pipelines.
Repeatable
Repeatability is arguably one of the most important pieces of any data pipeline. Every pipeline
codebase must meet these criteria, it has to be repeatable. What good is the code if the author is the
only one who understands, can troubleshoot, or even worse, run the pipeline? This is more common
than you think.
Some pipeline breaks, everyone is running around like a chicken with their head cut off because that
“one person” is on vacation today, and no one else knows what to do.
Repeatability is a crucial topic in data pipelines because it ensures that the data pipeline can be run
consistently and reliably over time. In data engineering, repeatability refers to the ability to recreate
the same output from a given input, even if the pipeline is run multiple times. Here are some reasons
why repeatability is an essential topic in data pipelines:
Quality assurance:
Data pipelines are often used to generate business-critical reports, insights, and analytics. Ensuring
that these outputs are accurate and reliable is essential. Repeatability allows data engineers to verify
that the output of the data pipeline is consistent across different runs, ensuring that the data is
accurate and reliable.
Debugging:
Inevitably, data pipelines will encounter errors or bugs that need to be fixed. When this happens, it
Chapter 1 - The Theory. 17
is essential to be able to rerun the pipeline with the same inputs to reproduce the error and debug the
issue. Repeatability ensures that the data pipeline can be rerun with the same inputs to reproduce
errors consistently.
Auditability:
In regulated industries, data pipelines must be auditable to ensure compliance with legal and
regulatory requirements. Repeatability enables data engineers to provide an audit trail that shows
how the data pipeline was run, what inputs were used, and what outputs were generated.
Collaboration:
Collaborating on data pipelines requires a high level of coordination and communication between
team members. Repeatability enables team members to share code, data, and outputs confidently,
knowing that they will get the same results when they run the pipeline.
Ensuring repeatability in data pipelines involves several best practices, including version control
for code and data, containerization, and automated testing. By following these best practices, data
engineers can build robust and reliable data pipelines that provide consistent and accurate results
over time.
Engineers will many times design into their code assumptions about the world that data lives in.
It makes the code unreadable, and anyone should be able to clone the code repository, look at the
README, and with a few lines at the console, kick off the pipeline.
Anything more inherently means the pipeline is not repeatable. It shouldn’t take more than one
button click or command line argument issued to re-start or run any data pipeline.
There should not be a myriad of configurations, flags, and files that must be staged in certain
ambiguous places for the data pipeline to run.
A good run of thumb is this… if you can’t schedule a CRON job to run a single Python file with a
main() function and a few arguments, which are noted in the README, then the code is probably
not repeatable.
If no one but the person who wrote the code can run it… that is a problem. This is another important
subject and we will explore it more later.
Resilient
This might seem obvious to some, but to others, it isn’t something they’ve had experience with.
What do I mean when I say a data pipeline must be resilient? I mean the codebase must be written
in such a way that it isn’t fragile.
There should not be many instances of “hardcoded” values, like dates or numbers in the code. This
inevitably means that an engineer will have to make a code change just to run a pipeline for some
Chapter 1 - The Theory. 18
historic data. Is there no try: except: blocks? This probably means the author didn’t think about the
pipeline in a resilient way.
Scalable
Last but not least, and probably the most important, scalability is the name of the game. A data
pipeline that isn’t scalable isn’t a pipeline at all,
it’s just a one-time script that is doomed from the beginning.
One of the most common mistakes and time wastes I’ve seen is someone taking a few weeks to
write a codebase to solve a problem, testing and testing, releasing the code. Only coming to realize
in production when scaled up it either crashes or is so slow as to be useless.
You can rarely write a pipeline to deal with one piece of code and have it scale up without significant
changes. Scalability has to be one of the main tenants driving the creation of the pipeline from the
beginning.
When your thinking about a piece of data working its way through the pipeline, ask yourself, what if
this record has 10 million friends? Scalability affects file storage options, the processing framework,
and everything in between.
Data pipelines can be made more scalable by implementing strategies that allow them to handle an
increasing amount of data and processing tasks without compromising performance or reliability.
Here are some ways to make data pipelines more scalable:
Distributed processing:
Distributing processing tasks across multiple nodes or machines can help increase the pipeline’s
capacity to handle larger volumes of data. This can involve using technologies such as Apache Spark,
Apache Flink, or Hadoop to distribute processing tasks across a cluster of nodes.
Horizontal scaling:
Scaling the pipeline horizontally by adding more processing nodes can help increase its capacity to
handle more data and processing tasks. This can involve using cloud-based services such as Amazon
Web Services (AWS) or Microsoft Azure to add more compute resources to the pipeline as needed.
Partitioning:
Partitioning data into smaller chunks can help improve processing efficiency and enable parallel
processing. This can involve using techniques such as range partitioning or hash partitioning to
divide data into smaller chunks that can be processed in parallel.
Chapter 1 - The Theory. 20
Caching:
Caching frequently accessed data in memory can help reduce the amount of data that needs to be
processed, improve processing efficiency, and reduce the load on data stores. This can involve using
in-memory caching technologies such as Redis or Memcached.
Load balancing:
Implementing load balancing mechanisms can help evenly distribute processing tasks across
multiple processing nodes, preventing overload and improving performance. This can involve using
load balancers or distributed queuing systems such as Apache Kafka or RabbitMQ.
Auto-scaling:
Implementing auto-scaling mechanisms can help the pipeline automatically scale up or down based
on changes in processing demands. This can involve using cloud-based services that provide auto-
scaling capabilities such as AWS Auto Scaling or Azure Autoscale.
In summary, making data pipelines more scalable involves implementing strategies that enable them
to handle increasing amounts of data and processing tasks without compromising performance or
reliability. By adopting these strategies, data engineers can ensure that the pipeline can scale to meet
changing business needs and processing demands.
In Summary
“Facilitating the movement, storage, and access to data in a repeatable, resilient, and scalable
manner.”
It’s easy to get caught up in the mundane and day-to-day data engineering work of writing the next
ETL script. But, I suggest you take a different approach, looking at problems and resolving to make
good decisions, not to rush design, and to think through problems.
• Movement
• Storage
• Access
• Repeatable
• Resilient
• Scalable
You should always consider the above 6 points when approaching a new problem or pipeline.
Grasping the importance of these theories and them putting them into practice will put you ahead
of most engineers.
Reminding yourself daily that your pipelines need to move data efficiently, and what file types
are chosen do matter is will save you headaches down the road. Ensuring every pipeline is easily
repeatable and resilient from the start will come in handy when something breaks at night or on the
weekend.
Chapter 1 - The Theory. 21
Of course, when the business starts to scale and your data needs to keep up, you won’t be left holding
the bag! It’s very tempting to get caught up in this and that new technology, chasing the shiny new
toy. It’s easy to say “been there, done that,” when it comes to the next pipeline.
I encourage you to remember these basic ideas we discussed and use them to drive all the decisions
you make when tackling the next problem.
Chapter 2 - Data Pipeline Basics
Finally, it’s time to dive into the inside workings of data pipelines, what we’ve all been waiting for!
Let’s start by talking about the basics of building data pipelines. There are a few tips and tricks that
will help you generalize all the pipeline projects you work on.
I consider these things basic 101 level requirements. Chances are if you don’t follow these approaches
or some variation of them, you’re going to struggle to produce reliable and repeatable pipelines that
you can be proud of.
A data pipeline is a system or process that moves data from one or multiple sources to one or
multiple destinations in a structured and automated way. The primary purpose of a data pipeline is to
transform, process, and integrate data to provide useful insights, make informed business decisions,
and drive data-driven actions.
Data pipeline basics typically involve the following components:
Data Sources:
These are the systems or applications from where the data originates. The sources can be a database,
file storage, web service, or any other source that can provide the necessary data.
Data Ingestion:
This involves the process of extracting the data from the sources and moving it to a target system
or storage. The data ingestion process can involve different techniques such as batch processing,
streaming, or event-driven processing.
Data Transformation:
This involves the process of modifying, enriching, or cleansing the data to ensure that it meets
the desired quality standards and is in a format that is usable by downstream applications.
Transformation can be performed using various techniques such as filtering, aggregation, mapping,
or data validation.
Data Storage:
This involves storing the data in a storage system that can efficiently handle large volumes of data
and provide fast access to it. The storage system can be a traditional database, data warehouse, data
lake, or any other storage system that is appropriate for the data volume and use case.
Data Processing:
This involves performing complex computations on the data to extract useful insights and generate
reports or dashboards. The data processing can be performed using various technologies such as
machine learning, data mining, or data analytics.
Data Delivery:
This involves delivering the data to the intended end-users or applications in a timely and efficient
Chapter 2 - Data Pipeline Basics 23
manner. The delivery can be performed using various techniques such as APIs, messaging queues,
or web services.
In summary, data pipeline basics involve a set of components that work together to extract,
transform, process, and deliver data in a structured and automated way. By implementing a data
pipeline, organizations can ensure that their data is accurate, accessible, and can provide valuable
insights to drive business decisions.
• Project Structure.
• Testing.
• Documentation.
• Containerization.
• Architecture First.
These are just some basics that will assist us going forward, as they are the core of what will drive
the design and layout of all your data pipeline projects. I know you might be getting tired of the
theory, but this is where the theory starts to meet the reality of writing code to move data.
This isn’t a book about how to write code, so write code as you like, but adopting a few well-accepted
styles can have a positive impact on your success at delivering highly scalable and efficient pipelines.
Chapter 2 - Data Pipeline Basics 24
Project Structure
When starting our data pipelines in Python, or any language, we should follow a clear and concise
design pattern that shows any reader the flow of the data through the code. Just because something
is complex doesn’t mean it’s acceptable to have our program and data flow is hidden into obscurity.
The project structure of a data pipeline project typically involves several components and directories
that are organized in a logical and modular way. Here are some of the common components that
are included in a data pipeline project structure:
Configuration:
This directory contains configuration files that define the settings and parameters used by the data
pipeline, such as database connection strings, API keys, and environment variables.
Data:
This directory contains the data used by the data pipeline, including input data, intermediate data,
and output data.
Documentation:
This directory contains any relevant documentation related to the data pipeline, such as user
manuals, data dictionaries, and technical specifications.
Logs:
This directory contains logs generated by the data pipeline, including error logs, debug logs, and
performance logs.
Scripts:
This directory contains the scripts used by the data pipeline, including data ingestion scripts, data
transformation scripts, and data processing scripts.
Source:
This directory contains the source code of the data pipeline, including any custom libraries, modules,
and dependencies.
Tests:
This directory contains the tests used to verify the functionality of the data pipeline, including unit
tests, integration tests, and end-to-end tests.
In addition to these components, the project structure may also include a Makefile or a build script
that automates the process of building, testing, and deploying the data pipeline. The project structure
may also include a README file that provides a brief overview of the project, including its purpose,
functionality, and usage.
In summary, the project structure of a data pipeline project is designed to organize the various
components of the project in a logical and modular way, making it easier to manage, maintain, and
scale the data pipeline over time. By following a standardized project structure, data engineers can
ensure that their data pipeline projects are organized, consistent, and easy to understand by other
team members or stakeholders.
Chapter 2 - Data Pipeline Basics 25
Be short and to the point when writing code, the data should flow through the code in an obvious
and human-readable way. That’s why we are using Python!
We will dive into DevOps later but anytime you start a new codebase/pipeline, or work on an old
one, you should ensure a few basics. Always start by creating the following resources.
Let’s see what a very basic project structure might look like.
It’s a contrived and simple example in the above picture, but it’s obvious the main.py is where we
should look for the code, test probably has our tests, there is a Dockerfile. More than anything we
are looking for a project to be clear.
Here is an example of what you don’t want to see when looking at a new pipeline codebase.
1 run_database.py
2 general_functions.py
3 data_file.csv
4 read_methods.py
5 configurations.json
You don’t know where to start, what to look at, what is important or what isn’t. It’s just messy and
not obvious, be clean and organized in your code as well the project structure itself.
What would a better project structure look like applied above?
Chapter 2 - Data Pipeline Basics 26
1 README.md
2 Dockerfile
3 requirements.txt
4 utilties/general_functions.py
5 configs/configurations.json
6 sample_data/data_file.csv
7 database/run_database.py
8 read_methods.py
You might not think this has that much to do with data engineering pipelines, but it does. A solid
foundation is a key to the future success of any project. Taking the guessing game out of your
pipelines is one of the best things you can do.
There is nothing worse than going to a repository in GitHub and finding random files and folders
littered everywhere with no apparent consistency. Remember what your mom always said … “clean
your room!”
Before writing the so-called guts of our codebase, large or small, it’s important to get a few things
right. I do understand that there are many different styles of writing code, but there is no excuse for
sloppy and dirty code and organization.
Pipeline code that is thrown together haphazardly is probably code that was also written poorly, not
tested, and will break often. Taking ownership and pride in a codebase is important.
Let’s dive a little deeper into structuring the actual codebase of a pipeline.
Or even worse than that, just a bunch of files all in the same directory, each with a different name,
and you have no idea where the entry point of the codebase is!
You should always start with something along these lines.
1 >> main.py
2 if __name__ == '__main__':
3 main()
I mean call it entry_point() if you have to. Just give the obvious indication of where in the world
to start.
I can’t stress this enough and I want to make a point here. This is a book about coding practices,
but you can’t be a successful Data Engineer without embracing a mindset of clean and concise code
structures that are easy to read and use.
Your data pipeline is only going to be as good as the code you write. If it’s hard to read and follow,
then it will leave a bad taste and impression, regardless of how good a job you think you did.
Simple, but important. We should continue this pattern of obvious code and data flow. Your pipeline
code should follow the data flow in a sense. The methods and functions should be named to indicate
what they are doing and should be called inside your main() method in an obvious pattern.
In a data pipeline, the code should flow with the data to ensure that data is processed in a logical
and efficient manner. This means that the code should be organized and structured in a way that
follows the flow of data through the pipeline.
Here are some ways to ensure that the code flows with the data in a data pipeline:
Start with input data:
The code should begin by retrieving the input data from its source, whether it is a database, a file,
or an API. This input data should be processed and transformed in a series of steps that correspond
to the processing stages of the pipeline.
Use data structures that fit the data:
The code should use appropriate data structures that fit the type and format of the data being
processed. For example, if the data is in a tabular format, a DataFrame or a relational database
may be used.
Avoid data duplication:
The code should avoid duplicating data unnecessarily. Instead, it should process data in a sequential
Chapter 2 - Data Pipeline Basics 28
manner, passing data from one step of the pipeline to the next without duplicating or storing the
same data multiple times.
Keep track of data flow:
The code should keep track of the data flow at each stage of the pipeline. This may involve logging
data or using a data lineage tool to track the flow of data from input to output.
End with output data:
The code should end by writing the output data to its destination, whether it is a database, a file, or
an API. The output data should be in the desired format and contain all the necessary information
for downstream processing or analysis.
By ensuring that the code flows with the data in a data pipeline, data engineers can develop pipelines
that are efficient, maintainable, and scalable. This can lead to better performance, faster development
time, and improved data quality.
As the data flows, the code flows, and vice versa.
Simply reading a main() function should tell you what’s happening to the data and where it is
coming from and going too.
Example
Let’s say we have a very simple pipeline to read CSV files, get some metrics from the files, and store
those results into a SQL database.
It can be tempting to start thinking about the details or writing code right away. We should shy away
from this practice. Let’s sketch our code the way we would sketch a drawing. It helps to understand
the work that lies ahead of us and will help keep us on the straight and narrow as we develop the
pipeline.
Chapter 2 - Data Pipeline Basics 29
It doesn’t matter very much if you write Object Oriented Programming (OOP) or Functional
programming with lots of little methods. If you are writing ETL or data pipelines, your code should
show the flow of the data when read.
In the above example, we talked about a simple program to open CSV files, get metrics, and put
those metrics into the database. The code should be framed in such a way this is obvious with a
cursory glance.
1 >> main.py
2 def main():
3 calculate_workload()
4 download_csv_files_in_parallel()
5 stream_files_to_memory()
6 calculate_metrics()
7 save_metrics_to_database()
Why write code like this? Because it’s obvious how the data flows through the program. It also
gives a high-level understanding of what the pipeline is going to look like before every bit of code
is written.
This is a great advantage because it will help us think about how scalable, efficient, and extensible
this pipeline will be. I can’t emphasize this enough. If you just started by working on calculate_-
workload() and not thinking about what comes after or before, you’re on the road to a life of
hardship.
You get caught up in “how am I going to read this data”, and writing the code straight away, your
missing details about what is coming after that could change how you write the code you working
on!
Code readability and organization are crucial aspects of developing a high-quality and maintainable
data pipeline. Here are some best practices for improving code readability and organization in a data
pipeline:
Use descriptive names:
Use descriptive variable names, function names, and comments to make the code easier to
understand. This will also make it easier to maintain the code over time.
Follow coding standards:
Follow standard coding conventions and guidelines to ensure that the code is consistent and easy
to read. This includes using appropriate indentation, using whitespace effectively, and following
naming conventions.
Use modular code:
Break the code down into small, reusable functions that perform a specific task. This will make the
code easier to read and debug, and it will also make it easier to reuse code in other parts of the
pipeline.
Use version control:
Use a version control system to manage changes to the code over time. This will make it easier to
track changes, collaborate with other developers, and roll back changes if necessary.
Document the code:
Use comments and documentation to explain how the code works, what it does, and how it should
be used. This will make it easier for other developers to understand the code and contribute to the
project.
Use testing frameworks: Use testing frameworks to test the code and ensure that it works as expected.
This will help catch errors and bugs early on in the development process, making it easier to fix issues
before they become bigger problems.
By following these best practices, data engineers can improve the readability and organization of
their code, making it easier to maintain and update the data pipeline over time. This will help ensure
that the data pipeline remains efficient, reliable, and scalable, even as data volumes and processing
requirements change over time.
So perhaps in the above example, streaming files into memory could be a few methods/functions.
1 >> stream_csv_files.py
2 import csv
3
4 def open_csv_file(file_location: str) -> object:
5 with open(file_location) as f:
6 csv_reader = csv.reader(f)
7 for row in csv_reader:
8 yield row
9
Chapter 2 - Data Pipeline Basics 31
This is a simple but powerful idea. Don’t have the bad habit of writing data pipelines all in a single
file. It isn’t helpful and will most certainly leave yourself and others confused as the code grows,
group your code in logical units.
You should not write functions that are 50 lines long. This means your not breaking up your code
into small enough logical units.
Functions and methods that are working on data and contain too many lines of business or
transformation logic are going to be prone to error and impossible to maintain or troubleshoot.
Tests.
We will talk more about testing in the DevOps chapter, but this topic is so important it’s worth a
cursory overview now.
You need to have tests for every data pipeline. Why? Because if want to be cut above the rest you
will follow best practices every step of the way, even when you don’t have to.
Think about it, if you change code how are you supposed to know it works and that you have not
broken something you don’t know about, especially if you are new to a codebase?
Tests are an essential aspect of developing a high-quality and reliable data pipeline. Here are some
reasons why tests are important in a data pipeline:
Ensure correctness:
Tests can ensure that the data pipeline is functioning correctly and producing the expected results.
This is especially important when dealing with large volumes of data, complex transformations, and
multiple data sources.
Catch errors early:
Tests can catch errors early on in the development process, making it easier to fix issues before they
Chapter 2 - Data Pipeline Basics 32
become bigger problems. This can save time and resources by reducing the amount of time spent on
debugging and troubleshooting.
Facilitate collaboration:
Tests can help facilitate collaboration among developers by providing a common language and
framework for discussing the code. Tests can also help ensure that changes made by one developer
do not break the code or data pipeline for others.
Support scalability:
Tests can support scalability by helping to identify bottlenecks or performance issues before they
become major problems. This can help ensure that the data pipeline can handle increasing volumes
of data and processing requirements.
Improve confidence:
Tests can improve the confidence of data pipeline users and stakeholders by providing evidence that
the pipeline is reliable and producing accurate results. This can help build trust and credibility for
the data pipeline and its users.
In summary, tests are critical to developing a high-quality and reliable data pipeline. They can
help ensure correctness, catch errors early, facilitate collaboration, support scalability, and improve
confidence in the pipeline.
Unit Testing
Unit tests are the first line of defense. No one should have to run a pipeline to be able to know if it’s
working or not. Changing the business or transformation logic acting upon data should be tested
in such a way that anyone can attempt the change and run tests with a reasonable assumption that
unit tests will catch any problems.
Unit tests will not catch every bug, they never will, but they will catch some, and they will protect
you from those “silly” mistakes we are all prone to making.
In our above example, we would have at least …
1 tests()
2 test_download_csv_files_in_parallel()
3 test_stream_files_to_memory()
4 test_calculate_metrics()
5 test_save_metrics_to_database()
Every language has its own popular testing frameworks, for Python that would be pytest. Any
Google search or YouTube video can introduce you to the basics of testing for any language and
have you writing your own in an hour or less.
Many times it’s critical when testing data pipeline code to have example and sample files to process,
and the sooner you generate them and start writing tests the better.
Chapter 2 - Data Pipeline Basics 33
It’s not a good idea to wait until you are done before thinking about tests.
Many times writing unit tests for the method or function you are writing right away will cause you
to refactor your code, to make it simpler and more testable … this is a good thing.
1 import csv
2
3 def open_csv_file(file_location: str) -> object:
4 with open(file_location) as f:
5 csv_reader = csv.reader(f)
6 for row in csv_reader:
7 return row
8
9 def test_open_csv_file():
10 test_row = open_csv_file(file_location='sample_file.csv')
11 assert test_row == [1, 2, 3]
If you’re new to testing just take some time to read some articles and watch some videos for your
language of choice. Once you get into the habit of writing unit tests you won’t look back.
Documentation
Having a README.md file in every ETL project isn’t asking that much. No matter how well we think we
write code, no one, including yourself six months later will remember every nuance and reasoning
that caused code to be written in a certain way with certain assumptions.
Documentation is a crucial aspect of developing a data pipeline. Here are some reasons why
documentation is important in a data pipeline:
Facilitate understanding:
Documentation can help developers and stakeholders understand how the data pipeline works, what
it does, and how it should be used. This is especially important for complex data pipelines that
involve multiple data sources and complex transformations.
Support maintenance:
Documentation can help support the maintenance of the data pipeline by providing guidance on how
to modify, update, and troubleshoot the pipeline over time. This can help ensure that the pipeline
remains efficient, reliable, and scalable as data volumes and processing requirements change over
time.
Foster collaboration:
Documentation can help facilitate collaboration among developers and stakeholders by providing
a common language and framework for discussing the data pipeline. This can help ensure that
everyone is on the same page and working towards the same goals.
Chapter 2 - Data Pipeline Basics 34
Improve efficiency:
Documentation can improve efficiency by reducing the amount of time and effort required to
understand, use, and modify the data pipeline. This can help ensure that developers and stakeholders
can focus on other important tasks, such as data analysis and modeling.
Mitigate risk:
Documentation can help mitigate risk by providing evidence that the data pipeline is reliable,
accurate, and compliant with relevant regulations and policies. This can help protect against data
breaches, regulatory fines, and other risks associated with data management.
Documentation is critical to developing a high-quality and reliable data pipeline. It can help facilitate
understanding, support maintenance, foster collaboration, improve efficiency, and mitigate risk.
By investing in documentation, data engineers can ensure that the data pipeline remains efficient,
reliable, and scalable over time.
The README.md is your chance to get these important ideas across.
If your company uses something like Confluence, even better! But everyone has an opportunity to
add a README.md file to a new or existing pipeline project you working on.
A good place to start with README.md contents is here …
Even a small amount of context given at the right time to some new engineer who’s about to debug
code they’ve never seen is going to be a lifesaver. It will make you a hero.
Let’s look at an example README.md for a Data Lake project. What we want to do is provide new
technical folks a good overview of what they are dealing with.
README example
Chapter 2 - Data Pipeline Basics 35
1 >> README.md
2 Data Lake Repo (Databricks and Delta Lake)
3
4 Docker and Tests
5 First build the docker image docker build --tag delta-warehouse . You should only ha\
6 ve to do this once.. unless you change the Dockerfile
7
8 Use the command docker-compose up test to run all unit-tests. The container will be \
9 tied to your local volume from which you are running the command, and will pick up y\
10 our changes.
11
12 Storage
13 There are three s3 buckets located in us-east-1 used by the DeltaLake and Databricks\
14 warehouse.
15
16 warehouse-production
17 warehouse-development
18 warehouse-integration-testing
19
20 The Development bucket should be used for exploratory and development work as needed.
21
22 The Integration bucket should not be used for development, but only end-to-end autom\
23 ated integration tests. (sample data that will be static)
24
25 The Production bucket should be for production use only.
26
27 Table Maintenance
28 We run two different types of table maintenance. See data-warehouse/maintenance/.
29
30 Optimize (reduce number of small files)
31 Vacuum (delete unused data files)
32
33 DataBricks Scripts
34 The folder databricks-scripts holds the files that Airflow will submit an s3 uri in \
35 a DataBricks Job call.
36
37 This folder is auto pushed via CirclCi on push to master branch to s3://warehouse-pr\
38 oduction/databricks-scripts
39
40 Warehouse Layout
41 The warehouse is made up of raw files in a s3 bucket, where are moved into Databrick\
42 s DeltaLake tables via DataBricks with ETL orchestrated by Airflow.
43
Chapter 2 - Data Pipeline Basics 36
I know documentation can seem like a boring topic, but it’s better to have something rather than
nothing. It’s better to read a few sentences before digging into a project, to get the general idea of
what’s happening, rather than having to guess.
Containerzation
You’re just going to have to learn to use Docker, containers are the future and are the reality today.
This is just as true for a data engineer as is it is for a software engineer. There is no easier way to
make a data pipeline repeatable and reliable than to have a Dockerfile to run the code on. It removes
all system ambiguity.
Containers are a great equalizer when it comes to systems, requirements, and dependencies. You
can essentially freeze every detail about where and how your code runs, assessable to yourself or
anyone coming on later to work on a project. I hope this importance doesn’t escape you.
Containerization is an important technique in data engineering pipelines that involves packaging
an application and its dependencies into a container, which can then be deployed across different
environments with consistent behavior. Here are some reasons why containerization is important
in data engineering pipelines:
Consistency:
Containerization ensures that the data pipeline is consistent across different environments, such as
development, testing, and production. This helps ensure that the pipeline performs the same way in
all environments and reduces the risk of errors caused by differences in environment configurations.
Portability:
Containers are portable, meaning they can be deployed across different operating systems and cloud
platforms. This enables data engineers to move the data pipeline between different environments
with ease, without worrying about compatibility issues.
Scalability:
Containers are lightweight and can be scaled up or down quickly and easily to handle changing vol-
umes of data. This helps ensure that the data pipeline can handle increased processing requirements
and data volumes without the need for significant changes to the infrastructure.
Efficiency:
Containers use resources more efficiently than traditional virtual machines, as they share resources
with the host operating system. This enables data engineers to run multiple containers on the same
hardware, which can lead to significant cost savings.
Chapter 2 - Data Pipeline Basics 37
Security:
Containers provide a secure environment for running applications, as they are isolated from the
host operating system and other containers. This reduces the risk of security breaches caused by
vulnerabilities in the application or the operating system.
Containerization is an important technique in data engineering pipelines that provides consistency,
portability, scalability, efficiency, and security. By containerizing the data pipeline, data engineers
can ensure that it performs consistently across different environments, can be easily deployed and
scaled, and is secure and cost-efficient.
I know that some things can feel like a hill too big to climb, especially if you have never used Docker
before, containers can be a little bit of black magic. But, I promise it’s not that bad.
What does a Dockerfile do for a data engineering project or pipeline?
Let’s look at a quick example of how Dockerfiles and containerization can remove ambiguity and
level the playing field in data engineering. When coming to a new code base for the first time it can
be helpful just to review the Dockerfile to see what tools and systems are installed.
Just doing that simple task will usually give you a good idea of what’s to come!
We can see from the above simple example of a Dockerfile we have an Ubuntu image that gets Java,
Scala, and Python installed. We can also see Spark being installed, this gives us a great idea of what
this project will be all about, Spark!
• It’s clear from the Docker container what tools and tech are being used.
Chapter 2 - Data Pipeline Basics 38
Also, it levels the playing field in the sense that the next Developer who comes to work on this
project doesn’t have to spend a day fighting his machine getting Spark installed before he can start
work.
I’m not going to teach you how to become an expert with Dockerfiles, that isn’t the point of this
book. But I suggest you spend some time learning how to write tests and wrap your projects in a
Dockerfile. There are plenty of good resources to help you get started, just Google it!
We will dive more into some high-level Docker examples for data engineering in the DevOps section
later.
Architecture First
Surprise, surprise we are talking about architecture, are your eyes glazing over yet? I know you
might be rolling your eyes and saying “get on with it already.” Diving into writing code right away
without working through the architecture first is a big mistake.
We all know that code is written with certain assumptions and around certain tech stacks. Writing
code is going to run on a Lambda vs code that will be running via a Cron on a EC2 box is probably
going to be different. So, we should probably think through the differences first before writing code.
Architecture First is an approach to data engineering that prioritizes designing the overall architec-
ture of the data pipeline before implementing any code. Here are some reasons why Architecture
First is important in data engineering:
Flexibility:
An Architecture First approach enables data engineers to design a data pipeline that is flexible and
can adapt to changing business requirements. By considering the overall architecture of the pipeline
first, data engineers can ensure that it can support new data sources, transformations, and processing
requirements in the future.
Integration:
An Architecture First approach enables data engineers to design a data pipeline that integrates
seamlessly with other systems and applications. By considering the overall architecture of the
pipeline first, data engineers can ensure that it can be easily integrated with other systems and
applications, reducing the risk of data silos and enabling more comprehensive data analysis.
Reduced risk:
An Architecture First approach can help reduce the risk of errors and inefficiencies in the data
Chapter 2 - Data Pipeline Basics 39
pipeline. By designing the architecture first, data engineers can identify potential issues and
challenges early in the development process, reducing the risk of costly errors and rework.
Architecture First is an important approach to data engineering that prioritizes designing the overall
architecture of the data pipeline before implementing any code. By considering the architecture first,
data engineers can ensure that the pipeline is scalable, flexible, efficient, and integrated with other
systems, while reducing the risk of errors and inefficiencies.
This is more than how the code is laid out in the files as we talked about earlier. This is a thought
process that begins before a single line of code is written and that usually where projects live or die.
I look at the topic of software and data pipeline architecture like that old saying your parents
preached at you … “think before you act.”
We are comfortable with certain design patterns and ways of solving problems. We usually rely
on our history and this is about how we’ve done something similar in the past. This can be an ok
approach, but probably not the best.
It’s best when starting a data pipeline or ETL project to take the 10,000-foot view. Ask yourself a few
questions.
These are just a few questions to get you started. The best thing you can do is try to poke holes in
your ideas and design.
Being able to “step back” from the problem and think high level about pipeline and data architecture
seems like a lost art. Too many people rush into decisions and architecture because it’s familiar and
safe. Don’t do that.
The decisions we make upfront have a big impact downstream.
Review
Let’s do a quick review of our data pipeline basics.
• Project Structure
• Testing
Chapter 2 - Data Pipeline Basics 40
• Documentation
• Containerization
• Architecture First
These are some of the core tenants of developing top-tier data pipelines that some probably see as
“not the point” or as “mere peripheries” of the development work. I can assure you this is a bad
stance to take. Getting certain core principles and practices in place before the real work begins will
change the trajectory of the entire project.
Most people say “I will get to that later,” and “we can come back to get that.” This rarely happens.
Typically when a project or codebase starts with a mess of files and no project structure it sets a
precedence that the code is no good either.
• Tech debt and bad design happen when you say “I will fix that later.”
• Being messy with project structure upfront will ensure things only get worse.
• When tests are not a first-class citizen of your data pipelines, this ensures failure.
• Not taking the time to containerize the project with Docker makes development difficult.
• Skipping architecture even on simple projects is a mistake.
When you first discover a data pipeline that doesn’t have a single test written for it you will be
scared to touch it, who knows if something will break?
Also, when you find no documentation or even README it’s easy to get overwhelmed, how do
you know where to even start? What is the entry point for the code, what is the background,
what technologies are used? The amount of questions that can be answered with some simple
documentation is amazing.
Never forget containerization with Dockerfiles and how it puts everyone on the same page when it
comes to development and requirements. Without containerization, you just rely on people getting
things installed on their machine, Apple, Linux, Windows .. who knows all the headaches that will
happen?
Of course, it all begins and ends with architecture first. The classic mistake most developers fall into
is just jumping in feet first writing code without looking back. Do yourself and others a favor and
save time and headaches by working through the ins and outs of the technical details before writing
the actual code. It will make your life easier in the long run.
Chapter 3 - Pipeline Architecture
Data pipeline and platform architecture is an important topic that many data engineers shy away
from. We all fall into the pit of going straight to writing code without thinking about the big picture.
This is what architecture is all about, the big picture.
It doesn’t matter what your title is, it doesn’t need to have an architect in front of it, it’s something
you should do and a skill you should work on every day. You might feel like your projects are too
small to worry about such lofty ideas, but you are wrong.
Data Engineers, especially those at the senior levels, depending on the company, are given great
leeway in deciding what tools are right for the job. Many times everyone will look to the Data
Engineers for recommendations about what available pipeline tools and technologies are available,
and are best suited to a specific need.
When approaching architecture in data engineering, data engineers should consider the following
key steps:
Define the problem:
The first step in approaching architecture is to clearly define the problem the data pipeline is intended
to solve. This involves understanding the business requirements and data sources, as well as any
technical constraints or limitations.
Identify the data flow:
Once the problem has been defined, the data engineer should identify the data flow through the
pipeline. This involves mapping out the data sources, transformations, and destinations, as well as
any required data storage or processing steps.
Choose the right technologies:
The next step is to choose the right technologies for the data pipeline. This involves evaluating
different technologies based on their scalability, performance, cost, and compatibility with existing
systems.
Design for scalability:
When designing the data pipeline, data engineers should prioritize scalability to ensure that the
pipeline can handle increasing volumes of data and processing requirements over time. This involves
considering factors such as data partitioning, load balancing, and parallel processing.
Ensure data quality:
Data quality is a critical aspect of data engineering architecture. Data engineers should design the
pipeline to ensure that data is validated, cleaned, and enriched to meet business requirements and
avoid downstream issues.
Incorporate security and compliance:
Security and compliance are also important considerations in data engineering architecture. Data
Chapter 3 - Pipeline Architecture 42
engineers should ensure that the pipeline is designed to meet regulatory requirements, such as GDPR
or HIPAA, and includes appropriate security measures, such as encryption and access controls.
Test and iterate:
Finally, data engineers should test and iterate on the data pipeline to ensure that it meets the business
requirements and performs as expected. This involves testing different scenarios, identifying issues,
and making improvements based on feedback and performance data.
Data engineers should approach architecture by defining the problem, identifying the data flow,
choosing the right technologies, designing for scalability and data quality, incorporating security
and compliance, and testing and iterating to ensure that the pipeline meets business requirements
and performs as expected.
Data pipeline architectures …
To be able to answer these important questions, and allow the data pipelines to flow free and fast,
requires some basic architecture skills and thought processes.
Architecture isn’t some high and lofty ideas that has no grounding in reality. The best architects use
their knowledge, experience, and research to understand the technical details of what has to happen.
They pick the best long-term and simplest solution to get the desired result.
Many times it’s about trying to discover problems and roadblocks before they happen.
All that might sound scary, but it’s thinking about your past experiences, and the experiences of
others and applying them to the future. It’s about trying to objectively poke holes in software and
solutions before they are chosen or written. Think about what could go wrong, where bottlenecks
might exist, about worse case scenarios.
Taking a step back and trying to think about all the possible scenarios and solutions is very helpful
in data engineering. It’s trying not to make a decision and start designing a pipeline in your mind
based around some piece of technology that you have a bias for.
Try to avoid swinging too far into over-engineering a solution. The majority of the time the simple
solution will out-perform and cause fewer headaches than the fancy new one.
Many times simply drawing or sketching out the data pipeline that you want helps to think through
problems and help others see your vision.
Data engineers will often have to think about the coupling or connection between systems, or
pieces of the pipe. We might have one tool that orchestrates and manages pipeline dependencies
like Airflow. We might have another tool like Databricks Spark that runs the transformations of the
data. These tools must work together, and such concerns are to be considered during the architecture
planning phase of most data engineering projects.
Before we jump into the different steps and details of architecture, let’s take a look at an example
problem, to help us make sense of some common architecture thought processes.
Chapter 3 - Pipeline Architecture 44
Example Project
Let’s work through how we can think about pipeline architecture by using an example. Here is the
background.
You work for a medium-sized manufacturing company that makes all sorts of widgets. It’s a small
IT department of about 15 people, with only 3 data engineers. The company recently purchased
some new manufacturing equipment. These new widget machines are fancy and can connect to the
company network, sending out a signal/data each time a widget is produced in a TXT or flat-file
configuration.
Instead of manually counting inventory, the company would like the data engineering team to
capture this widget data being output by the new machines and feed it back into the inventory
system as well as provide dashboards of machine throughput to the business.
The IT department has still in the middle of a transition to the cloud, they have some infrastructure
in AWS, and some local servers still running that host the system used to run the manufacturing
system.
As the Senior Data Engineer, you’ve been tasked with deciding how to implement this system. Let’s
walk through the architecture thought process of how we might approach the different ways to
implement a solution.
The first part of any architecture decisions on data pipelines should be examining the requirements
of the business or as much of them as you can get. Of course in the real world, it might take some
digging and you only find out half of what you need, but the answers are important nonetheless.
Understanding what the final expectations are is key to making technology and pipeline design
decisions. Let’s say in our case we find that the business expects a dashboard to be updated every
few hours with production numbers from the machines, but you also find out that they want the
inventory system to be updated every hour at least.
Sketching out the current state and the high-level data flow needed is always a great place to start.
Here is a sketch of our current example.
This already starts to tell us a story, real-time information is not expected for this project. That
means when choosing the simplest technology layout to solve our problem, we can probably toss out
streaming or message queues. It’s always helpful to understand where we can get rid of complexity!
Let’s review what we discover about the above requirements. When working on architecture it
always helps to make a list that boils down to what we need.
We can see from the above list that our possible options for the pipeline are becoming more obvious,
we can rule in or out certain technologies. We don’t need streaming, we need batch. We need code
that ingests TXT files, which is fairly straightforward. We also know we need to build a warehouse
to house the accumulated data for dashboard analytics.
Now, these are things that data engineers love to work on!
Let’s continue to play along with our contrived example, and dig into each step we listed in the
introduction as some of the core tenents and questions we ask and work on during data pipeline
architecture.
How much data, how quickly will it grow/ types of data format(s).
For our example, we talked to the manufacturing folks in charge of setting up and installing the
machines. They have documentation from the vendor that says the new machines will write out
plain text tab-delimited files to a drive on any IP address via ssh.
A new file will be written every 10 minutes in the following format.
We know we have 5 new machines, which will output files every 10 minutes, they are text files, so
the data will be relatively small and easy to deal with. It appears even if more machines are added,
the data will not grow overwhelmingly fast.
As a data engineer, this is not a ton of information, but it allows us to sit down at this point and start
making some high-level decisions.
Based on what we know we calculate that each file put out from the 5 machines will be 4KB in size,
every 10 minutes. We know the plant will be running 10 hours per day …
This is not much data … even if we are half wrong and our pipeline needs to handle double the data.
This is what we do in architecture planning, we break down what we know into bite-sized chunks,
we ask questions, and start making estimates and design decisions based on what we do know.
Let’s make some architectural decisions. When we put a few facts together some things become
clear. The data is needed on a local server to populate the inventory system, the data is not big nor
is it real-time.
While it might have been tempting when first hearing about this project to select new and shiny
technology that would have done the job, it probably would have proved to be complicated and
expensive, another over-engineered project.
Option 1
Cloud-based streaming/messaging system to ingest files, transform and write files to cloud-based
storage. Also, write results back into locally hosted servers to populate the inventory system. Maybe
Kafka/Pulsar/Apache Sparking to a Delta Table and Tableau as a Dashboard.
Option 2
Keep everything local, the files are written to a local server, process with locally hosted Python script
on the same server, backup files to the cloud.
Maybe we present our leader with the following updated proposal. This is all done before any code
is written or decisions are made. It allows no work to be wasted and problems to be identified early.
Chapter 3 - Pipeline Architecture 47
Example Continued.
We also have to think about tradeoffs, as data engineers who love the cloud and managed services
we want to gravitate towards the fun cool stuff, regardless of if the complexity matches the situation
or not. But we have to be realistic what architecting data pipelines.
If we worked at a manufacturing plant that was double the size, and say had 100 machines running
all producing these files that might change the discussion.
• Context is important.
• Drawing out the current state and purposed future states is helpful.
• Calculating rough data size and usage helps us make decisions.
In this case, we could present to our boss some actual numbers and our recommendation. The cost
of storing our data in the cloud is known with our rough size calculations, as well as the cost of
running messaging services along with Spark or Beam would be fairly easy to obtain a pricing.
This would be compared to the cost of managing and hosting a local server with our custom code
to process the file(s).
This is of course a simple example, and we glossed over many details but I think you get the picture.
I also think it’s quite common for this type of exercise to never even take place. Usually, it’s just
up to an engineer who may just start choosing technology regardless of thinking through all the
options available.
Having a good step back and thinking through the data flow and requirements is a critical step to not
ending up with an expensive over-engineered solution, or on the opposite side, a simplistic solution
that breaks later because of failure to scale.
I want to point out a few key areas we covered in this example that apply to many data pipeline
projects when it comes to architecture decisions.
Analytics:
Data size and velocity can impact the types of analytics that can be performed on the data. Real-time
data requires different analytics approaches than batch processing, and large data sets may require
different analytical tools or techniques than smaller ones. Data engineers must design pipelines that
support the required analytics based on the size and velocity of the data.
Data size and velocity are important factors in data engineering pipelines that impact performance,
scalability, resource management, and analytics. Data engineers must design pipelines that can
handle large data volumes and high data velocity while ensuring performance, scalability, and cost-
effectiveness.
Let’s dive a little deeper.
Data Size
Why should we understand data size? Because usually, it requires us to go through a few exercises
that help a person better understand what a pipeline needs to look like.
Let’s consider another simple example. We find out we have new data sources, someone sends us a
sample CSV file that is comma-delimited and contains some 20 million records, about 9GB’s in size.
They tell us we will be getting two of these files per day.
If we are in a Python shop a few things might come to mind. We have a few options for processing
these files…
Most data engineers will understand to processing files of that size without memory issues is
probably going to require Dask or PySpark, libraries that good at lazy evaluation.
What else can we understand from data size? Well, the size of the file is also going to tell us
generally about what type of technology is best suited for this particular problem. Most of us
Chapter 3 - Pipeline Architecture 50
probably understand that CSV files that are 9GB in size are not suited well to any sort of streaming
or messaging system.
But let’s say we found our data files were JSON files just a few KB’s in size, delivered at the rate of
300,000 per day, well then we might start thinking about a streaming service that seems to fit that
data size pattern a little better.
• Think carefully about which technologies would be best suited for the size of data your pipeline
is handling.
Data Velocity
Just like data size, the amount of incoming data (velocity) is another critical aspect of data pipeline
architecture as we discussed ealier. It makes a very sizeable difference if we are receiving something
in only a handful of “batches”, or if we will be receiving hundreds of thousands or even millions of
files.
There are technology stacks that are built to deal with file volumes on each end of this spectrum.
The nice thing about understanding data velocity during the architecture phase is that it informs
what some of the rest of the pipeline might have to look like and handle.
If we are looking at a high volume of incoming files, the growth of data may require some sort of
compaction or aggregations steps in the pipeline. If we expect to receive only a few files a day, we
will most likely not have to think about these steps of compaction or aggregation very much.
So, calculating data size and velocity upfront, even if you end up being half wrong is incredibly
helpful. It’s almost a given that going through this exercise will bring out the obvious winners and
losers of which platforms were made to handle your data needs.
• With rough data sizes, break the work up into logical batches.
• Determine based on transformations and possible tooling, the required CPU and RAM
requirements.
• Do the simple math.
Resource Requirement = (processing unit size x (number of CPU + RAM)) x number of batches
This will roughly give you an idea of what will be required. Let’s walk through a simple example.
Chapter 3 - Pipeline Architecture 52
We are architecting a new data pipeline and will be receiving two raw CSV files per day, each with
about 15 million records that will need a few different transformations. After the transformations,
we will be storing the data in compressed parquet files. Each CSV file uncompressed is about 10GBs.
We have decided that using Apache Spark would be a good option for processing the CSV files with
minor transformations in a parquet data lake is S3.
Right away we can understand that this process will be both CPU bound and RAM bound. But
knowing that we will only be processing 10GBs at a time, twice a day, getting the rough estimate of
resource requirements is very easy.
We know that Spark is a lazy evaluation framework and that the transformations on our 15 million
records doesn’t require all data to be held in memory.
Knowing these facts the compute resource calculations become straightforward.
This information along with a guesstimate of how long it might take, say 15 minutes to process such
a file, and armed with the cost per hour information from AWS/GCP/Azure we can quickly calculate
roughly what it will cost to run this pipeline for a day.
In our above example, we know that two files of 10GB each will be ingested and turned into
compressed parquet files in S3. A quick Google search and some testing will show you that we
can probably conservatively get 70% storage saves moving from CSV to Parquet.
1 storage cost = 2 files per day x 10GB per file x 365 days in a year x .30 (size of c\
2 ompressed parquets) = 2.2TBs of data.
Many times understanding basic compute and storage requirements during the architecture phase
of pipeline work might change some of the decisions we make.
In the example discussed above after thinking about cost and complexity we might decide that
processing those files on a Spark cluster is overkill. Spark is more expensive to run in the cloud and
our 10GB per file could easily be processed with our Apache Airflow framework we are already
running for example.
While all these calculations might be simple, going through the exercise of understanding roughly
the storage and compute requirements might change our architecture decisions about what tool(s)
to use and our approach to the problem.
• Step back and understand where the data is going to be, and what it needs to look like.
• End results will often change the way we get those results.
Getting a clear picture of the output and usage of the data needed from the pipeline will inform or
change some of the choices we make upfront. Knowing if the data is needed for a dashboard or a
web application at the very least is going to change how we store the data, and how we store the
data might have an impact on what tool we choose to process and output that data.
It’s all connected and should not be overlooked before jumping into designing a data pipeline.
In data engineering, complexity and simplicity are often in tension with each other. On one hand,
complex data pipelines can handle more sophisticated use cases and support more advanced data
processing capabilities. On the other hand, simple data pipelines are often easier to build, maintain,
and troubleshoot.
There are several factors that contribute to the complexity of a data pipeline:
Data sources:
The more data sources you have, the more complex your pipeline becomes.
Data transformations:
As you add more transformations to your pipeline, the logic becomes more complex.
Dependencies:
The more dependencies your pipeline has, the more complex it becomes.
Error handling:
The more error handling you have in your pipeline, the more complex it becomes.
Scale:
The larger the scale of your pipeline, the more complex it becomes.
While complexity can offer more advanced capabilities and better results, there are several down-
sides to overly complex data pipelines:
Difficulty in maintenance:
Complex pipelines are more difficult to maintain, troubleshoot and modify.
More prone to errors:
Complex pipelines have more potential points of failure and are more prone to errors.
Longer development times:
Complex pipelines often take longer to develop and require more specialized expertise.
Higher costs:
Complex pipelines often require more resources and infrastructure, leading to higher costs.
On the other hand, simple data pipelines have several advantages:
Easier to maintain:
Simple pipelines are easier to maintain, troubleshoot, and modify.
More reliable:
Simple pipelines have fewer potential points of failure, and thus are more reliable.
Shorter development times:
Simple pipelines often take less time to develop and can be implemented more quickly.
Lower costs:
Simple pipelines require fewer resources and infrastructure, leading to lower costs.
There is a tradeoff between complexity and simplicity in data pipelines. While complex pipelines
offer more advanced capabilities, they are often more difficult and expensive to maintain. Simple
Chapter 3 - Pipeline Architecture 55
pipelines, on the other hand, are easier to develop and maintain, and can be more reliable and
cost-effective. Data engineers should carefully consider the tradeoffs and choose the right balance
between complexity and simplicity based on the specific needs of their organization.
There are few things in the world of data engineering that will cause problems like over-complexity.
It takes many forms and has terrible side effects.
I am not saying that every complex system should not be complex, some problems and data are
inherently hard to work on and require complex solutions.
On the reverse side, overly simplistic solutions have their own set of problems.
Both extremes should be avoided and have many pitfalls. But good engineers know how to strike
that delicate balance. Choosing new tools and technology simply because they are new is a bad habit
and happens too often. It’s important to remember when architecting pipelines that we should let
the data and requirements drive the complexity, not the other way around.
Forcing simplicity or complexity onto a problem is the wrong approach.
Data engineers should try to be balanced when designing pipelines. They should take into consider-
ation all that we have talked about, and then setback and evaluate the options they have chosen.
Does the data and situation warrant the complexity, or on the other hand the simplicity?
Will future developers and data/business changes break the solution? There are always reasons to
incorporate simplicity and complexity depending on the situation. The job of the data engineer is
to balance all these requirements and find the middle ground that can be reliable, scalable, and
approachable.
Knowing that we will most likely have to come back to a data pipeline to make changes, to add data,
the functionality should give us pause when architecting systems.
Chapter 3 - Pipeline Architecture 56
Understanding Cost
A topic that is important to many businesses is trying to get a handle on costs. I would argue that
this might even be one of the easiest steps to complete if all of the other steps above have already
been completed.
If you understand the size, volume, and types of data that will be coming, along with what
technology choices are likely to be the best fit, figuring out the cost is usually not that hard.
Knowing the size and growth of storage requirements, along with looking at pricing documentation
for say s3 or azure storage will be an easy calculation.
The same applies to compute cost, once we understand our data size and volume we can usually
figure out what size of machine(s) are needed to process that data. Again with today’s cloud
platforms, it’s very straightforward to find per hour pricing on computing and do some simple and
rough calculations.
Again, when picking cloud offerings for certain technologies that might be required, say like AWS
EMR, pricing is always available for these decisions and can be added to the cost.
Code Architecture
Another area I want to touch on briefly that is closely related to what we’ve just discussed is the
architecture of our code. This is another good topic to think about after we’ve completed the steps
above and we are ready to dive into writing pipeline code.
Once you’ve decided to move down a path towards a solution, you’ve set up Dockerfiles, tests, some
documentation, the only thing left is to write code.
Everyone has different stypes of code writing and usually strong opinions about it. Some people
practice TDD (Test Driven Development) to the nth degree, others do not.
Some people write in an OOP fashion, others with a more functional approach. It doesn’t matter
what style you choose and there are gobs of books writing on software architecture. I’m not here to
write a book on that, but I do think making a few points will be helpful.
I’m generally just advocating that you think carefully and take pride in the data pipeline code you
write. Try to stick to normal software engineering best practices while writing your pipeline.
Whether you’re using Scala, Python, or Java, just take the time to understand what best practices
are and follow them.
There are only two approaches you can take with data pipelines, batch or streaming. This requires
two separate technology stacks and approaches to the pipelines and architecture.
Batch processing involves processing data in discrete chunks or batches, typically over a period
of time, such as hourly, daily, or weekly. Batch processing is often used for non-real-time, offline
processing of data, such as generating reports or data analysis.
Streaming processing, on the other hand, involves processing data in real-time, as it is generated or
received. Streaming processing is often used for real-time analysis, monitoring, and alerting.
There are several differences between batch and streaming processing:
Data latency:
Batch processing has higher latency as it processes data in discrete chunks, while streaming
processing has lower latency as it processes data in real-time.
Processing model:
Batch processing uses a pull-based model, where the data is pulled from a source and processed,
while streaming processing uses a push-based model, where data is pushed to the processing system.
Data processing:
Batch processing involves processing large volumes of data in a single batch, while streaming
processing involves processing small amounts of data in real-time.
Error handling:
Batch processing typically involves processing and handling errors after the batch is complete, while
streaming processing involves handling errors in real-time.
Infrastructure requirements:
Batch processing requires less infrastructure to manage and process large amounts of data, while
streaming processing requires a more robust and scalable infrastructure to handle real-time data
processing.
Chapter 3 - Pipeline Architecture 58
Both batch and streaming architectures have their own advantages and disadvantages, and the
choice between the two depends on the specific needs and use case of the organization. Batch
processing is better suited for non-real-time, offline data processing, while streaming processing
is better suited for real-time data processing and analysis.
Streaming
Streaming use cases are fairly obvious, and many times the only choice. If we are tasked with
ingesting high volumes of high velocity and relatively small data points, streaming is of course
probably your only and best option.
Tools like Kafka, Pulsar, Spark streaming, and the like, as well as the popularity of microservices
that produce and send messages, have made streaming data pipelines very common. Typically the
data isn’t very flat or relational, not like a CSV, by more ragged and fits JSON data structures better.
• High-velocity data.
• Smaller data record sizes.
• Increases complexity of pipeline architecture.
It’s best to think about streaming data pipeline architecture like a garden hose of water. Lots of small
messages flying about. Typically in a use-case, this will become quite clear.
Batch
Usually batch is the default and most common type of data pipeline. When real-time isn’t that
necessary, and data points are produced in batches, or together, in a very flattened and as a file,
batch processing will suit you just fine.
This is typically the big difference between streaming and batch systems. Batch data pipelines will
be working on ingesting file(s), while streaming pipelines will be working on individual messages
or data points.
Most of what you will learn about data engineering applies to both batch and streaming systems.
While of course, the technology behind each is different, technologies usually aren’t the hard part
to learn, it’s more about the nuances of the particular data sets that provide most of the challenges.
Puzzle Pieces
Before I close the chapter on architecture, I want to talk a little more about the big picture. Data
engineering can be a difficult subject to master, mostly because of the wide range of technology
used.
Chapter 3 - Pipeline Architecture 59
Many times one must be versed in every cloud provider product, streaming, batch, data storage,
command line, servers, networks, database, programming languages, and the list goes on. One thing
will become more clear as you gain more experience, architecture is about the big picture, and putting
puzzle pieces together.
Because of the numerous sets of tooling, and the fact that one tool can never provide end-to-end
modern data pipelines, the ability to fit puzzle pieces together is key.
Understanding how storage solutions work together with data transformation and analytics tools,
and how those tools work with orchestration tools, these types of decisions are what architecture is
really about.
Many times tools are picked for data pipelines and the other puzzle pieces are ignored until the
implementation phase happens, at which time roadblocks and other unknowns arise, causing project
delays and problems. How do you avoid and overcome such obstacles?
Remember when building data pipelines, no detail is too small to plan for. Think about all aspects of
the data pipeline, all the pieces, and parts, choose the solution for each puzzle piece, and then test or
explore each one. Many times simply glossing over technical documentation for each puzzle piece
will reveal pitfalls or pros and cons of such a solution.
This is really what data pipeline architecture boils down to. Being able to design a simple, elegant,
cost-effective way to stitch technology together to produce the desired result.
Summary
What makes an exceptional data engineer? Sure coding skills have a lot to do with it, having
experience and know-how with different technology stakes is key. But, some of the best engineers
I have met are the ones who take their time.
Don’t underestimate the value of getting into the habit of doing simple architecture planning on
even a small data pipeline project. Planning in all areas of our life always pays back big dividends,
the world of data engineering is no different.
Thinking through the pros and cons of approaching problems in different ways is so helpful for our
critical thinking skills, and we will always learn something on the journey of trying to understand
and plan for that problem at a deeper level.
Chapter 4 - Storage
Probably one of the most overlooked pieces of the data engineering puzzle is storage. In the world
of moving data around and transforming it, it’s hard to overstate the important role that file types
and storage play in our data pipelines.
If you stop and think about data engineering in general, what is it that you do as a data engineer,
you will conclude that it starts and stops with file storage and file types, especially in the age of big
data.
File and data storage are critical components of data engineering pipelines, as they enable the
efficient and effective storage, retrieval, and manipulation of large volumes of data. Here are some
key reasons why file and data storage are important in data engineering:
Data storage enables efficient data processing:
Without data storage, data processing would be slow and inefficient, as data would have to be
constantly reloaded and reprocessed. By storing data in a structured and organized manner, data
engineers can quickly access and process data, leading to faster and more efficient data pipelines.
Data storage enables data analysis:
Data storage allows for the storage and retrieval of large volumes of data, which is essential for
performing data analysis. By storing data in a way that enables easy querying and analysis, data
engineers can quickly generate insights and make data-driven decisions.
Data storage enables data sharing:
Storing data in a centralized location enables multiple teams and stakeholders to access and use the
same data, leading to greater collaboration and more effective decision-making.
Data storage enables data security:
Storing data in a secure manner is critical to protect sensitive information and ensure compliance
with data privacy regulations. By storing data in secure locations, such as encrypted databases or
cloud storage, data engineers can minimize the risk of data breaches and ensure the confidentiality
of sensitive information.
Pipelines typically start as some sort of file(s), and the result usually ends up as a file(s). While this
could be database files, in the new era of big data it’s becoming more common to store data outside
of traditional database systems.
Understanding what types of files and compression best fit certain types of data and use cases is
a very undervalued skill. Choosing Parquet vs Avro vs JSON vs CSV vs HDF5 vs RDBMS. These
decisions should be driven by the data needs and access patterns and should be explored before just
“picking” something because it’s easy.
With the shift towards a new type of data warehousing using data lakes and file storage instead
of the traditional relational database systems, a data engineer who knows how to best layout file
storage is priceless.
Many people gloss over the storage aspect of pipelines, not realizing how much there is to learn and
the productivity and speed gains that can be found in properly picking file storage and partitioning
strategies.
We will talk through the six basic storage fundamentals that every data engineer needs to take
seriously when working on pipelines. While these don’t cover every topic, this is a good introduction
to what is important for a data engineer to understand.
• Access patterns.
• SQL/NoSQL vs files.
• File types.
• Compression.
• Storage location.
• Partitions.
First, we will talk about data access patterns and how they affect our decision-making when thinking
about storage. What do I mean when I say access patterns?
Access Patterns
Data access patterns are about how files need to be read and searched by our data pipelines, and
those systems that ingest the results of our data pipelines. This includes a few topics you should
always keep in mind.
• How many system(s) will need access to the data storage layer?
• How often will the above system(s) be accessing the data storage?
• How much data will these system(s) be reading?
• How much logic will be applied by those systems to the data?
• How does the system technically access the data?
Chapter 4 - Storage 62
It’s best practice to understand how your data will be consumed, this has ramifications around
design and tooling.
In general, and usually upfront, data engineers must decide if the storage layer is going to include a
relational database system of some kind.
Usually, this is a straightforward decision based on the requirements of the pipeline. If the data
warehouse or other storage layer already exists in say MySQL, Postgres, or SQLServer then you
may have no choice in the matter. But this is becoming less and less common today with the growth
of data, much of it unstructured or semi-structured.
Access patterns can also include how much of the data will be accessed at one time. Will it just be
small pieces or large chunks of data that are needed by outside systems? Having a storage system
that supports fast single lookups vs serving requests for GB’s or TB’s of data will change how we
think and choose technology.
data with a consistent schema, such as financial data or customer information. SQL databases are
known for their ACID (Atomicity, Consistency, Isolation, Durability) compliance, which ensures
data integrity and consistency.
NoSQL: NoSQL (Not only SQL) is a type of non-relational database management system that stores
data in a more flexible and dynamic manner compared to SQL databases. NoSQL databases are
ideal for storing unstructured or semi-structured data such as social media data, sensor data, or log
files. Unlike SQL databases, NoSQL databases do not require a predefined schema and can be scaled
horizontally to handle large volumes of data.
Files: Files are a type of storage system that stores data in a raw, unstructured format. Files are
commonly used for storing large volumes of unstructured data, such as images, videos, or log files.
Files can be stored in a variety of formats, including text files, CSV files, JSON files, and more.
In summary, SQL, NoSQL, and files are different types of storage systems used in data engineering
pipelines. SQL databases are ideal for structured data with a consistent schema, NoSQL databases
are ideal for unstructured or semi-structured data, and files are ideal for storing large volumes of
unstructured data. The choice of storage system depends on the specific needs and requirements of
the organization.
How can I know if my data should be in a SQL database or in some other filesystem or store?
While there is no way to know without exact specifications, there are plenty of rules of thumb you
can follow to help make that decision.
Is the data best suited to fit into multiple tables that have very straightforward relationships together,
and is the data very tabular in nature. More explicitly, can easily be represented in a spreadsheet. If
the answer is yes, you might consider a SQL database.
Also, think about how often the data arrives and how often the data is needed. If you are planning
on ingesting multi-millions of records a day into the data store and serving up just as many requests
for that data … it’s possible a relational database will work, but it won’t be your average setup!
Storing documents that are very relational and transactional are also great candidates for NoSQL
systems like MongoDB and DynamoDB. The use cases for these types of data are usually pretty
obvious. Many times it comes down to if the data needs to be more on the analytical side of the
transactional support side.
Data that leans more into the key-value pair model, with possible built-in hierarchal ragged
structures are prime for NoSQL data stores.
I’ve seen 10GB of data stored in DynamoDB because someone thought it was cool. Don’t be this
data engineer, take the time to understand that data and what storage best fits the need.
1 >> customer.json
2 {
3 "customer_information":
4 {"customer_id": 543267,
5 "customer_name": "Billbo Baggins",
6 "customer_address": "100 BagEnd, The Shire, Middle Earth"
7 },
8 "order_date": "2021-04-01 23:45:03:01"
9 "product_info": {"product_id": 567894,
10 "quantity": 5}
11 }
Something like MongoDb or DynamoDb would be a great obvious choice for such a use case. Being
able to store the incoming order information in its document structure without transforming it, right
into a database that can be connected to the customer portal is straightforward.
Of course, we could make this work easily in a relational database like Postgres, but again always try
to pick the simple solution. Evaluate which technology best fits the need in a manner that is simple
and supports a reasonable expansion of requirements and data.
Any sort of requirement for supporting transactional systems from web apps to manufacturing
systems where CRUD and ACID reign supreme, these are classic database use cases. But, as time
goes on traditional database storage has given way to feature-rich file storage solutions
I’m now going to side-step relational database systems, we can talk about that more in the SQL
chapter. Today’s big data is dominated by different file types.
File Types
Now that the data engineering world turns on storage systems like s3, along with the amazing
growth of data, it’s become less and less common to store data in traditional databases. Files have
become the defacto new data lake. There has been a lot of feature development around file storage,
especially when it comes to big data.
It doesn’t matter much if files are being stored in the cloud or not, the type of file chosen as the
storage layer has a massive impact on data pipeline performance and usability. At a high level, what
should you worry about when choosing file storage?
• Data compression.
• Smart files (predicate push down support, selective reads).
• File type usability by ingestion systems and code.
• Schema and data type support.
• Row vs columnar storage.
A data engineer needs to understand high-level file storage options and what use case warrants the
use of each file storage option. Most of the decisions are straightforward when it comes to choosing
file storage options. If you understand a few basics about file types, typically a use case will make
obvious what the best choice is.
Chapter 4 - Storage 66
We are going to try and cover a few of the more popular file options in the big data world for storage,
we will look at the most common use cases for each, and try them each with Python. By the end
you should have a basic understanding of each file storage option and when it’s best to use them.
Before we dive into the specifics and examples for each file type, let’s talk about a common topic
called row vs columnar storage that you are likely to run into.
This means at a high level this choice between columnar vs row storage is all about access patterns
of the data, how it is going to be used?.
Traditional data warehouse or data lake patterns designed for analytical computations would fall into
the columnar storage system. High-volume transactional data being streamed from a clickstream
would make more sense as row-based storage.
Let’s get back to some specific examples and walk through some of the more popular file storage
formats you will find in the data engineering world.
Chapter 4 - Storage 67
• parquet
• avro
• orc
• csv/flat-file
• json
Parquet.
Probably the most popular columnar storage file type is Parquet. Parquet was made popular by the
rise of Hadoop/Apache Spark and the new type of data warehouse and data lakes.
It’s nearly impossible today to be a data engineer without using parquet files. Many times you might
be using parquet files and not even know it. For example, the open-source data lake tool called Delta
Lake, which has been made popular by Databricks, uses parquet files underneath the hood to provide
many of its capabilities.
A few things that have helped make parquet files popular as a storage system.
• string
• int
• decimal/float
• boolean
• binary
Chapter 4 - Storage 68
Parquet compression.
Parquets also come with built-in compression that is no joke. This compression can turn into big
savings when used with big data and you start thinking about it over TB’s of data. It is not uncommon
to see 70-80% storage savings from a CSV compared to a parquet with the same data.
Parquet files come with three different compression options. Many times snappy will be the default.
• gzip
• snappy
• LZO
1 >> parquet.py
2 import pandas as pd
3 dataframe = pd.read_csv('sample.csv')
4 dataframe.to_parquet('sample.snappy.parquet' ,engine='auto', compression='snappy\
5 ')
Projection reads.
1 >> parquet.py
2 import pyarrow.parquet as pq
3 data = pq.read_pandas('sample.snappy.parquet', columns=['ID', 'Date', 'Description']\
4 ).to_pandas()
• Reading select columns from data sets is a game-changer for big data.
Chapter 4 - Storage 69
You can imagine in large datasets how much performance can be gained from such a feature. Unlike
a CSV file, you don’t have to read everything, with parquet you can specify what columns you are
interested in working with.
An example with PySpark while reading a parquet dataset to sum a customer’s order amounts might
look like this.
1 data_set.select('customer_id', 'order_amount')
Parquet partitions.
Many times parquet files are partitioned into subfolders based on some column, usually a date. This
breaks the dataset up into smaller chunks.
Since parquet files are mostly used for big datasets, it’s rare to have parquet datasets that are not
partitioned.
Partitioning is an important topic we will cover a little later. What partitioning does for parquets
and other file storage systems is ensure your dataset is broken up into physical partitions.
Predicate pushdowns.
Predicate pushdown is another feature of parquet files that makes them fast and a great tool for data
engineers. Because of the metadata stored with parquet files, reading a dataset with filters that are
“pushed” down to the file can create huge performance gains over file formats like CSV that don’t
have this feature.
A pyarrow example might look something like this.
1 parquet_dataset.to_table(filter=dataset.field('order_amount') >= 7)
This means you can quickly read data that meets filters.
These topics just scratched the surface of what parquet files can do, I encourage you to remember
the features and benefits of parquet files and think hard about them before writing your next dataset
as a CSV.
Avro.
The Avro file format has been around as long as Hadoop has been. It is often coined as a “data
serialization” framework. It’s always touted as being able to handle flexible and wide-ranging
data structures, even hierarchical within records. It’s also built to be row-oriented (different from
Parquet).
I think of Avro as JSON on steroids. Here are some of Avro’s highlights.
Chapter 4 - Storage 70
Avro vs Parquet.
It’s interesting to note that the Avro format is used widely in the RPC (remote produce call) space,
to communicate messages and data across networks. This is where it widely differs from parquet.
Where parquet will be found a lot in say large datasets used for data lakes and data warehousing,
Avro is going to be found in more highly transactional and messaging systems and architectures.
One of the first things you will notice when starting to read or work with Avro is that the schema is
an integral concept. You can’t write data to an Avro file without having or defining a schema first.
Most commonly the schema is defined with JSON, which makes it very approachable, but very
different from most file systems you are probably used to.
Avro Examples.
The first thing you will also have to do when starting to create an Avro dataset is going to be creating
the schema. In most cases, it will be easy to just write a Python dictionary and keep it in memory
or a JSON file.
A couple of key points, below you will notice the parse_schema() method used to turn the json/dict
into the Avro schema. Also of note, you would want some sort of iterator that contains your
records/data.
Otherwise, fastavro in Python exposes a reader() and writer() that you will most likely be familiar
with. Here is my example of writing records to a file, then reading them back in.
Chapter 4 - Storage 71
1 import fastavro
2 >> create json schema from dictionary
3 avro_schema = {"namespace": "middle_earth.avro",
4 "type": "record",
5 "name": "MiddleEarth",
6 "fields": [
7 {"name": "character", "type": "string"},
8 {"name": "position", "type": "string"},
9 {"name": "dragontreasure", "type": "int"}
10 ]
11 }
12
13 avro_schema = fastavro.parse_schema(avro_schema) // turn dict/json into avro schema
14
15 >> some sort of records stream/iterable
16 heros = [
17 {'character': 'Gandalf', 'position': 'Wizard', 'dragontreasure': 50},
18 {'character': 'Samwise', 'position': 'Hobbit', 'dragontreasure': 1},
19 {'character': 'Gollumn', 'position': 'Sneaker', 'dragontreasure': 5},
20 {'character': 'Mr.Oakenshield', 'position': 'Dwarf', 'dragontreasure': 1000},
21 ]
22
23 avro_file = 'middle_earth.avro'
24
25 >> write records to file.
26 with open(avro_file, 'wb') as write_file:
27 fastavro.writer(write_file, avro_schema, heros)
28
29 >> read back file
30 with open(avro_file, 'rb') as in_file:
31 for record in fastavro.reader(in_file):
32 print(record)
1 >> avro.py
2 import fastavro
3 import csv
4
5 bike_share_file = 'Divvy_Trips_2019_Q3.csv'
6 avro_file = 'bike_share.avro'
7
8 >> create json schema from dictionary
9 avro_schema = {"namespace": "bike_share.avro",
10 "type": "record",
11 "name": "BikeShare",
12 "fields": [
13 {"name": "trip_id", "type": "string"},
14 {"name": "bikeid", "type": "string"},
15 {"name": "from_station_id", "type": "string"},
16 {"name": "gender", "type": ["string", "null"]},
17 {"name": "birthyear", "type": ["string", "null"]}
18 ]
19 }
20
21 >> turn dict/json into avro schema
22 avro_schema = fastavro.parse_schema(avro_schema)
23
24
25 def stream_csv_records(file_location: str) -> iter:
26 with open(file_location) as f:
27 creader = csv.reader(f)
28 next(creader) // skip header
29 for row in creader:
30 yield row
31
32
33 def transformed_stream(record_stream: iter) -> iter:
34 for record in records_stream:
35 avro_record = {u"trip_id": str(record[0]),
36 u"bikeid": str(record[3]),
37 u"from_station_id": str(record[5]),
38 u"gender": str(record[10]),
39 u"birthyear": str(record[11])
40 }
41 yield avro_record
42
43 records_stream = stream_csv_records(bike_share_file)
Chapter 4 - Storage 73
44 transformed_records = transformed_stream(records_stream)
45
46 >> write records to file.
47 with open(avro_file, 'wb') as write_file:
48 fastavro.writer(write_file, avro_schema, transformed_records)
You’re probably less likely to run into Avro in the wild, but it could happen. It’s more built for serial
messaging systems and complex schema’s that maybe don’t work well with Parquet for example.
Orc.
Another popular storage option in the big data and data engineering world is ORC, probably not as
popular as parquet, but popular enough to have gained a decent adoption in the community.
Stripes are integral to ORC, they are 64MB in size by default, are “independent” from each other …
allowing distributed work to happen on a file. Columns are separate from each other in the stripe,
allowing only needed data to be read.
The indexes in an ORC file allow push-down read filters into a file, signaling which Stripes need to
be read. You will notice this is similar to parquet files.
It’s also worth a note that working with Orc is not as easy as working with parquet or other files.
You can’t just work with ORC files with a simple Python setup, it was designed for HIVE and you
will most likely not run into it unless you’re working on legacy systems.
You will have more luck working with Orc using Java or Scala, JVM-based languages. Otherwise,
you can use Spark to work with getting data converted into Orc format.
Chapter 4 - Storage 74
1 df = spark.read.csv('bikes/*.csv', header='true')
2 df.write.format("orc").save("bikes/orcs/orcy.orc")
If you read about Orc files you will probably run across information indicating that it is faster than
the Parquet. While this may be true, be warned, gaining a few seconds here and there is probably
less important than how well the file type is supported by different tech stacks and the availability
of community to support it.
What I’m trying to say is that unless you have a good reason, choose parquet over Orc.
CSV / Flat-file.
These two file types are probably the most common file format used in data engineering, even some
big data workloads, unwisely, use CSV and flat-files. These are file types that have been around a
long time. What are they?
CSV’s and flat-files get nothing special, no compression, no built-in headers. They are easy and
simple to use, which is why they are so often found to be used in data engineering pipelines. If your
data is small, it probably makes sense to use CSV and flat-files.
CSV and flat-files are just text files with some sort of separator between records, typically a comma
is used “,”, although maybe times you will see pipes, like |, or tabs used to delimit records.
It also isn’t uncommon to double-quote records as well. Here are some topics to remember when
working with or learning CSV and flat files.
• No compression.
• Data values are separated by comma (,), but also pipe (|), and other values.
• No schema or data type support.
• Many times datapoints will be qualified, may be surrounded by quotes (“x”).
• Very easy to read and write in most languages.
1 >> example.csv
2 ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_st\
3 ation_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
4 A847FADBBC638E45,docked_bike,2020-04-26 17:45:14,2020-04-26 18:12:03,Eckhart Park,86\
5 ,Lincoln Ave & Diversey Pkwy,152,41.8964,-87.661,41.9322,-87.6586,member
6 5405B80E996FF60D,docked_bike,2020-04-17 17:08:54,2020-04-17 17:17:03,Drake Ave & Ful\
7 lerton Ave,503,Kosciuszko Park,499,41.9244,-87.7154,41.9306,-87.7238,member
Opening and working with CSV files couldn’t be easier in Python. Learning how to work with CSV
and flat-file records is key for any data engineer.
1 >> test.py
2 import csv
3
4 def open_csv_file(file_location: str) -> object:
5 with open(file_location) as f:
6 csv_reader = csv.reader(f)
7 for row in csv_reader:
8 print(row)
9
10 if __name__ == '__main__':
11 open_csv_file(file_location='PortfoliobyBorrowerLocation-Table 1.csv')
Most data engineers stay away from Pandas when working with data because it’s not a big data tool
that is scalable. Pandas is well known for having memory issues when working with data. But, that
being said, CSV and flat-files are typically not big and can easily be manipulated with Pandas, and
it should be used when it makes sense to do so.
1 >> pandas.py
2 import pandas
3
4 def open_csv_file(file_location: str) -> object:
5 dataframe = pandas.read_csv(file_location)
6 for index, row in dataframe.iterrows():
7 print(row['Location'], row['Balance (in billions)'], row['Borrowers (in thou\
8 sands)'])
9
10 if __name__ == '__main__':
11 open_csv_file(file_location='PortfoliobyBorrowerLocation-Table 1.csv')
Chapter 4 - Storage 76
What should someone building pipelines remember about using CSV and flat files? They are
probably the easiest storage option, and they work great for data that can be represented as a table,
with rows and columns.
It’s important to remember that they provide no out-of-the-box compression options. If you have
large datasets you will have to zip or gzip the files yourself. They don’t hold schema information as
well.
Also, you will struggle to support complex data types and structures in CSV and flat files. They are
meant to be a basic data file tool and should be kept that way.
JSON
Another popular format found all over data pipelines is JSON. Many times it’s used for configuration,
and not as the main storage type, but you will find s3 buckets full of JSON files in your engineering
travels.
JSON data is best for ragged hierarchy schemes. It acts just like a Python dictionary. It can contain
lists and arrays, as well as depth into its schema. Its popularity rose to meteoric highs with its use
in REST API’s as the data transmission format of choice.
1 >> sample.json
2 example_json = {
3 "some_key": "some_value",
4 "customer_id": 1234,
5 "order_amount" : 15.05,
6 "products": [456, 987, 043]
7 }
How is JSON most likely to be used? This is an interesting question, and we have to go back to the
basics to understand when we should, or should not, use JSON.
We should not choose JSON if we are storing large amounts of data into a storage bucket. Why?
Because JSON files don’t provide the partitioning, predicate pushdown, compression of other file
formats.
You do not want to end up with millions of data points spread across millions of JSON files unless
they are later going to be ingested into another system for analytics.
JSON is a great tool for storing configurations for applications, as well as messages and individual
data objects being passed around inside a data application. If your data represents tabular formats,
don’t use JSON, if your data is more document-oriented, it could be a great choice.
Chapter 4 - Storage 77
Let’s look at a few quick examples of using JSON in Python, and how it might fit into a data pipeline
application.
1 >> json.py
2 import json
3
4 def read_configuration(config_uri: str = 'config.json') -> json:
5 """ read configurations file, data input and output locations """
6 with open(config_uri) as cf:
7 config = json.load(cf)
8 return config
JSON makes the perfect data structure for holding configurations and passing information around
inside code and your programs. It’s used a lot in APIs and the web world. It’s a great versatile tool,
but as a data engineer, you have to remember its limited usability and big data platforms.
Compression.
The compression of files is an important topic when working with data. As the size of data grows,
and thus the cost, it’s important to remember size matters when storing files.
There are many types of file compression, but in data engineering, you will most likely see the
following types…
• gzip
• tar
• snappy
• zip
There are others but these are the ones you will run across daily. Honestly, when storing files there
should rarely be a case where those files are not compressed in some manner. It’s easy to compress
Chapter 4 - Storage 78
files, it saves size and money, and there isn’t much of a penalty to uncompress those files on read.
Many tools like Spark will automatically uncompress files on read for you!
When you can save 80% storage space per file just by compression, it is a no-brainer.
Most compression can be done easily on the command line, if you have a Linux flavor like Ubuntu,
these packages are just an apt-get install away.
1 bash
2 >> tar a single file
3 gzip file1.txt
4 // gzip multiple files
5 gzip file1.txt file2.txt
6
7 >> tar a directory of files
8 tar -zcvf example.tar.gz /my/source/data
Also never forget you can easily work with these compression types in your pipeline code. For
example, say we are downloading a file from AWS s3 with Python and need to gzip and compress
that file.
1 >> read.py
2 obj = s3.get_object(Bucket='my-s3-bucket, Key='file-1.csv')
3 return gzip.GzipFile(fileobj=StringIO(obj['Body'].read())).read().splitlines()
Or again, we have an object of data in Python and we want to write it to a gzip object.
1 >> gzip.py
2 my_data = "my, row, of, csv, data"
3 data_io = StringIO()
4 with gzip.GzipFile(fileobj=data_io, mode="w") as f:
5 f.write("\n".join(my_data) + "\n")
The main takeaway for any data engineer on compression should be, to use it. By nature we will be
storing and warehousing large amounts of data, it makes sense for cost reasons alone, not to mention
the ease of use of compression tools, to find a suitable compression for our files and use it.
Storage location.
Choosing the storage location for your data is a lot easier than it used to be. Actual HDFS and
Hadoop file systems are uncommon and legacy now. Cloud storage provided by AWS, GCP, and
Azure rules the day. These are storage locations in the cloud are typically referred to as “buckets.”
Chapter 4 - Storage 79
This is great for a few reasons, high availability and access from anywhere is pretty much standard
operating procedure now. No more network mounts and drives to mess with. But, some complexity
is added with cloud storage.
You have to learn and become adept at using the command line interfaces provided by the cloud
providers to interact with their storage systems. These command-line tools like aws CLI and gsutil
are how bulk files are moved around.
It could be by copying, moving, deleting, or syncing, but you will have to learn how these operations
work.
Also, when coding to the storage system in data pipelines, you must learn the packages and libraries
provided by the cloud companies, things like the Python package boto3 used to interact with s3. If
you are using GCP, gsutil will most likely have a place in your pipelines.
These tools can add some complexity, but what they provide in return is usually worth the hassle. It’s
also worth a note to think about cost when it comes to storage. Many of the popular cloud storage
options, like s3, only charge fractions of a cent per GB stored, but what they don’t mention is the
cost to read and write that data.
This is something data engineers should think about. Reading and writing massive amounts of
storage is going to increase your storage charges and cost, many times the devil is in the details.
Partitions.
Partitioning strategies are the greatest unsung heroes of data engineering and file storage that aren’t
talked about enough, nor is it taught with any consistency.
What is partitioning? It’s how the data files you are working with are physically stored and organized
on a disk or in the cloud.
How do you know if the data you are working with is partitioned or not? Just answer this question.
Are the files just randomly located in any place or folder … or is their structure to the madness?
One of the most common partitioning strategies in data engineering is date partitioning. So some
folder structure has a directory that might be broken down by year, month, and day. Here is an
example.
Chapter 4 - Storage 80
1 year=2021
2 month=08
3 day=15
4 ...file_1.txt
5 ...file_2.txt
As you can see this makes it easy for people and for code to find data that is just in a single year or
even a single, month and particular day. This is fundamentally what partitioning data is.
In another case, we might be working in an environment where we gather lots of information about
a few repeat customers or clients. We might run into a data partitioning strategy as follows.
1 client_number=123
2 invoices
3 year=2021
4 month=10
5 day=01
6 client_number=456
7 invoices
8 year=2021
9 month=10
10 day=01
What I’m trying to convey here is that data partitioning is a powerful strategy and is going to be
closely related to how your data is either produced or consumed. And when you have terabytes or
petabytes of data, you can’t read every file when a pipeline needs some data, that would take too
long and make the system unusable.
Chapter 4 - Storage 81
This is the key to data partitioning, it allows for more fine-grained file system seeking and data
locating. The program can hone into exactly where the data is that is needed.
This topic is so important that even popular tools like Apache Spark have it built right into their
APIs.
1 dataframe.write.partitionBy('year', 'month').parquet('/mnt/my_data/')
The idea of grouping data and saving it in partitions is at the core of big data, and most tools now
and in the future that data engineers will be working with will have some form of this partitioning
idea.
Just writing data and files to a single folder and letting one folder contains thousands upon thousands
of files will give yourself or someone else a headache down the road. Think about how your ETL
would need to interact with that data later. What kind of predicates or filtering statements could be
applied. If you’re familiar with SQL then this would be thinking about the WHERE clause.
Data partitioning can, of course, get very complicated and they might be certain nuances to consider
when you are working with specific tools, but in the end, I want you to remember the big picture.
Never forget the importance of just organizing your data and files into commonsense patterns that
most likely will present themselves based on the use case of your particular data.
• Access patterns.
• SQL/NoSQL vs files.
Chapter 4 - Storage 82
• File types.
• Compression.
• Storage location.
• Partitions.
Data access patterns are at the core of data storage. How the data is used is should drive our decisions
around how and where we choose to store our data and files.
The SQL/NoSQL databases as opposed to the file-based storage option will often come up at the
beginning of most data projects. They require different approaches and tooling, leading to very
different data architectures.
File types is always a fun topic to cover. We talked about the following file formats.
• parquet
• avro
• orc
• csv/flat-file
• json
Knowing the difference and situations for all these file types is important and will make a difference
in your data pipelines.
Lastly, we covered compression and partitioning. Compression will save time and money and is too
often ignored. Data partitioning is at the heart of big data storage, if you are ever going to work with
data of size, it’s a topic you have to learn.
Chapter 5 - Compute and Resources
This chapter is all about managing resources. Believe it or not, this is a big part of what data engineers
do on a day-to-day basis and can affect every decision we make when writing code … and don’t
forget the bottom line.
At the start of this book, I mentioned our pipelines need to be scalable. Well, how do you think most
big data gets processed through ETL pipelines? This work happens on clusters and servers, often
referred to as compute in the distributed big data processing world.
That is the topic I want to bring to the forefront, compute, in the form of RAM and CPU. It’s quite
common for data engineerings in the start of their career to struggle with Out Of Memory (OOM)
issues, as well as performance problems that end up being related to resource usage.
CPU (Central Processing Unit) and memory are critical components in data engineering as they are
responsible for processing and storing data. Here are some key points to keep in mind regarding
CPU and memory in data engineering:
CPU:
The CPU is responsible for executing instructions and processing data. In data engineering, CPUs are
used to run data processing applications and algorithms. The speed and number of CPUs required
depends on the complexity and volume of the data being processed.
Memory:
Memory is used to store data temporarily while it is being processed. In data engineering, memory
is used to cache data, intermediate results, and program code. The amount of memory required
depends on the size of the data being processed and the complexity of the algorithms being used.
Memory hierarchy:
Memory is organized into a hierarchy, with the fastest and most expensive memory located closer to
the CPU. In data engineering, it is important to manage the memory hierarchy effectively to ensure
that data can be accessed and processed quickly.
Parallel processing:
Parallel processing is a technique used to improve the performance of data processing by dividing
the workload across multiple CPUs. This requires careful management of memory access and
synchronization to ensure that the data is processed correctly.
Optimization:
Optimizing the use of CPU and memory resources is critical in data engineering to ensure that the
data processing tasks are completed efficiently. This can involve techniques such as load balancing,
caching, and pipelining, which are used to minimize the amount of time that data spends waiting
for CPU or memory resources to become available.
Chapter 5 - Compute and Resources 84
The principles we will talk about apply if you are writing what would be considered small data
running on a single machine. Managing the resources and compute available to you is the key to a
pipeline that is efficient and fast.
Overview
In the architecture chapter we covered some examples of calculating compute needs, but let’s get a
little more specific about what exactly a data engineer needs to take into consideration when writing
data pipelines.
• RAM
• CPU
• Storage
• Cluster/Node count (aka how many servers do I have to work with.)
Why should we care about such things? Honestly, because the above resource factors will affect how
fast the code runs, how much data can be processed, and ensure we are not wasting and leaving
unused compute on the table, and of course, there is a little thing called cost!
Handling CPU and memory in data pipelines involves optimizing the use of resources to ensure that
data processing tasks are completed efficiently. Here are some strategies that can be used:
Optimize algorithms:
One of the most effective ways to improve CPU and memory usage is to optimize the data processing
algorithms used in the pipeline. This can involve using more efficient algorithms, reducing the
amount of data processed, or breaking up large tasks into smaller, more manageable pieces.
Chapter 5 - Compute and Resources 85
• Before writing pipelines, you should understand the resources that are available.
Once you have answered those questions you should do everything in your power when writing ETL
and code to use every last scrap of RAM and CPU, leaving nothing on the table. This will probably
take the form of two major topics regardless of the programming language of choice.
Some tools like Apache Spark and other distributed big data platforms take care of certain pieces
of that puzzle, resource consumption I mean, but there are probably many times when you will
Chapter 5 - Compute and Resources 86
just be writing custom code, and it will be up to you to make those decisions and write your code
accordingly.
How does this work out in real life? Let’s take a look at an example that has happened to me many
times in my career.
Example
In our example, you get assigned a new JIRA ticket. The gist of the ticket is that you will be receiving
several raw text files that are fixed-width delimited (say each column of data is separated by 50 white
spaces). Your task is to convert each CSV file into a comma-delimited CSV file for processing by some
downstream system.
Here are a few functions that you might write to accomplish this work.
Easy enough right? All goes well with your testing until you put it in production. You realize now
that you receive about 40 files a day ranging between a few MB up to 2GB in size .. and the folder
you are supposed to process has a few hundred historical files already in it.
You do the math and realize it’s going to take a month to process each file one by one. Also the
more you think about it, you realize the machine that will be processing these files on AWS is a
c4.4xlarge meaning it has 16 CPU and 30GB of RAM.
It’s easy to see that processing files one at a time and not only going to take forever, but it’s not
efficient and leavings most of the compute resources wasted on the machine the code runs on.
Data engineers should learn to think about processing more than one file, or record, at a time. Why
settle on processing one file at a time in our example when most likely we have the resources
available to work on many files at the same time.
Chapter 5 - Compute and Resources 87
For a data engineer this is something we want to avoid, no matter if you are using Python or Scala,
Spark or not, it’s clear in the above situation we need to find a way to process more files concurrently,
use every last CPU and RAM to chew through the files.
RAM/Memory
This topic is probably one that bites data engineers the most often, yet there isn’t much training or
material on how to manage memory consumption for everyday use in data engineering.
There are plenty of software engineering-related resources that can teach out about different OOP
methods and how certain data structures might be larger or smaller than others, but typically this
type of material is directed toward normal software engineer principles and not that suited to that
well to data engineering work.
Data engineers are usually thinking in terms of data pipelines.
If you think about what data engineers do, day-to-day, it’s transforming large amounts of data
via pipelines. So when a data engineer thinks about memory usage it typically comes down to the
following question.
This can be a tricky tight rope to walk. OOM (Out Of Memory) errors are common in data
engineering. One of the easiest steps to take is just to calculate the size of the data file, or the piece
of data that you are working on, compare it to the size of the resource and do the simple division to
see how many pieces of data you can fit into memory while still leaving a little room for overhead.
• Calculating the RAM usage of a data pipeline isn’t that hard, it just takes a little work.
On the other hand, we all know that memory is faster than disk, and working on data in memory is
much faster and is a great option to use when available. Let’s take Python for example and explore
features for working with in-memory data that every data engineer should be aware of.
Even if Python isn’t your first language, the ideas should generally transfer to other languages and
approaches. What I’m trying to accomplish here is to introduce you to the idea of thinking about
memory usage when writing your code. Most beginners don’t do this, and it never comes up until
something is breaking.
Chapter 5 - Compute and Resources 88
• Concepts are key, this is how can we use RAM for faster pipelines with a real-world example.
Think about it. If you are the typical data engineer, business intelligence engineer, report developer,
or data analyst, what is one of the first topics you learn? Probably opening a file, maybe a CSV file,
a text file, a JSON file, or maybe it’s even a zip file full of other files.
It usually involves a file-io most of the time. You may have never thought about it, but reading and
writing files to disk will be a bottleneck in programs, especially if you are reading and writing the
same file multiple times. Say you’re downloading a file over HTTP to disk, then unzipping it, then
reading the file in. That’s a lot of disk-io.
This is where StringIO and BytesIO come in, they are in-memory data objects. One way to think
about these streams of data is that they act as a File Object. What does that mean? It means you can
treat and interact with these objects like you would any other file, they have the File API on top of
them. You can read them, write to them, etc. A file but not a file, get it? And, it’s all done in memory.
The best part of IO streams is that they live in memory, and that means fast. Of course, you need the
resources to do this on your machine or server, but that usually isn’t a problem. One minor detail to
remember about a StringIO/BytesIO is that when created, it acts like an already opened file.
Let’s look at some examples of how this could work. Simplistic, but to the point.
Chapter 5 - Compute and Resources 89
A CSV writer takes a file object, well we have one of those, don’t we! Next, we call .writerows() on
our csv_writer object, we write two separate rows. The next part may seem a little strange to you,
seek(0). Since we are dealing with a stream, file-like object, and we’ve written to lines, technically
that opened file object is at the “end.” To read back out of that file object and print the lines we wrote,
we need to be at the beginning.
Let’s take this just slightly further to show how StringIO/BytesIO could be useful.
Example of BytesIO
Let’s say we have a boring job and our boss asks us to download information about Livestock and
Meat International Trade data from the government, and insert relevant information into a database.
The typical workflow would be to download the zip file, unpack it, read in the relevant CSV file from
disk, find the data, and off to the races.
Well, we know better now don’t we. Sounds like a few spots of file-io, like writing the zip to disk,
unzipping the files to disk, then reading the CSV file from disk. But, there is a more excellent way.
Chapter 5 - Compute and Resources 90
1 import requests
2 from io import BytesIO
3 from zipfile import ZipFile, is_zipfile
4
5 url = 'https://www.ers.usda.gov/webdocs/DataFiles/81475/LivestockMeatTrade.zip'
6
7 try:
8 response = requests.get(url)
9 except:
10 print('Problem downloading zip file.')
11
12 if response.status_code == 200:
13 in_memory_zip = BytesIO(response.content)
14 with ZipFile(in_memory_zip) as zippy:
15 for item in zippy.infolist():
16 if 'Exports' in item.filename:
17 with zippy.open(item.filename) as export_file:
18 for row in export_file:
19 print(row.decode('utf-8'))
Easy! It’s fast because we are doing everything in memory and it’s simple code, straightforward.
Really what I’m showing you here is that many of the packages and methods you use in Python can
take a file-like object, in this example with ZipFile(in_memory_zip) as zippy, doesn’t matter if
it’s an actual file sitting on your disk or a file-like object sitting in memory.
RAM/Memory Review
I know that might have seemed a little bit of a deep dive for a book on data engineering. What I
want to get you thinking about is the code that drives your pipelines, it could be PySpark or it could
be just plain Python or Scala. If you are working with data, that data is going to be in memory at
some point.
Data in-memory is going to be faster than data on disk, most of us have at least heard this in passing,
“in-memory is so much faster.” But what data engineers need to think about when writing pipeline
code is … how much data am I dealing with?
How much data is going to be in-memory in a worst-case scenario when doing this specific
transformation or calculation? OOM errors are one of the most common problems that data
engineers run into early on in their careers.
This is usually because people get in the habit of assuming something works, and the data grows,
more data comes in, or just code gets pushed to production, and everything breaks.
What I want you to take away as a data engineer is how to take advantage of RAM, because it is
fast, but also be aware of how easily you can cause OOM errors as things scale.
RAM is fast and useful, it’s also limited and easy to run out of. You don’t want to leave a bunch of
memory on the table if you have it as an available resource. You don’t want to overuse it and cause
memory leaks and other errors, use it wisely.
CPU/Cores
Now that we have talked a little bit about memory/RAM, and how we should pay attention and use
it well, let’s talk about CPU. The thought process here is the same as with our memory discussions.
In today’s world, it’s very unlikely that you will be working with resources, servers, or virtual servers,
that only have a single core. In most cases, you will probably have a few cores, so leaving those cores
unused is a waste that could significantly reduce code run times when used properly.
I will say this upfront, no matter what language you use, even Python, concurrency is a tricky topic
even for good Software Engineers. Many tears have been spilled on the internet arguing about this
topic. But, let’s just cover the basics and show a few examples in Python of how you as a data
engineer can easily use all CPUs available to supercharge your pipelines.
Again, this isn’t specifically about Python, this is meant to teach you that many of the machines
your data pipelines will run on have multiple cores and CPUs available, and you should use them
all. It will make pipelines run faster, reduce compute cost (because you’re doing more with what
you have), and generally make your life better.
1 import csv
2 from glob import glob
3 from datetime import datetime
4 from concurrent.futures import ProcessPoolExecutor
5
6 def main():
7 files = gather_files()
8 with ProcessPoolExecutor(max_workers=3) as Thready:
9 Thready.map(work_file, files)
10 for file in files:
11 rows = read_file(file)
12 for row in rows:
13 filter_row(row)
14
15 def work_file(file_loc: str) -> None:
16 rows = read_file(file_loc)
17 for row in rows:
18 filter_row(row)
19
20 def gather_files(loc: str = 'trips/*.csv') -> iter:
21 files = glob(loc)
22 for file in files:
23 yield file
24
25 def read_file(file_location: str) -> iter:
26 with open(file_location, 'r') as f:
27 data = csv.reader(f)
28 next(data)
29 for row in data:
30 yield row
31
32 def filter_row(row: object) -> None:
33 if row[12] == 'member':
34 print('member ride found')
35
36
37 if __name__ == '__main__':
38 t1 = datetime.now()
39 main()
40 t2 = datetime.now()
41 x = t2-t1
42 print(f'It took {x} to process files')
Chapter 5 - Compute and Resources 93
The point here isn’t to teach you how to use ProcessPoolExecutor’s, but to teach you that such
tools exist and are available for you to use where appropriate. The problem breaks down to this,
regardless of tools or language.
You have a list of work that needs to be done. You probably have a method or function that does
that actual work, now you need a tool to spread the list of work across several CPUs so you can do
more work at one time with your code.
In our case, that’s exactly what the ProcessPoolExecutor’s map function provided for us. We passed
it our function work_file and our list of work files and the rest is done for us! All of a sudden we
have multiple files being worked on at once.
Take some time to study this example if you are not familiar with this type of data processing, it’s
about changing your mindset, breaking down a large task into its components, and figuring out how
you can spread the workaround to multiple cores.
Storage
Storage is probably one of the easiest resources to manage, and we won’t devote a lot of time to this.
With the advent of cloud storage services like Azure Blob, AWS s3, and Google Cloud Storage the
idea of running out of disk space has floated into the past.
It’s still an important topic to talk about and I’ve seen disk issues pop up many times in pipelines.
This is mostly the case simply because data engineers think they don’t have to pay attention to
storage anymore.
Most servers and resources come with either a set amount of storage attached, or possibly some
elastic storage service that is quite large, or pipelines will just read and write directly to the cloud.
Whatever the case it’s easy to get caught with disk out-of-space errors if you are not careful.
It could be writing intermediate results to disk, but not cleaning up after yourself or knowing upfront
what disk space is available is a good way to get into trouble.
Cluster/Nodes
Last but not least I want to talk about a difficult topic … cluster size or node count. This can be
tricky because so much depends on the tools and technology stack being used. It could be Spark,
Kubernetes, Airflow … all these systems have multiple workers, each with their resources.
A general rule of thumb before you start working on a data pipeline is just to understand the system
and resources you are working with … what do you have available. Even if only for the simple fact
you could greatly speed up your data processing if you knew you had 10 works available and not
just 1 or 2.
Any pipeline that isn’t just running on a single machine … which is becoming less and less common,
no matter the tech, is going to depend on the number and size of the workers available to process
data.
• Understand the system you are working on, what resources are available.
You would be surprised at how just knowing that you have x number of workers, each with x CPU
and x RAM will allow you to calculate throughput and possible pipeline problems as the data grows.
Being able to calculate how much data you have, and what each worker could or should handle
is something that a data engineer should become familiar with calculating. It will save you from
wondering why a pipeline is running for more than 1 day or gets OOM errors.
If you never attempt to calculate data size and what resources are available for processing that data,
then you are taking a shot in the dark, it may work out, or it may not!
A cluster with nodes is a group of connected computers that work together to perform a common task
or set of tasks. Each computer in the cluster is referred to as a node, and all nodes work together to
perform computational tasks that would be difficult or impossible for a single computer to perform
on its own.
In a data engineering context, clusters with nodes are commonly used to process and analyze large
datasets. The cluster distributes data processing tasks across the nodes, allowing the workload to be
shared and completed more quickly than would be possible on a single machine.
Nodes in a cluster are typically connected to each other through a high-speed network, allowing
them to communicate quickly and efficiently. Each node in the cluster typically has its own CPU,
memory, and storage resources, and these resources can be combined to provide a more powerful
and scalable computing platform.
Cluster with nodes architecture is commonly used with distributed computing frameworks like
Apache Hadoop or Apache Spark, which are designed to enable distributed data processing and
analysis across a cluster of nodes. These frameworks allow data engineers to write code that can
be distributed across the cluster, and automatically manage the distribution of workloads across the
nodes.
Chapter 6 - Mastering SQL
Introduction To SQL
I don’t intend to go very deep into SQL or Relational Databases in this chapter. My goal is just to
give insights and high-level thoughts to those who are maybe just dipping their feet into the water.
Sure, I will cover the basics of SQL and what you should learn, if you are unfamiliar with some of
the topics, use that as a guide to show you what you should dig into later.
Feel free to skip this chapter if you have been a DBA or have been writing SQL for the last 15 years,
although sometimes a reminder about the basics is helpful!
I’ve met my fair share of snooty people who poo-poo SQL and databases as second class hand-me-
downs. I still remember talking to an academic computer science graduate who was explaining to
me how he refused to teach database classes, he was just too good for that. Whatever. We, data
engineers, know better!
Refusing to accept how 90% of companies can operate as data-driven businesses just isn’t important
to some people. There is probably nothing more important in the tool belt of a data engineer than
being above average at SQL and databases.
Tuning queries, writing queries, indexing, designing data warehouses. I’m sure some Hadoop data
engineers skipped this step of RDBMS world, but that is not the normal path of a data engineer.
Here are some of the reasons why SQL is important in data engineering:
Data Manipulation:
SQL provides a set of commands that can be used to manipulate data in a relational database,
including selecting, updating, inserting, and deleting data. This makes it easy to modify large
datasets, and to extract the data needed for analysis.
Data Analysis:
SQL provides a powerful set of tools for analyzing data in a relational database. These tools
include aggregation functions, grouping, sorting, filtering, and joining tables. With these tools,
data engineers can perform complex analysis on large datasets, and extract insights that can drive
business decisions.
Data Integration:
SQL can be used to integrate data from different sources into a single database. This makes it easier
to combine data from multiple sources, and to perform analysis on the combined dataset.
Chapter 6 - Mastering SQL 96
Data Security:
SQL provides a set of security features that can be used to protect sensitive data in a relational
database. This includes features like user authentication, role-based access control, and encryption.
Scalability:
SQL can be used to manage and manipulate large datasets, making it a powerful tool for scaling
data processing pipelines. With SQL, data engineers can easily manage and analyze data at scale,
without having to worry about the limitations of traditional data processing tools.
Overall, SQL is an essential tool for data engineers, providing a powerful set of tools for data
manipulation, analysis, integration, security, and scalability.
Think of it as working on your car, there are differences and you might have to check the manual a
few times. But once you figure out how to change the oil, the brakes, air-filter, the light bulbs, you
can work on almost any car.
All the angry DBA’s right now can just send me emails that I will promptly delete. No, the type of
RDBMS doesn’t matter to learn SQL/database fundamentals.
Choosing the right type of database is an important decision that can have a significant impact on
the performance and scalability of your data engineering pipelines. Here are some factors to consider
when choosing the right type of database:
Data Structure:
Consider the structure of your data. If your data is highly structured, with a consistent schema, a
relational database may be the best choice. If your data is unstructured, with varying schema or data
types, a NoSQL database may be a better fit.
Scalability:
Consider the scalability requirements of your data engineering pipelines. If you need to scale your
pipelines to handle large volumes of data, a distributed database or a cloud-based database may be
a better choice.
Performance:
Consider the performance requirements of your pipelines. If you need to process data in real-time, a
Chapter 6 - Mastering SQL 97
streaming database may be a better choice. If you need to perform complex queries on large datasets,
a relational database may be a better fit.
Cost:
Consider the cost of different types of databases, including licensing fees, hardware costs, and
maintenance costs. Cloud-based databases may be a more cost-effective option for small to medium-
sized pipelines, while on-premises databases may be more cost-effective for larger pipelines.
Security:
Consider the security requirements of your pipelines. Some types of databases offer more robust
security features than others, including user authentication, encryption, and access control.
Overall, choosing the right type of database requires careful consideration of the structure, scalabil-
ity, performance, cost, and security requirements of your data engineering pipelines. By carefully
evaluating these factors, you can choose a database that meets your needs and helps you build
efficient and effective pipelines.
main goal of OLTP systems is to ensure data consistency and accuracy, with a focus on real-time
processing and availability.
On the other hand, OLAP is a system designed for analytical processing, where the focus is on
querying and analyzing large volumes of data to gain insights and make informed decisions. OLAP
systems are typically used in decision-making environments, such as business intelligence and data
warehousing. The main goal of OLAP systems is to provide a multidimensional view of data, with
a focus on historical data analysis, complex queries, and advanced data visualization.
To summarize, OLTP is focused on transaction processing for real-time operations, while OLAP
is focused on analytical processing for decision-making purposes. Both OLTP and OLAP have
different architectures, data models, and processing requirements, and they are often used together
in enterprise systems to support various business needs.
OLTP = highly transactional in nature.
OLAP = analytical in nature, think data warehousing, aggregation.
Transactions vs analytics are the difference and should drive differences in how you think about
modeling and handling the design of SQL tables. Highly transactional systems are very concerned
about allowing for large volumes of inserts. Being able to support analytical queries in an OLTP
system just isn’t in the cards.
Below is an example OLAP database, star schema, some aggregate tables with many connected
dimensions.
On the other hand, OLTP tables are wide and few, made to ingest data quickly.
Many times OLTP tables will be wide with lots of columns, designed to handle high volumes of
Chapter 6 - Mastering SQL 99
inserts. Think maybe an online order coming into a system, financial or banking transactions hitting
a table. OLAP on the other hand is designed for aggregating and summarizing data, think analytics,
and data warehousing. Hopefully, that makes sense to you.
For most data engineering work, we find ourselves working in the OLAP world and design. We think
about structuring our data in a way that can answer questions with the aggregation of records.
Table design/layout.
In databases, the design and layout of tables play a crucial role in organizing and storing data
efficiently. A well-designed table can improve query performance, reduce data redundancy, and
facilitate data analysis.
Here are some essential considerations for designing tables in databases:
Define Primary Key:
A primary key is a unique identifier for each row in a table. It helps to establish relationships between
tables and enforce data integrity. It is essential to choose a primary key that is simple, unique, and
does not change over time.
Normalize Data:
Normalization is the process of organizing data in a database to reduce redundancy and improve data
integrity. It involves breaking down tables into smaller, more manageable tables and establishing
relationships between them. This helps to avoid data duplication, improves data consistency, and
simplifies queries.
Use Appropriate Data Types:
Choosing the appropriate data types for table columns can help reduce storage requirements and
improve performance. For example, using integer data type for numeric values and varchar data
type for text strings can help to minimize storage space and improve query performance.
Indexing:
Indexes can be used to improve query performance by creating a sorted copy of a column. This
allows the database to quickly find the rows that match a particular value or range of values. It is
essential to choose the appropriate columns to index and to avoid over-indexing, which can slow
down updates and inserts.
Partitioning:
Partitioning is the process of dividing large tables into smaller, more manageable sections. This can
help to improve performance, manageability, and scalability. It is particularly useful for tables with
high write loads or that are too large to fit in memory.
Denormalization:
While normalization is useful for reducing redundancy, it can sometimes lead to complex queries
and slower performance. Denormalization involves introducing redundancy into a table to improve
performance. This should be used sparingly and only when it provides a significant performance
benefit.
Chapter 6 - Mastering SQL 100
Designing tables in databases requires careful consideration of various factors, including data types,
normalization, indexing, partitioning, and denormalization. A well-designed table can improve
performance, reduce redundancy, and facilitate data analysis.
I’m going to mostly talk about table design and layout that data engineers will run into. If you start
reading a lot about the database and table design you will probably run into people talking about
normal forms, “you should design this in the third normal form.”
“Normalizing” a table is really about data de-duplication, and the extent that you take that concept
into your table designs. Without worrying about exactly which normal form level you should be
designing in your SQL tables, I suggest a more pragmatic and common-sense approach.
• Normalizing tables aims to de-duplicate, ensuring integrity, and logical data management.
• Normalization can be taken too far.
• Be logical in breaking up your data into tables.
If you don’t normalize your SQL tables, and up with just two large tables that hold the information
needed, there is a good chance you didn’t normalize your data model enough. If you end up with 15
tables for a relatively simple data set, you took normalization too far.
I suggest looking for the middle ground. The easiest approach is just to group the datasets into their
logical units. Think about a warehouse that holds information about customers and their orders.
What is a logical way to group and design the SQL tables in a relational database?
• Customer information.
• Order information.
• Product information.
It isn’t any harder than that. Complexity is always the enemy, especially if the situation doesn’t
warrant that complexity.
I do suggest you read up on this concept of database and table normalization if you plan on working
in and around classical relational databases and data warehouse environments. I believe table designs
are closely related to the OLTP and OLAP thoughts above. It isn’t that hard to avoid bad table design,
most of it is common sense. Here are some rules I’ve used in the past.
You have to know the queries first (Try to understand data access patterns, even at a high level before
trying to design table layouts).
Break the data up into logical units (customer data, orders, products).
Don’t put everything in one table (this use case is rare). When you’re designing a table think about
how it relates to other tables (ex. how will it join to other data). Simple is usually better than fancy
( and means faster queries at the end most likely.) Understand data types ( if you haven’t thought
through the data type of each value, then you aren’t ready to design the table yet).
Chapter 6 - Mastering SQL 101
I’m going to repeat that because it is so key to table designs in relational databases. What piece of
data, or pieces of data combined, make a record unique in your data set?
This is a concept of a primary key.
The primary key is the first thing you should define when designing a table, if you can’t, then you
don’t understand the data well enough yet, or your table has a poor design.
• Hobbits
• Quests
• Friends
• Adventures
Example
1 #example.sql
2 CREATE TABLE middle_earth.Hobbits
3 (hobbit_id INT NOT NULL,
4 first_name VARCHAR(150),
5 last_name VARCHAR(150),
6 age INT DEFAULT 1,
7 favorite_food VARCHAR(250)
8 );
9
10 CREATE TABLE middle_earth.Quests
11 (quest_id INT NOT NULL,
12 quest_name VARCHAR(150),
13 location VARCHAR(200),
Chapter 6 - Mastering SQL 102
14 hobbit_id INT,
15 treasure_value BIGINT
16 );
17
18 CREATE TABLE middle_earth.Hobbit_Companions
19 (companion_id INT NOT NULL,
20 first_name VARCHAR(150),
21 last_name VARCHAR(150),
22 power_level INT DEFAULT 10
23 );
24
25 CREATE TABLE middle_earth.Adventures
26 (hobbit_id INT NOT NULL,
27 companion_id INT NOT NULL,
28 quest_id INT NOT NULL
29 );
Hobbits, so I made a hobbit table. Hobbits always go on quests, so that is an obvious one. And of
course, a hobbit will never leave his hobbit-hole without a companion, bingo.
Finally, they will set off on an adventure together, the last table. Just apply that simple logic to
database table design and layout. Think about them as pieces of a puzzle and consider how they
relate to each other.
So which hobbits have been going on adventures with whom? Can we answer this question for the
tables designed above?
Chapter 6 - Mastering SQL 103
1 >> example.sql
2 SELECT h.first_name, h.last_name, c.first_name, c.last_name, q.quest_name
3 FROM Hobbits h
4 INNER JOIN Adventures a ON h.hobbit_id = a.hobbit_id
5 INNER JOIN Hobbit_Companions c ON c.companion_id = a.companion_id
6 INNER JOIN Quests q ON q.quest_id = a.quest_id AND q.hobbit_id = a.hobbit_id;
You get the idea. Table design and layout are half art and half science.
Here are good rules of thumb for Data Engineers when it comes to database table design.
• Keep it simple.
• Break tables up into logical units.
• Always think about primary keys (what makes something unique).
Every table should have indexes (I hope you are not surprised at this novel idea). Table join keys
should be the primary and most important index (how two tables link together and relate to each
other).
If you have to put 10 indexes on your table go back and read the section on table design and layout
because you failed. Think about what makes each table unique. This is a critical topic in database
design.
Indexing is a database feature that allows for faster data retrieval by creating a copy of a table’s
column(s) in a separate data structure. This data structure is then used to look up data based on the
values in the indexed column(s), instead of scanning the entire table.
Here are some basics of indexing:
Index Types:
There are different types of indexes that can be used based on the type of data and queries. Common
types include B-tree, hash, bitmap, and full-text indexes.
Chapter 6 - Mastering SQL 104
Index Columns:
An index is created on one or more columns of a table. It is essential to choose the columns to index
carefully, as indexing too many or the wrong columns can slow down the database’s performance.
Indexing Process:
Indexing involves creating a separate data structure that stores a sorted copy of the indexed
column(s) along with a reference to the corresponding row in the original table. This structure is
then used to look up data more efficiently.
Index Maintenance:
When data is inserted, updated, or deleted in a table, the index needs to be updated accordingly. This
can slow down write operations and increase the database’s storage requirements.
Indexing Strategies:
Indexing strategies involve choosing the appropriate columns to index based on the most frequently
used queries. Indexing can also be combined with other database features, such as partitioning and
clustering, to further improve performance.
Query Optimization:
Indexing can significantly improve query performance, but it is not a silver bullet. Properly
optimized queries, including the use of indexes, can improve query execution time, and reduce
resource consumption.
Indexing is an important feature in databases that can improve query performance and reduce the
time required for data retrieval. Properly designed indexes can make a significant difference in
database performance, but it is essential to choose the appropriate columns to index and maintain
the index regularly.
Firstly, let’s talk about poorly written queries, again, regardless of the database system. A key concept
is that databases were designed to have you work on data in sets, groups of data, not row by row.
1 >> example.sql
2 SELECT table_A.*, table_A.*
3 FROM table_A
4 INNER JOIN table_B ON table_A.key = table_A.key
What possible reason is there, to SELECT every single column from one or more tables? This only
happens in very rare cases, but this type of query is very common. It wastes resources and can double
or triple the size of data sets, that don’t need to be that big.
The database must process your query as submitted, if you want it to run fast, then ask for example
what you need.
1 >> example.sql
2 SELECT table_A.*, (SELECT b.column FROM table_B as b WHERE b.column = table_A.column)
3 FROM table_A
1 >> example.sql
2 SELECT table_A.column_1, table_A.column_2, table_A.column_3
3 FROM table_A
4 WHERE table_A.column_1 > 100 AND table_A.column_2 > now() AND table_A.column_3 IN(1,\
5 2,3)
Views bad.
Stop using VIEWS, they suck and always will. These nasty little buggers have been a blight on
databases for many years, yet you still see them used from time to time, and they always drag down
the performance of queries.
Yes, they can be useful and used well, but it’s best just to stay away.
1 >> example.sql
2 SELECT a.column,
3 FROM table_A as a
4 INNER JOIN table_B as b on a.key = b.key
5 LEFT OUTER JOIN table_C as c on c.key = a.key
6 RIGHT OUTER JOIN table_d as d on d.key = c.key
7 ...
It’s best not to let things get so complicated, besides working on your ego, you are most likely to
introduce bugs and confusion.
Chapter 6 - Mastering SQL 107
Summary
I could go on but you get the point again I’m sure. Just like in good table design, good query design
and tuning starts with the simple approach. Think about the very basic things you need to accomplish
first, do that, and then add the detail later. Try to use SQL that applies to the entire dataset, like
GROUP BY, WHERE instead of using a UDF or a SELECT in the SELECT statement.
Always try to pair the data down as much as possible before getting fancy.
• Check indexes first. Are you joining on some column with no index?
• Are the index stats being updated?
• Look for bad SQL. Do the joins and subqueries look reasonable?
• Inspect the complex parts of the query. If something is complex it’s probably like that for a
reason. Find out why.
• Learn how to read query execution plans and stats.
Some SQL query problems can only be solved by getting way more into the weeds than you want
to.
SQL has always been near and dear to my heart, being on data warehousing teams is how I cut
my teeth on working in IT. After years of working around data warehousing, business intelligence,
watching software engineers struggle with database design and queries, while at the same time being
the best programmers, it becomes obvious to me that the simple things are what matters.
There are always cases where even the DBA’s have problems, that’s just life. But, most of the time
sticking to the fundamentals will get you most of the way there.
Chapter 6 - Mastering SQL 108
SQL Fundementals
What I want to do in this section is just list the fundamentals of writing SQL. I’m not planning on
teaching you how to learn each of the topics in depth. I just want to bring them up, give you a list
of SQL features that you should be comfortable with. If you are unfamiliar with any of these, use it
as an indication you need further study in this area.
Here are the SQL features I want to glance over at a high level.
• SELECT statements
• JOINS
• GROUP BY and WINDOW functions
• AGGREGATE functions
• WHERE clause
• SubQuerys
• CTEs
SELECT statements
SELECT statements in SQL are probably one of the first topics you learn. What should a good data
engineer know about SELECT statements?
Not paying attention to SELECT statements is the culprit of many poor-performing queries. When
you SELECT * from three different tables in your query, it should be no surprise things are slow.
Also, putting complex logic in SELECT statements usually means that logic must run on each row
individually, in some cases, like CASE WHEN, this is needed, in my cases it is not, because it is
expensive.
JOINS
JOIN statements are another topic most seasoned engineerings are very familiar with, yet it is
surprising how many times the simple JOIN types can cause problems and bugs.
You should learn like the back of your hand, the different JOIN types, and when you should and
shouldn’t use them.
• LEFT or RIGHT OUTER (keep only those records that have a match in one or the other)
• FULL (give me everything on both sides)
• ANTI JOINS (give me records only found on one side or the other)
What it boils down to is realizing when you need to use an INNER JOIN vs any other join. INNER
joins will most likely always drop records (because it likely not everything matches between two
tables). This is a simple concept, yet causes lots of problems in applications from Postgres to Spark
SQL.
WHERE clause
WHERE clauses are arguably one of the most important features of SQL, in a relational database
or big data system like Spark. Why? Because any filtering or subsetting that can be done with a
WHERE clause, as early as possible in the process, is going to ensure the best performance and run
time of a query.
Figuring out WHERE clauses also typically give great insight into how the data should be indexed
or partitioned.
Many times it is advisable to spend time understanding common WHERE clauses and filters while
a data model is being built before anything goes to production. This is how important the WHERE
clause is.
SubQuerys vs CTEs
SubQuerys and CTEs are very much the same things, although each tool that provides them may
have its nuances and performance issues.
Chapter 6 - Mastering SQL 110
CTEs
At a basic level, there isn’t anything that special about a CTE, other than the possibility of recursion,
which only applies to %.01 of the persons reading this.
Subquery
Also known as the infamous nested query, usually gets a bad rap. Why? Overused just like anything
else. They are a powerful way to slowly and methodically build up complex queries and logic.
There are two types of sub-queries.
• Correlated.
• Un-correlated.
What does that mean? It means that a subquery might or might not have some specific relation to
the query “above” it, or “over” it.
Python + SQL
I want to take a short little side excursion and talk about how to work with Python and Postgres, a
popular relational database. I’m not going to get too in-depth, but I want to cover the basics, so you
can start to get some idea about what is involved in connecting code to a database.
We are going to use a popular Python package called psycopg2 to do some simple examples.
The first step you will need to complete is getting the URI/URL for your Postgres connection to
the database and server. This gives a lot of people trouble in the beginning, the first few times you
have to put one together. Here are the options for parameters that are common to connect to most
databases.
• Host
• Database
• Port
• User
• Password
Probably some of the confusion comes in because depending on the setup, only some of the
parameters are needed, and of course, there are different ways to indicate the URI and or parameters.
You will probably see people connecting in two different ways.
• URI
• Parameters
1 >> example.py
2 import psycopg2
3
4 host = 'localhost'
5 database = 'my_database'
6 user = 'postgres'
7 pass = 'postgres'
8
9 # connecting with params
10 try:
11 conn = psycopg2.connect(host=host, database=database, user=user, password=pass)
12 except psycopg2.Error as e:
Chapter 6 - Mastering SQL 112
Here is one note to make, never make a database call or connection without catching and raising
errors. You always need to know what went wrong, which will happen often when sending queries
to the database and pulling records back.
Also, never take a database connection for granted. Clean up after yourself, if you have a DBA,
they will thank you. Databases many times have connection limits, you don’t want to leave tons of
orphaned connections to a database that the server has to clean up. If you make a connection object,
in our case conn, call conn.close() when done.
Cursors
There is something else that can be confusing about Python psycopg2 connections, you will use it
to generate a database cursor, which in turn will execute SQL statements, but committing or rolling
back transactions is done with a connection object. Otherwise, look into “auto-commit” if you don’t
like being in control of your own destiny.
Let’s make a cursor and create a database table. Remember a connection gives you that pipe to your
database, a cursor is needed to execute a query against the database. The most important method of
a database cursor is the execute() method, or executemany().
1 import psycopg2
2
3 host = 'localhost'
4 database = 'my_database'
5 user = 'postgres'
6 pass = 'postgres'
7
8 try:
9 conn = psycopg2.connect(host=host, database=database, user=user, password=pass)
10 except psycopg2.Error as e:
11 print(f'Had problem connecting with error {e}.')
12
Chapter 6 - Mastering SQL 113
13 database_cursor = conn.cursor()
14 query = 'CREATE TABLE books (id INT NOT NULL, author varchar(150), title varchar(50)\
15 );'
16 try:
17 database_cursor.execute(query)
18 conn.commit()
19 except psycopg2.Error as e:
20 print(f'Problem executing query {query} with error {e}.')
21 conn.rollback()
Notice how I wrapped the cursor execute() call in a try except block to catch errors. Also notice
how I used the database connection to commit() the transaction if no errors are caught, as well as
using the connection to rollback() the transaction if there is an error.
This is probably the next area that gives people the most trouble. Once you’ve learned how to connect
to your database and execute a simple query, most people are probably going to want to parameterize
those queries at some point, passing in some values from the Python script. I’m not going to get into
security, but I’m sure you see the implications here, so do your own research.
It’s really not that hard, all you have to do in your query string is put the characters %s where you
would like the parameter to be pulled in. Then for psycopg2 you need to pass a Python tuple with
those parameter(s). It would look something like this.
1 query = 'INSERT INTO books (id, author, title) VALUES (%s, %s, %s);'
2 data = (1, "St. Augustine", "Confessions")
3 try:
4 database_cursor.execute(query, data)
5 except psycopg2.Error as e:
6 print(f'Had problem with query {query} with error {e}.')
Once you have a few records in your database, the next thing to do would be to retrieve them. The
concept is very similar to the execution of inserting data. There are three methods you can call on a
cursor that is executing a sql SELECT statement to retrieve data.
• fetchone()
• fetchmany(number_of_records_to_fetch)
• fetchall()
You should never call fetchone() in production most likely, this is the default behavior and a cursor
object is iterable in Python so you can just do the following.
Chapter 6 - Mastering SQL 114
Also, don’t forget the important step of cleaning up after yourself in the scripts!
1 database_cursor.close()
2 conn.close()
What I really wanted to show you in the above section about Python and databases, was the concepts,
not so much the code. Most connections to databases, via pretty much every language requires the
same sort of parameters we talked about in the beginning.
Also, opening connections and cursors to the database, pushing and pulling records are all good
concepts to start to grasp as a data engineer. Remembering to clean up after yourself is an important
step as well.
SQL Summary
You will notice in this chapter I took more time with the concepts than with teaching you how to
JOIN or filter a dataset. The reason for this is that there is a myriad of other resources to teach you
how to group and aggregate SQL queries.
It’s just a topic that is been ground into the ground. I would say from my experience it isn’t picking
up that syntax that is particularly difficult for data engineers. It’s always the simple topics and
concepts that end up causing problems.
Not understanding how to correctly design database tables, the basics around indexing, or just how
to write clean and simple queries. Mastering the high-level concepts is more important, becoming
an expert on writing window functions will just come with time and practice.
Chapter 7 - Data Warehousing / Data
Lakes
Closely related to the last topic of SQL is Data Warehousing and Data Lakes. Many things have
changed since the days of Data Warehousing existing solely on relational databases. Now, many
data lakes and data warehouses exist in cloud storage like s3. Even with these large changes, at the
core, many of the same concepts still apply when designing classic RDBMS Data Warehousing or
Data Lakes in the cloud on s3. But let’s try to cover the basics upfront.
• The Data Warehouse / Lake / House is usually defined differently by different people.
• Kimball star schema design is still very popular and useful.
It depends on the context of the discussion for what the actual definition of a Data Warehouse is or
a Data Lake. What the more important question is sometimes, what is the technology underlying
your Data Warehouse or Data Lake.
Chapter 7 - Data Warehousing / Data Lakes 116
For years a Data Warehouse was just another name for a relational database of some sort that had a
Kimball or Imam style design applied to it. It’s isn’t so clear cut anymore. But, there are some things
that both Data Lakes and Data Warehouses have in common.
Both places are the dumping ground for an organization’s data. It’s supposed to be the single source
of truth to which all data flows and is captured. So what’s the difference between a Data Warehouse
and a Data Lake from there?
I don’t know if there is a correct answer to this question, as many people define each differently, but
it’s safe to assume it has a lot to do with the data model applied. So, the data model (which is many
times driven by the underlying technology) is the main difference between a Data Warehouse and
a Data Lake.
You are probably aware that Data Warehouses and Data Lakes are there for OLAP, aka aggregation
purposes. They are mostly used to answer high-level questions across the business that requires
many times years of data. So, it comes down to how that data is stored in its “tables” and what the
relationship looks like between the different data sets.
Generally speaking, both Data Warehouses and Data Lakes seek to at a minimum de-duplicate data,
clean it, and store it logically, many times capturing historical versions of that data as well (SCD),
called Slowly Changing Dimensions, aka data that changes over time.
But, if we have to make more clear distinctions between these concepts, let’s give it a shot.
Data Warehouse, Data Lake, and Lake House are all different approaches to storing and managing
large volumes of data.
Data Warehouse:
A data warehouse is a central repository of integrated data from various sources in an organization.
It is designed to support business intelligence (BI) activities such as reporting, data analysis,
and decision-making. Data warehouses typically use a structured schema and relational database
technology to store data. Data is organized into tables with defined relationships, and the schema is
optimized for analytical queries.
Data Lake:
A data lake is a centralized repository of raw data from various sources in its native format. It
is designed to support big data processing and analytics workloads, including machine learning
and artificial intelligence (AI). Data lakes typically use a distributed file system, such as Hadoop or
Amazon S3, to store data. Data is not pre-processed, and the schema is not predefined, making it
easier to store unstructured and semi-structured data.
Lake House:
A lake house is a hybrid approach that combines the benefits of both data warehouses and data
lakes. It is designed to support real-time data processing and analytics workloads by providing a
unified platform for structured, semi-structured, and unstructured data. Lake houses typically use a
distributed file system to store raw data and a relational database for structured data. This approach
allows for faster data processing, real-time analytics, and more advanced AI and machine learning
capabilities.
Chapter 7 - Data Warehousing / Data Lakes 117
Data warehouses are ideal for structured data, data lakes are ideal for unstructured and semi-
structured data, and lake houses provide a unified platform for both types of data. Each approach
has its benefits and drawbacks, and the choice of approach depends on the organization’s specific
data management needs and goals.
With all that being said, let’s dive into some topics you will most certainly run across as a Data
Engineer, regardless of if you end up working with Data Warehouses or Data Lakes.
Why can’t I just use the same old Kimball data model?
Some people might ask the question, “Well, if they both use the same ANSI SQL interface, why does
it matter, can’t I just model the same way I always have?”
Yes and No.
The problem is that between typical relational SQL databases and new Data Lakes there are two
major differences.
For time and memorial, the Kimball and star style schema for Data Warehouses has reigned supreme
and still does in many instances. It consists of the following typical points …
So yes, sure, you don’t have to change your data model, but things probably won’t work that well …
I mean you can drive your Ford Geo over the Rocky Mountains, but do you really want to do that?
Data modeling in data lakes refers to the process of organizing and structuring data in a way
that makes it easier to analyze and query. Unlike traditional data warehouses, data lakes do not
require a predefined schema or structure, which can make it challenging to organize and model
data effectively. However, there are several approaches that can be used to model data in data lakes:
Schema-on-Read:
Schema-on-read is an approach where the data is stored in its raw form, without a predefined schema.
The schema is applied when the data is read, based on the specific query or analysis being performed.
This approach provides greater flexibility and agility, as new data can be added and analyzed quickly,
without requiring changes to the schema.
Schema-on-Write:
Schema-on-write is an approach where the schema is defined upfront, and the data is structured
Chapter 7 - Data Warehousing / Data Lakes 119
and transformed before it is stored in the data lake. This approach can provide better performance
and consistency, as the data is already structured and optimized for analysis.
Hybrid Approach:
A hybrid approach combines the best of both schema-on-read and schema-on-write. It involves
defining a set of core schemas or data models upfront, while also allowing for additional data to
be stored in a more flexible format. This approach can provide the benefits of both schema-on-read
and schema-on-write, while also addressing some of the limitations of each.
Practically for Data Engineer’s what does this mean?
Assertions about Data Models for Data Lakes and Lake Houses.
I’m going to make a few assertions about data models for Data Lakes and Lake Houses, then we will
dive into each one.
• Data model isn’t as normalized, you will have fewer tables/data sinks.
• Data model is driven by partitions.
• Data Models will have fewer tables and sinks in Data Lakes and Lake Houses.
• One technical implementation for folks that work with Spark and other Big Data technologies
have dealt with for a long time is the “small file problem.”
Many of the Big Data processing tools have hardships with data sets that are made up of many
small files. There are multiple reasons for this, but what it really boils down to is that all technology
behind the clusters running our compute were designed for large data sets.
These tools scale well with big data … sure, they can work on data sets that are GB’s in size, but that
isn’t where they shine.
Think about it, if you have 50GB of data, and you split that data into 10 different tables of 5GBs
each, this means that all the transforms, tables, and queries are working on small datasets, most
likely split into small files (if you’re working with data this size it means your probably ingesting
files with KB’s of data, and that you don’t know about partitioning or compaction so all the files are
tiny.
You think Spark will shine, and that designing a Data Lake for your small data will solve all your
problems … but it won’t.
Example
In the classic relational SQL database model, you focused on data deduplication, data normalization,
and star schema to the extreme. Typically in the Data Lake and Lake House world, we would not
model our file-based data sink in this manner.
We would instead focus on keeping data sources intact through our process … transforming them
from raw to target ready to consume. For example … in the classic Data Warehouse world, we would
most likely receive customer information with address information. We might go from a single data
file to three or four tables where we commonize address information.
Chapter 7 - Data Warehousing / Data Lakes 120
You will probably not find the above type of very minute and extreme normalization and de-
duplication in many Data Lakes. Never say never though.
Data Lakes and Houses will contain Accumulators and Descriptors, not necessarily Facts and
Dimensions.
The only reason I don’t like to call Fact tables Fact tables in Data Lakes, and Dimensions as
Dimensions, is because of the historical context around Fact and Dimensions and all that entails.
It gives a certain impression about the Data Model under discussion, and I like to make distinctions.
Accumulators
Accumulators in a Data Lake aggregate or accumulate the transactional records, pretty much like
a Fact table would have in the past. The main difference will be that the Accumulator table will
Chapter 7 - Data Warehousing / Data Lakes 121
probably contain fewer “keys” that point to other tables, that would most likely have been stripped
out and put into a Dimension table in the classic SQL Kimball model.
1 >> example.sql
2 CREATE TABLE transaction_accumulator (
3 transaction_id BIGINT,
4 transaction_description STRING,
5 amount FLOAT,
6 product_id INT,
7 product_description STRING,
8 product_category INT,
9 customer_id INT,
10 customer_name STRING,
11 customer_address_1 STRING,
12 .. etc , etc.
13 )
The Accumulator above is similar to a Fact table but isn’t stripped of all its descriptors with that
data being normalized out into a myriad of other tables.
Descriptors
The Descriptors in the new Data Lake or Lake House are exactly like Dimensions, again they are
just less broken up and normalized, there are just fewer of them.
If the business requires say some distinct list of addresses, that would just be done an additional sub
table probably run and filled much like a Data Mart or analytic further downstream. It wouldn’t be
built into the Descriptor Tables and given keys that are referenced in an Accumulator table.
Facts and dimensions are the two main components of a dimensional data model, which is commonly
used in data warehousing and business intelligence (BI) applications.
Facts:
Facts are the measurable, numeric data that represent the business events or transactions that an
organization wants to analyze. Examples of facts include sales revenue, customer orders, product
quantities, and website traffic. Facts are typically stored in fact tables, which contain the detailed
information about the events or transactions that occurred. Fact tables also include foreign keys that
reference the dimension tables.
Dimensions:
Dimensions are the characteristics or attributes that provide context and describe the facts. Examples
of dimensions include time, geography, product, customer, and salesperson. Dimension tables
contain the descriptive information about the dimensions, such as the date, country, product
name, customer name, and salesperson name. Dimension tables also contain primary keys that are
referenced by the fact tables.
Facts and dimensions are linked together through a relationship. The relationship between the fact
table and dimension table is established through foreign keys. The foreign key in the fact table refers
to the primary key in the dimension table, which allows the fact table to retrieve the descriptive
information from the dimension table.
The dimensional data model provides a way to organize and analyze large amounts of data in a
way that is intuitive and easy to understand. The fact tables represent the events or transactions
that occurred, while the dimension tables provide context and allow for analysis at different levels
of detail. The dimensional data model is particularly useful for BI and reporting applications, as it
enables users to easily slice and dice the data to gain insights into business performance.
Fact Tables.
Fact tables are the central dataset that grows the quickest and is used to aggregate data. Many times
they have some sort of date or time series associated with them and are transactional in most cases.
Many times the data in a Fact table could also be known as an “event,” a record that something
happened. It could be a sales order, an invoice, a widget being produced, a user clicking on a button.
These are data points that would naturally be summarized or aggregated.
Dimension Tables.
Dimension tables are datasets that help describe and extend information in a Fact table. They are
ancillary, like the leaves on a branch. They help us understand and interpret certain values in a Fact
table and their primary goal is to be used as a “look-up” table.
Chapter 7 - Data Warehousing / Data Lakes 124
What is a Data Warehouse or Data Lake all about? It’s all about aggregating values. Think about
fact tables (that will be aggregated) as the hub in a spoke of a wheel. All the spindles coming off
dimensions.
Let’s look at a straightforward example of a Fact table and Dimension table. Of course, we will use
hobbits as our example, we know that hobbits go on adventures and amass treasure for themselves.
A good fact table in a data warehouse should consist of some nondescript ids, maybe a date, and
the values you would want to aggregate on. Say like the amount of treasure a hobbit was able to
compile over different time intervals.
What about the hobbit dimension that would describe a hobbit for us?
The Hobbit table is a good example of a dimension, a table that has important information that better
describes and extends the data, but may not be critical for the calculation and aggregation of values.
How do we combine the two?
So we can easily formulate how much treasure our hobbits are gathering over time. Remember
simple = better = faster, when it comes to database queries. And data warehouses and data lakes
tend to get large, so designing your Fact and Dimension tables correctly becomes important.
Chapter 7 - Data Warehousing / Data Lakes 125
Constraints.
One of the best parts of the Data Warehouse is the ability to enforce constraints upon our data. It
gives the ability to expect uniform values in the most critical data points. This concept is so key to
the Data Lake and Data Warehouse, that even file-based storage systems like Delta Lake support
constraints.
Now depending on the underlying technology, different constraints are supported in different
systems, but it’s good to have a general understanding of what’s available.
Here are some common constraints that can be useful in ensuring data integrity.
These two basic constraints in themselves will go a long way into ensuring the correctness of data
entering into the system. As you can see from the above example, you can be very basic or fairly
complex in the constraints you add to your data points. Remember, each constraint has to run on
INSERT, so don’t go overboard either.
Schemas.
Schemas are the other important topic when talking about Data Warehouses or Data Lakes. What
I’m mostly referring to when I talk about schemas in this context is the ability to control and enforce
column names and data types.
Part of the problem with ingesting data from many different sources is that the data is, well, different.
We are trying to take data points that are different and somehow commonize them into a single store
from which we can answer any question.
• Data types.
• Column names.
In data engineering, you will soon realize it’s always a few small but problematic issues that continue
to plague most data systems decade after decade.
Data Types.
A Data Warehouse or Data Lake allows you to define all the data types you expect for each data
point beforehand. Many times mixing of INTEGERS with STRINGS, say 1,2,3 with a,b,c will cause
major issues that need to be fixed downstream. How can you SUM a column if it includes both
INTEGERS and letters from the alphabet?
Column Names.
Another major area of concern in data engineering is the change of column names over time. It
happens and will always happen, for many reasons. It could be a bug or it could be the business
changing. Very few data sources stay static forever.
Example
Chapter 7 - Data Warehousing / Data Lakes 127
In the above example, we know that order_amount is going to be aggregated regularly, so we don’t
want STRINGS ending up in that column. As for order_type what if a new data set comes with a
column titled order_description? These are the types of problems that a Data Warehouse or Data
Lake aims to solve in the long run.
Nothing is more confusing than the “same” data source with different “versions” of data. This
introduces complex transformations and logic that has to “combine” data sets so a full view can
be taken of the single data source.
Data Warehouses and Data Lakes aim to solve this problem using data modeling, among other
techniques.
• Uniqueness.
• Relationships.
It’s critical to understand that the data repository that is the central hub of truth for an organization
needs to be just that, the source truth. What do we mean by truth?
Uniqueness.
The data in the Lake or Warehouse is going to be de-duplicated and somewhat transformed. When
we talked about primary keys and uniqueness in our chapter on SQL, this same topic applies and is
even more important in the Warehouse or Lake world.
We can’t de-duplicate and have unique data records without having keys, primary or otherwise that
identify our data. How can we relate Fact tables to their dimensions? This will be done by ids either
in a SQL table or as hashes generated by code in a Data Lake.
It isn’t important how our particular technology requires us to generate an id for each data row,
but that one takes that step in the first place. Why? The difference between data dumped into
cloud storage and a Warehouse or Lake is that in the latter we’ve taken the time to understand
and transform the data into a unique and usable dataset that can be related to other datasets easily.
Chapter 7 - Data Warehousing / Data Lakes 128
Example
Let’s look at an example of some rows of data, and the challenge that uniqueness can pose.
1 # dataset 1
2 customer_id | order_amount | order_date | order_source
3 12234 10.00 2021-12-01 56
4 12234 10.00 2021-12-01 67
As you can see in the above example, if we had these records in an incoming data file, being ingested
into our data warehouse, we might not have thought about this problem beforehand. A customer
places 2 orders on the same day for the same amount.
This requires id’s to be generated for each row of data, in a Fact or Dimension table. This could be
primary and foreign keys in a relational database or hashes of unique column(s) produced by our
code and stored along with the data in our Lake in s3.
Being able to uniquely identify each Fact and join to Dimension on ideally a single column will
require an id to be generated during the creation and loading of our Data Warehouse or Data Lake.
In Spark SQL code this might be simply creating a hash of the columns that make a record unique.
Relationships.
We already mentioned this in passing above, but relationships are the crux of a good Data Warehouse
or Data Lake. You are going to have a few main datasets or tables in your data store, how these data
sets and tables relate to each other is the key.
That is why the keys or hashes we use in our table structures to uniquely identify each row are so
important. We need to be able to JOIN our datasets together, without producing duplicates in our
Data Lakes.
Data is really about relationships. Sure, we might be able to answer a few simple questions using
a single table, but most business requirements require datasets to be joined to derive deep analysis
that provides the best insights and answers.
When building out a Data Warehouse or Data Lake, much time should be spent on defining the
relationships between our data tables and stores. We will dive into this more during the Data
Modeling chapter.
Chapter 7 - Data Warehousing / Data Lakes 129
This might make more sense with an example. Say we have a Dimension table of Customers or
Accounts in our Data Lake, periodically someone moves and changes their postal code and address.
How do we keep track that this customer’s information changed?
Of course, this is a contrived example, but it gives a good idea of how important a start and end
date/time, as well as an active flag, can be for tracking CDC changes over time in a Data Lake or
Warehouse.
Active flag.
The active flag is important because if you just want the “current” snapshot of what is more recent,
you can simply add a WHERE or FILTER to say active_flag = 1.
Effective date/times.
The effective start and end times of a record also provide a great history tracking of what “happened”
to a customer or any data record for that matter. It gives the ability to time travel as well, to see
“what this looked like on X date.”
Chapter 7 - Data Warehousing / Data Lakes 130
Of course, there are many ways to do this, some people use MERGE statements, and technically you
can combine steps 1 and 3. But generally speaking, you simply need to break up our source records
into two groups .. those that exist in the target and those that do not. Then it’s simply a matter of
ending the records (via UPDATE) in the target that already exists, and inserting the new records.
Key Takeaways.
There are a few high-level key concepts that should be kept while designing Data Lakes or Data
Warehouses. We glossed over a lot of topics, but I hope you have a better idea of what to expect
when working on a Data Lake or Warehouse.
It doesn’t matter if you will be working on a Data Warehouse or Data Lake, the nuances are small
between them, it’s the details that matter. You should identify where the transactional data will
accumulate with the most growth, these will be your fact tables. The dimensions will be more slowly
growing, with ancillary data that will help extend and describe your fact tables.
Don’t go overboard with data normalization, more on that later, the fewer the tables the better, they
will be large. You should expect to have a few fact tables in the middle, with dimension tables that
“surround” your facts, extending and describing them.
Never forget that placing constraints and the correct data types in your schemas of a Data Warehouse
or Lake will save you time and trouble that will pay dividends in the end.
Validating inputs and commonizing the data to be the same is a key part of what a Data Lake provides
and the value it brings. Spend the extra time to put value constraints in at high-value targets, columns
that find themselves in WHERE clauses or filters.
Chapter 7 - Data Warehousing / Data Lakes 131
Having the correct id’s or keys in our Warehouse is of utmost importance. Not having duplicates
records, and having a tool that can uniquely identify each record solves many problems that can
arise during a Data Lake project.
Ids and keys are critical to being able to reliable join different datasets, like Fact and Dimensions
together, without producing duplicate and erroneous results.
Summary
This was rather a big chapter with a lot of topics. As Data Warehouses have transitioned over to
Data Lakes and Lake Houses, things have gotten confused. It doesn’t help that the constant drip
of marketing material from vendors never stops either. Many times they just confuse the issues at
hand.
It’s most important to recongize the obvious differences between the different systems, their
technology and and data models.
Facts, dimensions, change data capture, constraints, keys, and data models. These are difficult topics
to master without a lot of experience, but you can set yourself up for success by getting broad
understanding of these different skills.
Chapter 8 - Data Modeling
It might seem like this chapter rehashes a few topics we have already talked about during Data
Lake/Warehousing and SQL chapters, and this is true because at the core of data modeling there are
a few ideas and techniques that seem to apply to 90% of the data modeling that you will do as a Data
Engineer.
You can use the same skills to model small sets of data stored in a few Postgres tables and hundreds
of TB’s of data stored in Parquet format in s3. Get these few basic data modeling skills right and
they will help you in almost every data project you work on.
Don’t ever forget that data modeling is half art and half science, many topics related to data modeling
are heavily dependent on the use case in question. We will call this concept “access patterns,” how
is the data used by the consumers. A data model should reflect how the data is going to be used.
Is the data in question going to be aggregating many values over specific time series, and products?
Then the data model should reflect that need and provide a schema for the data that can support
such needs.
Here are the highlights before we dig in.
• Datatype.
Chapter 8 - Data Modeling 133
• Constraints.
• Definition and name.
Data types are used to define the type of data that can be stored in a field or column of a database
table. They specify the range of values that a field can take, the storage requirements, and the
operations that can be performed on the data. Choosing the appropriate data types is an important
aspect of data modeling, as it can affect the performance, accuracy, and usability of the data.
Schema, on the other hand, refers to the structure or organization of the database tables, fields, and
relationships. It defines the rules and constraints that govern how the data is stored, validated, and
accessed. The schema determines how the data is organized into logical units, such as tables, views,
and indexes. It also provides a framework for managing the relationships between tables, such as
primary keys, foreign keys, and referential integrity constraints.
In data modeling, data types and schema are closely related because the choice of data types can
affect the design of the schema. For example, if a field requires a large amount of text data, such as a
product description or customer feedback, a text data type may be more appropriate than a numeric
data type. This choice of data type may then affect the design of the schema, such as the size of the
field, the indexing strategy, and the data validation rules.
Data types and schema are important components of data modeling that are closely related. The
choice of data types can affect the design of the schema, while the schema provides a framework for
managing and organizing the data in a structured and meaningful way.
Data Types.
Data types are a way of categorizing the type of data that can be stored in a field or column of a
database table. They define the range of values that a field can take, the storage requirements, and
the operations that can be performed on the data.
Common data types in databases include:
Integer:
This data type is used to store whole numbers, such as 1, 2, 3, and so on. The size of the integer can
vary depending on the range of values that need to be stored.
Float:
This data type is used to store decimal numbers, such as 1.5, 2.3, and so on. The precision and size
of the float can vary depending on the level of accuracy required.
Date/Time:
This data type is used to store dates and times, such as 2022-03-28 or 15:30:00. The format and
precision of the date/time data type can vary depending on the requirements.
String/Text:
This data type is used to store text data, such as names, addresses, and descriptions. The size of the
string/text data type can vary depending on the length of the text to be stored.
Chapter 8 - Data Modeling 134
Boolean:
This data type is used to store true/false values, such as yes/no or on/off.
There are also more specialized data types, such as binary, XML, and JSON, that are used for specific
types of data.
The choice of data type can affect the performance, accuracy, and usability of the data. For example,
using a numeric data type for a field that requires text data can lead to data truncation or errors.
Using a large data type for a field that requires small amounts of data can lead to wasted storage
space. Therefore, it is important to choose the appropriate data type for each field based on the
requirements of the data model.
Generally speaking, whatever data storage solution you are using, relational database or parquet,
basic data types don’t change very much. You should become familiar with a few of the most
common.
• STRING or CHAR
• INT or BIG INT
• DECIMAL or FLOAT
• ARRAY
• BOOLEAN or TRUE/FALSE
Example
Usually, it’s obvious based on the column name what the data type is, it’s easy to fall into
traps though. We should think carefully about assigning data types, should an amount by INT or
DECIMAL?
Of course, these data types specifically will change slightly if you’re using Postgres vs a parquet file,
but generally, you will be using some form of a string or char, int or decimal, true and false boolean
values, and sometimes arrays or list, or other more complex types.
Data Size.
Rember to consider data size when defining what data type should hold which data points. It’s going
to be a delicate balance, you don’t want to run out of room with a STRING or CHAR with a certain
Chapter 8 - Data Modeling 135
size, but you also don’t want to waste a lot of space by calling every column a STRING just to make
things easy.
All data is stored on disk eventually, it could be cloud storage in s3 or just a local drive on a server.
Either way, data can add up over time, choosing the correct data type for your data points will have
a large impact on storage size.
Data Engineers should spend time with their source data, examining values, looking for the min and
max’s of each column, looking at the lengths of strings, and using good old common sense.
It can be easy to fall into the trap of just glossing over data types, assigning certain values to STRINGs
just because you can, or not understanding the DECIMAL points involved in a numeric column, even
understanding if INT is a BIG INT or not.
Take your time to understand each data type in-depth, it will help you start with a solid data model
at its foundation.
Constraints.
After the data types have been identified, the next task is to understand the constraints that surround
each data point. This will be a familiar topic for those savvy SQL developers, but I want to extend
the idea a little further.
Constraints play an important role in ensuring data quality and consistency in a data warehouse. A
constraint is a rule that is applied to a database table to restrict the type or range of data that can be
stored in it. Constraints can be applied to individual columns or to the entire table.
Here are some common types of constraints used in a data warehouse:
Primary Key Constraint:
This constraint ensures that a unique value is stored in a specific column or combination of columns.
It is used to enforce data integrity by preventing duplicate values and making it easier to join tables.
Foreign Key Constraint:
This constraint ensures that the value in one column of a table is linked to the value in another
column of a different table. It is used to maintain referential integrity by preventing orphaned rows
and ensuring consistency in related data.
Not Null Constraint:
This constraint ensures that a column cannot have null values. It is used to enforce data integrity by
ensuring that required data is present.
Check Constraint:
This constraint ensures that the data stored in a column meets a specific condition or set of conditions.
It is used to enforce business rules and data quality standards.
Constraints are used to ensure that the data in a data warehouse is accurate, consistent, and
valid. They help to prevent data entry errors, enforce business rules, and maintain the integrity
Chapter 8 - Data Modeling 136
of relationships between tables. By applying constraints to the data, data warehouse administrators
can ensure that the data is reliable and can be used for meaningful analysis and reporting.
Most people, especially those working in SQL systems, usually take a laxidazical approach to define
constraints, both physical and theoretical. Yet even the easy and obvious constraints around data
points are often ignored. But, anyone with experience can tell you that having robust constraints
around your data is the first step in data quality.
What type of constraints should be thought about for each data point?
Constraints are a great way in either Data Warehouses or Data Lakes to control the quality and
ensure the integrity of the data being stored. Dirty data is a classic problem, so catching problems
early can be accomplished easily with constraints. Think of constraints as your first line of defense.
What if we didn’t have the constraint in place? Now we have a column with both declined and
decline, some unknowing analyst using our data source with a query that says WHERE response =
'declined' and is now getting incorrect results.
Chapter 8 - Data Modeling 137
Data Definitions.
This topic is might seem a little basic, but it’s a very important one nonetheless. Having a definition
of the data points is a two-fold task.
Readable column names, that carry meaning, as well as descriptions of data points.
Legible Names.
Ensure you define column names that tell someone what the data point is without too much problem.
Simply reading a column name should tell you a lot of what you need to know. If the column is a
code, call it ..._code, if it’s a description, call it ..._description. Simply calling something code
instead of response_code or description instead of product_description just causes ambiguity.
Descriptions.
Descriptions should also be kept for each data point. This is usually easiest to do when creating the
schema of the data at the first. The more time that goes by the less likely someone is just going
to remember or know what a particular data set represents. Putting together documentation or
comments in a codebase will come in handy later.
This topic seems boring and pointless until later when you are asked by the business what a definition
is or where to find some obscure data point.
• Customer information
• Product information
Chapter 8 - Data Modeling 138
• Order information
• Bill of material information
• Marketing information
• Accounting and Pricing information
This is a simple example of the logical data model. In your mind, you can already see the tables
forming that would fall into these different groups.
Grain of Data.
The idea of the “grain” of the data might be a new one to you but is an important thought processing
while data modeling. This becomes especially important during a Data Warehouse or Data Lake
project, even with some smaller relational database designs as well.
The grain of data is the lowest level of information available on a topic.
What’s the mean in real life?
If you have a table that holds order information from customers, does the table hold a single record
describing the overall order, even if multiple items were ordered? Or does the data table hold a
record for each item that a customer orders, and to get the total order summary you have to roll up
those individual order lines?
1 # orders table
2 order_id | product_id | quantity|
3 456 4AB-C 3
4 456 DCA-01 1
Chapter 8 - Data Modeling 139
In the above example, we might assume that the table holds one record for each order, but this is
not the case is it? The “grain” of this dataset is that each order can have multiple lines in the orders
table, one for each product ordered. This is an important distinction.
That might seem like a small difference, but it becomes very important when data modeling. If you
design a data model that relies on an Orders table having only a single line for each order, that
is a summary of multiple items, what happens later when you realize you are missing important
information that can only be obtained from a “lower” grain of data, means you need a data record
for each item ordered.
Knowing the lowest level of information that is required by the business logic and requirements that
are driving the data model is important. Identifying the lowest level of grain required should be one
of the first tasks when data modeling because it directly affects the final schema of the data model.
Uniqueness of Data.
Another data modeling technique that should always be thought about before writing any code is
the uniqueness of your data, and more specifically, each table or data sink you are trying to design
and model should have something that uniquely identifies each record or row.
This is important for several reasons. First, if you are unable to identify what makes a piece of data, or
row, unique, then you most likely don’t understand enough about the data. Second, the uniqueness
factor of the data will tie directly into what are the primary keys that will eventually be needed for
each record, or join later.
Duplicates in data are one of the oldest and most common problems that have plagued databases
and data warehouses since the beginning of time. What was the problem, how did the duplicates
get there?
I would argue it was a failure during the data modeling process to identify for each logical group
of data, what made each record unique. It’s typically a combination of data points, sometimes even
combined with a date. Anyone working on a data model should ensure they understand what makes
the data unique, and what information might have to be added to make a data record unique, even
if that requires the addition of outside data like file name or insert date.
When duplicate records start to appear in data models, every join to other data sources starts to
compound the problem. And, once duplicate records are placed into the system, they are typically
painful to back out of and remove.
Chapter 8 - Data Modeling 140
Access Patterns.
Access patterns are one of the least used data modeling techniques, mainly because they can be
difficult to ascertain at the beginning of a project when a data model is taking place.
• Understand queries.
• How do downstream systems consume the data.
• Understand business needs and requirements for the data.
What is a data access pattern? It’s how the data is queried and used in the end. When you’re just
starting to work on a data model for a system, this can be sometimes hard to know exactly what
type of queries will show up, and how the data will be used. But, usually, you can get at least a
decent idea by talking to the downstream people and applications that will be using your data and
data model.
Example
It doesn’t matter if your data modeling parquet files in Delta Lake or Postgres, if you can see the
queries that will be run, or mock them up yourself, the access patterns become quite clear.
In the above example, we can see that our indexes or partitions will most likely break down along
date parts like year and month, as well as product.
Normal Forms.
Many folks get glazed over their eyes when the topic of normalization forms comes up. I am one of
those people, but it’s important to take away some of the basics of normal form design when data
modeling.
• De-duplication
• Join integrity
This topic was more important in the days of the classic relational data warehouse on SQL Server
or Oracle but has become less so today. You can loosely follow the levels of data normalization or
de-duplication and benefit from much of what they provide in a file-based Data Lake.
What are some of those old database normalization techniques that can be applied to even new Data
Lakes in the cloud? There are a few timeless classic database normalizations that should be followed.
De-Duplication of Data.
Duplicate data creates many problems in a data sink, some of them not as obvious as you think. I’m
not talking about duplicate data records that we discussed earlier, this is a different problem, I’m
referring to data definitions that exist in more than one place.
Let’s take an example of a database that holds information about specific products. What happens
if we make the easy mistake of having product descriptions in more than one spot?
This would be easy to miss. We might have a product table and an order table that holds the product_-
id and its description. What’s the problem with this?
The product description changes, what do we do now? Now we have a problem if we didn’t follow
proper data normalization we could have product descriptions in several if not many tables, and now
Chapter 8 - Data Modeling 142
we have inconsistent data. The process of updating product descriptions becomes a very painful and
expensive operation.
As you can see, one of the main tenants of normalization, the removal of redundant data is integral
to data models of all kinds.
Join Integrity.
Probably one of the most complex and yet most important parts of data modeling is understanding
how different data sets related to each or the joins.
May times there can be many joins between many tables, giving all sorts of results. Part of
normalizing your data is controlling how these joins occur and the type of results they will give
back.
Will joining two datasets cause one dataset to duplicate many rows? Will joining another table cause
records to drop? In real terms, can someone assume that every single record in your orders table,
that has a customer column, will have one and only one record in the joined customer table?
Think closely about your data model and how these joins will affect everyone and every query
downstream.
• SQL models with many dimensions, data norms, and de-dups to the extreme.
• File-Based models have fewer dimensions, normalization in moderation.
• SQL Database models table size doesn’t matter much, small or big.
• File-Based data models table size and file size matter in the extreme.
• SQL Database models are centered around indexing and indexes.
• File-Based data models are centered around partitions and partitioning.
Relational SQL databases blazing fast and joining many small tables is no big deal. That’s typically
why the classic Data Warehouse running on a SQL database had a look-up code and dimension for
everything. Data deduplication and normalization were taken to the extreme in these data models
… because they fit the technology.
The following queries are quite common in the classic data warehouse world.
Partitions vs Indexes.
Another major role in the data model differences between the new Data Lakes and the old Data
Warehouses is partitioning vs indexes. The file-based Data Lakes are very much designed around the
Partitions applied to the data, typically a minimum of some datetime series, and usually extending
to other data attributes, like a customer, for example.
Chapter 8 - Data Modeling 145
Because of the size of the data in the Data Lake file systems, it’s an absolute necessity that the data
be “broken-up” aka partitioned according to one or more attributes in that data table.
Otherwise, the data of that size cannot be queried and accessed in a reasonable timeframe. Think
of entire scans running on every single file of a few hundred terabyte data sources, of course, is not
acceptable.
SQL relational databases on the other hand rely on primary and secondary keys to be able to seek,
join, and filter the datasets quickly. The data models of classic relational databases are typically
normalized, leading to many smaller tables with one or two large fact tables, typically designed
around the primary key, or uniqueness value of a data set.
A typical Spark SQL query on a Data Lake might look as follows.
1 SELECT fact.*
2 FROM fact.{s3://location}
3 INNER JOIN dim1.{s3://location}
4 WHERE fact.date BETWEEN x AND Y AND fact.partition_key = x AND dim1.partition_key = y
- Business Quality
- Data Value Quality
This chapter will be shorter than others. We will cover just some high-level basics about what to
look for and what you can try to achieve when implementing data quality yourself. This is a new
area of data engineering that has come into focus, so there is still a lot to learn and very few reliable
tools available.
Most Data Quality tools today are made and implemented inside organizations big and small. So,
really what we need is an understanding of what Data Quality is, how we can measure most data
quality, and what some different implementations might look like.
Compliance:
Data quality is essential for regulatory compliance. Organizations must be able to demonstrate that
they have taken appropriate measures to ensure the accuracy and completeness of their data.
Efficiency:
Poor data quality can lead to wasted time and resources spent on cleaning and correcting data.
Ensuring good data quality can streamline processes and improve efficiency.
Cost Savings:
Good data quality can result in cost savings. It can help organizations identify areas for improvement,
reduce errors and redundancies, and avoid the costs of poor decision-making.
Competitive Advantage:
High-quality data can give organizations a competitive advantage by providing insights that others
may not have. Good data quality can help organizations make better decisions, develop better
products and services, and improve customer experiences.
Data quality is essential for making informed decisions and achieving business goals. It is important
for organizations to invest in data quality initiatives to ensure that their data is accurate, complete,
consistent, and timely.
But, we can probably reason about some common Data Quality issues that come up again over
and over regardless of data source and type. Data Quality can probably be split up into two main
headings, the “business” quality standards, and the “data value” qualities of each data source and
sink. Let’s dive into each of these a little more.
Data Quality is also extremely important from an end-user and business point of view. It’s typically
poor data quality that users of our data notice first and immediately reduce their trust in our data
products. Once that trust is lost, it’s very hard to gain back.
Good Data Quality can ensure that trust and long-term use of the data and pipelines engineers work
so hard on every day.
Business Quality
When you think about data quality, you might rightly think about a column that should not be NULL,
and this is correct. But, such simple constraints can easily be built into most data stores. And yes,
these simple constraints are core to data quality but are usually not the problems that are difficult
to solve.
What is at the core of data quality?
Example
To make this more concrete, consider what the business might care about for this example. Let’s say
we have a system that produces scores for each of our clients on a daily basis.
1 >> example.csv
2 client_id | scores
3 1234 15
4 4567 | 25
5 7899 | 0
Now maybe the engineering team put in checks to make sure no client gets a NULL score, the business
told us everyone should always have a score every day. But, what happens when in a panic, some
Product Manager calls us up saying our most important client got a score of 0 today, and they had
20 yesterday.
Maybe the business logic requires not only that every client has a score every day, but that the score
also has to be in the range of 5-30. We can see from our example our data has suspect quality from
the business perspective, although any engineer reviewing the data might see no NULL values and
consider the data to be complete and accurate.
This difference is a very important concept to grasp.
Double meanings.
It’s also imperative in our quest for data quality to identify and deal with all kinds of misunderstand-
ings. Missunderstandings can come from many sources, it could be bad field names, data points that
are too similar, or just ambiguity.
Chapter 9 - Data Quality 149
Many times businesss’ struggle with data quality simply because there are no concrete quality
definitions and rules surrounding the data. Just like in life, in data, many people may refer to the
same data, but be talking about different things, or believe different things about that data.
This is a simple example but shows the importance of business data quality and how we reason
about the business values contained in each data point. It is an often overlooked part of data quality
that leads to poor data quality and user experience that throws a shadow on the reliability of our
data.
If you have a description column, there is a good chance that the values will range all over the board,
a mix of numbers and characters of varying lengths. On the other hand, you might have an amount
column that holds dollar values. Some obvious constraints are that it should be a positive decimal
number holding up the 2 precision points.
When you break data values down, many times there is some sort of expected range of values for
each column or data point. Maybe all the values should be one from a list of strings like YES and
NO. It could be that it’s a numerical value that should always hold an integer in the range of 1 to
1000. Whatever the case is, as your data grows in size and complexity it becomes a very serious and
complex business to know what is happening over time to all those data points.
All of the above measures are core to ensuring long-term data quality, and the best part is they are
all fairly easy to approach and check with code in an automated fashion. These factors should be
checked on each data set before it’s ingested, avoiding costly pipeline downtime and bad data being
propagated downstream causing trust and reliability issues.
There are a few common mistakes, and data quality checks that can be put in place, that save a lot
of problems later on.
Mixing data types in a column is a very common problem in data engineering, and creates numerous
data quality issues later on.
1 name | amount
2 Bill | 5.55
3 Bob | $4.99
This is a data quality issue that will break anything trying to do math or aggregations on the amount
column for example. A simple example, but extremely common, the lesson is to be firm about data
types because many times applications and code rely upon assumptions about data.
For example, we know that logic based on unspoken engineering and business principles tells us that
we should always have an integer value associated with an order.
But, did anyone ever discuss if 0 is an acceptable order quality? What does it mean if someone has
a 0 value, does that mean there is a system bug, or it is a valid order?
These types of data range and integrity of values seem to be easy on the surface, but after you review
each data column you might be dealing with, things could get complicated quickly.
Chapter 9 - Data Quality 152
1 import great_expectations as ge
2 df = ge.read_csv('my_data.csv')
3 null_check = df.expect_column_values_to_not_be_null('column_name')
4 print(null_check)
Data type check: This check verifies whether the data type of a column matches an expected data
type. It can be performed using the expect_column_values_to_have_data_type() function:
Chapter 9 - Data Quality 153
Range check: This check verifies whether the values in a column fall within a specified range. It can
be performed using the expect_column_values_to_be_between() function:
Uniqueness check: This check verifies whether the values in a column are unique. It can be performed
using the expect_column_values_to_be_unique() function:
1 unique_check = df.expect_column_values_to_be_unique('column_name')
2 print(unique_check)
Pattern check: This check verifies whether the values in a column match a specified pattern or regular
expression. It can be performed using the expect_column_values_to_match_regex() function:
Completeness check: This check verifies whether a dataset contains all expected columns and
whether the columns have non-null values. It can be performed using the expect_table_columns_-
to_match_ordered_list() and expect_column_values_to_not_be_null() functions, respectively:
1 completeness_check = df.expect_table_columns_to_match_ordered_list(['column_name_1',\
2 'column_name_2', 'column_name_3']) & df.expect_column_values_to_not_be_null('column\
3 _name')
4 print(completeness_check)
Great Expectations provides a wide range of data quality checks, and also supports custom data
quality checks, allowing users to create their own checks based on their specific requirements.
Chapter 10 - DevOps for Data
Engineers
DevOps for Data Engineers covers a lot of topics and technologies. Depending on the size of the
company a Data Engineer works in they may, or may not, get very involved in DevOps and
CI/CD. Many companies have a team of Platform Engineers and others whose job it is to make
the development of other engineers simple and easy.
DevOps is a methodology or culture that aims to improve collaboration and communication
between software development and operations teams in order to increase efficiency and reduce
time-to-market for software products. The term “DevOps” is a combination of “development” and
“operations”.
Traditionally, development and operations teams have worked separately, often leading to conflicts
and delays in the software development life cycle. DevOps seeks to bridge this gap by creating a
collaborative and iterative process that involves all stakeholders in the software development process,
from planning and development to testing and deployment.
DevOps is based on a set of principles that include continuous integration, continuous delivery,
and continuous deployment. Continuous integration refers to the practice of regularly merging
code changes into a shared repository, while continuous delivery ensures that code changes are
automatically built, tested, and deployed to production. Continuous deployment takes this one step
further by automatically deploying changes to production without manual intervention.
DevOps also emphasizes automation and monitoring, using tools such as configuration management
and infrastructure as code to automate the deployment and management of software applications.
Monitoring tools are used to track performance and identify issues, allowing for rapid response and
continuous improvement.
Overall, DevOps is a culture of collaboration and continuous improvement that seeks to break down
silos and streamline the software development process. It helps organizations deliver software faster,
with higher quality, and with better alignment to business objectives.
But, many companies on the smaller side don’t have the resources to have a whole group of people
working in DevOps and CI/CD. What I want to do in this chapter is give a quick run-through of
some of the most important DevOps topics that will give you the biggest bang for your educational
bucks.
Dockerfiles level the playing field by pre-packing all the OS and system requirements and dependen-
cies into a single easy-to-use source.
Docker has become an important tool for DevOps teams for several reasons:
Portability:
Docker containers are self-contained and can be deployed on any machine, regardless of the
underlying operating system. This makes it easy to move applications between development, testing,
and production environments.
Consistency:
Docker ensures that the application and its dependencies are packaged together and run consistently
across different environments. This eliminates the “works on my machine” problem and ensures that
the application behaves the same way in all environments.
Scalability:
Docker makes it easy to scale applications horizontally by adding more containers. This allows
applications to handle increased traffic or load without requiring changes to the underlying
infrastructure.
Versioning:
Docker allows developers to version their applications and dependencies, making it easy to roll back
to a previous version if there are issues with the current version.
Collaboration:
Docker makes it easy to share and collaborate on applications and dependencies. Developers can
share Docker images through registries, allowing other team members to quickly deploy and test
the application.
Overall, Docker plays a crucial role in DevOps by enabling developers to package and deploy appli-
cations quickly and consistently, regardless of the underlying infrastructure. This helps to streamline
the development process, increase efficiency, and improve collaboration between development and
operations teams.
The other option is to build your Dockerfile, based on whatever OS you want, with whatever
packages and tools you need, even layered on top of some Dockerfile from someone else.
Let’s take the example of someone who builds pipelines that run in AWS on Linux-based images.
You want a good development base that is as close as possible or exactly like production correct? So
you build a Dockerfile that has say Python and Spark-based on Linux with the aws cli installed.
Chapter 10 - DevOps for Data Engineers 158
1 FROM ubuntu:18.04
2
3 RUN apt-get update && \
4 apt-get install -y default-jdk scala wget vim software-properties-common python3\
5 .8 python3-pip curl unzip libpq-dev build-essential libssl-dev libffi-dev
6
7 RUN wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2\
8 .tgz && \
9 tar xvf spark-3.0.1-bin-hadoop3.2.tgz && \
10 mv spark-3.0.1-bin-hadoop3.2/ /spark && \
11 ln -s /spark spark
12
13 RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip\
14 " && \
15 unzip awscliv2.zip && \
16 ./aws/install
17
18 WORKDIR code
19 COPY . /code
20
21 RUN pip3 install -r requirements.txt
22
23 ENV MY_CODE=./code
It’s just an example but you get the point, defining a complex set of tools that won’t easily be
broken that all developers and users of the pipeline can use is a very simple and powerful way to
make development, testing, and code usage easy for all.
Usually, a Dockerfile written like this stored with the code can be built using a simple command.
Docker-compose.
Also, make sure to read up on docker-compose. A great way to automate running tests and bits of
code. Docker-compose is simply a way to have multiple services and images all work together and
talk together, a great tool for more complex projects.
Dockerfile summary.
Dockerfiles are far from rocket science, they are probably one of the easiest things to learn, even as
a new developer. Like anything else they can get complicated when running multiple services, but
the basic usage of a Dockerfile will give you the 80% of what you need upfront.
Chapter 10 - DevOps for Data Engineers 159
I also believe Dockerfiles in general forces a more rigid development structure that is missing from a
lot of data engineering code bases. When you find Dockerfiles you are more likely to find unit tests,
documentation, requirements files, and generally better design patterns.
Unit Testing.
There are few things in life that are worse than cracking open some serious pipeline code, and then
realizing there isn’t a single function written to encapsulate logic, wondering if some change you
are about to make will bring down the whole pipeline.
Why unit testing?
• increase learning
• leads to better quality code
• reduce bugs
• increase development time
• increase developer confidence
• can be automated into the DevOps and CI/CD
Increase learning.
When you are new to a codebase you don’t know what you don’t know, you don’t have any backstory
and you are usually flying by the seat of your pants in the beginning. When you have no unit tests,
usually the only other way to test changes on a pipeline is to run it, this is sometimes easier said
than done in a development environment. The first line of defense should be unit testing the entire
pipeline.
Also simply reading unit tests can be a great way to start learning a codebase, before simply just
trying to read the whole codebase.
Now imagine a pipeline with line after line of this type of code, with not a single piece of logic
encapsulated in a method/function. This is way more common than people would think, I would
venture a guess that 80% of pipelines are written this way.
Instead if we want testable code, we would do the following.
Now, this is code that can be called in a unit test, and is reusable, making the code more clean and
modular.
Reduce bugs.
An important part of DevOps is automating processes, and that allows for the reduction of bugs make
it into production. Automated unit tests are probably the number one way we can use DevOps to
reduce the chances of bugs getting out.
By making sure we unit test our code and making the running of those tests automated, reduces or
risk of making errors and protects developers from making human mistakes.
You might be new to testing in general, so I want to take this chance to do a deep dive example in
unit testing an imaginary PySpark pipeline.
Obviously, we need a test directory, a file to hold our unit tests test_pipeline.py, and a file specific
to pytest called conftest.py.
Chapter 10 - DevOps for Data Engineers 161
Now we have a fixture that will provide a Spark Session for each of our unit tests, so we can test
our code on actual Spark dataframes.
1 import pyspark.sql.functions as F
2 from pyspark.sql import DataFrame
3
4 def sample_transform(input_df: DataFrame) -> DataFrame:
5 inter_df = input_df.where(input_df['that_column'] == \
6 F.lit('hobbit')).groupBy('another_column').agg(F.sum('\
7 yet_another').alias('new_column'))
8 output_df = inter_df.select('another_column', 'new_column', \
9 F.when(F.col('new_column') > 10, 'yes').otherwise('n\
10 o').alias('indicator')).where(
11 F.col('indicator') == F.lit('yes'))
12 return output_df
Now we actually have a function to unit test. So let’s write the test, we will crack open our test file…
Chapter 10 - DevOps for Data Engineers 162
If we run our function sample_transform against our sample dataframe the following is the output.
1 +--------------+----------+---------+
2 |another_column|new_column|indicator|
3 +--------------+----------+---------+
4 | Billbo| 70| yes|
5 +--------------+----------+---------+
This is what we are trying to validate, that our filters, switching logic, and filter logic are done
correctly and we get the expected outcome.
• Dockerfile
• Docker compose file
The Dockerfile doesn’t need to be rocket science, a little Ubuntu, Java, Python, Spark …
using the below file run docker build --tag spark-test .
1 FROM ubuntu:18.04
2
3 RUN apt-get update && \
4 apt-get install -y default-jdk scala wget vim software-properties-common python3\
5 .8 python3-pip curl unzip libpq-dev build-essential libssl-dev libffi-dev python3-de\
6 v&& \
7 apt-get clean
8
9 RUN wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2\
10 .tgz && \
11 tar xvf spark-3.0.1-bin-hadoop3.2.tgz && \
12 mv spark-3.0.1-bin-hadoop3.2/ /usr/local/spark && \
13 ln -s /usr/local/spark spark
14
15 WORKDIR app
16 COPY . /app
17 RUN pip3 install cython==0.29.21 numpy==1.18.5 && pip3 install pytest pyspark pandas\
18 ==1.0.5
19 ENV PYSPARK_PYTHON=python3
1 version: "3.9"
2 services:
3 test:
4 environment:
5 - PYTHONPATH=./src
6 image: "spark-test"
7 volumes:
8 - .:/app
9 command: python3 -m pytest
Now if you build your Dockerfile in your project, you should be able to just run docker-compose up
spark-test and there you have it, your unit tests for PySpark running with one easy command.
Chapter 10 - DevOps for Data Engineers 164
Unit testing your pipeline code really isn’t that hard, and it saves a lot of time later down the road.
Being able to make changes to code and have some idea if you breaking anything, without running
the entire pipeline, is kinda nice. Data Engineers have gotten a bad rap over the years because
unwillingness to incorporate basic software engineering principles like unit tests.
Obviously, from above it doesn’t take much effort to add unit tests for your Spark code. Encapsulate
your logic, write a few docker files, and you are off to the races.
CI/CD.
You may not like it but CI/CD is a major part of DevOps and has become critical to good Data
Engineering pipelines. Let’s face it, we all know we do too many things manually, and it gets even
harder when we have multiple people working in some repositories.
At some point you have cannot rely on some person, no matter how smart, to do a manual sync of
code or build a new Docker image if that is the case. This is where CI/CD comes into play.
Now there is a myriad of tools that will probably continue to grow and expand, Jenkins, Git-
Lab/GitHub CI/CD runners etc. Many tools have been built around taking actions when something
happens in a repository of data. This is something that all Data Engineers need to learn.