0% found this document useful (0 votes)
35 views172 pages

Data Engineering With Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views172 pages

Data Engineering With Python

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 172

Introduction to Data Engineering

Learn the skills needed to break into Data Engineering.

Daniel Beach
This book is for sale at http://leanpub.com/dataengineeringwithpython

This version was published on 2023-03-26

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.

© 2019 - 2023 Daniel Beach


Contents

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Knowledge and Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
What are the topics we will cover? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 1 - The Theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5


What Is a Data Pipeline? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Data Pipelines built with Passion and Creativity . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Storage and File Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Repeatable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Resilient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Scalable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
In Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

Chapter 2 - Data Pipeline Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22


Project Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Data Pipeline Code Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Code Readability and Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Containerzation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Architecture First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

Chapter 3 - Pipeline Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41


Architecture Applied to Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Data Size and Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Calculating Compute Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Calculating Storage Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Understanding the End Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Understanding Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Code Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Batch vs Streaming Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Puzzle Pieces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
CONTENTS

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Chapter 4 - Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
SQL/NoSQL Databases vs Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
File Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Row vs Columnar Storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Common file types in data engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Parquet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Avro. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Orc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
CSV / Flat-file. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Compression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Storage location. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Partitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

Chapter 5 - Compute and Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83


Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
RAM/Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
CPU/Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Cluster/Nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Chapter 6 - Mastering SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95


Introduction To SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Does the type of database matter? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
The fundamentals of SQL/Databases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
OLTP vs. OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Table design/layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Table Design in Real Life. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Understanding Indexing Basics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
How to write fast/tune queries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Where to look for common problems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
SQL Fundementals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Python + SQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
SQL Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

Chapter 7 - Data Warehousing / Data Lakes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115


Data Warehouse vs Data Lake vs Lake House . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Data Modeling in Data Warehouses, Data Lakes, and Lake Houses. . . . . . . . . . . . . . . 117
Facts and Dimensions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Constraints and Schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
CONTENTS

Column Names. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126


The Role of ID’s in a Data Warehouses or Data Lake. . . . . . . . . . . . . . . . . . . . . . . . 127
CDC / History Tracking. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

Chapter 8 - Data Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132


Data Types and Schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
Data Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Data Size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Data Definitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Modeling Data Logically. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Logical data models lead to physical relationships. . . . . . . . . . . . . . . . . . . . . . . . . 138
Grain of Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
Uniqueness of Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
Access Patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Talking to the Business. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
Normal Forms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
De-Duplication of Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Join Integrity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Keys - Primary and Foreign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
The Idea Behind Keys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Relational Databases (SQL) vs Data Lake (File Based) Modeling. . . . . . . . . . . . . . . . . 143
The number of Fact tables and Dimensions and normalization. . . . . . . . . . . . . . . . . . 143
File size and table size matter in the new File-Based Data Lakes. . . . . . . . . . . . . . . . . 144
Partitions vs Indexes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Walking the data model line between old and new. . . . . . . . . . . . . . . . . . . . . . . . . 145

Chapter 9 - Data Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146


What is Data Quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
Reasoning about data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Double meanings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Data value quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Measures of Data Quality. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Correct Header or Column Names. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Correct File Formatting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Correct data types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Values ranges and values integrity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Data Quality Applied . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

Chapter 10 - DevOps for Data Engineers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154


DevOps applied to Data Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
CONTENTS

Dockerfiles and Docker-compose. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


Unit Testing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
CI/CD. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Automation is the name of the game. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
CI for Data Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
Introduction
This book is all about the movement of data, specifically developing data pipelines and how to
become an awesome Data Engineer.
This book focuses on the crucial task of moving data, specifically building data pipelines, and how to
become an exceptional Data Engineer. In today’s age of Business Intelligence, Data Science, Machine
Learning, and the ever-growing need for companies to gather as much data as possible, designing
data pipelines has become an invaluable skill.
Despite occasional layoffs in the tech industry, Data Engineers remain highly sought after, making it
a recession-resistant career choice. Data engineering is a unique combination of technical and non-
technical skills, setting it apart from traditional software engineering disciplines. This book provides
a comprehensive overview of the fundamental topics and highlights the essential skills necessary for
a successful Data Engineer.

What is a Data Engineer?


The role of a Data Engineer is constantly evolving with the ever-changing technology landscape, but
certain aspects remain constant. Essentially, a Data Engineer’s primary function is to facilitate the
movement of data, enable businesses to consume that data, and ultimately provide tangible value
back to the business.
Data Engineers facilitate the movement of data and enable businesses to consume that data,
providing value back to the business.

• Facilitate data movement.


• Enable data consumption.
• Provide tangible value back to the business.

Despite the increasing demand for skilled Data Engineers, finding qualified individuals for the role
remains a challenge. Acquiring the necessary skills as an individual can also be daunting, given the
lag in training and education offerings that cater to real-world Data Engineering knowledge. It can
Introduction 2

be overwhelming for beginners to navigate the vast array of tools, programming languages, and
skills required to excel in this field.
This book aims to bridge this gap by providing comprehensive coverage of the essential topics in
Data Engineering. As someone who has experienced the challenges of starting out in this field, I
understand the difficulties that come with knowing where to start and which topics to prioritize.

What To Expect
In this book, I want to give you the skills and knowledge, especially the underlying theory, to write
beautiful, fast, scalable data pipelines. It’s impossible to teach everything and cover every topic, but
I at least want you to know, what you should focus on. Hopefully, you discover many topics that
you can dive into at your leisure. But I also want to be clear about what this book is not.
This book isn’t about how to write code.
Data pipelines are so different and varied in their structure, based on the technology stacks being
used, but most of the concepts are the same. Some people wrongly assume that they should learn
how to be a great coder, especially in the beginning, sure, that is helpful. But, as you grow in your
career you will quickly realize that it’s other skills that enable you to be a good Data Engineer.

• Good data engineering is more than just writing good code.


• Knowledge and concepts first, then apply those learnings.
• To write code better, write code last..

What I don’t want to teach you is how to write code. You will see me using Python in my examples,
and that is just for the ease of code readability. I expect you are a smart and savvy person, you
reading this book after all.
The theory and ideas behind many data engineering topics are more important than how well you
write code, which comes with time and experience. Let’s take a preview of what will be coming in
the next chapters ahead as you move forward.

Chapters
Here are the chapters and topics you can expect to encounter.

• The Theory Of Data Engineering and Pipelines


• Data Pipeline Basics
• Pipeline Architecture
• Storage - Files
• Compute and Resources
• SQL and Databases
• Data Warehousing and Data Lakes
Introduction 3

• Data Modeling
• Data Quality
• DevOps

In each chapter, we will try to introduce the topic at a high level, and help you understand why it
is important. Many times we will then dive into more specifics and examples to help cement those
ideas in.
I encourage you to read the chapter is the order they are given as well, there is a certain rhythm to
life and Data Engineering that will be better served if done in order.

The Focus of This Book


Rather than delving into the minutiae of coding for Data Engineering, this book will primarily focus
on theory. However, we will use code snippets as examples to enhance the explanation of certain
topics. Developing programming skills is a gradual process, and if you wish to refine your skills in
a particular language, taking courses or classes is advisable.
Many times the harder skills to develop are those that are less obvious, the skills that come with
experience.
Often, the more challenging skills to acquire are those that are not immediately obvious, as they
come with experience. When starting out, it can be difficult to identify what you do not know.
Therefore, throughout the book, I will incorporate code snippets and examples to clarify key points
and concepts.

• The book is about the underlying concepts and theories.


• Try to learn lessons before you learn them the hard way.
• Data Engineering is a journey, to fail is to succeed.

I’m going to give you the headstart you need to help you surpass all your contemporaries and learn
the skills that are central to becoming a successful data engineer. The best part is, you can do all
this with Python, in which most of our examples will be written, but the choice of language doesn’t
matter as much as the skill sets and thought processes.
Never let someone tell you X language is better than Y.
Focus on learning and growth, the tools are secondary to the underlying skillset.
I’ve personally built a successful career as a Senior Data Engineer, never have taken a Computer
Science class in my life, and used Python for 90%+ of my professional life. This proves that like
Introduction 4

many things in life, focusing on the basics will get you much farther in the long run, than focusing
on semantics or tying yourself to a particular tool or language.

Knowledge and Experience


Building data pipelines require a unique set of knowledge that crosses many disciplines and isn’t
easy to come by without specific experience. What makes it even harder is that many of the skills,
like data modeling, for example, are somewhat esoteric, half art and half science. The good news is
that I can help speed you down the path to success by giving you the 20,000-foot view of the topics
and problems you will encounter in the real world.

• Data Engineering covers a wide variety of topics and technology.


• Data Engineering is both an art and a science.
• Data Engineering is about combining real-world experience with the data landscape around
you.

I want to share those experiences, tips, and tricks in this book to jump-start you into building reliable,
scalable data pipelines.

What are the topics we will cover?


In this book, we will cover a range of topics related to data engineering. We will start with the
theory of data pipelines, followed by the fundamental skills and thought processes that are essential
for every data pipeline, regardless of complexity.
Next, we will delve into architecture and storage options, including file types that play a significant
role in data engineering. We will also explore compute resources, such as RAM and CPU, and how
to work with them effectively in the Cloud.
No data engineering book would be complete without a brief overview of SQL and relational
databases. We will also cover data warehousing and data lakes, including the storage layer, which
can range from SQL Server to Parquet files.
Data modeling is another crucial topic in this book, as it is half art and half science. We will discuss
how to create a data pipeline that provides value by building an effective data model.
Data quality is a less popular but vital topic that affects the longevity and usability of the data output
by engineers. We will provide an overview of data quality tools and techniques.
Finally, we will examine the role of DevOps-CI/CD in data pipelines, an often-overlooked area that
is crucial for success.
By the end of this book, you will have the necessary knowledge and skills to build robust data
pipelines like a seasoned data engineer. We will share tips and tricks that you can use for any data
pipeline project, making you an expert in the field. So, let’s dive in!
Chapter 1 - The Theory.
Starting a book on data engineering and data pipelines with a chapter on theory may seem
unconventional, but it’s crucial not to skip this section. Understanding the purpose and goals of
data pipelines will influence all decisions made throughout the process.
Just like any adventure, planning and foresight are necessary when delving into the world of
data engineering. Building a strong foundation in theory will prevent headaches down the line,
as foundational principles impact code and technology decisions.

• Building a solid foundation reduces downstream headaches.


• The foundational principles directly impact code and technology decisions.

Theory can be dry and I will do my best to be to the point. Having a high-level understanding of
what we are trying to accomplish with our data pipelines should make writing them a little bit more
manageable.

What Is a Data Pipeline?


I believe it’s important to stop and ask simple questions sometimes, it’s easy to miss the forest
through the trees. What is a data pipeline?
A data pipeline is a series of data processing steps that move data from one place to another. The steps
in a data pipeline are typically automated, and the pipeline is designed to efficiently and reliably
move data from a variety of sources to a variety of destinations.
Data pipelines are used in a wide variety of applications, including:

• Data warehousing and analytics


• Machine learning and artificial intelligence
• Business intelligence
• Data visualization

Data pipelines typically consist of the following components:

• A source: This is the data that is being processed by the pipeline. The source can be a database,
a file system, or an application.
• A processing step: This is the step that transforms the data from the source into a format
that can be used by the destination. Processing steps can include filtering, cleaning, and
transforming data.
Chapter 1 - The Theory. 6

• A destination: This is the final location where the data is stored. The destination can be a
database, a file system, or an application.

Data pipelines are typically automated using a variety of tools and technologies, including:

• Data integration tools


• Data transformation tools
• Data visualization tools

Data pipelines are an essential part of the modern data landscape. They allow businesses to efficiently
and reliably move data from a variety of sources to a variety of destinations, where it can be used
for a variety of purposes.

You might be tempted to think that it’s simply the movement of data, but it’s more complicated than
that. I would rather say that a good data pipeline is about …
“facilitating the movement, storage, and access to data in a repeatable, resilient, and scalable
manner.”
These are the fundamentals you should think about when it comes to data engineering and data
pipelines.

• Movement
• Storage
• Access
• Repeatable
• Resilient
• Scalable

A little more than you thought? Maybe you think I’m nitpicking and throwing out jargon? We will
dig into these 6 fundamentals shortly.
I would argue the difference in understanding a data pipeline as “just the movement data” vs
“facilitating the movement, storage, and access to data in a repeatable, resilient, and scalable manner”
is the difference between an amateur, broken, and unworthy pipeline and one built by a professional
that will be in place for years.
Chapter 1 - The Theory. 7

Data Pipelines built with Passion and Creativity


Being passionate about what you working on vs doing something for a paycheck is large, we all
know this. When you were in middle school and your mother told you to clean your room or not go
out with your friends, I’m sure you “cleaned” it. But did you really clean it? Your heart wasn’t in it.

• Data engineering and pipelines are complex.


• The business relies heavily on functioning pipelines.
• Heart and passion for data goes a long ways.
• Creativity produces better results than rigidity.

Data pipelines built with passion and creativity are the ones that are most likely to succeed. When
you build a data pipeline with passion, you are more likely to be invested in its success. You are
also more likely to be creative in your approach to solving problems. This can lead to innovative
solutions that can help your business grow.
There are a few things you can do to build data pipelines with passion and creativity:

• Start with a clear goal in mind. What do you want to achieve with your data pipeline? Once
you know your goal, you can start to think about how you can achieve it.
• Be creative in your approach to solving problems. Don’t be afraid to try new things. The best
way to find innovative solutions is to experiment.
• Be passionate about your work. If you are passionate about what you do, it will show in your
work. Your passion will motivate you to work hard and to come up with creative solutions.

Building data pipelines with passion and creativity is essential for the success of your business. When
you are passionate about your work, you are more likely to be successful. You are also more likely
to be creative in your approach to solving problems. This can lead to innovative solutions that can
help your business grow.
All this can lead to data engineering jobs being very stressful and all-time consuming. If you have
a passion for working with data, this will help offset the burdens that can come with the job.
Passion for data engineering should lead to excellence in the code we write and help us design more
creative long-term solutions.
Chapter 1 - The Theory. 8

Creativity is the spice in life, both in coding and leisure. Data engineering is a hard profession to
master, not thinking outside the box and being rigid will curtail your career and results. Complex
problems require creative solutions, some of the best data engineers come from non-software
backgrounds.
Truly good data pipelines are built with heart and love. You have to see it as more than just another
task that is going to make your boss happy. If we see the movement of data in a system as a simple,
meaningless, and overhead task, the obvious result is going to be lackluster and the codebase will
fail at a critical moment.

• Passion for data leads to better data engineering solutions.


• It’s hard work, and requires dedication inside and outside work to learning.
• Once trust in data is lost, it’s impossible to replace.

Worse, it will compromise trust from business units and lead to incorrect results. Let’s examine our
theory of what a data pipeline is and how it will affect the code we will write later.

Data Engineering Standards


• Data Engineering is not “beneath” Software Engineering.
• Pipelines conform to the same level or higher expectations as normal SE code.

There can be an attitude in tech that Data Engineering is somehow beneath Software Engineering
and doesn’t ascribe to the same standards. This is a crucial mistake. Data engineers should hold
themselves and the complex pipelines they design to the same high level of standards.
There are a few key data engineering standards that are important to follow in order to build reliable
and efficient data pipelines. These standards include:
Chapter 1 - The Theory. 9

• Data quality: Data quality is essential for any data pipeline. Data should be clean, accurate,
and consistent. There are a number of tools and techniques that can be used to improve data
quality, such as data cleansing, data validation, and data profiling.
• Data scalability: Data pipelines should be able to handle large volumes of data. This means that
they should be able to scale up or down to meet the needs of the business. There are a number
of ways to improve data scalability, such as using distributed computing technologies, such as
Hadoop and Spark.
• Data reliability: Data pipelines should be reliable. This means that they should be able to run
without errors or failures. There are a number of ways to improve data reliability, such as using
fault-tolerant technologies, such as replication and load balancing.
• Data efficiency: Data pipelines should be efficient. This means that they should be able to
process data quickly and use resources efficiently. There are a number of ways to improve data
efficiency, such as using optimized algorithms and data structures.

Following these standards can help to build reliable and efficient data pipelines that can meet the
needs of the business
Let’s dive into more of what this means and looks like day-to-day.

Movement
The movement of data is basic to the data pipeline. We are by definition picking up data from
somewhere and dropping it off at “home”, with some number of stops in between. Therefore we
can’t brush past the topic of data movement. In Data Engineering some movements will be complex
and some will be simple.
For example, reading a CSV and doing something every row is about as basic as it gets. A very small
but simple data pipeline.

1 import csv
2
3 def open_csv_file(file_location: str) -> object:
4 with open(file_location) as f:
5 csv_reader = csv.reader(f)
6 for row in csv_reader:
7 print(row)

Usually, the life of a Data Engineer isn’t so easy though.


There are a few obvious topics that come to mind when thinking about moving data from point A
to B. Some of the most common, you are probably are familiar with or have heard of …

• Streaming data (real-time).


Chapter 1 - The Theory. 10

• Batch processing.
• Micro batch processing (near-real-time).
• ETL / ELT

When data engineers think about the movement of data, one of the first thoughts should be … “Is
this data pipeline streaming, batch, or something in between?”

This type of basic approach to data movement is critical because all downstream actions are most
likely very different based upon the answer of streaming vs batch.
Don’t be fooled, the basic choice at the foundation of building data pipelines will have a huge impact
on how you design the system. Your code will take shape around the technology choice you make
here. It’s something you have to think about, “How will my data move point A to B?”

• Technology choices affect code and architecture.

If you choose a messaging service like Pub/Sub, Kafka, Pulsar, etc, that code is going to look
completely different and have a different setup requirement than if you decide to push CSV files
to Parquet files with Spark. Those options couldn’t be more different.

Complexity in Data Pipelines

When considering what the movement of the data should look like, remember complexity is a killer.
Do you already have numerous custom complex transformations that need to take place? Will adding
a complex data movement or storage option make the codebase nearly unapproachable? What are
the speed requirements of the project? Is some nominal data loss acceptable?

• Data Engineers should reduce complexity.


• Don’t over-engineer solutions.
Chapter 1 - The Theory. 11

Data pipelines are critical for efficiently processing and analyzing large amounts of data in modern
businesses. However, as data pipelines grow more complex, they can become challenging to manage
and maintain, which can result in higher costs, longer processing times, and lower data quality.

• Reducing complexity in data pipelines is essential to ensure they remain efficient and effective.
Here are a few strategies that can help:
• Simplify the data pipeline architecture: Data pipelines can be simplified by reducing the
number of components, eliminating unnecessary steps, and streamlining the data flow.
• Adopt standardization: Establishing standardization across the data pipeline can reduce com-
plexity and make the process more manageable. This can include standardizing data formats,
naming conventions, and data processing protocols.
• Automate repetitive tasks: Automating routine tasks, such as data cleansing and transforma-
tion, can reduce manual errors and simplify the data pipeline. This can also improve the
accuracy and consistency of data processing.
• Implement modular design: Building modular components in the data pipeline can make
it easier to manage and maintain. Each module can be developed, tested, and deployed
independently, making the overall process more manageable.
• Leverage cloud computing: Cloud computing can provide scalable, flexible, and cost-effective
infrastructure for data pipelines. This can reduce complexity by eliminating the need to manage
physical hardware and software.
• Monitor and optimize: Regular monitoring and optimization of the data pipeline can help
identify bottlenecks, errors, and inefficiencies. This can help prevent issues before they arise
and improve overall performance.

By adopting these strategies, businesses can reduce complexity in their data pipelines, which can
lead to better data quality, faster processing times, and lower costs.
If your dealing with financial data, sure a messaging system where it becomes hard to “lose” data
might be a good choice. If you’re dealing with terabytes of satellite imagery, missing one image
won’t have a reasonable effect on the outcome.
Sounds like a lot of jargon just for deciding the data movement stack you will choose for your
pipelines?
The point I’m trying to make is that your code is probably a pyramid of complexity. The foundational
decisions you make upfront will direct the type and scope of the code that is needed to implement
the rest of the project. Don’t take it lightly.
Again, the movement of data by code is fundamentally what Data Engineering is all about. We know
we have to move and act upon the data, it’s the details of how to implement that data movement is
where the Data Engineer provides value.

Storage and File Types


Don’t forget our definition of data pipelines. “Facilitating the movement, storage, and access to data
in a repeatable, resilient, and scalable manner.”
Chapter 1 - The Theory. 12

Very closely related to data movement that we just covered is storage, or data at rest. Obviously,
with the rise of cloud storage options like s3, gs, azure blob, it’s been much easier to ignore choices
of file and data storage. But, just because it’s easy doesn’t mean that data pipelines won’t depend in
many ways on the type of storage chosen.

• Storage affects both time and money.


• Data rest (file) storage affects technology and code access patterns.

The difference between a CSV file and parquet is enormous (we will explore details later), especially
as data grows.
Again, many of the file storage choices can be hard to back out of once baked into the software
solution. Understanding the data that is flowing and how it is used will always give clues to the
storage option needed.

In data engineering, storage and file types are critical components that play a vital role in the overall
data processing and analysis pipeline. Here’s a brief explanation of these concepts:
Storage:
Storage refers to the physical or virtual location where data is stored. Data storage can be classified
into two categories: primary storage and secondary storage. Primary storage, also known as volatile
storage, is temporary storage used by the computer’s memory to hold data that is currently being
processed. On the other hand, secondary storage, also known as non-volatile storage, is long-term
storage that is used to hold data that is not currently being used. Examples of secondary storage
include hard drives, solid-state drives, cloud storage, and network-attached storage.
File Types:
File types refer to the different formats in which data can be stored. Each file type has a specific
structure, encoding, and extension that determines how the data is stored, processed, and accessed.
Here are some common file types used in data engineering:
CSV (Comma Separated Values): This is a text file format used to store tabular data, such as
spreadsheets. Each row of data is represented as a separate line in the file, and the values in each
row are separated by commas.
Chapter 1 - The Theory. 13

JSON (JavaScript Object Notation): This is a lightweight data interchange format that is easy to
read and write. JSON files store data as key-value pairs, and the data is represented in a hierarchical
format.
Parquet: This is a columnar storage file format designed to improve the efficiency of data processing.
Parquet files store data in columns instead of rows, which can improve query performance and
reduce storage costs.
Avro: This is a data serialization system that is used to exchange data between different systems.
Avro files are compact and support schema evolution, which allows changes to the data structure
over time without breaking compatibility.
ORC (Optimized Row Columnar): This is another columnar storage file format that is designed to
optimize performance and storage efficiency. ORC files use compression and encoding techniques
to reduce file size and improve query performance.
Choosing the right storage and file types for a particular use case is critical for efficient data
processing, storage, and analysis. Understanding the strengths and weaknesses of each option can
help data engineers make informed decisions that lead to better data outcomes.

Storage and Files affect Technology and Code

I’ve seen tens of millions of individual JSON files stored in cloud buckets, and then a year later
analytics need to be done. Of course, at this point, even a massive AWS EMR cluster was having
trouble reading tens of millions of individual JSON files to retrieve needed business analytics.
Obviously, for that project, someone chose JSON because it was easy, and the consequences were
large and required a sizable project to regain insights into the data that could have been avoided by
proper storage upfront.
Storage can be seen as just an in-consequential decision when building data pipelines, but this is
rarely the case once the project is popular and starts to scale.
Storage and file types have a significant impact on technology and code in data engineering. Here
are some ways in which storage and files affect technology and code:
Data processing and analysis:
Different storage and file types require different processing and analysis techniques. For example,
columnar storage file formats like Parquet and ORC require specialized processing techniques to
efficiently access and analyze the data. In contrast, text-based file formats like CSV and JSON require
different processing techniques. As a result, technology and code used in data engineering must be
tailored to the specific storage and file types used in a particular use case.
Scalability:
The choice of storage and file types can impact the scalability of a data pipeline. For example, storing
large amounts of data in a traditional relational database may not be practical due to scalability
limitations. In contrast, using a distributed storage system like Hadoop Distributed File System
(HDFS) can provide greater scalability and performance. Similarly, using columnar storage file
Chapter 1 - The Theory. 14

formats like Parquet can significantly reduce storage costs and improve query performance for large
datasets.
Data integration:
Integrating data from multiple sources requires compatibility between the storage and file types
used in each source. For example, if one source uses CSV files and another uses JSON files, the
data integration process may require additional code to transform the data into a compatible format.
Choosing storage and file types that are compatible with the majority of data sources can simplify
the data integration process and reduce the amount of code required.
Code efficiency:
The choice of storage and file types can impact the efficiency of code used in data engineering.
For example, using a columnar storage file format like Parquet can reduce the amount of data that
needs to be read from disk, which can improve the performance of data processing and analysis
code. Similarly, using a distributed storage system like HDFS can allow code to be parallelized across
multiple nodes, improving overall code efficiency.
In summary, storage and file types have a significant impact on technology and code in data
engineering. Choosing the right storage and file types for a particular use case can have a profound
impact on the scalability, efficiency, and effectiveness of data processing and analysis pipelines.

Access
Continuing on with our definition of data pipeline, we come to data access.
What does access have to do with data pipelines? Everything. The data output from a pipeline or
intermediate steps should be easily and quickly accessible. Why is this so important?
In data engineering, data access patterns refer to the different ways in which data is accessed,
processed, and analyzed. Understanding data access patterns is crucial for designing efficient and
effective data pipelines. Here are some common data access patterns in data engineering:
Batch processing:
In this data access pattern, data is processed in large batches at regular intervals, such as daily
or weekly. Batch processing is commonly used for analyzing large datasets that are stored in data
warehouses or data lakes. Batch processing is useful for data that is not time-sensitive, and the
analysis does not need to be performed in real-time.
Real-time processing:
In this data access pattern, data is processed as it is generated or received, in real-time. Real-time
processing is commonly used for applications that require immediate feedback or response, such
as fraud detection or recommendation engines. Real-time processing requires low latency and high
throughput to ensure that data is processed quickly and accurately.
Stream processing:
In this data access pattern, data is processed in real-time as it is generated, but the processing is
Chapter 1 - The Theory. 15

performed in small batches, rather than continuously. Stream processing is commonly used for
applications that require continuous data processing, such as log analysis or sensor data processing.
Stream processing requires high throughput and low latency to ensure that data is processed quickly
and accurately.
Interactive processing:
In this data access pattern, data is processed interactively, allowing users to query and analyze data
in real-time. Interactive processing is commonly used for data visualization, dashboarding, and ad
hoc analysis. Interactive processing requires low latency and high interactivity to ensure that users
can query and analyze data quickly and easily.
Near-line processing: In this data access pattern, data is processed on-demand, but the processing
is not performed in real-time. Near-line processing is commonly used for applications that require
moderate latency and throughput, such as data archiving or backup. Near-line processing requires
moderate latency and throughput to ensure that data is processed in a timely and efficient manner.
Each data access pattern has its strengths and weaknesses, and choosing the right pattern depends on
the specific use case and requirements. Understanding data access patterns is critical for designing
efficient and effective data pipelines that can meet the needs of the business.
Well, what happens when the inevitable bug or business requirements change and you need to
troubleshoot? What happens when the pipeline has scaled to hundreds or even terabytes of data?
You will care about data access then.
Simply put, how do people or machines consume the data in question? Data is no good if it cannot
be used.
Chapter 1 - The Theory. 16

I’ve seen complex projects completed by some very smart people, with code that looks like it should
be admired. And those codebases solved very hard problems. But they failed in the end with no
adoption even within the same engineering group.
Why? The data access layer was impossible to understand and interact with. It made the whole
project useless. Don’t let your project or pipeline fall into that hole.

• Data is of no use without reasonable accessibility and usability.

When you design a data pipeline never forgot about how the data will be accessed and explored. This
comes down to how and where the data is stored at rest, and then presented for use by consuming
users or applications.
Moral of the story? Choosing storage and file types is one of the most important topics and with
far-reaching consequences when designing data pipelines.

Repeatable
Repeatability is arguably one of the most important pieces of any data pipeline. Every pipeline
codebase must meet these criteria, it has to be repeatable. What good is the code if the author is the
only one who understands, can troubleshoot, or even worse, run the pipeline? This is more common
than you think.

• Anyone must be able to run a data pipeline.


• Anyone must be able to troubleshoot a pipeline.
• Pipelines break, they will have to be re-run.
• Pipelines should produce the same results (idempotent).

Some pipeline breaks, everyone is running around like a chicken with their head cut off because that
“one person” is on vacation today, and no one else knows what to do.
Repeatability is a crucial topic in data pipelines because it ensures that the data pipeline can be run
consistently and reliably over time. In data engineering, repeatability refers to the ability to recreate
the same output from a given input, even if the pipeline is run multiple times. Here are some reasons
why repeatability is an essential topic in data pipelines:
Quality assurance:
Data pipelines are often used to generate business-critical reports, insights, and analytics. Ensuring
that these outputs are accurate and reliable is essential. Repeatability allows data engineers to verify
that the output of the data pipeline is consistent across different runs, ensuring that the data is
accurate and reliable.
Debugging:
Inevitably, data pipelines will encounter errors or bugs that need to be fixed. When this happens, it
Chapter 1 - The Theory. 17

is essential to be able to rerun the pipeline with the same inputs to reproduce the error and debug the
issue. Repeatability ensures that the data pipeline can be rerun with the same inputs to reproduce
errors consistently.
Auditability:
In regulated industries, data pipelines must be auditable to ensure compliance with legal and
regulatory requirements. Repeatability enables data engineers to provide an audit trail that shows
how the data pipeline was run, what inputs were used, and what outputs were generated.
Collaboration:
Collaborating on data pipelines requires a high level of coordination and communication between
team members. Repeatability enables team members to share code, data, and outputs confidently,
knowing that they will get the same results when they run the pipeline.
Ensuring repeatability in data pipelines involves several best practices, including version control
for code and data, containerization, and automated testing. By following these best practices, data
engineers can build robust and reliable data pipelines that provide consistent and accurate results
over time.

People and Assumptions Affect Pipelines

Engineers will many times design into their code assumptions about the world that data lives in.
It makes the code unreadable, and anyone should be able to clone the code repository, look at the
README, and with a few lines at the console, kick off the pipeline.
Anything more inherently means the pipeline is not repeatable. It shouldn’t take more than one
button click or command line argument issued to re-start or run any data pipeline.
There should not be a myriad of configurations, flags, and files that must be staged in certain
ambiguous places for the data pipeline to run.
A good run of thumb is this… if you can’t schedule a CRON job to run a single Python file with a
main() function and a few arguments, which are noted in the README, then the code is probably
not repeatable.
If no one but the person who wrote the code can run it… that is a problem. This is another important
subject and we will explore it more later.

Resilient
This might seem obvious to some, but to others, it isn’t something they’ve had experience with.
What do I mean when I say a data pipeline must be resilient? I mean the codebase must be written
in such a way that it isn’t fragile.
There should not be many instances of “hardcoded” values, like dates or numbers in the code. This
inevitably means that an engineer will have to make a code change just to run a pipeline for some
Chapter 1 - The Theory. 18

historic data. Is there no try: except: blocks? This probably means the author didn’t think about the
pipeline in a resilient way.

• When data pipelines break every day, it signifies fundamental problems.


• Resilient pipelines require attention to code and architecture.

Becoming ready for the unknown is important in life and code!


A good data pipeline should be specific enough to get the job done efficiently, but not so specific
that minor changes in requirements don’t lead to codebase re-writes and refactors at every turn.
Having a resilient pipeline means thinking ahead, without over-engineering a solution. This can end
up being half art and half science but it’s an easy trap to fall into.
Thinking outside the box and not making assumptions like “I will assume this file always has records
in it and is never empty,” always make a pipeline less resilient.
A data pipeline can be made more resilient by implementing various strategies to handle errors,
failures, and unexpected events that can occur during its execution. Here are some ways to make a
data pipeline more resilient:
Monitoring and alerting:
Implementing monitoring and alerting mechanisms to identify errors and failures in the pipeline as
soon as they occur is crucial. These mechanisms can notify the relevant stakeholders so that they
can take immediate action to resolve the issue.
Automated retries:
Incorporating automatic retry mechanisms into the pipeline can help handle transient errors, such
as network disruptions, and prevent pipeline failures. This can involve configuring retry logic for
individual pipeline components or for the pipeline as a whole.
Backup and recovery:
Creating backups of critical data and having a recovery plan in place can help ensure that data is not
lost in the event of a pipeline failure. This can involve replicating data across multiple data stores,
implementing data recovery procedures, and conducting regular backups.
Disaster recovery planning:
Planning for disasters such as data center outages, power failures, and natural disasters is essential
for ensuring pipeline resilience. This can involve implementing redundant infrastructure, having
backup power sources, and having a disaster recovery plan in place.
Version control:
Using version control for code and data can help track changes and revert to previous versions in
the event of pipeline failures or data corruption.
Testing:
Conducting thorough testing of the pipeline, including unit testing, integration testing, and end-to-
end testing, can help identify potential failure points and ensure that the pipeline is resilient to errors
and failures.
Chapter 1 - The Theory. 19

In summary, making a data pipeline more resilient involves implementing a combination of


strategies to handle errors, failures, and unexpected events that can occur during its execution. By
adopting these strategies, data engineers can ensure that the pipeline remains reliable and performs
optimally under various conditions.

Scalable
Last but not least, and probably the most important, scalability is the name of the game. A data
pipeline that isn’t scalable isn’t a pipeline at all,
it’s just a one-time script that is doomed from the beginning.
One of the most common mistakes and time wastes I’ve seen is someone taking a few weeks to
write a codebase to solve a problem, testing and testing, releasing the code. Only coming to realize
in production when scaled up it either crashes or is so slow as to be useless.

• Data pipelines should be built to handle more data … and more.


• Scalability issues leading to poor performance are common problems.

You can rarely write a pipeline to deal with one piece of code and have it scale up without significant
changes. Scalability has to be one of the main tenants driving the creation of the pipeline from the
beginning.
When your thinking about a piece of data working its way through the pipeline, ask yourself, what if
this record has 10 million friends? Scalability affects file storage options, the processing framework,
and everything in between.
Data pipelines can be made more scalable by implementing strategies that allow them to handle an
increasing amount of data and processing tasks without compromising performance or reliability.
Here are some ways to make data pipelines more scalable:
Distributed processing:
Distributing processing tasks across multiple nodes or machines can help increase the pipeline’s
capacity to handle larger volumes of data. This can involve using technologies such as Apache Spark,
Apache Flink, or Hadoop to distribute processing tasks across a cluster of nodes.
Horizontal scaling:
Scaling the pipeline horizontally by adding more processing nodes can help increase its capacity to
handle more data and processing tasks. This can involve using cloud-based services such as Amazon
Web Services (AWS) or Microsoft Azure to add more compute resources to the pipeline as needed.
Partitioning:
Partitioning data into smaller chunks can help improve processing efficiency and enable parallel
processing. This can involve using techniques such as range partitioning or hash partitioning to
divide data into smaller chunks that can be processed in parallel.
Chapter 1 - The Theory. 20

Caching:
Caching frequently accessed data in memory can help reduce the amount of data that needs to be
processed, improve processing efficiency, and reduce the load on data stores. This can involve using
in-memory caching technologies such as Redis or Memcached.
Load balancing:
Implementing load balancing mechanisms can help evenly distribute processing tasks across
multiple processing nodes, preventing overload and improving performance. This can involve using
load balancers or distributed queuing systems such as Apache Kafka or RabbitMQ.
Auto-scaling:
Implementing auto-scaling mechanisms can help the pipeline automatically scale up or down based
on changes in processing demands. This can involve using cloud-based services that provide auto-
scaling capabilities such as AWS Auto Scaling or Azure Autoscale.
In summary, making data pipelines more scalable involves implementing strategies that enable them
to handle increasing amounts of data and processing tasks without compromising performance or
reliability. By adopting these strategies, data engineers can ensure that the pipeline can scale to meet
changing business needs and processing demands.

In Summary
“Facilitating the movement, storage, and access to data in a repeatable, resilient, and scalable
manner.”
It’s easy to get caught up in the mundane and day-to-day data engineering work of writing the next
ETL script. But, I suggest you take a different approach, looking at problems and resolving to make
good decisions, not to rush design, and to think through problems.

• Movement
• Storage
• Access
• Repeatable
• Resilient
• Scalable

You should always consider the above 6 points when approaching a new problem or pipeline.
Grasping the importance of these theories and them putting them into practice will put you ahead
of most engineers.
Reminding yourself daily that your pipelines need to move data efficiently, and what file types
are chosen do matter is will save you headaches down the road. Ensuring every pipeline is easily
repeatable and resilient from the start will come in handy when something breaks at night or on the
weekend.
Chapter 1 - The Theory. 21

Of course, when the business starts to scale and your data needs to keep up, you won’t be left holding
the bag! It’s very tempting to get caught up in this and that new technology, chasing the shiny new
toy. It’s easy to say “been there, done that,” when it comes to the next pipeline.
I encourage you to remember these basic ideas we discussed and use them to drive all the decisions
you make when tackling the next problem.
Chapter 2 - Data Pipeline Basics
Finally, it’s time to dive into the inside workings of data pipelines, what we’ve all been waiting for!
Let’s start by talking about the basics of building data pipelines. There are a few tips and tricks that
will help you generalize all the pipeline projects you work on.
I consider these things basic 101 level requirements. Chances are if you don’t follow these approaches
or some variation of them, you’re going to struggle to produce reliable and repeatable pipelines that
you can be proud of.
A data pipeline is a system or process that moves data from one or multiple sources to one or
multiple destinations in a structured and automated way. The primary purpose of a data pipeline is to
transform, process, and integrate data to provide useful insights, make informed business decisions,
and drive data-driven actions.
Data pipeline basics typically involve the following components:
Data Sources:
These are the systems or applications from where the data originates. The sources can be a database,
file storage, web service, or any other source that can provide the necessary data.
Data Ingestion:
This involves the process of extracting the data from the sources and moving it to a target system
or storage. The data ingestion process can involve different techniques such as batch processing,
streaming, or event-driven processing.
Data Transformation:
This involves the process of modifying, enriching, or cleansing the data to ensure that it meets
the desired quality standards and is in a format that is usable by downstream applications.
Transformation can be performed using various techniques such as filtering, aggregation, mapping,
or data validation.
Data Storage:
This involves storing the data in a storage system that can efficiently handle large volumes of data
and provide fast access to it. The storage system can be a traditional database, data warehouse, data
lake, or any other storage system that is appropriate for the data volume and use case.
Data Processing:
This involves performing complex computations on the data to extract useful insights and generate
reports or dashboards. The data processing can be performed using various technologies such as
machine learning, data mining, or data analytics.
Data Delivery:
This involves delivering the data to the intended end-users or applications in a timely and efficient
Chapter 2 - Data Pipeline Basics 23

manner. The delivery can be performed using various techniques such as APIs, messaging queues,
or web services.
In summary, data pipeline basics involve a set of components that work together to extract,
transform, process, and deliver data in a structured and automated way. By implementing a data
pipeline, organizations can ensure that their data is accurate, accessible, and can provide valuable
insights to drive business decisions.

The Best Pipelines Start with a Solid Foundation


I have no intention that this book teaches you how to write code, I will leave that for others. What
I do want to teach you is how to best write data pipelines possible using best practices, I will use
Python in my examples, but these practices can be applied to any language.
It’s more about the exercise and thought process than it is about what your language of choice
is. Don’t let people tell you that X or Y language is better because it’s “faster” or some other
reason. The more you write code and work on software projects you will realize it’s more about
the implementation of the code, than the language chosen.
I want to review these data engineering coding best practices that you will run into out in the real
world. Practices that will make your life, and the data engineers’ lives that come after you more
bearable.

• Project Structure.
• Testing.
• Documentation.
• Containerization.
• Architecture First.

These are just some basics that will assist us going forward, as they are the core of what will drive
the design and layout of all your data pipeline projects. I know you might be getting tired of the
theory, but this is where the theory starts to meet the reality of writing code to move data.
This isn’t a book about how to write code, so write code as you like, but adopting a few well-accepted
styles can have a positive impact on your success at delivering highly scalable and efficient pipelines.
Chapter 2 - Data Pipeline Basics 24

Project Structure
When starting our data pipelines in Python, or any language, we should follow a clear and concise
design pattern that shows any reader the flow of the data through the code. Just because something
is complex doesn’t mean it’s acceptable to have our program and data flow is hidden into obscurity.
The project structure of a data pipeline project typically involves several components and directories
that are organized in a logical and modular way. Here are some of the common components that
are included in a data pipeline project structure:
Configuration:
This directory contains configuration files that define the settings and parameters used by the data
pipeline, such as database connection strings, API keys, and environment variables.
Data:
This directory contains the data used by the data pipeline, including input data, intermediate data,
and output data.
Documentation:
This directory contains any relevant documentation related to the data pipeline, such as user
manuals, data dictionaries, and technical specifications.
Logs:
This directory contains logs generated by the data pipeline, including error logs, debug logs, and
performance logs.
Scripts:
This directory contains the scripts used by the data pipeline, including data ingestion scripts, data
transformation scripts, and data processing scripts.
Source:
This directory contains the source code of the data pipeline, including any custom libraries, modules,
and dependencies.
Tests:
This directory contains the tests used to verify the functionality of the data pipeline, including unit
tests, integration tests, and end-to-end tests.
In addition to these components, the project structure may also include a Makefile or a build script
that automates the process of building, testing, and deploying the data pipeline. The project structure
may also include a README file that provides a brief overview of the project, including its purpose,
functionality, and usage.
In summary, the project structure of a data pipeline project is designed to organize the various
components of the project in a logical and modular way, making it easier to manage, maintain, and
scale the data pipeline over time. By following a standardized project structure, data engineers can
ensure that their data pipeline projects are organized, consistent, and easy to understand by other
team members or stakeholders.
Chapter 2 - Data Pipeline Basics 25

Be short and to the point when writing code, the data should flow through the code in an obvious
and human-readable way. That’s why we are using Python!
We will dive into DevOps later but anytime you start a new codebase/pipeline, or work on an old
one, you should ensure a few basics. Always start by creating the following resources.

• README.md - a few well-written sentences can save someone hours.


• Dockerfile - they are easy to use and level the playing field.
• Tests folder - as a gentle reminder that tests are important.
• Requirements.txt - project dependencies should be captured somewhere.
• Folder structure to divide the codebase logically.

Let’s see what a very basic project structure might look like.

It’s a contrived and simple example in the above picture, but it’s obvious the main.py is where we
should look for the code, test probably has our tests, there is a Dockerfile. More than anything we
are looking for a project to be clear.
Here is an example of what you don’t want to see when looking at a new pipeline codebase.

1 run_database.py
2 general_functions.py
3 data_file.csv
4 read_methods.py
5 configurations.json

You don’t know where to start, what to look at, what is important or what isn’t. It’s just messy and
not obvious, be clean and organized in your code as well the project structure itself.
What would a better project structure look like applied above?
Chapter 2 - Data Pipeline Basics 26

1 README.md
2 Dockerfile
3 requirements.txt
4 utilties/general_functions.py
5 configs/configurations.json
6 sample_data/data_file.csv
7 database/run_database.py
8 read_methods.py

Not perfect but I’m sure you get the idea.

Build On a Solid Foundation

You might not think this has that much to do with data engineering pipelines, but it does. A solid
foundation is a key to the future success of any project. Taking the guessing game out of your
pipelines is one of the best things you can do.
There is nothing worse than going to a repository in GitHub and finding random files and folders
littered everywhere with no apparent consistency. Remember what your mom always said … “clean
your room!”

• You can’t have good pipelines without a solid base.


• You have to do the small parts well, to make the whole better.

Before writing the so-called guts of our codebase, large or small, it’s important to get a few things
right. I do understand that there are many different styles of writing code, but there is no excuse for
sloppy and dirty code and organization.
Pipeline code that is thrown together haphazardly is probably code that was also written poorly, not
tested, and will break often. Taking ownership and pride in a codebase is important.
Let’s dive a little deeper into structuring the actual codebase of a pipeline.

Data Pipeline Code Structure


If you’ve never written a data pipeline before, or are just starting, where do you even start? There
are so many steps to writing a complete data pipeline end-to-end, how do you accomplish the task?
One small step at a time, starting with the basics.
It all starts with main()
Every pipeline should start with a main.py file and a main() function.
This might sound picky but it’s essential to be clear from the beginning, and that starts with a
main.py file with a main() function that anyone can go to for a start. There is nothing worse than
opening a file with 20 functions and having no idea where to look.
Chapter 2 - Data Pipeline Basics 27

• Always include a main() entrypoint for your data pipelines.

Or even worse than that, just a bunch of files all in the same directory, each with a different name,
and you have no idea where the entry point of the codebase is!
You should always start with something along these lines.

1 >> main.py
2 if __name__ == '__main__':
3 main()

I mean call it entry_point() if you have to. Just give the obvious indication of where in the world
to start.

Good Coding Practices make Good Data Pipelines

I can’t stress this enough and I want to make a point here. This is a book about coding practices,
but you can’t be a successful Data Engineer without embracing a mindset of clean and concise code
structures that are easy to read and use.
Your data pipeline is only going to be as good as the code you write. If it’s hard to read and follow,
then it will leave a bad taste and impression, regardless of how good a job you think you did.

The Code Should Flow with the Data

Simple, but important. We should continue this pattern of obvious code and data flow. Your pipeline
code should follow the data flow in a sense. The methods and functions should be named to indicate
what they are doing and should be called inside your main() method in an obvious pattern.
In a data pipeline, the code should flow with the data to ensure that data is processed in a logical
and efficient manner. This means that the code should be organized and structured in a way that
follows the flow of data through the pipeline.
Here are some ways to ensure that the code flows with the data in a data pipeline:
Start with input data:
The code should begin by retrieving the input data from its source, whether it is a database, a file,
or an API. This input data should be processed and transformed in a series of steps that correspond
to the processing stages of the pipeline.
Use data structures that fit the data:
The code should use appropriate data structures that fit the type and format of the data being
processed. For example, if the data is in a tabular format, a DataFrame or a relational database
may be used.
Avoid data duplication:
The code should avoid duplicating data unnecessarily. Instead, it should process data in a sequential
Chapter 2 - Data Pipeline Basics 28

manner, passing data from one step of the pipeline to the next without duplicating or storing the
same data multiple times.
Keep track of data flow:
The code should keep track of the data flow at each stage of the pipeline. This may involve logging
data or using a data lineage tool to track the flow of data from input to output.
End with output data:
The code should end by writing the output data to its destination, whether it is a database, a file, or
an API. The output data should be in the desired format and contain all the necessary information
for downstream processing or analysis.
By ensuring that the code flows with the data in a data pipeline, data engineers can develop pipelines
that are efficient, maintainable, and scalable. This can lead to better performance, faster development
time, and improved data quality.
As the data flows, the code flows, and vice versa.

Simply reading a main() function should tell you what’s happening to the data and where it is
coming from and going too.

Example
Let’s say we have a very simple pipeline to read CSV files, get some metrics from the files, and store
those results into a SQL database.

• Read CSV files.


• Get metrics from files.
• Store results in SQL.

Don’t Focus on the Details Right Away

It can be tempting to start thinking about the details or writing code right away. We should shy away
from this practice. Let’s sketch our code the way we would sketch a drawing. It helps to understand
the work that lies ahead of us and will help keep us on the straight and narrow as we develop the
pipeline.
Chapter 2 - Data Pipeline Basics 29

It doesn’t matter very much if you write Object Oriented Programming (OOP) or Functional
programming with lots of little methods. If you are writing ETL or data pipelines, your code should
show the flow of the data when read.
In the above example, we talked about a simple program to open CSV files, get metrics, and put
those metrics into the database. The code should be framed in such a way this is obvious with a
cursory glance.

1 >> main.py
2 def main():
3 calculate_workload()
4 download_csv_files_in_parallel()
5 stream_files_to_memory()
6 calculate_metrics()
7 save_metrics_to_database()

Why write code like this? Because it’s obvious how the data flows through the program. It also
gives a high-level understanding of what the pipeline is going to look like before every bit of code
is written.

• Data flows through a data pipeline.


• Human readable data pipelines increase productivity and reduce bugs.
• Don’t work on small details, do the big picture first.

This is a great advantage because it will help us think about how scalable, efficient, and extensible
this pipeline will be. I can’t emphasize this enough. If you just started by working on calculate_-
workload() and not thinking about what comes after or before, you’re on the road to a life of
hardship.
You get caught up in “how am I going to read this data”, and writing the code straight away, your
missing details about what is coming after that could change how you write the code you working
on!

Code Readability and Organization


Again, every main() function should show the flow of the data through the pipeline in an obvious
fashion. Also, if the different steps you will be designing throughout the pipeline require 5 or more
functions, you should consider breaking those out into a separate Python Class or file if you writing
only functions.
This makes your code more reusable, maintainable, and easier to debug. It gives context and doesn’t
overwhelm someone else, or you when debugging or feature addition happens.
Chapter 2 - Data Pipeline Basics 30

Code readability and organization are crucial aspects of developing a high-quality and maintainable
data pipeline. Here are some best practices for improving code readability and organization in a data
pipeline:
Use descriptive names:
Use descriptive variable names, function names, and comments to make the code easier to
understand. This will also make it easier to maintain the code over time.
Follow coding standards:
Follow standard coding conventions and guidelines to ensure that the code is consistent and easy
to read. This includes using appropriate indentation, using whitespace effectively, and following
naming conventions.
Use modular code:
Break the code down into small, reusable functions that perform a specific task. This will make the
code easier to read and debug, and it will also make it easier to reuse code in other parts of the
pipeline.
Use version control:
Use a version control system to manage changes to the code over time. This will make it easier to
track changes, collaborate with other developers, and roll back changes if necessary.
Document the code:
Use comments and documentation to explain how the code works, what it does, and how it should
be used. This will make it easier for other developers to understand the code and contribute to the
project.
Use testing frameworks: Use testing frameworks to test the code and ensure that it works as expected.
This will help catch errors and bugs early on in the development process, making it easier to fix issues
before they become bigger problems.
By following these best practices, data engineers can improve the readability and organization of
their code, making it easier to maintain and update the data pipeline over time. This will help ensure
that the data pipeline remains efficient, reliable, and scalable, even as data volumes and processing
requirements change over time.
So perhaps in the above example, streaming files into memory could be a few methods/functions.

1 >> stream_csv_files.py
2 import csv
3
4 def open_csv_file(file_location: str) -> object:
5 with open(file_location) as f:
6 csv_reader = csv.reader(f)
7 for row in csv_reader:
8 yield row
9
Chapter 2 - Data Pipeline Basics 31

10 def stream_csv_rows(rows: iter, transform_etl: object) -> iter:


11 for row in rows:
12 transformed_row = transform_etl(row)
13 yield transformed_row

This is a simple but powerful idea. Don’t have the bad habit of writing data pipelines all in a single
file. It isn’t helpful and will most certainly leave yourself and others confused as the code grows,
group your code in logical units.

• Break up functions into the smallest units possible.


• Write data transformations that are re-usable.
• Avoid long functions and methods.

You should not write functions that are 50 lines long. This means your not breaking up your code
into small enough logical units.
Functions and methods that are working on data and contain too many lines of business or
transformation logic are going to be prone to error and impossible to maintain or troubleshoot.

Tests.
We will talk more about testing in the DevOps chapter, but this topic is so important it’s worth a
cursory overview now.
You need to have tests for every data pipeline. Why? Because if want to be cut above the rest you
will follow best practices every step of the way, even when you don’t have to.

• Tests are basic to data pipelines.


• Tests protect pipelines from bugs and errors.
• Tests increase development productivity.

Think about it, if you change code how are you supposed to know it works and that you have not
broken something you don’t know about, especially if you are new to a codebase?
Tests are an essential aspect of developing a high-quality and reliable data pipeline. Here are some
reasons why tests are important in a data pipeline:
Ensure correctness:
Tests can ensure that the data pipeline is functioning correctly and producing the expected results.
This is especially important when dealing with large volumes of data, complex transformations, and
multiple data sources.
Catch errors early:
Tests can catch errors early on in the development process, making it easier to fix issues before they
Chapter 2 - Data Pipeline Basics 32

become bigger problems. This can save time and resources by reducing the amount of time spent on
debugging and troubleshooting.
Facilitate collaboration:
Tests can help facilitate collaboration among developers by providing a common language and
framework for discussing the code. Tests can also help ensure that changes made by one developer
do not break the code or data pipeline for others.
Support scalability:
Tests can support scalability by helping to identify bottlenecks or performance issues before they
become major problems. This can help ensure that the data pipeline can handle increasing volumes
of data and processing requirements.
Improve confidence:
Tests can improve the confidence of data pipeline users and stakeholders by providing evidence that
the pipeline is reliable and producing accurate results. This can help build trust and credibility for
the data pipeline and its users.
In summary, tests are critical to developing a high-quality and reliable data pipeline. They can
help ensure correctness, catch errors early, facilitate collaboration, support scalability, and improve
confidence in the pipeline.

Unit Testing
Unit tests are the first line of defense. No one should have to run a pipeline to be able to know if it’s
working or not. Changing the business or transformation logic acting upon data should be tested
in such a way that anyone can attempt the change and run tests with a reasonable assumption that
unit tests will catch any problems.
Unit tests will not catch every bug, they never will, but they will catch some, and they will protect
you from those “silly” mistakes we are all prone to making.
In our above example, we would have at least …

1 tests()
2 test_download_csv_files_in_parallel()
3 test_stream_files_to_memory()
4 test_calculate_metrics()
5 test_save_metrics_to_database()

Every language has its own popular testing frameworks, for Python that would be pytest. Any
Google search or YouTube video can introduce you to the basics of testing for any language and
have you writing your own in an hour or less.
Many times it’s critical when testing data pipeline code to have example and sample files to process,
and the sooner you generate them and start writing tests the better.
Chapter 2 - Data Pipeline Basics 33

It’s not a good idea to wait until you are done before thinking about tests.
Many times writing unit tests for the method or function you are writing right away will cause you
to refactor your code, to make it simpler and more testable … this is a good thing.

1 import csv
2
3 def open_csv_file(file_location: str) -> object:
4 with open(file_location) as f:
5 csv_reader = csv.reader(f)
6 for row in csv_reader:
7 return row
8
9 def test_open_csv_file():
10 test_row = open_csv_file(file_location='sample_file.csv')
11 assert test_row == [1, 2, 3]

If you’re new to testing just take some time to read some articles and watch some videos for your
language of choice. Once you get into the habit of writing unit tests you won’t look back.

Documentation
Having a README.md file in every ETL project isn’t asking that much. No matter how well we think we
write code, no one, including yourself six months later will remember every nuance and reasoning
that caused code to be written in a certain way with certain assumptions.
Documentation is a crucial aspect of developing a data pipeline. Here are some reasons why
documentation is important in a data pipeline:
Facilitate understanding:
Documentation can help developers and stakeholders understand how the data pipeline works, what
it does, and how it should be used. This is especially important for complex data pipelines that
involve multiple data sources and complex transformations.
Support maintenance:
Documentation can help support the maintenance of the data pipeline by providing guidance on how
to modify, update, and troubleshoot the pipeline over time. This can help ensure that the pipeline
remains efficient, reliable, and scalable as data volumes and processing requirements change over
time.
Foster collaboration:
Documentation can help facilitate collaboration among developers and stakeholders by providing
a common language and framework for discussing the data pipeline. This can help ensure that
everyone is on the same page and working towards the same goals.
Chapter 2 - Data Pipeline Basics 34

Improve efficiency:
Documentation can improve efficiency by reducing the amount of time and effort required to
understand, use, and modify the data pipeline. This can help ensure that developers and stakeholders
can focus on other important tasks, such as data analysis and modeling.
Mitigate risk:
Documentation can help mitigate risk by providing evidence that the data pipeline is reliable,
accurate, and compliant with relevant regulations and policies. This can help protect against data
breaches, regulatory fines, and other risks associated with data management.
Documentation is critical to developing a high-quality and reliable data pipeline. It can help facilitate
understanding, support maintenance, foster collaboration, improve efficiency, and mitigate risk.
By investing in documentation, data engineers can ensure that the data pipeline remains efficient,
reliable, and scalable over time.
The README.md is your chance to get these important ideas across.
If your company uses something like Confluence, even better! But everyone has an opportunity to
add a README.md file to a new or existing pipeline project you working on.
A good place to start with README.md contents is here …

• Give a high-level description of what the code’s doing.


• Describe the input and output datasets, where they come from, where they go.
• List any reasons and business requirements that drove decisions when writing the code.
• Explain how to test and deploy the code.
• Talk about any configurations or other arguments used in the code.

Even a small amount of context given at the right time to some new engineer who’s about to debug
code they’ve never seen is going to be a lifesaver. It will make you a hero.
Let’s look at an example README.md for a Data Lake project. What we want to do is provide new
technical folks a good overview of what they are dealing with.

README example
Chapter 2 - Data Pipeline Basics 35

1 >> README.md
2 Data Lake Repo (Databricks and Delta Lake)
3
4 Docker and Tests
5 First build the docker image docker build --tag delta-warehouse . You should only ha\
6 ve to do this once.. unless you change the Dockerfile
7
8 Use the command docker-compose up test to run all unit-tests. The container will be \
9 tied to your local volume from which you are running the command, and will pick up y\
10 our changes.
11
12 Storage
13 There are three s3 buckets located in us-east-1 used by the DeltaLake and Databricks\
14 warehouse.
15
16 warehouse-production
17 warehouse-development
18 warehouse-integration-testing
19
20 The Development bucket should be used for exploratory and development work as needed.
21
22 The Integration bucket should not be used for development, but only end-to-end autom\
23 ated integration tests. (sample data that will be static)
24
25 The Production bucket should be for production use only.
26
27 Table Maintenance
28 We run two different types of table maintenance. See data-warehouse/maintenance/.
29
30 Optimize (reduce number of small files)
31 Vacuum (delete unused data files)
32
33 DataBricks Scripts
34 The folder databricks-scripts holds the files that Airflow will submit an s3 uri in \
35 a DataBricks Job call.
36
37 This folder is auto pushed via CirclCi on push to master branch to s3://warehouse-pr\
38 oduction/databricks-scripts
39
40 Warehouse Layout
41 The warehouse is made up of raw files in a s3 bucket, where are moved into Databrick\
42 s DeltaLake tables via DataBricks with ETL orchestrated by Airflow.
43
Chapter 2 - Data Pipeline Basics 36

44 The data flows as follows....


45 s3 raw files
46 Staging Delta Tables
47 Fact/Dimension Delta Tables

I know documentation can seem like a boring topic, but it’s better to have something rather than
nothing. It’s better to read a few sentences before digging into a project, to get the general idea of
what’s happening, rather than having to guess.

Containerzation
You’re just going to have to learn to use Docker, containers are the future and are the reality today.
This is just as true for a data engineer as is it is for a software engineer. There is no easier way to
make a data pipeline repeatable and reliable than to have a Dockerfile to run the code on. It removes
all system ambiguity.
Containers are a great equalizer when it comes to systems, requirements, and dependencies. You
can essentially freeze every detail about where and how your code runs, assessable to yourself or
anyone coming on later to work on a project. I hope this importance doesn’t escape you.
Containerization is an important technique in data engineering pipelines that involves packaging
an application and its dependencies into a container, which can then be deployed across different
environments with consistent behavior. Here are some reasons why containerization is important
in data engineering pipelines:
Consistency:
Containerization ensures that the data pipeline is consistent across different environments, such as
development, testing, and production. This helps ensure that the pipeline performs the same way in
all environments and reduces the risk of errors caused by differences in environment configurations.
Portability:
Containers are portable, meaning they can be deployed across different operating systems and cloud
platforms. This enables data engineers to move the data pipeline between different environments
with ease, without worrying about compatibility issues.
Scalability:
Containers are lightweight and can be scaled up or down quickly and easily to handle changing vol-
umes of data. This helps ensure that the data pipeline can handle increased processing requirements
and data volumes without the need for significant changes to the infrastructure.
Efficiency:
Containers use resources more efficiently than traditional virtual machines, as they share resources
with the host operating system. This enables data engineers to run multiple containers on the same
hardware, which can lead to significant cost savings.
Chapter 2 - Data Pipeline Basics 37

Security:
Containers provide a secure environment for running applications, as they are isolated from the
host operating system and other containers. This reduces the risk of security breaches caused by
vulnerabilities in the application or the operating system.
Containerization is an important technique in data engineering pipelines that provides consistency,
portability, scalability, efficiency, and security. By containerizing the data pipeline, data engineers
can ensure that it performs consistently across different environments, can be easily deployed and
scaled, and is secure and cost-efficient.
I know that some things can feel like a hill too big to climb, especially if you have never used Docker
before, containers can be a little bit of black magic. But, I promise it’s not that bad.
What does a Dockerfile do for a data engineering project or pipeline?

• Anyone can run the code.


• Anyone can test the code.
• Reduces the chances of code/packages breaking the pipeline.
• Integration is easier because today’s distributed systems are embracing containers.

Let’s look at a quick example of how Dockerfiles and containerization can remove ambiguity and
level the playing field in data engineering. When coming to a new code base for the first time it can
be helpful just to review the Dockerfile to see what tools and systems are installed.
Just doing that simple task will usually give you a good idea of what’s to come!

1 >> example Dockerfile


2 FROM ubuntu:18.04
3
4 RUN apt-get update && \
5 apt-get install -y default-jdk scala wget vim software-properties- common python\
6 3.8 python3-pip curl unzip libpq-dev build-essential libssl-dev libffi-dev pytho\
7 n3-dev && \
8 apt-get clean
9
10 RUN wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2\
11 .tgz && \
12 tar xvf spark-3.0.1-bin-hadoop3.2.tgz && \
13 mv spark-3.0.1-bin-hadoop3.2/ /usr/local/spark && \

We can see from the above simple example of a Dockerfile we have an Ubuntu image that gets Java,
Scala, and Python installed. We can also see Spark being installed, this gives us a great idea of what
this project will be all about, Spark!

• It’s clear from the Docker container what tools and tech are being used.
Chapter 2 - Data Pipeline Basics 38

• It’s all packaged and ready to go, with no manual installations.

Also, it levels the playing field in the sense that the next Developer who comes to work on this
project doesn’t have to spend a day fighting his machine getting Spark installed before he can start
work.
I’m not going to teach you how to become an expert with Dockerfiles, that isn’t the point of this
book. But I suggest you spend some time learning how to write tests and wrap your projects in a
Dockerfile. There are plenty of good resources to help you get started, just Google it!
We will dive more into some high-level Docker examples for data engineering in the DevOps section
later.

Architecture First
Surprise, surprise we are talking about architecture, are your eyes glazing over yet? I know you
might be rolling your eyes and saying “get on with it already.” Diving into writing code right away
without working through the architecture first is a big mistake.
We all know that code is written with certain assumptions and around certain tech stacks. Writing
code is going to run on a Lambda vs code that will be running via a Cron on a EC2 box is probably
going to be different. So, we should probably think through the differences first before writing code.

• Architecture before code always.


• Think before you act or write code.
• Ask questions early and often.

Architecture First is an approach to data engineering that prioritizes designing the overall architec-
ture of the data pipeline before implementing any code. Here are some reasons why Architecture
First is important in data engineering:
Flexibility:
An Architecture First approach enables data engineers to design a data pipeline that is flexible and
can adapt to changing business requirements. By considering the overall architecture of the pipeline
first, data engineers can ensure that it can support new data sources, transformations, and processing
requirements in the future.
Integration:
An Architecture First approach enables data engineers to design a data pipeline that integrates
seamlessly with other systems and applications. By considering the overall architecture of the
pipeline first, data engineers can ensure that it can be easily integrated with other systems and
applications, reducing the risk of data silos and enabling more comprehensive data analysis.
Reduced risk:
An Architecture First approach can help reduce the risk of errors and inefficiencies in the data
Chapter 2 - Data Pipeline Basics 39

pipeline. By designing the architecture first, data engineers can identify potential issues and
challenges early in the development process, reducing the risk of costly errors and rework.
Architecture First is an important approach to data engineering that prioritizes designing the overall
architecture of the data pipeline before implementing any code. By considering the architecture first,
data engineers can ensure that the pipeline is scalable, flexible, efficient, and integrated with other
systems, while reducing the risk of errors and inefficiencies.

Pipeline Architecture Starts with Questions

This is more than how the code is laid out in the files as we talked about earlier. This is a thought
process that begins before a single line of code is written and that usually where projects live or die.
I look at the topic of software and data pipeline architecture like that old saying your parents
preached at you … “think before you act.”
We are comfortable with certain design patterns and ways of solving problems. We usually rely
on our history and this is about how we’ve done something similar in the past. This can be an ok
approach, but probably not the best.
It’s best when starting a data pipeline or ETL project to take the 10,000-foot view. Ask yourself a few
questions.

• What am I trying to accomplish?


• Do I have a tool or tech stack already that can solve this?
• What would be the simplest approach?
• What would be the most complex approach?
• What are the trade-offs of the above?
• How does this plan fit into the bigger picture of my org?

These are just a few questions to get you started. The best thing you can do is try to poke holes in
your ideas and design.
Being able to “step back” from the problem and think high level about pipeline and data architecture
seems like a lost art. Too many people rush into decisions and architecture because it’s familiar and
safe. Don’t do that.
The decisions we make upfront have a big impact downstream.

Review
Let’s do a quick review of our data pipeline basics.

• Project Structure
• Testing
Chapter 2 - Data Pipeline Basics 40

• Documentation
• Containerization
• Architecture First

These are some of the core tenants of developing top-tier data pipelines that some probably see as
“not the point” or as “mere peripheries” of the development work. I can assure you this is a bad
stance to take. Getting certain core principles and practices in place before the real work begins will
change the trajectory of the entire project.
Most people say “I will get to that later,” and “we can come back to get that.” This rarely happens.
Typically when a project or codebase starts with a mess of files and no project structure it sets a
precedence that the code is no good either.

• Tech debt and bad design happen when you say “I will fix that later.”
• Being messy with project structure upfront will ensure things only get worse.
• When tests are not a first-class citizen of your data pipelines, this ensures failure.
• Not taking the time to containerize the project with Docker makes development difficult.
• Skipping architecture even on simple projects is a mistake.

When you first discover a data pipeline that doesn’t have a single test written for it you will be
scared to touch it, who knows if something will break?
Also, when you find no documentation or even README it’s easy to get overwhelmed, how do
you know where to even start? What is the entry point for the code, what is the background,
what technologies are used? The amount of questions that can be answered with some simple
documentation is amazing.
Never forget containerization with Dockerfiles and how it puts everyone on the same page when it
comes to development and requirements. Without containerization, you just rely on people getting
things installed on their machine, Apple, Linux, Windows .. who knows all the headaches that will
happen?
Of course, it all begins and ends with architecture first. The classic mistake most developers fall into
is just jumping in feet first writing code without looking back. Do yourself and others a favor and
save time and headaches by working through the ins and outs of the technical details before writing
the actual code. It will make your life easier in the long run.
Chapter 3 - Pipeline Architecture
Data pipeline and platform architecture is an important topic that many data engineers shy away
from. We all fall into the pit of going straight to writing code without thinking about the big picture.
This is what architecture is all about, the big picture.
It doesn’t matter what your title is, it doesn’t need to have an architect in front of it, it’s something
you should do and a skill you should work on every day. You might feel like your projects are too
small to worry about such lofty ideas, but you are wrong.
Data Engineers, especially those at the senior levels, depending on the company, are given great
leeway in deciding what tools are right for the job. Many times everyone will look to the Data
Engineers for recommendations about what available pipeline tools and technologies are available,
and are best suited to a specific need.
When approaching architecture in data engineering, data engineers should consider the following
key steps:
Define the problem:
The first step in approaching architecture is to clearly define the problem the data pipeline is intended
to solve. This involves understanding the business requirements and data sources, as well as any
technical constraints or limitations.
Identify the data flow:
Once the problem has been defined, the data engineer should identify the data flow through the
pipeline. This involves mapping out the data sources, transformations, and destinations, as well as
any required data storage or processing steps.
Choose the right technologies:
The next step is to choose the right technologies for the data pipeline. This involves evaluating
different technologies based on their scalability, performance, cost, and compatibility with existing
systems.
Design for scalability:
When designing the data pipeline, data engineers should prioritize scalability to ensure that the
pipeline can handle increasing volumes of data and processing requirements over time. This involves
considering factors such as data partitioning, load balancing, and parallel processing.
Ensure data quality:
Data quality is a critical aspect of data engineering architecture. Data engineers should design the
pipeline to ensure that data is validated, cleaned, and enriched to meet business requirements and
avoid downstream issues.
Incorporate security and compliance:
Security and compliance are also important considerations in data engineering architecture. Data
Chapter 3 - Pipeline Architecture 42

engineers should ensure that the pipeline is designed to meet regulatory requirements, such as GDPR
or HIPAA, and includes appropriate security measures, such as encryption and access controls.
Test and iterate:
Finally, data engineers should test and iterate on the data pipeline to ensure that it meets the business
requirements and performs as expected. This involves testing different scenarios, identifying issues,
and making improvements based on feedback and performance data.
Data engineers should approach architecture by defining the problem, identifying the data flow,
choosing the right technologies, designing for scalability and data quality, incorporating security
and compliance, and testing and iterating to ensure that the pipeline meets business requirements
and performs as expected.
Data pipeline architectures …

• Is about the big picture.


• Is about fitting puzzle pieces together.
• Is about reducing complexity.
• Is about finding the right tools for the job.
• Is about making tradeoffs.

To be able to answer these important questions, and allow the data pipelines to flow free and fast,
requires some basic architecture skills and thought processes.
Architecture isn’t some high and lofty ideas that has no grounding in reality. The best architects use
their knowledge, experience, and research to understand the technical details of what has to happen.
They pick the best long-term and simplest solution to get the desired result.
Many times it’s about trying to discover problems and roadblocks before they happen.

All that might sound scary, but it’s thinking about your past experiences, and the experiences of
others and applying them to the future. It’s about trying to objectively poke holes in software and
solutions before they are chosen or written. Think about what could go wrong, where bottlenecks
might exist, about worse case scenarios.

Architecture Applied to Data


What does data pipeline architecture come down to?
Chapter 3 - Pipeline Architecture 43

• What are the business requirements.


• What end result is needed.
• What are the Service Level Agreements (SLA’s).
• Draw it before you write it.
• How much data, how often, and how quickly will it grow.
• What types of data format(s).
• Storage needs, compute needs.
• What pieces of technology fit the requirements best.
• What do I have today that will do the job, what are the tradeoffs.
• Code architecture, how should it be implemented.
• Batch vs Streaming.
• Putting the final puzzle pieces together.

Taking a step back and trying to think about all the possible scenarios and solutions is very helpful
in data engineering. It’s trying not to make a decision and start designing a pipeline in your mind
based around some piece of technology that you have a bias for.
Try to avoid swinging too far into over-engineering a solution. The majority of the time the simple
solution will out-perform and cause fewer headaches than the fancy new one.

• Architecture provides space to think and plan.


• Architecture provides space to fail early.
• Simple over complex.
• Think about connections between systems.

Many times simply drawing or sketching out the data pipeline that you want helps to think through
problems and help others see your vision.

Data engineers will often have to think about the coupling or connection between systems, or
pieces of the pipe. We might have one tool that orchestrates and manages pipeline dependencies
like Airflow. We might have another tool like Databricks Spark that runs the transformations of the
data. These tools must work together, and such concerns are to be considered during the architecture
planning phase of most data engineering projects.
Before we jump into the different steps and details of architecture, let’s take a look at an example
problem, to help us make sense of some common architecture thought processes.
Chapter 3 - Pipeline Architecture 44

Example Project
Let’s work through how we can think about pipeline architecture by using an example. Here is the
background.

The Problem Statement

You work for a medium-sized manufacturing company that makes all sorts of widgets. It’s a small
IT department of about 15 people, with only 3 data engineers. The company recently purchased
some new manufacturing equipment. These new widget machines are fancy and can connect to the
company network, sending out a signal/data each time a widget is produced in a TXT or flat-file
configuration.
Instead of manually counting inventory, the company would like the data engineering team to
capture this widget data being output by the new machines and feed it back into the inventory
system as well as provide dashboards of machine throughput to the business.

• Small manufacturing company.


• Limited engineering resources.
• Manufacturing machines push TXT files to a network location.
• IT Infrastructure is in AWS and local.

The IT department has still in the middle of a transition to the cloud, they have some infrastructure
in AWS, and some local servers still running that host the system used to run the manufacturing
system.
As the Senior Data Engineer, you’ve been tasked with deciding how to implement this system. Let’s
walk through the architecture thought process of how we might approach the different ways to
implement a solution.

Step 1 - Examine Requirements

The first part of any architecture decisions on data pipelines should be examining the requirements
of the business or as much of them as you can get. Of course in the real world, it might take some
digging and you only find out half of what you need, but the answers are important nonetheless.
Understanding what the final expectations are is key to making technology and pipeline design
decisions. Let’s say in our case we find that the business expects a dashboard to be updated every
few hours with production numbers from the machines, but you also find out that they want the
inventory system to be updated every hour at least.

• Always understand final output and expectations.


• Understand current and future infrastructure.
Chapter 3 - Pipeline Architecture 45

Sketching out the current state and the high-level data flow needed is always a great place to start.
Here is a sketch of our current example.

This already starts to tell us a story, real-time information is not expected for this project. That
means when choosing the simplest technology layout to solve our problem, we can probably toss out
streaming or message queues. It’s always helpful to understand where we can get rid of complexity!
Let’s review what we discover about the above requirements. When working on architecture it
always helps to make a list that boils down to what we need.

• Not real-time, but hourly updates.


• Need an analytical dashboard.
• Ingest/capture widget production data points.
• Will not be running the cloud.
• TXT files for source raw data sent to a network location.

We can see from the above list that our possible options for the pipeline are becoming more obvious,
we can rule in or out certain technologies. We don’t need streaming, we need batch. We need code
that ingests TXT files, which is fairly straightforward. We also know we need to build a warehouse
to house the accumulated data for dashboard analytics.
Now, these are things that data engineers love to work on!
Let’s continue to play along with our contrived example, and dig into each step we listed in the
introduction as some of the core tenents and questions we ask and work on during data pipeline
architecture.

How much data, how quickly will it grow/ types of data format(s).

For our example, we talked to the manufacturing folks in charge of setting up and installing the
machines. They have documentation from the vendor that says the new machines will write out
plain text tab-delimited files to a drive on any IP address via ssh.
A new file will be written every 10 minutes in the following format.

1 machine_id part_number datetime quantity


2 AB34252AA 1545A03 2021-04-01:13:34:43 50
Chapter 3 - Pipeline Architecture 46

We know we have 5 new machines, which will output files every 10 minutes, they are text files, so
the data will be relatively small and easy to deal with. It appears even if more machines are added,
the data will not grow overwhelmingly fast.
As a data engineer, this is not a ton of information, but it allows us to sit down at this point and start
making some high-level decisions.

• Where should the data be stored, cloud or local?


• What type of tech platform is good for this type of data
• Calculate rough data sizes and speed of processing needed.

Based on what we know we calculate that each file put out from the 5 machines will be 4KB in size,
every 10 minutes. We know the plant will be running 10 hours per day …

1 60 files x 5 machines x 4KB = 1200KB

This is not much data … even if we are half wrong and our pipeline needs to handle double the data.
This is what we do in architecture planning, we break down what we know into bite-sized chunks,
we ask questions, and start making estimates and design decisions based on what we do know.

What pieces of technology fit the needs best?

Let’s make some architectural decisions. When we put a few facts together some things become
clear. The data is needed on a local server to populate the inventory system, the data is not big nor
is it real-time.
While it might have been tempting when first hearing about this project to select new and shiny
technology that would have done the job, it probably would have proved to be complicated and
expensive, another over-engineered project.

Option 1

Cloud-based streaming/messaging system to ingest files, transform and write files to cloud-based
storage. Also, write results back into locally hosted servers to populate the inventory system. Maybe
Kafka/Pulsar/Apache Sparking to a Delta Table and Tableau as a Dashboard.

Option 2

Keep everything local, the files are written to a local server, process with locally hosted Python script
on the same server, backup files to the cloud.
Maybe we present our leader with the following updated proposal. This is all done before any code
is written or decisions are made. It allows no work to be wasted and problems to be identified early.
Chapter 3 - Pipeline Architecture 47

Example Continued.

We also have to think about tradeoffs, as data engineers who love the cloud and managed services
we want to gravitate towards the fun cool stuff, regardless of if the complexity matches the situation
or not. But we have to be realistic what architecting data pipelines.
If we worked at a manufacturing plant that was double the size, and say had 100 machines running
all producing these files that might change the discussion.

• Context is important.
• Drawing out the current state and purposed future states is helpful.
• Calculating rough data size and usage helps us make decisions.

In this case, we could present to our boss some actual numbers and our recommendation. The cost
of storing our data in the cloud is known with our rough size calculations, as well as the cost of
running messaging services along with Spark or Beam would be fairly easy to obtain a pricing.
This would be compared to the cost of managing and hosting a local server with our custom code
to process the file(s).

Review of Example Architecture Problem

This is of course a simple example, and we glossed over many details but I think you get the picture.
I also think it’s quite common for this type of exercise to never even take place. Usually, it’s just
up to an engineer who may just start choosing technology regardless of thinking through all the
options available.
Having a good step back and thinking through the data flow and requirements is a critical step to not
ending up with an expensive over-engineered solution, or on the opposite side, a simplistic solution
that breaks later because of failure to scale.
I want to point out a few key areas we covered in this example that apply to many data pipeline
projects when it comes to architecture decisions.

• Calculating data size and velocity.


• Calculating compute/storage requirements based on data size.
• Understanding the end result.
• Complexity vs simplicity tradeoffs.
• Understanding cost.
Chapter 3 - Pipeline Architecture 48

Data Size and Velocity


Probably one of the easiest architecture steps when thinking about data pipelines is trying to
calculate rough data size and velocity requirements.
Not only is this exercise well because it forces a basic understanding of what the data looks like and
potential issues, but understanding the velocity or incoming volume will shed light on what are and
are not acceptable solutions.
Both the size of the data and the velocity or frequency of the data should impact the architecture.
We don’t want to over-engineer solutions, if we are talking about a few GB’s of data, you probably
don’t need Apache Spark for example.
Data Size and Velocity are two important factors that data engineers must consider when designing
and implementing data engineering pipelines. Here is an explanation of their importance:
Data Size:
Data size refers to the volume of data that a data engineering pipeline must handle. As the volume of
data increases, the complexity and resources required to process, store, and analyze it also increase.
Data engineers must design pipelines that can handle large data volumes while ensuring that they
are scalable, efficient, and cost-effective.
Data Velocity:
Data velocity refers to the rate at which data is generated and processed. In some cases, data must
be processed and analyzed in real-time or near real-time to enable timely decision-making. Data
engineers must design pipelines that can handle high data velocity while ensuring that they are
accurate, reliable, and scalable.
The importance of data size and velocity can be summarized as follows:
Performance:
The size and velocity of data can impact the performance of data pipelines. As data volumes increase
or processing requirements become more demanding, data pipelines may experience performance
issues such as slow processing times or data latency. Data engineers must design pipelines that can
handle the performance requirements of the data being processed.
Scalability:
Data size and velocity can also impact the scalability of data pipelines. As data volumes and
processing requirements increase, data pipelines must be able to scale up to meet the demand. Data
engineers must design pipelines that can scale horizontally or vertically to handle increased data
volumes and processing requirements.
Resource Management:
Managing the resources required to process and store large volumes of data can be challenging. Data
engineers must design pipelines that efficiently use resources such as processing power, memory, and
storage to ensure that the pipeline remains cost-effective.
Chapter 3 - Pipeline Architecture 49

Analytics:
Data size and velocity can impact the types of analytics that can be performed on the data. Real-time
data requires different analytics approaches than batch processing, and large data sets may require
different analytical tools or techniques than smaller ones. Data engineers must design pipelines that
support the required analytics based on the size and velocity of the data.
Data size and velocity are important factors in data engineering pipelines that impact performance,
scalability, resource management, and analytics. Data engineers must design pipelines that can
handle large data volumes and high data velocity while ensuring performance, scalability, and cost-
effectiveness.
Let’s dive a little deeper.

Data Size
Why should we understand data size? Because usually, it requires us to go through a few exercises
that help a person better understand what a pipeline needs to look like.

• What format is the data coming in, can we get a sample?


• Can we simply mock-up a sample file ourselves?
• What types of systems handle data files of this size?

Let’s consider another simple example. We find out we have new data sources, someone sends us a
sample CSV file that is comma-delimited and contains some 20 million records, about 9GB’s in size.
They tell us we will be getting two of these files per day.

• 20 million records daily.


• 9GB volume per day.
• CSV file format.

If we are in a Python shop a few things might come to mind. We have a few options for processing
these files…

• CSV reader in Python


• Pandas
• Dask
• PySpark

Most data engineers will understand to processing files of that size without memory issues is
probably going to require Dask or PySpark, libraries that good at lazy evaluation.
What else can we understand from data size? Well, the size of the file is also going to tell us
generally about what type of technology is best suited for this particular problem. Most of us
Chapter 3 - Pipeline Architecture 50

probably understand that CSV files that are 9GB in size are not suited well to any sort of streaming
or messaging system.
But let’s say we found our data files were JSON files just a few KB’s in size, delivered at the rate of
300,000 per day, well then we might start thinking about a streaming service that seems to fit that
data size pattern a little better.

• Think carefully about which technologies would be best suited for the size of data your pipeline
is handling.

Data Velocity
Just like data size, the amount of incoming data (velocity) is another critical aspect of data pipeline
architecture as we discussed ealier. It makes a very sizeable difference if we are receiving something
in only a handful of “batches”, or if we will be receiving hundreds of thousands or even millions of
files.
There are technology stacks that are built to deal with file volumes on each end of this spectrum.

• At what rate and frequency is the data coming at us.


• Some architecture and tools are better suited to high volumes of data than others.

The nice thing about understanding data velocity during the architecture phase is that it informs
what some of the rest of the pipeline might have to look like and handle.
If we are looking at a high volume of incoming files, the growth of data may require some sort of
compaction or aggregations steps in the pipeline. If we expect to receive only a few files a day, we
will most likely not have to think about these steps of compaction or aggregation very much.
So, calculating data size and velocity upfront, even if you end up being half wrong is incredibly
helpful. It’s almost a given that going through this exercise will bring out the obvious winners and
losers of which platforms were made to handle your data needs.

Calculating Compute Requirements


Another area that is ignored during the architecture and layout of data pipelines is the amount of
compute that will be required. This can be one of the most important estimations because it’s usually
the number of one driver of cost.
The more compute power that is needed the more expensive the pipeline will be to run. Not every
surprise is a good one!
Calculating compute requirements is an important step in designing and implementing a data
pipeline. Here are the steps you can take to calculate compute requirements:
Chapter 3 - Pipeline Architecture 51

Understand the workload:


The first step is to understand the workload that the data pipeline will be handling. This includes
the type of data being processed, the processing requirements, and the expected volume of data.
Identify the processing steps:
Once you understand the workload, you can identify the processing steps required to handle the
workload. This may include data ingestion, transformation, aggregation, and analysis.
Determine the processing time:
The next step is to estimate the processing time required for each step of the pipeline. This can be
done by analyzing historical data or by running test scenarios.
Calculate resource requirements:
With the processing time estimated, you can calculate the resource requirements for each processing
step. This includes CPU and memory requirements, as well as storage requirements.
Factor in scaling requirements:
In addition to the resource requirements for the current workload, you should also factor in the
scaling requirements. This includes the ability to scale up or down based on changes in the workload.
Choose the right infrastructure:
With the compute requirements calculated, you can choose the right infrastructure for the data
pipeline. This may include cloud-based infrastructure such as AWS or Azure, or on-premises
infrastructure.
Monitor and adjust:
Finally, it’s important to monitor the performance of the data pipeline and adjust the compute
requirements as needed. This may involve adding or removing resources, adjusting processing times,
or modifying the pipeline architecture.
Calculating compute requirements involves understanding the workload, identifying processing
steps, estimating processing time, calculating resource requirements, factoring in scaling require-
ments, choosing the right infrastructure, and monitoring and adjusting as needed. By following
these steps, you can design and implement a data pipeline that meets the compute requirements of
your workload.
There are a few ways to get a rough estimate of the compute requirements. Let’s talk about the most
straightforward approach to determining resource requirements with an example or two.

• With rough data sizes, break the work up into logical batches.
• Determine based on transformations and possible tooling, the required CPU and RAM
requirements.
• Do the simple math.

Resource Requirement = (processing unit size x (number of CPU + RAM)) x number of batches
This will roughly give you an idea of what will be required. Let’s walk through a simple example.
Chapter 3 - Pipeline Architecture 52

Example of Calculating Compute Requirements

We are architecting a new data pipeline and will be receiving two raw CSV files per day, each with
about 15 million records that will need a few different transformations. After the transformations,
we will be storing the data in compressed parquet files. Each CSV file uncompressed is about 10GBs.

• 2 CSV files per day.


• 15 million records per file.
• Semi-complex transformations.
• 10GB per file.

We have decided that using Apache Spark would be a good option for processing the CSV files with
minor transformations in a parquet data lake is S3.
Right away we can understand that this process will be both CPU bound and RAM bound. But
knowing that we will only be processing 10GBs at a time, twice a day, getting the rough estimate of
resource requirements is very easy.

• We are bound by both CPU and RAM


• Max 10GB of processing per file.

We know that Spark is a lazy evaluation framework and that the transformations on our 15 million
records doesn’t require all data to be held in memory.
Knowing these facts the compute resource calculations become straightforward.

1 Resource Requirement = (processing unit size (1 file at a time) x (number of CPU(4) \


2 + RAM(8GB)) x number of batches (2) = 8 CPU and 16GB of ram will be the total resour\
3 ce used during 1 day of use.

• We can use rough resource requirements to size machines and clusters.

This information along with a guesstimate of how long it might take, say 15 minutes to process such
a file, and armed with the cost per hour information from AWS/GCP/Azure we can quickly calculate
roughly what it will cost to run this pipeline for a day.

Calculating Storage Requirements


Calculating storage requirements and cost is the same process as compute resources. Knowing
roughly how big the data is, and how often it is received is usually not that hard.
Once you know this information most likely with some compression numbers that are not that hard
to find, cloud storage costs can be found out.
Chapter 3 - Pipeline Architecture 53

• Calculate storage requirements similar to compute resources.

In our above example, we know that two files of 10GB each will be ingested and turned into
compressed parquet files in S3. A quick Google search and some testing will show you that we
can probably conservatively get 70% storage saves moving from CSV to Parquet.

1 storage cost = 2 files per day x 10GB per file x 365 days in a year x .30 (size of c\
2 ompressed parquets) = 2.2TBs of data.

Many times understanding basic compute and storage requirements during the architecture phase
of pipeline work might change some of the decisions we make.
In the example discussed above after thinking about cost and complexity we might decide that
processing those files on a Spark cluster is overkill. Spark is more expensive to run in the cloud and
our 10GB per file could easily be processed with our Apache Airflow framework we are already
running for example.
While all these calculations might be simple, going through the exercise of understanding roughly
the storage and compute requirements might change our architecture decisions about what tool(s)
to use and our approach to the problem.

Understanding the End Result


Another overlooked part of architecting data pipelines is the simple step of trying to think about
the result. Too often as developers and engineers, we can get caught up in the excitement of writing
code, new technology, or just how to solve the problem without stepping back first.

• Step back and understand where the data is going to be, and what it needs to look like.
• End results will often change the way we get those results.

Getting a clear picture of the output and usage of the data needed from the pipeline will inform or
change some of the choices we make upfront. Knowing if the data is needed for a dashboard or a
web application at the very least is going to change how we store the data, and how we store the
data might have an impact on what tool we choose to process and output that data.
It’s all connected and should not be overlooked before jumping into designing a data pipeline.

Complexity vs Simplicity Tradeoffs


Many times the culmination of our architecture work will come down to this idea, and it’s probably
the most important one.
Complexity is that silent killer that sneaks up on you.
Chapter 3 - Pipeline Architecture 54

In data engineering, complexity and simplicity are often in tension with each other. On one hand,
complex data pipelines can handle more sophisticated use cases and support more advanced data
processing capabilities. On the other hand, simple data pipelines are often easier to build, maintain,
and troubleshoot.
There are several factors that contribute to the complexity of a data pipeline:
Data sources:
The more data sources you have, the more complex your pipeline becomes.
Data transformations:
As you add more transformations to your pipeline, the logic becomes more complex.
Dependencies:
The more dependencies your pipeline has, the more complex it becomes.
Error handling:
The more error handling you have in your pipeline, the more complex it becomes.
Scale:
The larger the scale of your pipeline, the more complex it becomes.
While complexity can offer more advanced capabilities and better results, there are several down-
sides to overly complex data pipelines:
Difficulty in maintenance:
Complex pipelines are more difficult to maintain, troubleshoot and modify.
More prone to errors:
Complex pipelines have more potential points of failure and are more prone to errors.
Longer development times:
Complex pipelines often take longer to develop and require more specialized expertise.
Higher costs:
Complex pipelines often require more resources and infrastructure, leading to higher costs.
On the other hand, simple data pipelines have several advantages:
Easier to maintain:
Simple pipelines are easier to maintain, troubleshoot, and modify.
More reliable:
Simple pipelines have fewer potential points of failure, and thus are more reliable.
Shorter development times:
Simple pipelines often take less time to develop and can be implemented more quickly.
Lower costs:
Simple pipelines require fewer resources and infrastructure, leading to lower costs.
There is a tradeoff between complexity and simplicity in data pipelines. While complex pipelines
offer more advanced capabilities, they are often more difficult and expensive to maintain. Simple
Chapter 3 - Pipeline Architecture 55

pipelines, on the other hand, are easier to develop and maintain, and can be more reliable and
cost-effective. Data engineers should carefully consider the tradeoffs and choose the right balance
between complexity and simplicity based on the specific needs of their organization.
There are few things in the world of data engineering that will cause problems like over-complexity.
It takes many forms and has terrible side effects.

• Extended debugging and error research times.


• Extended development time to add new functionality.
• Extended period to onboard new developers.
• Developers are afraid or unwilling to touch code.
• High cost of running and maintenance.

I am not saying that every complex system should not be complex, some problems and data are
inherently hard to work on and require complex solutions.
On the reverse side, overly simplistic solutions have their own set of problems.

• Developer boredom and short tenure.


• Inability to scale.
• Cannot extend the functionality.
• Extended time to develop and add new functionality.
• Stagnation of technology stack.

Both extremes should be avoided and have many pitfalls. But good engineers know how to strike
that delicate balance. Choosing new tools and technology simply because they are new is a bad habit
and happens too often. It’s important to remember when architecting pipelines that we should let
the data and requirements drive the complexity, not the other way around.
Forcing simplicity or complexity onto a problem is the wrong approach.

• Take a pragmatic and balanced approach to problems.


• Too complex and the pipeline will be unmanageable.
• Too simple and the pipeline won’t scale.

Data engineers should try to be balanced when designing pipelines. They should take into consider-
ation all that we have talked about, and then setback and evaluate the options they have chosen.
Does the data and situation warrant the complexity, or on the other hand the simplicity?
Will future developers and data/business changes break the solution? There are always reasons to
incorporate simplicity and complexity depending on the situation. The job of the data engineer is
to balance all these requirements and find the middle ground that can be reliable, scalable, and
approachable.
Knowing that we will most likely have to come back to a data pipeline to make changes, to add data,
the functionality should give us pause when architecting systems.
Chapter 3 - Pipeline Architecture 56

Understanding Cost
A topic that is important to many businesses is trying to get a handle on costs. I would argue that
this might even be one of the easiest steps to complete if all of the other steps above have already
been completed.
If you understand the size, volume, and types of data that will be coming, along with what
technology choices are likely to be the best fit, figuring out the cost is usually not that hard.

• Understand that data file/record size will affect cost.


• Understand that the volume of data will affect cost.
• Your chosen tech stack will affect cost.

Knowing the size and growth of storage requirements, along with looking at pricing documentation
for say s3 or azure storage will be an easy calculation.
The same applies to compute cost, once we understand our data size and volume we can usually
figure out what size of machine(s) are needed to process that data. Again with today’s cloud
platforms, it’s very straightforward to find per hour pricing on computing and do some simple and
rough calculations.
Again, when picking cloud offerings for certain technologies that might be required, say like AWS
EMR, pricing is always available for these decisions and can be added to the cost.

Code Architecture
Another area I want to touch on briefly that is closely related to what we’ve just discussed is the
architecture of our code. This is another good topic to think about after we’ve completed the steps
above and we are ready to dive into writing pipeline code.
Once you’ve decided to move down a path towards a solution, you’ve set up Dockerfiles, tests, some
documentation, the only thing left is to write code.
Everyone has different stypes of code writing and usually strong opinions about it. Some people
practice TDD (Test Driven Development) to the nth degree, others do not.
Some people write in an OOP fashion, others with a more functional approach. It doesn’t matter
what style you choose and there are gobs of books writing on software architecture. I’m not here to
write a book on that, but I do think making a few points will be helpful.

• Approach your data pipeline code cautiously and slowly.


• Plan at a high level before writing code at the low level.
• Keep functions and methods as small as possible.
• Break code into logical units.
Chapter 3 - Pipeline Architecture 57

I’m generally just advocating that you think carefully and take pride in the data pipeline code you
write. Try to stick to normal software engineering best practices while writing your pipeline.
Whether you’re using Scala, Python, or Java, just take the time to understand what best practices
are and follow them.

Batch vs Streaming Architecture


A major area of concern you will find in data engineering architecture is the batch vs streaming
approach. We will mostly gloss over this, not because it isn’t important, but because it’s usually a
very obvious decision. But, maybe not to everyone, so let’s review.

• Your pipeline will either be batch or streaming.


• This distinction should be made early in architecture.

There are only two approaches you can take with data pipelines, batch or streaming. This requires
two separate technology stacks and approaches to the pipelines and architecture.
Batch processing involves processing data in discrete chunks or batches, typically over a period
of time, such as hourly, daily, or weekly. Batch processing is often used for non-real-time, offline
processing of data, such as generating reports or data analysis.
Streaming processing, on the other hand, involves processing data in real-time, as it is generated or
received. Streaming processing is often used for real-time analysis, monitoring, and alerting.
There are several differences between batch and streaming processing:
Data latency:
Batch processing has higher latency as it processes data in discrete chunks, while streaming
processing has lower latency as it processes data in real-time.
Processing model:
Batch processing uses a pull-based model, where the data is pulled from a source and processed,
while streaming processing uses a push-based model, where data is pushed to the processing system.
Data processing:
Batch processing involves processing large volumes of data in a single batch, while streaming
processing involves processing small amounts of data in real-time.
Error handling:
Batch processing typically involves processing and handling errors after the batch is complete, while
streaming processing involves handling errors in real-time.
Infrastructure requirements:
Batch processing requires less infrastructure to manage and process large amounts of data, while
streaming processing requires a more robust and scalable infrastructure to handle real-time data
processing.
Chapter 3 - Pipeline Architecture 58

Both batch and streaming architectures have their own advantages and disadvantages, and the
choice between the two depends on the specific needs and use case of the organization. Batch
processing is better suited for non-real-time, offline data processing, while streaming processing
is better suited for real-time data processing and analysis.

Streaming
Streaming use cases are fairly obvious, and many times the only choice. If we are tasked with
ingesting high volumes of high velocity and relatively small data points, streaming is of course
probably your only and best option.
Tools like Kafka, Pulsar, Spark streaming, and the like, as well as the popularity of microservices
that produce and send messages, have made streaming data pipelines very common. Typically the
data isn’t very flat or relational, not like a CSV, by more ragged and fits JSON data structures better.

• High-velocity data.
• Smaller data record sizes.
• Increases complexity of pipeline architecture.

It’s best to think about streaming data pipeline architecture like a garden hose of water. Lots of small
messages flying about. Typically in a use-case, this will become quite clear.

Batch
Usually batch is the default and most common type of data pipeline. When real-time isn’t that
necessary, and data points are produced in batches, or together, in a very flattened and as a file,
batch processing will suit you just fine.
This is typically the big difference between streaming and batch systems. Batch data pipelines will
be working on ingesting file(s), while streaming pipelines will be working on individual messages
or data points.
Most of what you will learn about data engineering applies to both batch and streaming systems.
While of course, the technology behind each is different, technologies usually aren’t the hard part
to learn, it’s more about the nuances of the particular data sets that provide most of the challenges.

Puzzle Pieces
Before I close the chapter on architecture, I want to talk a little more about the big picture. Data
engineering can be a difficult subject to master, mostly because of the wide range of technology
used.
Chapter 3 - Pipeline Architecture 59

Many times one must be versed in every cloud provider product, streaming, batch, data storage,
command line, servers, networks, database, programming languages, and the list goes on. One thing
will become more clear as you gain more experience, architecture is about the big picture, and putting
puzzle pieces together.
Because of the numerous sets of tooling, and the fact that one tool can never provide end-to-end
modern data pipelines, the ability to fit puzzle pieces together is key.
Understanding how storage solutions work together with data transformation and analytics tools,
and how those tools work with orchestration tools, these types of decisions are what architecture is
really about.

• The number of technology choices available can make architecture difficult.


• Try to break the problem down into its pieces.
• Always keep the big picture in mind, not every detail.

Many times tools are picked for data pipelines and the other puzzle pieces are ignored until the
implementation phase happens, at which time roadblocks and other unknowns arise, causing project
delays and problems. How do you avoid and overcome such obstacles?
Remember when building data pipelines, no detail is too small to plan for. Think about all aspects of
the data pipeline, all the pieces, and parts, choose the solution for each puzzle piece, and then test or
explore each one. Many times simply glossing over technical documentation for each puzzle piece
will reveal pitfalls or pros and cons of such a solution.
This is really what data pipeline architecture boils down to. Being able to design a simple, elegant,
cost-effective way to stitch technology together to produce the desired result.

Summary
What makes an exceptional data engineer? Sure coding skills have a lot to do with it, having
experience and know-how with different technology stakes is key. But, some of the best engineers
I have met are the ones who take their time.
Don’t underestimate the value of getting into the habit of doing simple architecture planning on
even a small data pipeline project. Planning in all areas of our life always pays back big dividends,
the world of data engineering is no different.
Thinking through the pros and cons of approaching problems in different ways is so helpful for our
critical thinking skills, and we will always learn something on the journey of trying to understand
and plan for that problem at a deeper level.
Chapter 4 - Storage
Probably one of the most overlooked pieces of the data engineering puzzle is storage. In the world
of moving data around and transforming it, it’s hard to overstate the important role that file types
and storage play in our data pipelines.
If you stop and think about data engineering in general, what is it that you do as a data engineer,
you will conclude that it starts and stops with file storage and file types, especially in the age of big
data.
File and data storage are critical components of data engineering pipelines, as they enable the
efficient and effective storage, retrieval, and manipulation of large volumes of data. Here are some
key reasons why file and data storage are important in data engineering:
Data storage enables efficient data processing:
Without data storage, data processing would be slow and inefficient, as data would have to be
constantly reloaded and reprocessed. By storing data in a structured and organized manner, data
engineers can quickly access and process data, leading to faster and more efficient data pipelines.
Data storage enables data analysis:
Data storage allows for the storage and retrieval of large volumes of data, which is essential for
performing data analysis. By storing data in a way that enables easy querying and analysis, data
engineers can quickly generate insights and make data-driven decisions.
Data storage enables data sharing:
Storing data in a centralized location enables multiple teams and stakeholders to access and use the
same data, leading to greater collaboration and more effective decision-making.
Data storage enables data security:
Storing data in a secure manner is critical to protect sensitive information and ensure compliance
with data privacy regulations. By storing data in secure locations, such as encrypted databases or
cloud storage, data engineers can minimize the risk of data breaches and ensure the confidentiality
of sensitive information.
Pipelines typically start as some sort of file(s), and the result usually ends up as a file(s). While this
could be database files, in the new era of big data it’s becoming more common to store data outside
of traditional database systems.

• Data engineering many times starts and ends with files.


• There are some commonly used file types across all data engineering.
• Understanding the pros and cons of different file types are important.
• Storage plays a large role in big data.
Chapter 4 - Storage 61

Why care about storage?

Understanding what types of files and compression best fit certain types of data and use cases is
a very undervalued skill. Choosing Parquet vs Avro vs JSON vs CSV vs HDF5 vs RDBMS. These
decisions should be driven by the data needs and access patterns and should be explored before just
“picking” something because it’s easy.
With the shift towards a new type of data warehousing using data lakes and file storage instead
of the traditional relational database systems, a data engineer who knows how to best layout file
storage is priceless.
Many people gloss over the storage aspect of pipelines, not realizing how much there is to learn and
the productivity and speed gains that can be found in properly picking file storage and partitioning
strategies.

Storage concepts and fundamentals.

We will talk through the six basic storage fundamentals that every data engineer needs to take
seriously when working on pipelines. While these don’t cover every topic, this is a good introduction
to what is important for a data engineer to understand.

• Access patterns.
• SQL/NoSQL vs files.
• File types.
• Compression.
• Storage location.
• Partitions.

First, we will talk about data access patterns and how they affect our decision-making when thinking
about storage. What do I mean when I say access patterns?

Access Patterns
Data access patterns are about how files need to be read and searched by our data pipelines, and
those systems that ingest the results of our data pipelines. This includes a few topics you should
always keep in mind.

• How many system(s) will need access to the data storage layer?
• How often will the above system(s) be accessing the data storage?
• How much data will these system(s) be reading?
• How much logic will be applied by those systems to the data?
• How does the system technically access the data?
Chapter 4 - Storage 62

It’s best practice to understand how your data will be consumed, this has ramifications around
design and tooling.

Access patterns at a high level.

In general, and usually upfront, data engineers must decide if the storage layer is going to include a
relational database system of some kind.

• Data will need to be accessed via a database.


• Data will be accessed via file storage.

Usually, this is a straightforward decision based on the requirements of the pipeline. If the data
warehouse or other storage layer already exists in say MySQL, Postgres, or SQLServer then you
may have no choice in the matter. But this is becoming less and less common today with the growth
of data, much of it unstructured or semi-structured.
Access patterns can also include how much of the data will be accessed at one time. Will it just be
small pieces or large chunks of data that are needed by outside systems? Having a storage system
that supports fast single lookups vs serving requests for GB’s or TB’s of data will change how we
think and choose technology.

SQL/NoSQL Databases vs Files.


If you are working with data that is on the smaller side, say a few hundred GBS or less, and the
data is relational and transactional, a traditional database is probably the obvious and best choice.
Traditional databases have been used for decades and have many great features like reliability and
ease of use via SQL.
SQL, NoSQL, and files are different types of storage systems used in data engineering pipelines.
Here’s an overview of each:
SQL: SQL (Structured Query Language) is a type of relational database management system
(RDBMS) that stores data in tables with a predefined schema. SQL is ideal for storing structured
Chapter 4 - Storage 63

data with a consistent schema, such as financial data or customer information. SQL databases are
known for their ACID (Atomicity, Consistency, Isolation, Durability) compliance, which ensures
data integrity and consistency.
NoSQL: NoSQL (Not only SQL) is a type of non-relational database management system that stores
data in a more flexible and dynamic manner compared to SQL databases. NoSQL databases are
ideal for storing unstructured or semi-structured data such as social media data, sensor data, or log
files. Unlike SQL databases, NoSQL databases do not require a predefined schema and can be scaled
horizontally to handle large volumes of data.
Files: Files are a type of storage system that stores data in a raw, unstructured format. Files are
commonly used for storing large volumes of unstructured data, such as images, videos, or log files.
Files can be stored in a variety of formats, including text files, CSV files, JSON files, and more.
In summary, SQL, NoSQL, and files are different types of storage systems used in data engineering
pipelines. SQL databases are ideal for structured data with a consistent schema, NoSQL databases
are ideal for unstructured or semi-structured data, and files are ideal for storing large volumes of
unstructured data. The choice of storage system depends on the specific needs and requirements of
the organization.
How can I know if my data should be in a SQL database or in some other filesystem or store?
While there is no way to know without exact specifications, there are plenty of rules of thumb you
can follow to help make that decision.

• Is the data tabular and relational.


• Data volumes, both ingress, and egress.
• Is the data of key-value or ragged structure.

Data storage considerations.

Is the data best suited to fit into multiple tables that have very straightforward relationships together,
and is the data very tabular in nature. More explicitly, can easily be represented in a spreadsheet. If
the answer is yes, you might consider a SQL database.
Also, think about how often the data arrives and how often the data is needed. If you are planning
on ingesting multi-millions of records a day into the data store and serving up just as many requests
for that data … it’s possible a relational database will work, but it won’t be your average setup!

• Tabular and relational data should go into SQL databases.


• High ingress and egress volumes of data might not suit many SQL databases.
Chapter 4 - Storage 64

Is your data best modeled by a document.

Storing documents that are very relational and transactional are also great candidates for NoSQL
systems like MongoDB and DynamoDB. The use cases for these types of data are usually pretty
obvious. Many times it comes down to if the data needs to be more on the analytical side of the
transactional support side.
Data that leans more into the key-value pair model, with possible built-in hierarchal ragged
structures are prime for NoSQL data stores.
I’ve seen 10GB of data stored in DynamoDB because someone thought it was cool. Don’t be this
data engineer, take the time to understand that data and what storage best fits the need.

Data Storage Example


You are given a new project, your company has a new website and you will be receiving orders from
a website in JSON format, those orders need to be available in a customer portal within 1 minute of
the order.
We are also told that the incoming volume will be about 500,000 orders per day.

1 >> customer.json
2 {
3 "customer_information":
4 {"customer_id": 543267,
5 "customer_name": "Billbo Baggins",
6 "customer_address": "100 BagEnd, The Shire, Middle Earth"
7 },
8 "order_date": "2021-04-01 23:45:03:01"
9 "product_info": {"product_id": 567894,
10 "quantity": 5}
11 }

• High incoming order volume.


• Near-realtime availability of data.
• Document type data structure.
Chapter 4 - Storage 65

How to make storage decisions.

Something like MongoDb or DynamoDb would be a great obvious choice for such a use case. Being
able to store the incoming order information in its document structure without transforming it, right
into a database that can be connected to the customer portal is straightforward.
Of course, we could make this work easily in a relational database like Postgres, but again always try
to pick the simple solution. Evaluate which technology best fits the need in a manner that is simple
and supports a reasonable expansion of requirements and data.

• We choose MongoDB because of the high volume of incoming data.


• We don’t want the overhead of flattening data structure for SQL.
• We want a simple storage solution that can easily handle document data.

Any sort of requirement for supporting transactional systems from web apps to manufacturing
systems where CRUD and ACID reign supreme, these are classic database use cases. But, as time
goes on traditional database storage has given way to feature-rich file storage solutions
I’m now going to side-step relational database systems, we can talk about that more in the SQL
chapter. Today’s big data is dominated by different file types.

File Types
Now that the data engineering world turns on storage systems like s3, along with the amazing
growth of data, it’s become less and less common to store data in traditional databases. Files have
become the defacto new data lake. There has been a lot of feature development around file storage,
especially when it comes to big data.
It doesn’t matter much if files are being stored in the cloud or not, the type of file chosen as the
storage layer has a massive impact on data pipeline performance and usability. At a high level, what
should you worry about when choosing file storage?

• Data compression.
• Smart files (predicate push down support, selective reads).
• File type usability by ingestion systems and code.
• Schema and data type support.
• Row vs columnar storage.

A data engineer needs to understand high-level file storage options and what use case warrants the
use of each file storage option. Most of the decisions are straightforward when it comes to choosing
file storage options. If you understand a few basics about file types, typically a use case will make
obvious what the best choice is.
Chapter 4 - Storage 66

We are going to try and cover a few of the more popular file options in the big data world for storage,
we will look at the most common use cases for each, and try them each with Python. By the end
you should have a basic understanding of each file storage option and when it’s best to use them.
Before we dive into the specifics and examples for each file type, let’s talk about a common topic
called row vs columnar storage that you are likely to run into.

Row vs Columnar Storage.


Something you will start to run into as you dive deeper into data engineering and storage solutions
are called row vs columnar storage systems. It can all seem a little ephemeral, but the difference is
very simple to understand.
Row-based storage stores data in rows next to each other, like you would visualize data normally in
your mind. It makes for quick writes (inserts) and reads on those rows of data.
Columnar storage stores the data field or column-wise, that is data from different rows but the same
column/field is stored adjacent to each other. This makes queries and computations much faster, for
things like analytics and data warehousing applications.

This means at a high level this choice between columnar vs row storage is all about access patterns
of the data, how it is going to be used?.

• Row Based - quick insert and reads.


• Columnar - analytics, and computations.

Traditional data warehouse or data lake patterns designed for analytical computations would fall into
the columnar storage system. High-volume transactional data being streamed from a clickstream
would make more sense as row-based storage.
Let’s get back to some specific examples and walk through some of the more popular file storage
formats you will find in the data engineering world.
Chapter 4 - Storage 67

Common file types in data engineering.


Here are the files you will most often run across and use in your data engineering work day-to-day.

• parquet
• avro
• orc
• csv/flat-file
• json

Parquet.
Probably the most popular columnar storage file type is Parquet. Parquet was made popular by the
rise of Hadoop/Apache Spark and the new type of data warehouse and data lakes.
It’s nearly impossible today to be a data engineer without using parquet files. Many times you might
be using parquet files and not even know it. For example, the open-source data lake tool called Delta
Lake, which has been made popular by Databricks, uses parquet files underneath the hood to provide
many of its capabilities.
A few things that have helped make parquet files popular as a storage system.

• Schema/data type with the file.


• Great compression.
• Predicate pushdown (filter pushdown, read what you need).
• Selective reads (only read what you need).
• Support across platforms (Spark, Pandas, etc.).
• Ease of partitioning.

Parquet schema and data types.


You can store any kind of data you want in Parquet files. What makes them different from what a
lot of people are used to with txt or CSV files is that the schema information about the data is stored
with the file.
What do I mean by the schema? I mean the column names and data types, for example, account_-
number as a column and the associated data type which might be a STRING or INT. In the world of
CSV and flat-files, everything is a STRING, until your code decides it’s something else.
Parquets can store a variety of data types.

• string
• int
• decimal/float
• boolean
• binary
Chapter 4 - Storage 68

Parquet compression.
Parquets also come with built-in compression that is no joke. This compression can turn into big
savings when used with big data and you start thinking about it over TB’s of data. It is not uncommon
to see 70-80% storage savings from a CSV compared to a parquet with the same data.
Parquet files come with three different compression options. Many times snappy will be the default.

• gzip
• snappy
• LZO

Parquet predicate pushdowns, partitions, and projection reads.


Yet even another important feature of parquet files is the ability to only read the data you need.
Meaning when calling a read function on parquet files with most tools you can say “I want only
these 2 columns” and that is all that will be read, not the other 100 that may be there. Again, that is
a feature that makes a big difference with big data.
Let’s just look at a few parquet file examples with Python. There are two main libraries in Python
you can use to interact with parquet files, pandas or pyarrow.
Otherwise many people experience parquet files when working with Spark, which supports parquet
files natively.
Most people are used to using CSV files, so let’s see how to convert a CSV file to parquet with pandas.

1 >> parquet.py
2 import pandas as pd
3 dataframe = pd.read_csv('sample.csv')
4 dataframe.to_parquet('sample.snappy.parquet' ,engine='auto', compression='snappy\
5 ')

Projection reads.

Another popular Python package to use with parquets is called pyarrow.


We can also see how to only read columns we are interested in. This code also converts the result to
a pandas dataframe.

1 >> parquet.py
2 import pyarrow.parquet as pq
3 data = pq.read_pandas('sample.snappy.parquet', columns=['ID', 'Date', 'Description']\
4 ).to_pandas()

• Reading select columns from data sets is a game-changer for big data.
Chapter 4 - Storage 69

You can imagine in large datasets how much performance can be gained from such a feature. Unlike
a CSV file, you don’t have to read everything, with parquet you can specify what columns you are
interested in working with.
An example with PySpark while reading a parquet dataset to sum a customer’s order amounts might
look like this.

1 data_set.select('customer_id', 'order_amount')

Parquet partitions.

Many times parquet files are partitioned into subfolders based on some column, usually a date. This
breaks the dataset up into smaller chunks.

1 pq.write_to_dataset(data, root_path='dataset', partition_cols=['Date'])

Since parquet files are mostly used for big datasets, it’s rare to have parquet datasets that are not
partitioned.
Partitioning is an important topic we will cover a little later. What partitioning does for parquets
and other file storage systems is ensure your dataset is broken up into physical partitions.

Predicate pushdowns.

Predicate pushdown is another feature of parquet files that makes them fast and a great tool for data
engineers. Because of the metadata stored with parquet files, reading a dataset with filters that are
“pushed” down to the file can create huge performance gains over file formats like CSV that don’t
have this feature.
A pyarrow example might look something like this.

1 parquet_dataset.to_table(filter=dataset.field('order_amount') >= 7)

This means you can quickly read data that meets filters.
These topics just scratched the surface of what parquet files can do, I encourage you to remember
the features and benefits of parquet files and think hard about them before writing your next dataset
as a CSV.

Avro.
The Avro file format has been around as long as Hadoop has been. It is often coined as a “data
serialization” framework. It’s always touted as being able to handle flexible and wide-ranging
data structures, even hierarchical within records. It’s also built to be row-oriented (different from
Parquet).
I think of Avro as JSON on steroids. Here are some of Avro’s highlights.
Chapter 4 - Storage 70

• Handles schema changes.


• Row-oriented.
• Ragged schema structure support (similar to JSON).

Avro vs Parquet.

It’s interesting to note that the Avro format is used widely in the RPC (remote produce call) space,
to communicate messages and data across networks. This is where it widely differs from parquet.
Where parquet will be found a lot in say large datasets used for data lakes and data warehousing,
Avro is going to be found in more highly transactional and messaging systems and architectures.
One of the first things you will notice when starting to read or work with Avro is that the schema is
an integral concept. You can’t write data to an Avro file without having or defining a schema first.

• Avro is very schema-bound.


• Avro is used for messaging systems.

Most commonly the schema is defined with JSON, which makes it very approachable, but very
different from most file systems you are probably used to.

Avro Examples.

A popular Python package for working with avro is fastavro.

1 pip3 install fastavro

The first thing you will also have to do when starting to create an Avro dataset is going to be creating
the schema. In most cases, it will be easy to just write a Python dictionary and keep it in memory
or a JSON file.
A couple of key points, below you will notice the parse_schema() method used to turn the json/dict
into the Avro schema. Also of note, you would want some sort of iterator that contains your
records/data.
Otherwise, fastavro in Python exposes a reader() and writer() that you will most likely be familiar
with. Here is my example of writing records to a file, then reading them back in.
Chapter 4 - Storage 71

1 import fastavro
2 >> create json schema from dictionary
3 avro_schema = {"namespace": "middle_earth.avro",
4 "type": "record",
5 "name": "MiddleEarth",
6 "fields": [
7 {"name": "character", "type": "string"},
8 {"name": "position", "type": "string"},
9 {"name": "dragontreasure", "type": "int"}
10 ]
11 }
12
13 avro_schema = fastavro.parse_schema(avro_schema) // turn dict/json into avro schema
14
15 >> some sort of records stream/iterable
16 heros = [
17 {'character': 'Gandalf', 'position': 'Wizard', 'dragontreasure': 50},
18 {'character': 'Samwise', 'position': 'Hobbit', 'dragontreasure': 1},
19 {'character': 'Gollumn', 'position': 'Sneaker', 'dragontreasure': 5},
20 {'character': 'Mr.Oakenshield', 'position': 'Dwarf', 'dragontreasure': 1000},
21 ]
22
23 avro_file = 'middle_earth.avro'
24
25 >> write records to file.
26 with open(avro_file, 'wb') as write_file:
27 fastavro.writer(write_file, avro_schema, heros)
28
29 >> read back file
30 with open(avro_file, 'rb') as in_file:
31 for record in fastavro.reader(in_file):
32 print(record)

Another example would be converting a CSV file into avro format.


Chapter 4 - Storage 72

1 >> avro.py
2 import fastavro
3 import csv
4
5 bike_share_file = 'Divvy_Trips_2019_Q3.csv'
6 avro_file = 'bike_share.avro'
7
8 >> create json schema from dictionary
9 avro_schema = {"namespace": "bike_share.avro",
10 "type": "record",
11 "name": "BikeShare",
12 "fields": [
13 {"name": "trip_id", "type": "string"},
14 {"name": "bikeid", "type": "string"},
15 {"name": "from_station_id", "type": "string"},
16 {"name": "gender", "type": ["string", "null"]},
17 {"name": "birthyear", "type": ["string", "null"]}
18 ]
19 }
20
21 >> turn dict/json into avro schema
22 avro_schema = fastavro.parse_schema(avro_schema)
23
24
25 def stream_csv_records(file_location: str) -> iter:
26 with open(file_location) as f:
27 creader = csv.reader(f)
28 next(creader) // skip header
29 for row in creader:
30 yield row
31
32
33 def transformed_stream(record_stream: iter) -> iter:
34 for record in records_stream:
35 avro_record = {u"trip_id": str(record[0]),
36 u"bikeid": str(record[3]),
37 u"from_station_id": str(record[5]),
38 u"gender": str(record[10]),
39 u"birthyear": str(record[11])
40 }
41 yield avro_record
42
43 records_stream = stream_csv_records(bike_share_file)
Chapter 4 - Storage 73

44 transformed_records = transformed_stream(records_stream)
45
46 >> write records to file.
47 with open(avro_file, 'wb') as write_file:
48 fastavro.writer(write_file, avro_schema, transformed_records)

You’re probably less likely to run into Avro in the wild, but it could happen. It’s more built for serial
messaging systems and complex schema’s that maybe don’t work well with Parquet for example.

Orc.
Another popular storage option in the big data and data engineering world is ORC, probably not as
popular as parquet, but popular enough to have gained a decent adoption in the community.

• ORC files are made up of Stripes …. groups of row data.


• It supports data types like “datetime, decimal, and the complex types (struct, list, map, and
union)“
• The file has/can have “indexes.” Helps in seeking rows and skipping row groups when a reader
comes in with a predicate … aka push down filter.
• File footer contains meta-information about the ORC file as a whole.

compression can be ” Snappy, Zlib, or none.“

• Supports ACID when used with Hive.

So what do you need to know about ORC?

Stripes and Indexes for ORC

Stripes are integral to ORC, they are 64MB in size by default, are “independent” from each other …
allowing distributed work to happen on a file. Columns are separate from each other in the stripe,
allowing only needed data to be read.
The indexes in an ORC file allow push-down read filters into a file, signaling which Stripes need to
be read. You will notice this is similar to parquet files.
It’s also worth a note that working with Orc is not as easy as working with parquet or other files.
You can’t just work with ORC files with a simple Python setup, it was designed for HIVE and you
will most likely not run into it unless you’re working on legacy systems.
You will have more luck working with Orc using Java or Scala, JVM-based languages. Otherwise,
you can use Spark to work with getting data converted into Orc format.
Chapter 4 - Storage 74

1 df = spark.read.csv('bikes/*.csv', header='true')
2 df.write.format("orc").save("bikes/orcs/orcy.orc")

If you read about Orc files you will probably run across information indicating that it is faster than
the Parquet. While this may be true, be warned, gaining a few seconds here and there is probably
less important than how well the file type is supported by different tech stacks and the availability
of community to support it.
What I’m trying to say is that unless you have a good reason, choose parquet over Orc.

CSV / Flat-file.
These two file types are probably the most common file format used in data engineering, even some
big data workloads, unwisely, use CSV and flat-files. These are file types that have been around a
long time. What are they?
CSV’s and flat-files get nothing special, no compression, no built-in headers. They are easy and
simple to use, which is why they are so often found to be used in data engineering pipelines. If your
data is small, it probably makes sense to use CSV and flat-files.

• CSV and flat-files are as simple as they come.


• They don’t hold schema or data type.
• Delimiters matter with these files.

CSV and flat-files are just text files with some sort of separator between records, typically a comma
is used “,”, although maybe times you will see pipes, like |, or tabs used to delimit records.
It also isn’t uncommon to double-quote records as well. Here are some topics to remember when
working with or learning CSV and flat files.

• No compression.
• Data values are separated by comma (,), but also pipe (|), and other values.
• No schema or data type support.
• Many times datapoints will be qualified, may be surrounded by quotes (“x”).
• Very easy to read and write in most languages.

An example of what a CSV file would look like is below.


Chapter 4 - Storage 75

1 >> example.csv
2 ride_id,rideable_type,started_at,ended_at,start_station_name,start_station_id,end_st\
3 ation_name,end_station_id,start_lat,start_lng,end_lat,end_lng,member_casual
4 A847FADBBC638E45,docked_bike,2020-04-26 17:45:14,2020-04-26 18:12:03,Eckhart Park,86\
5 ,Lincoln Ave & Diversey Pkwy,152,41.8964,-87.661,41.9322,-87.6586,member
6 5405B80E996FF60D,docked_bike,2020-04-17 17:08:54,2020-04-17 17:17:03,Drake Ave & Ful\
7 lerton Ave,503,Kosciuszko Park,499,41.9244,-87.7154,41.9306,-87.7238,member

Opening and working with CSV files couldn’t be easier in Python. Learning how to work with CSV
and flat-file records is key for any data engineer.

1 >> test.py
2 import csv
3
4 def open_csv_file(file_location: str) -> object:
5 with open(file_location) as f:
6 csv_reader = csv.reader(f)
7 for row in csv_reader:
8 print(row)
9
10 if __name__ == '__main__':
11 open_csv_file(file_location='PortfoliobyBorrowerLocation-Table 1.csv')

Doesn’t get much easier than that, does it?

Pandas for CSV and flat-files.

Most data engineers stay away from Pandas when working with data because it’s not a big data tool
that is scalable. Pandas is well known for having memory issues when working with data. But, that
being said, CSV and flat-files are typically not big and can easily be manipulated with Pandas, and
it should be used when it makes sense to do so.

1 >> pandas.py
2 import pandas
3
4 def open_csv_file(file_location: str) -> object:
5 dataframe = pandas.read_csv(file_location)
6 for index, row in dataframe.iterrows():
7 print(row['Location'], row['Balance (in billions)'], row['Borrowers (in thou\
8 sands)'])
9
10 if __name__ == '__main__':
11 open_csv_file(file_location='PortfoliobyBorrowerLocation-Table 1.csv')
Chapter 4 - Storage 76

What to remember about CSV and flat-files.

What should someone building pipelines remember about using CSV and flat files? They are
probably the easiest storage option, and they work great for data that can be represented as a table,
with rows and columns.
It’s important to remember that they provide no out-of-the-box compression options. If you have
large datasets you will have to zip or gzip the files yourself. They don’t hold schema information as
well.
Also, you will struggle to support complex data types and structures in CSV and flat files. They are
meant to be a basic data file tool and should be kept that way.

JSON
Another popular format found all over data pipelines is JSON. Many times it’s used for configuration,
and not as the main storage type, but you will find s3 buckets full of JSON files in your engineering
travels.
JSON data is best for ragged hierarchy schemes. It acts just like a Python dictionary. It can contain
lists and arrays, as well as depth into its schema. Its popularity rose to meteoric highs with its use
in REST API’s as the data transmission format of choice.

1 >> sample.json
2 example_json = {
3 "some_key": "some_value",
4 "customer_id": 1234,
5 "order_amount" : 15.05,
6 "products": [456, 987, 043]
7 }

How is JSON most likely to be used? This is an interesting question, and we have to go back to the
basics to understand when we should, or should not, use JSON.

Understanding JSON for data pipelines.

We should not choose JSON if we are storing large amounts of data into a storage bucket. Why?
Because JSON files don’t provide the partitioning, predicate pushdown, compression of other file
formats.
You do not want to end up with millions of data points spread across millions of JSON files unless
they are later going to be ingested into another system for analytics.
JSON is a great tool for storing configurations for applications, as well as messages and individual
data objects being passed around inside a data application. If your data represents tabular formats,
don’t use JSON, if your data is more document-oriented, it could be a great choice.
Chapter 4 - Storage 77

Let’s look at a few quick examples of using JSON in Python, and how it might fit into a data pipeline
application.

1 >> file -> config.json


2 pipeline_config = {
3 "number_of_nodes" = 10,
4 "resources" : {
5 "cpu": 4,
6 "ram": 16GB
7 }
8 }

1 >> json.py
2 import json
3
4 def read_configuration(config_uri: str = 'config.json') -> json:
5 """ read configurations file, data input and output locations """
6 with open(config_uri) as cf:
7 config = json.load(cf)
8 return config

JSON makes the perfect data structure for holding configurations and passing information around
inside code and your programs. It’s used a lot in APIs and the web world. It’s a great versatile tool,
but as a data engineer, you have to remember its limited usability and big data platforms.

Compression.
The compression of files is an important topic when working with data. As the size of data grows,
and thus the cost, it’s important to remember size matters when storing files.
There are many types of file compression, but in data engineering, you will most likely see the
following types…

• gzip
• tar
• snappy
• zip

There are others but these are the ones you will run across daily. Honestly, when storing files there
should rarely be a case where those files are not compressed in some manner. It’s easy to compress
Chapter 4 - Storage 78

files, it saves size and money, and there isn’t much of a penalty to uncompress those files on read.
Many tools like Spark will automatically uncompress files on read for you!
When you can save 80% storage space per file just by compression, it is a no-brainer.
Most compression can be done easily on the command line, if you have a Linux flavor like Ubuntu,
these packages are just an apt-get install away.

1 bash
2 >> tar a single file
3 gzip file1.txt
4 // gzip multiple files
5 gzip file1.txt file2.txt
6
7 >> tar a directory of files
8 tar -zcvf example.tar.gz /my/source/data

Also never forget you can easily work with these compression types in your pipeline code. For
example, say we are downloading a file from AWS s3 with Python and need to gzip and compress
that file.

1 >> read.py
2 obj = s3.get_object(Bucket='my-s3-bucket, Key='file-1.csv')
3 return gzip.GzipFile(fileobj=StringIO(obj['Body'].read())).read().splitlines()

Or again, we have an object of data in Python and we want to write it to a gzip object.

1 >> gzip.py
2 my_data = "my, row, of, csv, data"
3 data_io = StringIO()
4 with gzip.GzipFile(fileobj=data_io, mode="w") as f:
5 f.write("\n".join(my_data) + "\n")

The main takeaway for any data engineer on compression should be, to use it. By nature we will be
storing and warehousing large amounts of data, it makes sense for cost reasons alone, not to mention
the ease of use of compression tools, to find a suitable compression for our files and use it.

Storage location.
Choosing the storage location for your data is a lot easier than it used to be. Actual HDFS and
Hadoop file systems are uncommon and legacy now. Cloud storage provided by AWS, GCP, and
Azure rules the day. These are storage locations in the cloud are typically referred to as “buckets.”
Chapter 4 - Storage 79

This is great for a few reasons, high availability and access from anywhere is pretty much standard
operating procedure now. No more network mounts and drives to mess with. But, some complexity
is added with cloud storage.
You have to learn and become adept at using the command line interfaces provided by the cloud
providers to interact with their storage systems. These command-line tools like aws CLI and gsutil
are how bulk files are moved around.
It could be by copying, moving, deleting, or syncing, but you will have to learn how these operations
work.

• Learn the command-line CLI tools for cloud providers.


• Cloud storage locations are the future.
• It costs money to read and write from cloud storage.

Also, when coding to the storage system in data pipelines, you must learn the packages and libraries
provided by the cloud companies, things like the Python package boto3 used to interact with s3. If
you are using GCP, gsutil will most likely have a place in your pipelines.
These tools can add some complexity, but what they provide in return is usually worth the hassle. It’s
also worth a note to think about cost when it comes to storage. Many of the popular cloud storage
options, like s3, only charge fractions of a cent per GB stored, but what they don’t mention is the
cost to read and write that data.
This is something data engineers should think about. Reading and writing massive amounts of
storage is going to increase your storage charges and cost, many times the devil is in the details.

Partitions.
Partitioning strategies are the greatest unsung heroes of data engineering and file storage that aren’t
talked about enough, nor is it taught with any consistency.
What is partitioning? It’s how the data files you are working with are physically stored and organized
on a disk or in the cloud.
How do you know if the data you are working with is partitioned or not? Just answer this question.
Are the files just randomly located in any place or folder … or is their structure to the madness?
One of the most common partitioning strategies in data engineering is date partitioning. So some
folder structure has a directory that might be broken down by year, month, and day. Here is an
example.
Chapter 4 - Storage 80

1 year=2021
2 month=08
3 day=15
4 ...file_1.txt
5 ...file_2.txt

As you can see this makes it easy for people and for code to find data that is just in a single year or
even a single, month and particular day. This is fundamentally what partitioning data is.

• Partitioning keeps programs from reading entire storage directories.


• Partitioning breaks the data up logically and physically.
• Partitioning increases performance and reduces cost (less read).

In another case, we might be working in an environment where we gather lots of information about
a few repeat customers or clients. We might run into a data partitioning strategy as follows.

1 client_number=123
2 invoices
3 year=2021
4 month=10
5 day=01
6 client_number=456
7 invoices
8 year=2021
9 month=10
10 day=01

Why do you need data partitions.

What I’m trying to convey here is that data partitioning is a powerful strategy and is going to be
closely related to how your data is either produced or consumed. And when you have terabytes or
petabytes of data, you can’t read every file when a pipeline needs some data, that would take too
long and make the system unusable.
Chapter 4 - Storage 81

This is the key to data partitioning, it allows for more fine-grained file system seeking and data
locating. The program can hone into exactly where the data is that is needed.
This topic is so important that even popular tools like Apache Spark have it built right into their
APIs.

1 dataframe.write.partitionBy('year', 'month').parquet('/mnt/my_data/')

The idea of grouping data and saving it in partitions is at the core of big data, and most tools now
and in the future that data engineers will be working with will have some form of this partitioning
idea.

How to think about data partitioning.


It doesn’t matter if you are engineering pipelines that work with data on disk, in the cloud, or some
NFS; the same thought process should be taken when thinking about data organization.
And this is really what lies at the bottom of a data partitioning strategy, trying to understand how
data is either produced and consumed and applying some common sense organization around that
data to help both humans and software work with that data.

• Data partitions are based on data access patterns.


• Common items in the classic WHERE clauses turn into partitions.

Just writing data and files to a single folder and letting one folder contains thousands upon thousands
of files will give yourself or someone else a headache down the road. Think about how your ETL
would need to interact with that data later. What kind of predicates or filtering statements could be
applied. If you’re familiar with SQL then this would be thinking about the WHERE clause.
Data partitioning can, of course, get very complicated and they might be certain nuances to consider
when you are working with specific tools, but in the end, I want you to remember the big picture.
Never forget the importance of just organizing your data and files into commonsense patterns that
most likely will present themselves based on the use case of your particular data.

Storage and Files Summary


This chapter covered a wide range of important topics related to files and storage. It’s easy to see
storage as a boring and unimportant topic for data engineering, but the opposite is true.
As a reminder, we covered the following topics.

• Access patterns.
• SQL/NoSQL vs files.
Chapter 4 - Storage 82

• File types.
• Compression.
• Storage location.
• Partitions.

Data access patterns are at the core of data storage. How the data is used is should drive our decisions
around how and where we choose to store our data and files.
The SQL/NoSQL databases as opposed to the file-based storage option will often come up at the
beginning of most data projects. They require different approaches and tooling, leading to very
different data architectures.
File types is always a fun topic to cover. We talked about the following file formats.

• parquet
• avro
• orc
• csv/flat-file
• json

Knowing the difference and situations for all these file types is important and will make a difference
in your data pipelines.
Lastly, we covered compression and partitioning. Compression will save time and money and is too
often ignored. Data partitioning is at the heart of big data storage, if you are ever going to work with
data of size, it’s a topic you have to learn.
Chapter 5 - Compute and Resources
This chapter is all about managing resources. Believe it or not, this is a big part of what data engineers
do on a day-to-day basis and can affect every decision we make when writing code … and don’t
forget the bottom line.
At the start of this book, I mentioned our pipelines need to be scalable. Well, how do you think most
big data gets processed through ETL pipelines? This work happens on clusters and servers, often
referred to as compute in the distributed big data processing world.
That is the topic I want to bring to the forefront, compute, in the form of RAM and CPU. It’s quite
common for data engineerings in the start of their career to struggle with Out Of Memory (OOM)
issues, as well as performance problems that end up being related to resource usage.
CPU (Central Processing Unit) and memory are critical components in data engineering as they are
responsible for processing and storing data. Here are some key points to keep in mind regarding
CPU and memory in data engineering:
CPU:
The CPU is responsible for executing instructions and processing data. In data engineering, CPUs are
used to run data processing applications and algorithms. The speed and number of CPUs required
depends on the complexity and volume of the data being processed.
Memory:
Memory is used to store data temporarily while it is being processed. In data engineering, memory
is used to cache data, intermediate results, and program code. The amount of memory required
depends on the size of the data being processed and the complexity of the algorithms being used.
Memory hierarchy:
Memory is organized into a hierarchy, with the fastest and most expensive memory located closer to
the CPU. In data engineering, it is important to manage the memory hierarchy effectively to ensure
that data can be accessed and processed quickly.
Parallel processing:
Parallel processing is a technique used to improve the performance of data processing by dividing
the workload across multiple CPUs. This requires careful management of memory access and
synchronization to ensure that the data is processed correctly.
Optimization:
Optimizing the use of CPU and memory resources is critical in data engineering to ensure that the
data processing tasks are completed efficiently. This can involve techniques such as load balancing,
caching, and pipelining, which are used to minimize the amount of time that data spends waiting
for CPU or memory resources to become available.
Chapter 5 - Compute and Resources 84

The principles we will talk about apply if you are writing what would be considered small data
running on a single machine. Managing the resources and compute available to you is the key to a
pipeline that is efficient and fast.

Overview
In the architecture chapter we covered some examples of calculating compute needs, but let’s get a
little more specific about what exactly a data engineer needs to take into consideration when writing
data pipelines.

• RAM
• CPU
• Storage
• Cluster/Node count (aka how many servers do I have to work with.)

Why should we care about such things? Honestly, because the above resource factors will affect how
fast the code runs, how much data can be processed, and ensure we are not wasting and leaving
unused compute on the table, and of course, there is a little thing called cost!

• Resources affect pipeline runtimes.


• Resources affect how much data we can process.
• Resources affect how much money we pay to run our pipelines.

Managing compute and resources.


Without getting lost in the weeds, because every situation is different, how can we generally
approach compute and resources in a semi-structured way?

• What resources are available.


• What does the tech stack require for resources at a minimum.
• How can I reduce resource consumption in my pipelines.
• How do I maximize my resource usage based on what’s available.

Handling CPU and memory in data pipelines involves optimizing the use of resources to ensure that
data processing tasks are completed efficiently. Here are some strategies that can be used:
Optimize algorithms:
One of the most effective ways to improve CPU and memory usage is to optimize the data processing
algorithms used in the pipeline. This can involve using more efficient algorithms, reducing the
amount of data processed, or breaking up large tasks into smaller, more manageable pieces.
Chapter 5 - Compute and Resources 85

Use parallel processing:


Parallel processing is a technique that involves dividing the workload across multiple CPUs or nodes.
This can significantly improve performance and reduce the amount of time that data spends waiting
for CPU or memory resources to become available.
Monitor resource usage:
Monitoring CPU and memory usage can help identify bottlenecks and other issues that can impact
performance. This information can be used to adjust resource allocation, optimize algorithms, or
restructure the pipeline to improve performance.
Use caching:
Caching involves storing frequently accessed data in memory to reduce the amount of time spent
accessing disk-based storage. This can significantly improve performance for data processing tasks
that involve accessing large amounts of data.
Use virtualization and containerization:
Virtualization and containerization can help simplify resource allocation and management by
allowing multiple applications to share the same physical resources. This can help improve resource
utilization and reduce costs.
Overall, effective handling of CPU and memory in data pipelines requires careful monitoring,
optimization of algorithms, use of parallel processing, and use of caching, virtualization, and
containerization technologies.
This first step any data engineer should take would be to find out exactly what resources are available
on the platform in which your ETL will be running. This of course will widely vary depending on
the technology stack, but in the end, you have to find the answer.
Are you using a Spark cluster? How many nodes are there, and what RAM and CPU are available
on each. Maybe it’s an Apache Airflow setup, what are the size of the workers, and how many
workers are available. It could be you use on-prem or standalone EC2 instances. These questions
have answers and you should find them for your use case.

• Before writing pipelines, you should understand the resources that are available.

How do we handle our resources?

Once you have answered those questions you should do everything in your power when writing ETL
and code to use every last scrap of RAM and CPU, leaving nothing on the table. This will probably
take the form of two major topics regardless of the programming language of choice.

• Concurrency (how many things can you do at the same time).


• In-Memory (how much data can you fit into memory to work on).

Some tools like Apache Spark and other distributed big data platforms take care of certain pieces
of that puzzle, resource consumption I mean, but there are probably many times when you will
Chapter 5 - Compute and Resources 86

just be writing custom code, and it will be up to you to make those decisions and write your code
accordingly.
How does this work out in real life? Let’s take a look at an example that has happened to me many
times in my career.

Example

In our example, you get assigned a new JIRA ticket. The gist of the ticket is that you will be receiving
several raw text files that are fixed-width delimited (say each column of data is separated by 50 white
spaces). Your task is to convert each CSV file into a comma-delimited CSV file for processing by some
downstream system.
Here are a few functions that you might write to accomplish this work.

• Open the text file.


• Read the text file.
• Iterate the rows in the text file.
• Convert each row to a comma-separated row.
• Write all the rows back to a file.

Easy enough right? All goes well with your testing until you put it in production. You realize now
that you receive about 40 files a day ranging between a few MB up to 2GB in size .. and the folder
you are supposed to process has a few hundred historical files already in it.
You do the math and realize it’s going to take a month to process each file one by one. Also the
more you think about it, you realize the machine that will be processing these files on AWS is a
c4.4xlarge meaning it has 16 CPU and 30GB of RAM.

It’s easy to see that processing files one at a time and not only going to take forever, but it’s not
efficient and leavings most of the compute resources wasted on the machine the code runs on.

Data engineers should learn to think about processing more than one file, or record, at a time. Why
settle on processing one file at a time in our example when most likely we have the resources
available to work on many files at the same time.
Chapter 5 - Compute and Resources 87

For a data engineer this is something we want to avoid, no matter if you are using Python or Scala,
Spark or not, it’s clear in the above situation we need to find a way to process more files concurrently,
use every last CPU and RAM to chew through the files.

RAM/Memory
This topic is probably one that bites data engineers the most often, yet there isn’t much training or
material on how to manage memory consumption for everyday use in data engineering.
There are plenty of software engineering-related resources that can teach out about different OOP
methods and how certain data structures might be larger or smaller than others, but typically this
type of material is directed toward normal software engineer principles and not that suited to that
well to data engineering work.
Data engineers are usually thinking in terms of data pipelines.
If you think about what data engineers do, day-to-day, it’s transforming large amounts of data
via pipelines. So when a data engineer thinks about memory usage it typically comes down to the
following question.

• How much memory is available?


• How much memory needs to be left available for the operating system?
• At its peak, what will my program’s max memory usage be?
• Can I calculate memory consumption based on my data?

This can be a tricky tight rope to walk. OOM (Out Of Memory) errors are common in data
engineering. One of the easiest steps to take is just to calculate the size of the data file, or the piece
of data that you are working on, compare it to the size of the resource and do the simple division to
see how many pieces of data you can fit into memory while still leaving a little room for overhead.

• Calculating the RAM usage of a data pipeline isn’t that hard, it just takes a little work.

On the other hand, we all know that memory is faster than disk, and working on data in memory is
much faster and is a great option to use when available. Let’s take Python for example and explore
features for working with in-memory data that every data engineer should be aware of.
Even if Python isn’t your first language, the ideas should generally transfer to other languages and
approaches. What I’m trying to accomplish here is to introduce you to the idea of thinking about
memory usage when writing your code. Most beginners don’t do this, and it never comes up until
something is breaking.
Chapter 5 - Compute and Resources 88

Introduction to StringIO and BytesIO in Python.


StringIO and BytesIO are perfect for making your Python faster. We are going to cover a few topics
specific to Python, but remember, it’s the concepts that are important.
We are going to cover File Objects, BytesIO, and StringIO, all great features in Python for maxing
out memory usage and supercharging data pipelines. These tools, and their concepts, are a great
example for data engineers to follow while building data pipelines.
Probably because of perceived complexities, IO streams are an underused feature that rarely comes
up in most code bases. We all know that memory is faster than disk-io. Yes, it requires us to think
about RAM and how much resource is available on our machine, but this helps ingrain good habits
early on in your career.

• Concepts are key, this is how can we use RAM for faster pipelines with a real-world example.

Why would you want to use from io import StringIO, BytesIO?

What data engineers work on.

Think about it. If you are the typical data engineer, business intelligence engineer, report developer,
or data analyst, what is one of the first topics you learn? Probably opening a file, maybe a CSV file,
a text file, a JSON file, or maybe it’s even a zip file full of other files.
It usually involves a file-io most of the time. You may have never thought about it, but reading and
writing files to disk will be a bottleneck in programs, especially if you are reading and writing the
same file multiple times. Say you’re downloading a file over HTTP to disk, then unzipping it, then
reading the file in. That’s a lot of disk-io.

BytesIO and StringIO

This is where StringIO and BytesIO come in, they are in-memory data objects. One way to think
about these streams of data is that they act as a File Object. What does that mean? It means you can
treat and interact with these objects like you would any other file, they have the File API on top of
them. You can read them, write to them, etc. A file but not a file, get it? And, it’s all done in memory.

• IO streams are just that, streams of data.


• They have File Object (read, write) type features.
• They help us get used to thinking about RAM/Memory as related to our code.
• They are faster than disk operations.

The best part of IO streams is that they live in memory, and that means fast. Of course, you need the
resources to do this on your machine or server, but that usually isn’t a problem. One minor detail to
remember about a StringIO/BytesIO is that when created, it acts like an already opened file.
Let’s look at some examples of how this could work. Simplistic, but to the point.
Chapter 5 - Compute and Resources 89

1 from io import StringIO, BytesIO


2 import csv
3
4 in_memory_file = StringIO()
5 csv_writer = csv.writer(in_memory_file)
6 csv_writer.writerows([[1, 2, 3], [4, 5, 6]])
7 in_memory_file.seek(0)
8 for row in in_memory_file:
9 print(row)

So what’s going on here?

• First, we create an in-memory open file object.


• Open our file object with our csv.writer.
• Write our rows as normal.
• Rewind file object to the beginning.
• Print the records of our file object.

A CSV writer takes a file object, well we have one of those, don’t we! Next, we call .writerows() on
our csv_writer object, we write two separate rows. The next part may seem a little strange to you,
seek(0). Since we are dealing with a stream, file-like object, and we’ve written to lines, technically
that opened file object is at the “end.” To read back out of that file object and print the lines we wrote,
we need to be at the beginning.
Let’s take this just slightly further to show how StringIO/BytesIO could be useful.

Example of BytesIO

Let’s say we have a boring job and our boss asks us to download information about Livestock and
Meat International Trade data from the government, and insert relevant information into a database.
The typical workflow would be to download the zip file, unpack it, read in the relevant CSV file from
disk, find the data, and off to the races.

• Download zip file.


• Unpack zip file.
• Read unpacked CSV file.
• Process data.

Well, we know better now don’t we. Sounds like a few spots of file-io, like writing the zip to disk,
unzipping the files to disk, then reading the CSV file from disk. But, there is a more excellent way.
Chapter 5 - Compute and Resources 90

1 import requests
2 from io import BytesIO
3 from zipfile import ZipFile, is_zipfile
4
5 url = 'https://www.ers.usda.gov/webdocs/DataFiles/81475/LivestockMeatTrade.zip'
6
7 try:
8 response = requests.get(url)
9 except:
10 print('Problem downloading zip file.')
11
12 if response.status_code == 200:
13 in_memory_zip = BytesIO(response.content)
14 with ZipFile(in_memory_zip) as zippy:
15 for item in zippy.infolist():
16 if 'Exports' in item.filename:
17 with zippy.open(item.filename) as export_file:
18 for row in export_file:
19 print(row.decode('utf-8'))

Easy! It’s fast because we are doing everything in memory and it’s simple code, straightforward.
Really what I’m showing you here is that many of the packages and methods you use in Python can
take a file-like object, in this example with ZipFile(in_memory_zip) as zippy, doesn’t matter if
it’s an actual file sitting on your disk or a file-like object sitting in memory.

RAM/Memory Review
I know that might have seemed a little bit of a deep dive for a book on data engineering. What I
want to get you thinking about is the code that drives your pipelines, it could be PySpark or it could
be just plain Python or Scala. If you are working with data, that data is going to be in memory at
some point.
Data in-memory is going to be faster than data on disk, most of us have at least heard this in passing,
“in-memory is so much faster.” But what data engineers need to think about when writing pipeline
code is … how much data am I dealing with?
How much data is going to be in-memory in a worst-case scenario when doing this specific
transformation or calculation? OOM errors are one of the most common problems that data
engineers run into early on in their careers.

• Data pipelines that work in memory are much faster.


• Working with data in memory can cause OOM issues.
Chapter 5 - Compute and Resources 91

This is usually because people get in the habit of assuming something works, and the data grows,
more data comes in, or just code gets pushed to production, and everything breaks.
What I want you to take away as a data engineer is how to take advantage of RAM, because it is
fast, but also be aware of how easily you can cause OOM errors as things scale.
RAM is fast and useful, it’s also limited and easy to run out of. You don’t want to leave a bunch of
memory on the table if you have it as an available resource. You don’t want to overuse it and cause
memory leaks and other errors, use it wisely.

CPU/Cores
Now that we have talked a little bit about memory/RAM, and how we should pay attention and use
it well, let’s talk about CPU. The thought process here is the same as with our memory discussions.
In today’s world, it’s very unlikely that you will be working with resources, servers, or virtual servers,
that only have a single core. In most cases, you will probably have a few cores, so leaving those cores
unused is a waste that could significantly reduce code run times when used properly.
I will say this upfront, no matter what language you use, even Python, concurrency is a tricky topic
even for good Software Engineers. Many tears have been spilled on the internet arguing about this
topic. But, let’s just cover the basics and show a few examples in Python of how you as a data
engineer can easily use all CPUs available to supercharge your pipelines.
Again, this isn’t specifically about Python, this is meant to teach you that many of the machines
your data pipelines will run on have multiple cores and CPUs available, and you should use them
all. It will make pipelines run faster, reduce compute cost (because you’re doing more with what
you have), and generally make your life better.

• Use all the CPU/cores available.


• Increase performance with the increased use of CPUs
• Understand what type of code requires more CPU and what doesn’t.

Example CPU Project


Let’s say we have a group of CSV files. We need to read all the files, look through each row and find
out if that row contains a “member” record or not.
Of course, we could open each file one by one and do this work, but after all our discussions so far
we would realize this isn’t scalable or fast, and wastes whatever resources we have on our server or
box running the code.
This must be balanced by how large the files are and if we start reading more than one file at a time,
we should be aware of the available CPUs and Memory on our machine so as not to overload the
system.
ProcessPoolExecutor’s in Python are a great way to introduce concurrency, allowing us to work on
more than one file at a time.
Chapter 5 - Compute and Resources 92

1 import csv
2 from glob import glob
3 from datetime import datetime
4 from concurrent.futures import ProcessPoolExecutor
5
6 def main():
7 files = gather_files()
8 with ProcessPoolExecutor(max_workers=3) as Thready:
9 Thready.map(work_file, files)
10 for file in files:
11 rows = read_file(file)
12 for row in rows:
13 filter_row(row)
14
15 def work_file(file_loc: str) -> None:
16 rows = read_file(file_loc)
17 for row in rows:
18 filter_row(row)
19
20 def gather_files(loc: str = 'trips/*.csv') -> iter:
21 files = glob(loc)
22 for file in files:
23 yield file
24
25 def read_file(file_location: str) -> iter:
26 with open(file_location, 'r') as f:
27 data = csv.reader(f)
28 next(data)
29 for row in data:
30 yield row
31
32 def filter_row(row: object) -> None:
33 if row[12] == 'member':
34 print('member ride found')
35
36
37 if __name__ == '__main__':
38 t1 = datetime.now()
39 main()
40 t2 = datetime.now()
41 x = t2-t1
42 print(f'It took {x} to process files')
Chapter 5 - Compute and Resources 93

The point here isn’t to teach you how to use ProcessPoolExecutor’s, but to teach you that such
tools exist and are available for you to use where appropriate. The problem breaks down to this,
regardless of tools or language.
You have a list of work that needs to be done. You probably have a method or function that does
that actual work, now you need a tool to spread the list of work across several CPUs so you can do
more work at one time with your code.

• Break down your workflow into its basic parts.


• Start by understanding the “unit of work.”
• Try to understand how to “spread” the work across different CPUs.

In our case, that’s exactly what the ProcessPoolExecutor’s map function provided for us. We passed
it our function work_file and our list of work files and the rest is done for us! All of a sudden we
have multiple files being worked on at once.
Take some time to study this example if you are not familiar with this type of data processing, it’s
about changing your mindset, breaking down a large task into its components, and figuring out how
you can spread the workaround to multiple cores.

Storage
Storage is probably one of the easiest resources to manage, and we won’t devote a lot of time to this.
With the advent of cloud storage services like Azure Blob, AWS s3, and Google Cloud Storage the
idea of running out of disk space has floated into the past.
It’s still an important topic to talk about and I’ve seen disk issues pop up many times in pipelines.
This is mostly the case simply because data engineers think they don’t have to pay attention to
storage anymore.
Most servers and resources come with either a set amount of storage attached, or possibly some
elastic storage service that is quite large, or pipelines will just read and write directly to the cloud.
Whatever the case it’s easy to get caught with disk out-of-space errors if you are not careful.
It could be writing intermediate results to disk, but not cleaning up after yourself or knowing upfront
what disk space is available is a good way to get into trouble.

• Always know how much disk space is available on your resource.


• If you write results to disk, always clean up after yourself.
• Reading and writing directly to cloud storage is best but creates network bottlenecks.
Chapter 5 - Compute and Resources 94

Cluster/Nodes
Last but not least I want to talk about a difficult topic … cluster size or node count. This can be
tricky because so much depends on the tools and technology stack being used. It could be Spark,
Kubernetes, Airflow … all these systems have multiple workers, each with their resources.
A general rule of thumb before you start working on a data pipeline is just to understand the system
and resources you are working with … what do you have available. Even if only for the simple fact
you could greatly speed up your data processing if you knew you had 10 works available and not
just 1 or 2.
Any pipeline that isn’t just running on a single machine … which is becoming less and less common,
no matter the tech, is going to depend on the number and size of the workers available to process
data.

• Understand the system you are working on, what resources are available.

You would be surprised at how just knowing that you have x number of workers, each with x CPU
and x RAM will allow you to calculate throughput and possible pipeline problems as the data grows.
Being able to calculate how much data you have, and what each worker could or should handle
is something that a data engineer should become familiar with calculating. It will save you from
wondering why a pipeline is running for more than 1 day or gets OOM errors.
If you never attempt to calculate data size and what resources are available for processing that data,
then you are taking a shot in the dark, it may work out, or it may not!
A cluster with nodes is a group of connected computers that work together to perform a common task
or set of tasks. Each computer in the cluster is referred to as a node, and all nodes work together to
perform computational tasks that would be difficult or impossible for a single computer to perform
on its own.
In a data engineering context, clusters with nodes are commonly used to process and analyze large
datasets. The cluster distributes data processing tasks across the nodes, allowing the workload to be
shared and completed more quickly than would be possible on a single machine.
Nodes in a cluster are typically connected to each other through a high-speed network, allowing
them to communicate quickly and efficiently. Each node in the cluster typically has its own CPU,
memory, and storage resources, and these resources can be combined to provide a more powerful
and scalable computing platform.
Cluster with nodes architecture is commonly used with distributed computing frameworks like
Apache Hadoop or Apache Spark, which are designed to enable distributed data processing and
analysis across a cluster of nodes. These frameworks allow data engineers to write code that can
be distributed across the cluster, and automatically manage the distribution of workloads across the
nodes.
Chapter 6 - Mastering SQL
Introduction To SQL
I don’t intend to go very deep into SQL or Relational Databases in this chapter. My goal is just to
give insights and high-level thoughts to those who are maybe just dipping their feet into the water.
Sure, I will cover the basics of SQL and what you should learn, if you are unfamiliar with some of
the topics, use that as a guide to show you what you should dig into later.
Feel free to skip this chapter if you have been a DBA or have been writing SQL for the last 15 years,
although sometimes a reminder about the basics is helpful!
I’ve met my fair share of snooty people who poo-poo SQL and databases as second class hand-me-
downs. I still remember talking to an academic computer science graduate who was explaining to
me how he refused to teach database classes, he was just too good for that. Whatever. We, data
engineers, know better!

• SQL is still a major player in data engineering.


• Most of the popular data engineering frameworks, like Spark, support SQL.

Refusing to accept how 90% of companies can operate as data-driven businesses just isn’t important
to some people. There is probably nothing more important in the tool belt of a data engineer than
being above average at SQL and databases.
Tuning queries, writing queries, indexing, designing data warehouses. I’m sure some Hadoop data
engineers skipped this step of RDBMS world, but that is not the normal path of a data engineer.
Here are some of the reasons why SQL is important in data engineering:
Data Manipulation:
SQL provides a set of commands that can be used to manipulate data in a relational database,
including selecting, updating, inserting, and deleting data. This makes it easy to modify large
datasets, and to extract the data needed for analysis.
Data Analysis:
SQL provides a powerful set of tools for analyzing data in a relational database. These tools
include aggregation functions, grouping, sorting, filtering, and joining tables. With these tools,
data engineers can perform complex analysis on large datasets, and extract insights that can drive
business decisions.
Data Integration:
SQL can be used to integrate data from different sources into a single database. This makes it easier
to combine data from multiple sources, and to perform analysis on the combined dataset.
Chapter 6 - Mastering SQL 96

Data Security:
SQL provides a set of security features that can be used to protect sensitive data in a relational
database. This includes features like user authentication, role-based access control, and encryption.
Scalability:
SQL can be used to manage and manipulate large datasets, making it a powerful tool for scaling
data processing pipelines. With SQL, data engineers can easily manage and analyze data at scale,
without having to worry about the limitations of traditional data processing tools.
Overall, SQL is an essential tool for data engineers, providing a powerful set of tools for data
manipulation, analysis, integration, security, and scalability.

Does the type of database matter?


Someone might ask this question. Shouldn’t we worry about Postgres, MySQL, SqlServer, doesn’t
the database type matter? Not really. Being good at the fundamentals of SQL has little to do with a
brand name slapped on top.
Remember, we aren’t talking about being a DBA here, we are talking about being a data engineer.
We’ve aren’t setting up automatic failover clusters. We are worried about understanding how to
model tables, design indexes, and tune queries.

• Fundamental SQL skills transcend database brand names.

Think of it as working on your car, there are differences and you might have to check the manual a
few times. But once you figure out how to change the oil, the brakes, air-filter, the light bulbs, you
can work on almost any car.
All the angry DBA’s right now can just send me emails that I will promptly delete. No, the type of
RDBMS doesn’t matter to learn SQL/database fundamentals.
Choosing the right type of database is an important decision that can have a significant impact on
the performance and scalability of your data engineering pipelines. Here are some factors to consider
when choosing the right type of database:
Data Structure:
Consider the structure of your data. If your data is highly structured, with a consistent schema, a
relational database may be the best choice. If your data is unstructured, with varying schema or data
types, a NoSQL database may be a better fit.
Scalability:
Consider the scalability requirements of your data engineering pipelines. If you need to scale your
pipelines to handle large volumes of data, a distributed database or a cloud-based database may be
a better choice.
Performance:
Consider the performance requirements of your pipelines. If you need to process data in real-time, a
Chapter 6 - Mastering SQL 97

streaming database may be a better choice. If you need to perform complex queries on large datasets,
a relational database may be a better fit.
Cost:
Consider the cost of different types of databases, including licensing fees, hardware costs, and
maintenance costs. Cloud-based databases may be a more cost-effective option for small to medium-
sized pipelines, while on-premises databases may be more cost-effective for larger pipelines.
Security:
Consider the security requirements of your pipelines. Some types of databases offer more robust
security features than others, including user authentication, encryption, and access control.
Overall, choosing the right type of database requires careful consideration of the structure, scalabil-
ity, performance, cost, and security requirements of your data engineering pipelines. By carefully
evaluating these factors, you can choose a database that meets your needs and helps you build
efficient and effective pipelines.

The fundamentals of SQL/Databases.


We are going to cover a few basic topics, give you a ten thousand foot overview. There are plenty of
books written on the inner workings of SQL and they have devoted chapters to understanding table
joins. I’m not going to do that.
I want to teach you the concepts that can be applied very easily to any relational database.

• OLTP vs. OLAP.


• Table design/layout.
• Understanding indexing basics.
• How to write fast/tune queries.
• Where to look for common problems.

OLTP vs. OLAP


I won’t spend much time here, but understanding OLTP vs OLAP is pretty important. There are
plenty of long-winded explanations about this, but it isn’t that hard to grasp this concept. If you’ve
designed a new set of tables in SQL for a project, it should fall into one of two categories.
OLTP (Online Transaction Processing) and OLAP (Online Analytical Processing) are two different
approaches to data management and processing.
OLTP is a system designed for transaction processing, where the focus is on executing transactions
such as data insertions, updates, and deletions. OLTP systems are typically used in operational
environments, such as e-commerce websites, banking systems, or airline reservation systems. The
Chapter 6 - Mastering SQL 98

main goal of OLTP systems is to ensure data consistency and accuracy, with a focus on real-time
processing and availability.
On the other hand, OLAP is a system designed for analytical processing, where the focus is on
querying and analyzing large volumes of data to gain insights and make informed decisions. OLAP
systems are typically used in decision-making environments, such as business intelligence and data
warehousing. The main goal of OLAP systems is to provide a multidimensional view of data, with
a focus on historical data analysis, complex queries, and advanced data visualization.
To summarize, OLTP is focused on transaction processing for real-time operations, while OLAP
is focused on analytical processing for decision-making purposes. Both OLTP and OLAP have
different architectures, data models, and processing requirements, and they are often used together
in enterprise systems to support various business needs.
OLTP = highly transactional in nature.
OLAP = analytical in nature, think data warehousing, aggregation.
Transactions vs analytics are the difference and should drive differences in how you think about
modeling and handling the design of SQL tables. Highly transactional systems are very concerned
about allowing for large volumes of inserts. Being able to support analytical queries in an OLTP
system just isn’t in the cards.
Below is an example OLAP database, star schema, some aggregate tables with many connected
dimensions.

On the other hand, OLTP tables are wide and few, made to ingest data quickly.

Many times OLTP tables will be wide with lots of columns, designed to handle high volumes of
Chapter 6 - Mastering SQL 99

inserts. Think maybe an online order coming into a system, financial or banking transactions hitting
a table. OLAP on the other hand is designed for aggregating and summarizing data, think analytics,
and data warehousing. Hopefully, that makes sense to you.
For most data engineering work, we find ourselves working in the OLAP world and design. We think
about structuring our data in a way that can answer questions with the aggregation of records.

Table design/layout.
In databases, the design and layout of tables play a crucial role in organizing and storing data
efficiently. A well-designed table can improve query performance, reduce data redundancy, and
facilitate data analysis.
Here are some essential considerations for designing tables in databases:
Define Primary Key:
A primary key is a unique identifier for each row in a table. It helps to establish relationships between
tables and enforce data integrity. It is essential to choose a primary key that is simple, unique, and
does not change over time.
Normalize Data:
Normalization is the process of organizing data in a database to reduce redundancy and improve data
integrity. It involves breaking down tables into smaller, more manageable tables and establishing
relationships between them. This helps to avoid data duplication, improves data consistency, and
simplifies queries.
Use Appropriate Data Types:
Choosing the appropriate data types for table columns can help reduce storage requirements and
improve performance. For example, using integer data type for numeric values and varchar data
type for text strings can help to minimize storage space and improve query performance.
Indexing:
Indexes can be used to improve query performance by creating a sorted copy of a column. This
allows the database to quickly find the rows that match a particular value or range of values. It is
essential to choose the appropriate columns to index and to avoid over-indexing, which can slow
down updates and inserts.
Partitioning:
Partitioning is the process of dividing large tables into smaller, more manageable sections. This can
help to improve performance, manageability, and scalability. It is particularly useful for tables with
high write loads or that are too large to fit in memory.
Denormalization:
While normalization is useful for reducing redundancy, it can sometimes lead to complex queries
and slower performance. Denormalization involves introducing redundancy into a table to improve
performance. This should be used sparingly and only when it provides a significant performance
benefit.
Chapter 6 - Mastering SQL 100

Designing tables in databases requires careful consideration of various factors, including data types,
normalization, indexing, partitioning, and denormalization. A well-designed table can improve
performance, reduce redundancy, and facilitate data analysis.
I’m going to mostly talk about table design and layout that data engineers will run into. If you start
reading a lot about the database and table design you will probably run into people talking about
normal forms, “you should design this in the third normal form.”
“Normalizing” a table is really about data de-duplication, and the extent that you take that concept
into your table designs. Without worrying about exactly which normal form level you should be
designing in your SQL tables, I suggest a more pragmatic and common-sense approach.

• Normalizing tables aims to de-duplicate, ensuring integrity, and logical data management.
• Normalization can be taken too far.
• Be logical in breaking up your data into tables.

If you don’t normalize your SQL tables, and up with just two large tables that hold the information
needed, there is a good chance you didn’t normalize your data model enough. If you end up with 15
tables for a relatively simple data set, you took normalization too far.
I suggest looking for the middle ground. The easiest approach is just to group the datasets into their
logical units. Think about a warehouse that holds information about customers and their orders.
What is a logical way to group and design the SQL tables in a relational database?

• Customer information.
• Order information.
• Product information.

It isn’t any harder than that. Complexity is always the enemy, especially if the situation doesn’t
warrant that complexity.
I do suggest you read up on this concept of database and table normalization if you plan on working
in and around classical relational databases and data warehouse environments. I believe table designs
are closely related to the OLTP and OLAP thoughts above. It isn’t that hard to avoid bad table design,
most of it is common sense. Here are some rules I’ve used in the past.

• Don’t let your tables get too wide.


• Table designs usually follow the logical business units closely.

You have to know the queries first (Try to understand data access patterns, even at a high level before
trying to design table layouts).
Break the data up into logical units (customer data, orders, products).
Don’t put everything in one table (this use case is rare). When you’re designing a table think about
how it relates to other tables (ex. how will it join to other data). Simple is usually better than fancy
( and means faster queries at the end most likely.) Understand data types ( if you haven’t thought
through the data type of each value, then you aren’t ready to design the table yet).
Chapter 6 - Mastering SQL 101

Primary Keys in Table Design.


I want to pause and talk about primary keys. This is glossed over by a lot of people but is essential
to table design. One of the keys to understanding the correct design for a table is if you can easily
list what columns make a row of data unique or different from all the other rows.

• First, define what makes each row of data unique.


• Uniqueness can be a single column or multiple columns.

I’m going to repeat that because it is so key to table designs in relational databases. What piece of
data, or pieces of data combined, make a record unique in your data set?
This is a concept of a primary key.
The primary key is the first thing you should define when designing a table, if you can’t, then you
don’t understand the data well enough yet, or your table has a poor design.

Table Design in Real Life.


Let’s take a simple example. We want a set of tables in SQL that will help us hold information about
hobbits, their friends, and the quests they go on.

• Hobbits
• Quests
• Friends
• Adventures

Example
1 #example.sql
2 CREATE TABLE middle_earth.Hobbits
3 (hobbit_id INT NOT NULL,
4 first_name VARCHAR(150),
5 last_name VARCHAR(150),
6 age INT DEFAULT 1,
7 favorite_food VARCHAR(250)
8 );
9
10 CREATE TABLE middle_earth.Quests
11 (quest_id INT NOT NULL,
12 quest_name VARCHAR(150),
13 location VARCHAR(200),
Chapter 6 - Mastering SQL 102

14 hobbit_id INT,
15 treasure_value BIGINT
16 );
17
18 CREATE TABLE middle_earth.Hobbit_Companions
19 (companion_id INT NOT NULL,
20 first_name VARCHAR(150),
21 last_name VARCHAR(150),
22 power_level INT DEFAULT 10
23 );
24
25 CREATE TABLE middle_earth.Adventures
26 (hobbit_id INT NOT NULL,
27 companion_id INT NOT NULL,
28 quest_id INT NOT NULL
29 );

Hobbits, so I made a hobbit table. Hobbits always go on quests, so that is an obvious one. And of
course, a hobbit will never leave his hobbit-hole without a companion, bingo.
Finally, they will set off on an adventure together, the last table. Just apply that simple logic to
database table design and layout. Think about them as pieces of a puzzle and consider how they
relate to each other.

• Break the idea down logically.


• Think about the relationship between the logical units.

So which hobbits have been going on adventures with whom? Can we answer this question for the
tables designed above?
Chapter 6 - Mastering SQL 103

1 >> example.sql
2 SELECT h.first_name, h.last_name, c.first_name, c.last_name, q.quest_name
3 FROM Hobbits h
4 INNER JOIN Adventures a ON h.hobbit_id = a.hobbit_id
5 INNER JOIN Hobbit_Companions c ON c.companion_id = a.companion_id
6 INNER JOIN Quests q ON q.quest_id = a.quest_id AND q.hobbit_id = a.hobbit_id;

You get the idea. Table design and layout are half art and half science.
Here are good rules of thumb for Data Engineers when it comes to database table design.

• Keep it simple.
• Break tables up into logical units.
• Always think about primary keys (what makes something unique).

Understanding Indexing Basics.


Yes, this is what can make a break your database design, OLAP or OLTP. The best tables in the world
can quickly become useless with no or incorrect indexing. And honestly, the basics don’t care about
what kind of RDBMS you are running, SqlServer, Postgres, MySQL, it doesn’t matter.

• Every table needs a primary key (uniqueness).


• Every table probably needs secondary indexes to support queries and joins.
• Don’t go overboard, indexing too much can have a negative impact.

Every table should have indexes (I hope you are not surprised at this novel idea). Table join keys
should be the primary and most important index (how two tables link together and relate to each
other).

• Columns that show up in GROUP BY and WHERE clauses should be in indexes.

If you have to put 10 indexes on your table go back and read the section on table design and layout
because you failed. Think about what makes each table unique. This is a critical topic in database
design.
Indexing is a database feature that allows for faster data retrieval by creating a copy of a table’s
column(s) in a separate data structure. This data structure is then used to look up data based on the
values in the indexed column(s), instead of scanning the entire table.
Here are some basics of indexing:
Index Types:
There are different types of indexes that can be used based on the type of data and queries. Common
types include B-tree, hash, bitmap, and full-text indexes.
Chapter 6 - Mastering SQL 104

Index Columns:
An index is created on one or more columns of a table. It is essential to choose the columns to index
carefully, as indexing too many or the wrong columns can slow down the database’s performance.
Indexing Process:
Indexing involves creating a separate data structure that stores a sorted copy of the indexed
column(s) along with a reference to the corresponding row in the original table. This structure is
then used to look up data more efficiently.
Index Maintenance:
When data is inserted, updated, or deleted in a table, the index needs to be updated accordingly. This
can slow down write operations and increase the database’s storage requirements.
Indexing Strategies:
Indexing strategies involve choosing the appropriate columns to index based on the most frequently
used queries. Indexing can also be combined with other database features, such as partitioning and
clustering, to further improve performance.
Query Optimization:
Indexing can significantly improve query performance, but it is not a silver bullet. Properly
optimized queries, including the use of indexes, can improve query execution time, and reduce
resource consumption.
Indexing is an important feature in databases that can improve query performance and reduce the
time required for data retrieval. Properly designed indexes can make a significant difference in
database performance, but it is essential to choose the appropriate columns to index and maintain
the index regularly.

Thoughts on table indexing.


Honestly, taking the very simple steps outlined above will put you way ahead of most people. Always
define your primary keys, the uniqueness for each table. Understanding the queries that will be used
on the tables, especially the WHERE and GROUP BY columns, indexes should be created to support
those.
Think about the simple hobbits example we reviewed. If you go back and look at the hobbit tables
we did in the table design/layout section above it should be very obvious where to place indexes,
hobbit_id , quest_id, and companion_id are pretty obvious choices, right? These columns are how
some of the tables are related together and joined.
I think learning the very basic and obvious approach to database indexing will get you 80% of the
way there. Of course, the topic is complicated and the examples and use cases can take years of
learning to fine-tune. Lots to learn but stick with the basics and they will serve you well.

How to write fast/tune queries.


If you write SQL you will run into slow queries. There are can be two causes of this!
Chapter 6 - Mastering SQL 105

• Poorly written queries.


• Underlying database problems (administrative).

Firstly, let’s talk about poorly written queries, again, regardless of the database system. A key concept
is that databases were designed to have you work on data in sets, groups of data, not row by row.

Think about your SELECT.


Never do SELECT * , the more data you are asking for the more time your query will take. Ask for
what you need, nothing else.
Below is what not to do.

1 >> example.sql
2 SELECT table_A.*, table_A.*
3 FROM table_A
4 INNER JOIN table_B ON table_A.key = table_A.key

What possible reason is there, to SELECT every single column from one or more tables? This only
happens in very rare cases, but this type of query is very common. It wastes resources and can double
or triple the size of data sets, that don’t need to be that big.
The database must process your query as submitted, if you want it to run fast, then ask for example
what you need.

No sub-queries in SELECT statements.


If you’re going to do a subquery, don’t put it inside the SELECT statement, requiring it to be run for
every row.

1 >> example.sql
2 SELECT table_A.*, (SELECT b.column FROM table_B as b WHERE b.column = table_A.column)
3 FROM table_A

Make that WHERE clause as big as possible.


Always have a WHERE clause. Since when do you need to run a query without a WHERE clause?
Add more things to the WHERE clause. This is filtering your dataset to only what is needed, every
time!
Talk to the business or user of the data you are getting, is there a date cutoff, do they want all of
history or every product? The more filters you add the faster the query will be.
Chapter 6 - Mastering SQL 106

1 >> example.sql
2 SELECT table_A.column_1, table_A.column_2, table_A.column_3
3 FROM table_A
4 WHERE table_A.column_1 > 100 AND table_A.column_2 > now() AND table_A.column_3 IN(1,\
5 2,3)

Views bad.
Stop using VIEWS, they suck and always will. These nasty little buggers have been a blight on
databases for many years, yet you still see them used from time to time, and they always drag down
the performance of queries.
Yes, they can be useful and used well, but it’s best just to stay away.

Save the details for last.


If you have complex data that requires periphery information in the end, don’t pull that data in until
the end. Aggregate first, pull in the details later.
For example, if you have a large complex query, and you need someone’s first and last name … wait
till the end to pull it in.

User-defined functions work as well as views.


Stop using UDF’s, they suck as well. Well, I know sometimes they must be used, but many times
there is functionality within the stand functions of the database to get the answer you need. Without
running a UDF. UDF’s have to be run for each row, so they get slow real fast.

Break down queries into logical units.


Do you have 7+ table joins in your first query before you’ve even started to subquery? Think again.
Start with what you need, use CTE’s or temporary tables, build up to what you need.

1 >> example.sql
2 SELECT a.column,
3 FROM table_A as a
4 INNER JOIN table_B as b on a.key = b.key
5 LEFT OUTER JOIN table_C as c on c.key = a.key
6 RIGHT OUTER JOIN table_d as d on d.key = c.key
7 ...

It’s best not to let things get so complicated, besides working on your ego, you are most likely to
introduce bugs and confusion.
Chapter 6 - Mastering SQL 107

Data types, especially of join columns.


Understanding the data types in general, and the data types of the table join columns is very
important. Joining on keys that are integers is going to be much quicker than a string hash that
is 24 characters long.

Summary

I could go on but you get the point again I’m sure. Just like in good table design, good query design
and tuning starts with the simple approach. Think about the very basic things you need to accomplish
first, do that, and then add the detail later. Try to use SQL that applies to the entire dataset, like
GROUP BY, WHERE instead of using a UDF or a SELECT in the SELECT statement.
Always try to pair the data down as much as possible before getting fancy.

Where to look for common problems.


I usually consider this the last resort. If everything else above is done properly you shouldn’t have
to dig farther down very often. But it will happen. And the problems are always side effects of what
was not done above.

• Check indexes first. Are you joining on some column with no index?
• Are the index stats being updated?
• Look for bad SQL. Do the joins and subqueries look reasonable?
• Inspect the complex parts of the query. If something is complex it’s probably like that for a
reason. Find out why.
• Learn how to read query execution plans and stats.

Some SQL query problems can only be solved by getting way more into the weeds than you want
to.
SQL has always been near and dear to my heart, being on data warehousing teams is how I cut
my teeth on working in IT. After years of working around data warehousing, business intelligence,
watching software engineers struggle with database design and queries, while at the same time being
the best programmers, it becomes obvious to me that the simple things are what matters.
There are always cases where even the DBA’s have problems, that’s just life. But, most of the time
sticking to the fundamentals will get you most of the way there.
Chapter 6 - Mastering SQL 108

SQL Fundementals
What I want to do in this section is just list the fundamentals of writing SQL. I’m not planning on
teaching you how to learn each of the topics in depth. I just want to bring them up, give you a list
of SQL features that you should be comfortable with. If you are unfamiliar with any of these, use it
as an indication you need further study in this area.
Here are the SQL features I want to glance over at a high level.

• SELECT statements
• JOINS
• GROUP BY and WINDOW functions
• AGGREGATE functions
• WHERE clause
• SubQuerys
• CTEs

SELECT statements
SELECT statements in SQL are probably one of the first topics you learn. What should a good data
engineer know about SELECT statements?

• Only ever SELECT the columns you need.


• Avoid putting complex logic in your SELECT statements unless necessary.
• It’s common to put CASE statements in the SELECT.

Not paying attention to SELECT statements is the culprit of many poor-performing queries. When
you SELECT * from three different tables in your query, it should be no surprise things are slow.
Also, putting complex logic in SELECT statements usually means that logic must run on each row
individually, in some cases, like CASE WHEN, this is needed, in my cases it is not, because it is
expensive.

JOINS
JOIN statements are another topic most seasoned engineerings are very familiar with, yet it is
surprising how many times the simple JOIN types can cause problems and bugs.
You should learn like the back of your hand, the different JOIN types, and when you should and
shouldn’t use them.

• INNER (only matching rows join)


Chapter 6 - Mastering SQL 109

• LEFT or RIGHT OUTER (keep only those records that have a match in one or the other)
• FULL (give me everything on both sides)
• ANTI JOINS (give me records only found on one side or the other)

What it boils down to is realizing when you need to use an INNER JOIN vs any other join. INNER
joins will most likely always drop records (because it likely not everything matches between two
tables). This is a simple concept, yet causes lots of problems in applications from Postgres to Spark
SQL.

GROUP BY and WINDOW functions and AGGREGATE functions


I’m talking quickly about GROUP BY, WINDOW, and AGGREGATE functions because they are all
used in conjunction with each other most of the time. You only would have GROUP BY or use a
WINDOW function when you are aggregating or “working on” some subsets of data.
Understanding generally what aggregation functions are available to you in most systems is key to
SQL success, most data queries written by engineers will require some GROUP BY or WINDOW
functions, usually in combination with some aggregate function.
These methods almost always “collapse” the data set, or in other terms, give some sort of smaller
summary dataset.

WHERE clause
WHERE clauses are arguably one of the most important features of SQL, in a relational database
or big data system like Spark. Why? Because any filtering or subsetting that can be done with a
WHERE clause, as early as possible in the process, is going to ensure the best performance and run
time of a query.
Figuring out WHERE clauses also typically give great insight into how the data should be indexed
or partitioned.
Many times it is advisable to spend time understanding common WHERE clauses and filters while
a data model is being built before anything goes to production. This is how important the WHERE
clause is.

SubQuerys vs CTEs
SubQuerys and CTEs are very much the same things, although each tool that provides them may
have its nuances and performance issues.
Chapter 6 - Mastering SQL 110

CTEs

At a basic level, there isn’t anything that special about a CTE, other than the possibility of recursion,
which only applies to %.01 of the persons reading this.

• Defined using the WITH x AS ( … ) syntax.


• Are a temporary result set.
• Can be thought of as a temporary table.
• Can be referenced multiple times in a query.
• Encapsulate logic and code.

Subquery

Also known as the infamous nested query, usually gets a bad rap. Why? Overused just like anything
else. They are a powerful way to slowly and methodically build up complex queries and logic.
There are two types of sub-queries.

• Correlated.
• Un-correlated.

What does that mean? It means that a subquery might or might not have some specific relation to
the query “above” it, or “over” it.

CTE vs SubQuery Summary


The hard part about SQL is that most people don’t consider it a “programming language,” we can
argue about that one later. The unfortunate downside of SQL not being considered as a programming
language is that it misses out on some of the best software engineering principles. It’s probably why
the old Database Developers got such a bad rap back in the day. Piles of queries that look ugly with
no testing.
Newer technology like SparkSQL and dbt are changing that. But, using CTEs is probably one of the
best ways to apply the idea of encapsulating logic to the world of SQL.
Just like long, jumbled, drawn-out code written in Python with functions that are 100 lines long
… SQL queries that might look impressive but take scrolling and scrolling and scrolling with sub-
queries here and there and everywhere become ripe for bugs and headaches.
CTEs can help with this mess. You can name them what they are, logically group them, encapsulate
specific code
Chapter 6 - Mastering SQL 111

Python + SQL
I want to take a short little side excursion and talk about how to work with Python and Postgres, a
popular relational database. I’m not going to get too in-depth, but I want to cover the basics, so you
can start to get some idea about what is involved in connecting code to a database.

The Basics of a Python Connection to Postgresql

We are going to use a popular Python package called psycopg2 to do some simple examples.
The first step you will need to complete is getting the URI/URL for your Postgres connection to
the database and server. This gives a lot of people trouble in the beginning, the first few times you
have to put one together. Here are the options for parameters that are common to connect to most
databases.

• Host
• Database
• Port
• User
• Password

Probably some of the confusion comes in because depending on the setup, only some of the
parameters are needed, and of course, there are different ways to indicate the URI and or parameters.
You will probably see people connecting in two different ways.

• URI
• Parameters

Let’s give it a try.

1 >> example.py
2 import psycopg2
3
4 host = 'localhost'
5 database = 'my_database'
6 user = 'postgres'
7 pass = 'postgres'
8
9 # connecting with params
10 try:
11 conn = psycopg2.connect(host=host, database=database, user=user, password=pass)
12 except psycopg2.Error as e:
Chapter 6 - Mastering SQL 112

13 print(f'Had problem connecting with error {e}.')


14
15 # or this way with URI (Postgres 9.2 and above)
16 uri = f'postgresql://{user}:{pass}@{host}:{port}/{database}'
17 try:
18 conn = psycopg2.connect(uri)
19 except psycopg2.Error as e:
20 print(f'Had problem connecting with error {e}.')

Here is one note to make, never make a database call or connection without catching and raising
errors. You always need to know what went wrong, which will happen often when sending queries
to the database and pulling records back.
Also, never take a database connection for granted. Clean up after yourself, if you have a DBA,
they will thank you. Databases many times have connection limits, you don’t want to leave tons of
orphaned connections to a database that the server has to clean up. If you make a connection object,
in our case conn, call conn.close() when done.

Cursors
There is something else that can be confusing about Python psycopg2 connections, you will use it
to generate a database cursor, which in turn will execute SQL statements, but committing or rolling
back transactions is done with a connection object. Otherwise, look into “auto-commit” if you don’t
like being in control of your own destiny.

Python Postgres Database Cursors

Let’s make a cursor and create a database table. Remember a connection gives you that pipe to your
database, a cursor is needed to execute a query against the database. The most important method of
a database cursor is the execute() method, or executemany().

1 import psycopg2
2
3 host = 'localhost'
4 database = 'my_database'
5 user = 'postgres'
6 pass = 'postgres'
7
8 try:
9 conn = psycopg2.connect(host=host, database=database, user=user, password=pass)
10 except psycopg2.Error as e:
11 print(f'Had problem connecting with error {e}.')
12
Chapter 6 - Mastering SQL 113

13 database_cursor = conn.cursor()
14 query = 'CREATE TABLE books (id INT NOT NULL, author varchar(150), title varchar(50)\
15 );'
16 try:
17 database_cursor.execute(query)
18 conn.commit()
19 except psycopg2.Error as e:
20 print(f'Problem executing query {query} with error {e}.')
21 conn.rollback()

Notice how I wrapped the cursor execute() call in a try except block to catch errors. Also notice
how I used the database connection to commit() the transaction if no errors are caught, as well as
using the connection to rollback() the transaction if there is an error.

Executing a Python Postgres psycopg2 Query with Parameters.

This is probably the next area that gives people the most trouble. Once you’ve learned how to connect
to your database and execute a simple query, most people are probably going to want to parameterize
those queries at some point, passing in some values from the Python script. I’m not going to get into
security, but I’m sure you see the implications here, so do your own research.
It’s really not that hard, all you have to do in your query string is put the characters %s where you
would like the parameter to be pulled in. Then for psycopg2 you need to pass a Python tuple with
those parameter(s). It would look something like this.

1 query = 'INSERT INTO books (id, author, title) VALUES (%s, %s, %s);'
2 data = (1, "St. Augustine", "Confessions")
3 try:
4 database_cursor.execute(query, data)
5 except psycopg2.Error as e:
6 print(f'Had problem with query {query} with error {e}.')

Retrieving Database Results with Python Postgres psycopg2.

Once you have a few records in your database, the next thing to do would be to retrieve them. The
concept is very similar to the execution of inserting data. There are three methods you can call on a
cursor that is executing a sql SELECT statement to retrieve data.

• fetchone()
• fetchmany(number_of_records_to_fetch)
• fetchall()

You should never call fetchone() in production most likely, this is the default behavior and a cursor
object is iterable in Python so you can just do the following.
Chapter 6 - Mastering SQL 114

1 query = 'SELECT * FROM books;'


2 try:
3 database_cursor.execute(query)
4 except psycopg2.Error as e:
5 print(f'Had problem with query {query} with error {e}.')
6
7 for record in database_cursor:
8 print(record)

Also, don’t forget the important step of cleaning up after yourself in the scripts!

1 database_cursor.close()
2 conn.close()

Python + Database Summary.

What I really wanted to show you in the above section about Python and databases, was the concepts,
not so much the code. Most connections to databases, via pretty much every language requires the
same sort of parameters we talked about in the beginning.
Also, opening connections and cursors to the database, pushing and pulling records are all good
concepts to start to grasp as a data engineer. Remembering to clean up after yourself is an important
step as well.

SQL Summary
You will notice in this chapter I took more time with the concepts than with teaching you how to
JOIN or filter a dataset. The reason for this is that there is a myriad of other resources to teach you
how to group and aggregate SQL queries.
It’s just a topic that is been ground into the ground. I would say from my experience it isn’t picking
up that syntax that is particularly difficult for data engineers. It’s always the simple topics and
concepts that end up causing problems.
Not understanding how to correctly design database tables, the basics around indexing, or just how
to write clean and simple queries. Mastering the high-level concepts is more important, becoming
an expert on writing window functions will just come with time and practice.
Chapter 7 - Data Warehousing / Data
Lakes
Closely related to the last topic of SQL is Data Warehousing and Data Lakes. Many things have
changed since the days of Data Warehousing existing solely on relational databases. Now, many
data lakes and data warehouses exist in cloud storage like s3. Even with these large changes, at the
core, many of the same concepts still apply when designing classic RDBMS Data Warehousing or
Data Lakes in the cloud on s3. But let’s try to cover the basics upfront.

• Data Warehouse vs Data Lake vs Lake House


• Data Modeling in Data Warehouses, Data Lakes, and Lake Houses.
• Facts
• Dimensions
• Constraints and Schema
• Id’s and Keys
• CDC / History Tracking
• Key Takeaways

Data Warehouse vs Data Lake vs Lake House


You’ve probably noticed that two terms come up a lot, many times they are mixed, and it’s hard to
truly know the difference, Data Warehouses vs Data Lakes, or even the new Lake House.
I would say most of the time you can chalk up these discussions at a high level to marketing. But,
once you get past the marketing, there are usually some underlying assumptions based on if the
datastore is a Data Lake vs Data Warehouse.
Many times when someone speaks of a Data Warehouse vs a Data Lake, they are probably talking
about the data model of the data store. Data Warehouses are known to be strict Kimball or Immam
style schemas, whereas many times Data Lakes start to blur this line and are less strict.

• The Data Warehouse / Lake / House is usually defined differently by different people.
• Kimball star schema design is still very popular and useful.

It depends on the context of the discussion for what the actual definition of a Data Warehouse is or
a Data Lake. What the more important question is sometimes, what is the technology underlying
your Data Warehouse or Data Lake.
Chapter 7 - Data Warehousing / Data Lakes 116

For years a Data Warehouse was just another name for a relational database of some sort that had a
Kimball or Imam style design applied to it. It’s isn’t so clear cut anymore. But, there are some things
that both Data Lakes and Data Warehouses have in common.
Both places are the dumping ground for an organization’s data. It’s supposed to be the single source
of truth to which all data flows and is captured. So what’s the difference between a Data Warehouse
and a Data Lake from there?
I don’t know if there is a correct answer to this question, as many people define each differently, but
it’s safe to assume it has a lot to do with the data model applied. So, the data model (which is many
times driven by the underlying technology) is the main difference between a Data Warehouse and
a Data Lake.
You are probably aware that Data Warehouses and Data Lakes are there for OLAP, aka aggregation
purposes. They are mostly used to answer high-level questions across the business that requires
many times years of data. So, it comes down to how that data is stored in its “tables” and what the
relationship looks like between the different data sets.
Generally speaking, both Data Warehouses and Data Lakes seek to at a minimum de-duplicate data,
clean it, and store it logically, many times capturing historical versions of that data as well (SCD),
called Slowly Changing Dimensions, aka data that changes over time.
But, if we have to make more clear distinctions between these concepts, let’s give it a shot.
Data Warehouse, Data Lake, and Lake House are all different approaches to storing and managing
large volumes of data.
Data Warehouse:
A data warehouse is a central repository of integrated data from various sources in an organization.
It is designed to support business intelligence (BI) activities such as reporting, data analysis,
and decision-making. Data warehouses typically use a structured schema and relational database
technology to store data. Data is organized into tables with defined relationships, and the schema is
optimized for analytical queries.
Data Lake:
A data lake is a centralized repository of raw data from various sources in its native format. It
is designed to support big data processing and analytics workloads, including machine learning
and artificial intelligence (AI). Data lakes typically use a distributed file system, such as Hadoop or
Amazon S3, to store data. Data is not pre-processed, and the schema is not predefined, making it
easier to store unstructured and semi-structured data.
Lake House:
A lake house is a hybrid approach that combines the benefits of both data warehouses and data
lakes. It is designed to support real-time data processing and analytics workloads by providing a
unified platform for structured, semi-structured, and unstructured data. Lake houses typically use a
distributed file system to store raw data and a relational database for structured data. This approach
allows for faster data processing, real-time analytics, and more advanced AI and machine learning
capabilities.
Chapter 7 - Data Warehousing / Data Lakes 117

Data warehouses are ideal for structured data, data lakes are ideal for unstructured and semi-
structured data, and lake houses provide a unified platform for both types of data. Each approach
has its benefits and drawbacks, and the choice of approach depends on the organization’s specific
data management needs and goals.
With all that being said, let’s dive into some topics you will most certainly run across as a Data
Engineer, regardless of if you end up working with Data Warehouses or Data Lakes.

Data Modeling in Data Warehouses, Data Lakes, and


Lake Houses.
This is probably the biggest area of ambiguity today in the data engineering world, I see questions
and confusion all around this topic. The literature and documentation have not kept up very well
with the new age of Big Data and its new tech stacks like Delta Lake or Iceberg.
How do you data model a Data Lake or Data Warehouse in the cloud? What is the confusion and
what are people trying to articulate as the core issue?
The question centers around file-based data models (think Delta Lake and Iceberg) vs relational
database (Postgres, SQL Server) models.
Why is this so tricky?
Because in the new world of Lake Houses and Data Lakes, driven by tools like Delta Lake and Iceberg
… the interface is still SQL-based, given the illusion of being no different from Postgres for example,
but underneath the hood, it is quite different.

Why can’t I just use the same old Kimball data model?

Some people might ask the question, “Well, if they both use the same ANSI SQL interface, why does
it matter, can’t I just model the same way I always have?”
Yes and No.
The problem is that between typical relational SQL databases and new Data Lakes there are two
major differences.

• The backend processing engines are completely different.


• The size of the data is usually completely different.

Data Modeling in Relational SQL Databases.


I won’t spend a lot of time here, this dead horse has been beaten enough over the last 20 years. But,
let’s review the basics so we can in turn see the major differences between this style of data modeling
and modeling in Data Lakes and Lake Houses.
Chapter 7 - Data Warehousing / Data Lakes 118

For time and memorial, the Kimball and star style schema for Data Warehouses has reigned supreme
and still does in many instances. It consists of the following typical points …

• Data is broken into many Facts and Dimensions.


• Data is very normalized and de-duplicated.
• Data usually fits on a single database instance (we will call this a few TBs of data or less).
• Most of the data is very tabular and uniform.
• Data transformations are easy-medium difficulty level.
• It’s Mainly for analytics, doesn’t support Machine Learning, etc very well.
• Driven by primary and foreign keys.
• This type of data modeling is very well known and practiced and has been in use for years.

So yes, sure, you don’t have to change your data model, but things probably won’t work that well …
I mean you can drive your Ford Geo over the Rocky Mountains, but do you really want to do that?

Data Modeling for Lake Houses and Data Lakes.


This is where things start to get a little hairy and vague. We know we need good data models for
these new data stores, but we struggle to understand what model to use and most importantly …
why.
What are the challenges with Data Lakes and Lake Houses and how is it different from the past?

• Hundereds of TB’s of data+


• Data comes from a variety of sources, isn’t as uniform or tabular.
• Data is processed on clusters with tools like Spark, Presto, BigQuery, Athena, EMR, Redshift,
blah, blah, blah.
• Data transforms tend to be more complex, medium to high difficulty.
• Supports analytics, but also Machine Learning and other outputs.

Data modeling in data lakes refers to the process of organizing and structuring data in a way
that makes it easier to analyze and query. Unlike traditional data warehouses, data lakes do not
require a predefined schema or structure, which can make it challenging to organize and model
data effectively. However, there are several approaches that can be used to model data in data lakes:
Schema-on-Read:
Schema-on-read is an approach where the data is stored in its raw form, without a predefined schema.
The schema is applied when the data is read, based on the specific query or analysis being performed.
This approach provides greater flexibility and agility, as new data can be added and analyzed quickly,
without requiring changes to the schema.
Schema-on-Write:
Schema-on-write is an approach where the schema is defined upfront, and the data is structured
Chapter 7 - Data Warehousing / Data Lakes 119

and transformed before it is stored in the data lake. This approach can provide better performance
and consistency, as the data is already structured and optimized for analysis.
Hybrid Approach:
A hybrid approach combines the best of both schema-on-read and schema-on-write. It involves
defining a set of core schemas or data models upfront, while also allowing for additional data to
be stored in a more flexible format. This approach can provide the benefits of both schema-on-read
and schema-on-write, while also addressing some of the limitations of each.
Practically for Data Engineer’s what does this mean?

Assertions about Data Models for Data Lakes and Lake Houses.

I’m going to make a few assertions about data models for Data Lakes and Lake Houses, then we will
dive into each one.

• Data model isn’t as normalized, you will have fewer tables/data sinks.
• Data model is driven by partitions.
• Data Models will have fewer tables and sinks in Data Lakes and Lake Houses.
• One technical implementation for folks that work with Spark and other Big Data technologies
have dealt with for a long time is the “small file problem.”

Many of the Big Data processing tools have hardships with data sets that are made up of many
small files. There are multiple reasons for this, but what it really boils down to is that all technology
behind the clusters running our compute were designed for large data sets.
These tools scale well with big data … sure, they can work on data sets that are GB’s in size, but that
isn’t where they shine.
Think about it, if you have 50GB of data, and you split that data into 10 different tables of 5GBs
each, this means that all the transforms, tables, and queries are working on small datasets, most
likely split into small files (if you’re working with data this size it means your probably ingesting
files with KB’s of data, and that you don’t know about partitioning or compaction so all the files are
tiny.
You think Spark will shine, and that designing a Data Lake for your small data will solve all your
problems … but it won’t.

Example

In the classic relational SQL database model, you focused on data deduplication, data normalization,
and star schema to the extreme. Typically in the Data Lake and Lake House world, we would not
model our file-based data sink in this manner.
We would instead focus on keeping data sources intact through our process … transforming them
from raw to target ready to consume. For example … in the classic Data Warehouse world, we would
most likely receive customer information with address information. We might go from a single data
file to three or four tables where we commonize address information.
Chapter 7 - Data Warehousing / Data Lakes 120

1 customer_id | customer_name | address_1 | address_2 | city | state | \


2 zip | country
3 123DADF645 | Billo Baggins | 123 Bag End | Hole 4 | Hobbiton | The Shire | \
4 456 | Middle Earth
5
6
7 CREATE TABLE common_customer (
8 customer_id BIGINT,
9 address_key STRING,
10 city_key STRING,
11 .... etc.
12 )
13
14 CREATE TABLE common_address (
15 id_key BIGINT,
16 address_1 STRING,
17 address_2 STRING
18 )
19
20 CREATE TABLE common_city (
21 id_key BIGINT,
22 city STRING
23 )
24
25 CREATE TABLE common_state (
26 id_key BIGINT,
27 state STRING
28 )
29 .... etc, etc.

You will probably not find the above type of very minute and extreme normalization and de-
duplication in many Data Lakes. Never say never though.
Data Lakes and Houses will contain Accumulators and Descriptors, not necessarily Facts and
Dimensions.
The only reason I don’t like to call Fact tables Fact tables in Data Lakes, and Dimensions as
Dimensions, is because of the historical context around Fact and Dimensions and all that entails.
It gives a certain impression about the Data Model under discussion, and I like to make distinctions.

Accumulators

Accumulators in a Data Lake aggregate or accumulate the transactional records, pretty much like
a Fact table would have in the past. The main difference will be that the Accumulator table will
Chapter 7 - Data Warehousing / Data Lakes 121

probably contain fewer “keys” that point to other tables, that would most likely have been stripped
out and put into a Dimension table in the classic SQL Kimball model.

1 >> example.sql
2 CREATE TABLE transaction_accumulator (
3 transaction_id BIGINT,
4 transaction_description STRING,
5 amount FLOAT,
6 product_id INT,
7 product_description STRING,
8 product_category INT,
9 customer_id INT,
10 customer_name STRING,
11 customer_address_1 STRING,
12 .. etc , etc.
13 )

The Accumulator above is similar to a Fact table but isn’t stripped of all its descriptors with that
data being normalized out into a myriad of other tables.

Descriptors

The Descriptors in the new Data Lake or Lake House are exactly like Dimensions, again they are
just less broken up and normalized, there are just fewer of them.
If the business requires say some distinct list of addresses, that would just be done an additional sub
table probably run and filled much like a Data Mart or analytic further downstream. It wouldn’t be
built into the Descriptor Tables and given keys that are referenced in an Accumulator table.

Data Partitions drive the data model.


Data partitioning is an interesting topic, with lots of tentacles, I suggest reading this Microsoft doc
to get the background. This isn’t a topic often talked about in Data Engineering, because it isn’t sexy,
but this topic is core to driving data models for Data Lakes and Data Warehouses.
When you work with SQL relational databases, everything you do is driven off of primary and
foreign keys, and indexes. Many times these concepts only exist very “loosely” in a Data Lake or
Lake House, and are implemented in the code, not part of the product (Delta Lake for example). If
you want these concepts, you create them and enforce them yourself.

• Big Data can only be handled with Partitions.


• Big Data sources and sinks have to be broken up into pieces, or semi-logical business or query
units.
Chapter 7 - Data Warehousing / Data Lakes 122

It really breaks down into the following problem …


How can I find what I need in this 200TB dataset? I can’t very well read every single file to get what
I need, it would take too long.
Enter partitions. And if partitions are this important and core to working on Big Data, then that
becomes the core of the data model used. The tables and sinks that are designed are designed around
these partitions, to make the data accessible and usable by the downstream compute.
Data partitioning refers to the process of dividing a large dataset into smaller, more manageable
segments or partitions. Each partition can then be processed or queried independently, which can
improve performance and scalability.
Data partitioning can affect a data model in several ways:
Partition Key:
The partition key is the column or columns used to divide the data into partitions. It is essential to
choose the partition key carefully, as it can impact query performance and scalability. The partition
key should be a column that is frequently used in queries and has high cardinality (many unique
values).
Query Optimization:
Data partitioning can improve query performance by limiting the amount of data that needs to be
scanned or processed. Queries can be optimized to leverage the partition key and process only the
relevant partitions, which can improve query response times.
Data Distribution:
Data partitioning can impact how data is distributed across nodes or clusters. Uneven data
distribution can result in hotspots or imbalanced workload distribution, which can negatively impact
performance. Data modeling needs to take into account how data is partitioned and distributed to
ensure even workload distribution.
Data Storage:
Data partitioning can impact how data is stored and accessed. Data may be stored in different
locations or formats depending on the partitioning strategy. Data modeling needs to take into
account how data is partitioned and stored to ensure that it can be easily queried and analyzed.
Data partitioning can have a significant impact on a data model, particularly in terms of query
optimization, data distribution, data storage, and partition key selection. Properly partitioning data
can improve performance and scalability, but it requires careful consideration and planning.

Facts and Dimensions.


Here are two concepts that should be a part of any Data Warehouse or Data Lake, fact and dimension
tables. They will end up being the beating heart of any analytical platform, regardless of the
semantics of the underlying data model.
Chapter 7 - Data Warehousing / Data Lakes 123

Facts and dimensions are the two main components of a dimensional data model, which is commonly
used in data warehousing and business intelligence (BI) applications.
Facts:
Facts are the measurable, numeric data that represent the business events or transactions that an
organization wants to analyze. Examples of facts include sales revenue, customer orders, product
quantities, and website traffic. Facts are typically stored in fact tables, which contain the detailed
information about the events or transactions that occurred. Fact tables also include foreign keys that
reference the dimension tables.
Dimensions:
Dimensions are the characteristics or attributes that provide context and describe the facts. Examples
of dimensions include time, geography, product, customer, and salesperson. Dimension tables
contain the descriptive information about the dimensions, such as the date, country, product
name, customer name, and salesperson name. Dimension tables also contain primary keys that are
referenced by the fact tables.
Facts and dimensions are linked together through a relationship. The relationship between the fact
table and dimension table is established through foreign keys. The foreign key in the fact table refers
to the primary key in the dimension table, which allows the fact table to retrieve the descriptive
information from the dimension table.
The dimensional data model provides a way to organize and analyze large amounts of data in a
way that is intuitive and easy to understand. The fact tables represent the events or transactions
that occurred, while the dimension tables provide context and allow for analysis at different levels
of detail. The dimensional data model is particularly useful for BI and reporting applications, as it
enables users to easily slice and dice the data to gain insights into business performance.

Fact Tables.
Fact tables are the central dataset that grows the quickest and is used to aggregate data. Many times
they have some sort of date or time series associated with them and are transactional in most cases.
Many times the data in a Fact table could also be known as an “event,” a record that something
happened. It could be a sales order, an invoice, a widget being produced, a user clicking on a button.
These are data points that would naturally be summarized or aggregated.

1 transaction_id | transaction_date | amount | quantity


2 6678932 2021-11-01 55.55 3
3 9283234 2021-12-12 65.75 1

Dimension Tables.
Dimension tables are datasets that help describe and extend information in a Fact table. They are
ancillary, like the leaves on a branch. They help us understand and interpret certain values in a Fact
table and their primary goal is to be used as a “look-up” table.
Chapter 7 - Data Warehousing / Data Lakes 124

1 customer_id | customer_name | customer_state | country


2 55353 Billbo, Bagins Iowa USA

What is a Data Warehouse or Data Lake all about? It’s all about aggregating values. Think about
fact tables (that will be aggregated) as the hub in a spoke of a wheel. All the spindles coming off
dimensions.
Let’s look at a straightforward example of a Fact table and Dimension table. Of course, we will use
hobbits as our example, we know that hobbits go on adventures and amass treasure for themselves.

1 CREATE TABLE middle_earth.Adventures (


2 hobbit_id INT NOT NULL,
3 quest_id INT NOT NULL,
4 date_quest_completed DATE,
5 amount_of_treasure BIGINT DEFAULT 0,
6 );

A good fact table in a data warehouse should consist of some nondescript ids, maybe a date, and
the values you would want to aggregate on. Say like the amount of treasure a hobbit was able to
compile over different time intervals.
What about the hobbit dimension that would describe a hobbit for us?

1 CREATE TABLE middle_earth.Hobbit (


2 hobbit_id INT NOT NULL,
3 name STRING NOT NULL,
4 age INT
5 );

The Hobbit table is a good example of a dimension, a table that has important information that better
describes and extends the data, but may not be critical for the calculation and aggregation of values.
How do we combine the two?

1 SELECT h.name, MONTH(a.date_quest_completed) as mnth, YEAR(a.date_quest_completed) a\


2 s yr, SUM(a.amount_of_treasure) as treasure,
3 FROM middle_earth.Adventures a
4 INNER JOIN middle_earth.Hobbit h on a.hobbit_id = b.hobbit_id
5 GROUP BY h.name, MONTH(a.date_quest_completed), YEAR(a.date_quest_completed);

So we can easily formulate how much treasure our hobbits are gathering over time. Remember
simple = better = faster, when it comes to database queries. And data warehouses and data lakes
tend to get large, so designing your Fact and Dimension tables correctly becomes important.
Chapter 7 - Data Warehousing / Data Lakes 125

Constraints and Schema.


Another major role of the Data Warehouse or Data Lake is the ability to enforce rules upon data from
various sources. This is what sets a Data Warehouse apart, you can have data from many different
places all in a single spot, ready to act upon.
Constraints and schema play important roles in a data warehouse environment, as they help to
ensure data accuracy, consistency, and quality. Here are some specific ways in which constraints
and schema impact data warehousing:
Constraints:
Constraints are rules that enforce data integrity and consistency by restricting what can be entered
into a database. In a data warehouse, constraints can be used to ensure that data conforms to
predefined rules, such as data type, format, or range. Common types of constraints used in data
warehousing include primary keys, foreign keys, unique constraints, and check constraints. By
enforcing constraints, data quality can be improved, and data inconsistencies can be minimized.
Schema:
A schema is a blueprint or design that defines the structure of a database, including the tables,
columns, data types, relationships, and constraints. In a data warehouse, the schema is typically
designed using a dimensional model, which consists of fact tables and dimension tables. The schema
defines the structure of the data warehouse, which enables efficient querying and analysis of large
amounts of data. By adhering to a well-designed schema, data can be organized and stored in a way
that supports business intelligence and reporting needs.

Constraints.
One of the best parts of the Data Warehouse is the ability to enforce constraints upon our data. It
gives the ability to expect uniform values in the most critical data points. This concept is so key to
the Data Lake and Data Warehouse, that even file-based storage systems like Delta Lake support
constraints.
Now depending on the underlying technology, different constraints are supported in different
systems, but it’s good to have a general understanding of what’s available.
Here are some common constraints that can be useful in ensuring data integrity.

• NULL or NOT NULL


• CHECK (is a value in some range or meets certain criteria.)

Here is a simple example to make the idea of constraints more concrete.

1 ADD CONSTRAINT dateEmployee CHECK (hireData > '1900-01-01')


Chapter 7 - Data Warehousing / Data Lakes 126

These two basic constraints in themselves will go a long way into ensuring the correctness of data
entering into the system. As you can see from the above example, you can be very basic or fairly
complex in the constraints you add to your data points. Remember, each constraint has to run on
INSERT, so don’t go overboard either.

Schemas.
Schemas are the other important topic when talking about Data Warehouses or Data Lakes. What
I’m mostly referring to when I talk about schemas in this context is the ability to control and enforce
column names and data types.
Part of the problem with ingesting data from many different sources is that the data is, well, different.
We are trying to take data points that are different and somehow commonize them into a single store
from which we can answer any question.

• Data types.
• Column names.

In data engineering, you will soon realize it’s always a few small but problematic issues that continue
to plague most data systems decade after decade.

Data Types.
A Data Warehouse or Data Lake allows you to define all the data types you expect for each data
point beforehand. Many times mixing of INTEGERS with STRINGS, say 1,2,3 with a,b,c will cause
major issues that need to be fixed downstream. How can you SUM a column if it includes both
INTEGERS and letters from the alphabet?

Column Names.
Another major area of concern in data engineering is the change of column names over time. It
happens and will always happen, for many reasons. It could be a bug or it could be the business
changing. Very few data sources stay static forever.

Example
Chapter 7 - Data Warehousing / Data Lakes 127

1 CREATE TABLE ....


2 ...
3 order_amount DECIMAL,
4 order_type STRING,
5 ...

In the above example, we know that order_amount is going to be aggregated regularly, so we don’t
want STRINGS ending up in that column. As for order_type what if a new data set comes with a
column titled order_description? These are the types of problems that a Data Warehouse or Data
Lake aims to solve in the long run.
Nothing is more confusing than the “same” data source with different “versions” of data. This
introduces complex transformations and logic that has to “combine” data sets so a full view can
be taken of the single data source.
Data Warehouses and Data Lakes aim to solve this problem using data modeling, among other
techniques.

The Role of ID’s in a Data Warehouses or Data Lake.


Know that we know the basics of what a Data Warehouse or Data Lake is, how they are made up
of Facts and Dimensions, and their primary purpose is the aggregation of data, we should touch on
another topic that is critical to the function of a Data Warehouse or Data Lake. That topic is ids.

• Uniqueness.
• Relationships.

It’s critical to understand that the data repository that is the central hub of truth for an organization
needs to be just that, the source truth. What do we mean by truth?

Uniqueness.
The data in the Lake or Warehouse is going to be de-duplicated and somewhat transformed. When
we talked about primary keys and uniqueness in our chapter on SQL, this same topic applies and is
even more important in the Warehouse or Lake world.
We can’t de-duplicate and have unique data records without having keys, primary or otherwise that
identify our data. How can we relate Fact tables to their dimensions? This will be done by ids either
in a SQL table or as hashes generated by code in a Data Lake.
It isn’t important how our particular technology requires us to generate an id for each data row,
but that one takes that step in the first place. Why? The difference between data dumped into
cloud storage and a Warehouse or Lake is that in the latter we’ve taken the time to understand
and transform the data into a unique and usable dataset that can be related to other datasets easily.
Chapter 7 - Data Warehousing / Data Lakes 128

Example

Let’s look at an example of some rows of data, and the challenge that uniqueness can pose.

1 # dataset 1
2 customer_id | order_amount | order_date | order_source
3 12234 10.00 2021-12-01 56
4 12234 10.00 2021-12-01 67

As you can see in the above example, if we had these records in an incoming data file, being ingested
into our data warehouse, we might not have thought about this problem beforehand. A customer
places 2 orders on the same day for the same amount.
This requires id’s to be generated for each row of data, in a Fact or Dimension table. This could be
primary and foreign keys in a relational database or hashes of unique column(s) produced by our
code and stored along with the data in our Lake in s3.
Being able to uniquely identify each Fact and join to Dimension on ideally a single column will
require an id to be generated during the creation and loading of our Data Warehouse or Data Lake.
In Spark SQL code this might be simply creating a hash of the columns that make a record unique.

1 sha2(concat_ws("|", `customer_id`, `order_amount`, `order_date`, , `order_source`), \


2 256) as hash_primary_key,

Relationships.
We already mentioned this in passing above, but relationships are the crux of a good Data Warehouse
or Data Lake. You are going to have a few main datasets or tables in your data store, how these data
sets and tables relate to each other is the key.
That is why the keys or hashes we use in our table structures to uniquely identify each row are so
important. We need to be able to JOIN our datasets together, without producing duplicates in our
Data Lakes.
Data is really about relationships. Sure, we might be able to answer a few simple questions using
a single table, but most business requirements require datasets to be joined to derive deep analysis
that provides the best insights and answers.
When building out a Data Warehouse or Data Lake, much time should be spent on defining the
relationships between our data tables and stores. We will dive into this more during the Data
Modeling chapter.
Chapter 7 - Data Warehousing / Data Lakes 129

CDC / History Tracking.


Probably one of the hardest parts of creating a Data Warehouse or Data Lake that provides the
greatest challenge is the requirement for history tracking, also known as CDC, or change data
capture.
It isn’t too bad once you’ve implemented it a few times, but in the beginning, it can be a little difficult
to grasp how you can provide such features. There are a few different approaches to capturing data
changes over time, I will pick one to go over to give you a good idea of what it can look like in the
real world.
There are two key parts to history tracking.

• Effective start and end dates/times.


• Active flag.

This might make more sense with an example. Say we have a Dimension table of Customers or
Accounts in our Data Lake, periodically someone moves and changes their postal code and address.
How do we keep track that this customer’s information changed?

1 customer_id | ... | postal_code | ... | start_date | end_date | active_flag


2 123 50023 2020-12-12 2021-5-4 0
3 123 50035 2021-5-5 9999-9-9 1

Of course, this is a contrived example, but it gives a good idea of how important a start and end
date/time, as well as an active flag, can be for tracking CDC changes over time in a Data Lake or
Warehouse.

Active flag.
The active flag is important because if you just want the “current” snapshot of what is more recent,
you can simply add a WHERE or FILTER to say active_flag = 1.

Effective date/times.
The effective start and end times of a record also provide a great history tracking of what “happened”
to a customer or any data record for that matter. It gives the ability to time travel as well, to see
“what this looked like on X date.”
Chapter 7 - Data Warehousing / Data Lakes 130

CDC implementation notes.


I’m not going to dive into details of all the technical details behind actually implementing CDC and
history tracking in code. I will just give you a few highlights and hints that will work most of the
time.
Of course, this can be a complex set of transformations to be able to track history records, but here
is generally how you can accomplish such a task.

• Get records in the source that ARE NOT in the target.


– These become your INSERT new records (aka they have no previous)
• Get records in sources THAT ARE in the target.
– UPDATE those records in the TARGET to set “end” or “stop” datetime and turn OFF the
active flag.
– INSERT new records similar to step 1

Of course, there are many ways to do this, some people use MERGE statements, and technically you
can combine steps 1 and 3. But generally speaking, you simply need to break up our source records
into two groups .. those that exist in the target and those that do not. Then it’s simply a matter of
ending the records (via UPDATE) in the target that already exists, and inserting the new records.

Key Takeaways.
There are a few high-level key concepts that should be kept while designing Data Lakes or Data
Warehouses. We glossed over a lot of topics, but I hope you have a better idea of what to expect
when working on a Data Lake or Warehouse.

Facts and Dimensions

It doesn’t matter if you will be working on a Data Warehouse or Data Lake, the nuances are small
between them, it’s the details that matter. You should identify where the transactional data will
accumulate with the most growth, these will be your fact tables. The dimensions will be more slowly
growing, with ancillary data that will help extend and describe your fact tables.
Don’t go overboard with data normalization, more on that later, the fewer the tables the better, they
will be large. You should expect to have a few fact tables in the middle, with dimension tables that
“surround” your facts, extending and describing them.

Constraints and Schema

Never forget that placing constraints and the correct data types in your schemas of a Data Warehouse
or Lake will save you time and trouble that will pay dividends in the end.
Validating inputs and commonizing the data to be the same is a key part of what a Data Lake provides
and the value it brings. Spend the extra time to put value constraints in at high-value targets, columns
that find themselves in WHERE clauses or filters.
Chapter 7 - Data Warehousing / Data Lakes 131

Id’s and Keys

Having the correct id’s or keys in our Warehouse is of utmost importance. Not having duplicates
records, and having a tool that can uniquely identify each record solves many problems that can
arise during a Data Lake project.
Ids and keys are critical to being able to reliable join different datasets, like Fact and Dimensions
together, without producing duplicate and erroneous results.

Summary
This was rather a big chapter with a lot of topics. As Data Warehouses have transitioned over to
Data Lakes and Lake Houses, things have gotten confused. It doesn’t help that the constant drip
of marketing material from vendors never stops either. Many times they just confuse the issues at
hand.
It’s most important to recongize the obvious differences between the different systems, their
technology and and data models.
Facts, dimensions, change data capture, constraints, keys, and data models. These are difficult topics
to master without a lot of experience, but you can set yourself up for success by getting broad
understanding of these different skills.
Chapter 8 - Data Modeling
It might seem like this chapter rehashes a few topics we have already talked about during Data
Lake/Warehousing and SQL chapters, and this is true because at the core of data modeling there are
a few ideas and techniques that seem to apply to 90% of the data modeling that you will do as a Data
Engineer.
You can use the same skills to model small sets of data stored in a few Postgres tables and hundreds
of TB’s of data stored in Parquet format in s3. Get these few basic data modeling skills right and
they will help you in almost every data project you work on.
Don’t ever forget that data modeling is half art and half science, many topics related to data modeling
are heavily dependent on the use case in question. We will call this concept “access patterns,” how
is the data used by the consumers. A data model should reflect how the data is going to be used.
Is the data in question going to be aggregating many values over specific time series, and products?
Then the data model should reflect that need and provide a schema for the data that can support
such needs.
Here are the highlights before we dig in.

• Understand data types and schema.


• Group data logically.
• Think about the grain of the data (lowest level of detail).
• Think about the uniqueness of data.
• Think about access patterns (query).
• Think about normal forms.
• File-based modeling vs SQL.

Data Types and Schema.


When you have data that needs to be stored for use by downstream systems, before starting to design
any part of the data model, begin by trying to understand the schema and data types of the data you
will be working with.
This most often consists of understanding the schema, usually made up of a data type and definitions.
Each piece of data should be assigned a data type, constraints around that data, and a definition,
including a name.

• Datatype.
Chapter 8 - Data Modeling 133

• Constraints.
• Definition and name.

Data types are used to define the type of data that can be stored in a field or column of a database
table. They specify the range of values that a field can take, the storage requirements, and the
operations that can be performed on the data. Choosing the appropriate data types is an important
aspect of data modeling, as it can affect the performance, accuracy, and usability of the data.
Schema, on the other hand, refers to the structure or organization of the database tables, fields, and
relationships. It defines the rules and constraints that govern how the data is stored, validated, and
accessed. The schema determines how the data is organized into logical units, such as tables, views,
and indexes. It also provides a framework for managing the relationships between tables, such as
primary keys, foreign keys, and referential integrity constraints.
In data modeling, data types and schema are closely related because the choice of data types can
affect the design of the schema. For example, if a field requires a large amount of text data, such as a
product description or customer feedback, a text data type may be more appropriate than a numeric
data type. This choice of data type may then affect the design of the schema, such as the size of the
field, the indexing strategy, and the data validation rules.
Data types and schema are important components of data modeling that are closely related. The
choice of data types can affect the design of the schema, while the schema provides a framework for
managing and organizing the data in a structured and meaningful way.

Data Types.
Data types are a way of categorizing the type of data that can be stored in a field or column of a
database table. They define the range of values that a field can take, the storage requirements, and
the operations that can be performed on the data.
Common data types in databases include:
Integer:
This data type is used to store whole numbers, such as 1, 2, 3, and so on. The size of the integer can
vary depending on the range of values that need to be stored.
Float:
This data type is used to store decimal numbers, such as 1.5, 2.3, and so on. The precision and size
of the float can vary depending on the level of accuracy required.
Date/Time:
This data type is used to store dates and times, such as 2022-03-28 or 15:30:00. The format and
precision of the date/time data type can vary depending on the requirements.
String/Text:
This data type is used to store text data, such as names, addresses, and descriptions. The size of the
string/text data type can vary depending on the length of the text to be stored.
Chapter 8 - Data Modeling 134

Boolean:
This data type is used to store true/false values, such as yes/no or on/off.
There are also more specialized data types, such as binary, XML, and JSON, that are used for specific
types of data.
The choice of data type can affect the performance, accuracy, and usability of the data. For example,
using a numeric data type for a field that requires text data can lead to data truncation or errors.
Using a large data type for a field that requires small amounts of data can lead to wasted storage
space. Therefore, it is important to choose the appropriate data type for each field based on the
requirements of the data model.
Generally speaking, whatever data storage solution you are using, relational database or parquet,
basic data types don’t change very much. You should become familiar with a few of the most
common.

• STRING or CHAR
• INT or BIG INT
• DECIMAL or FLOAT
• ARRAY
• BOOLEAN or TRUE/FALSE

Example
Usually, it’s obvious based on the column name what the data type is, it’s easy to fall into
traps though. We should think carefully about assigning data types, should an amount by INT or
DECIMAL?

1 account_id -> INT -> 1459345


2 account_name -> STRING -> "ABCDEFG"
3 purchase_amount -> DECIMAL -> 55.05
4 is_customer -> BOOLEAN -> True
5 product_ids -> ARRAY -> [1, 3, 6, 7]

Of course, these data types specifically will change slightly if you’re using Postgres vs a parquet file,
but generally, you will be using some form of a string or char, int or decimal, true and false boolean
values, and sometimes arrays or list, or other more complex types.

Data Size.
Rember to consider data size when defining what data type should hold which data points. It’s going
to be a delicate balance, you don’t want to run out of room with a STRING or CHAR with a certain
Chapter 8 - Data Modeling 135

size, but you also don’t want to waste a lot of space by calling every column a STRING just to make
things easy.
All data is stored on disk eventually, it could be cloud storage in s3 or just a local drive on a server.
Either way, data can add up over time, choosing the correct data type for your data points will have
a large impact on storage size.
Data Engineers should spend time with their source data, examining values, looking for the min and
max’s of each column, looking at the lengths of strings, and using good old common sense.
It can be easy to fall into the trap of just glossing over data types, assigning certain values to STRINGs
just because you can, or not understanding the DECIMAL points involved in a numeric column, even
understanding if INT is a BIG INT or not.
Take your time to understand each data type in-depth, it will help you start with a solid data model
at its foundation.

Constraints.
After the data types have been identified, the next task is to understand the constraints that surround
each data point. This will be a familiar topic for those savvy SQL developers, but I want to extend
the idea a little further.
Constraints play an important role in ensuring data quality and consistency in a data warehouse. A
constraint is a rule that is applied to a database table to restrict the type or range of data that can be
stored in it. Constraints can be applied to individual columns or to the entire table.
Here are some common types of constraints used in a data warehouse:
Primary Key Constraint:
This constraint ensures that a unique value is stored in a specific column or combination of columns.
It is used to enforce data integrity by preventing duplicate values and making it easier to join tables.
Foreign Key Constraint:
This constraint ensures that the value in one column of a table is linked to the value in another
column of a different table. It is used to maintain referential integrity by preventing orphaned rows
and ensuring consistency in related data.
Not Null Constraint:
This constraint ensures that a column cannot have null values. It is used to enforce data integrity by
ensuring that required data is present.
Check Constraint:
This constraint ensures that the data stored in a column meets a specific condition or set of conditions.
It is used to enforce business rules and data quality standards.
Constraints are used to ensure that the data in a data warehouse is accurate, consistent, and
valid. They help to prevent data entry errors, enforce business rules, and maintain the integrity
Chapter 8 - Data Modeling 136

of relationships between tables. By applying constraints to the data, data warehouse administrators
can ensure that the data is reliable and can be used for meaningful analysis and reporting.
Most people, especially those working in SQL systems, usually take a laxidazical approach to define
constraints, both physical and theoretical. Yet even the easy and obvious constraints around data
points are often ignored. But, anyone with experience can tell you that having robust constraints
around your data is the first step in data quality.
What type of constraints should be thought about for each data point?

• NULLs or NOT NULL.


• CHECK (controls what values)
• uniqueness (keys and ids)
• foreign key (SQL - key/id integrity)

Constraints are a great way in either Data Warehouses or Data Lakes to control the quality and
ensure the integrity of the data being stored. Dirty data is a classic problem, so catching problems
early can be accomplished easily with constraints. Think of constraints as your first line of defense.

Constraints for numeric columns.


For numeric columns, constraints can check to make sure a value is in some range of sanity, it’s a
great way to ensure mistakes in source systems don’t cascade farther into other parts where they
are harder to extract and correct.

1 CONSTRAINT checkRange CHECK (product_id BEWTEEN 1 AND 50)

Constraints for string columns.


In string columns, constraints can also help ensure the filters and where clauses we will work. Miss
spelling and typos have a way of finding a home in every system, constraints can catch those errors.
Take the simple example of a response column, let’s say we define a constraint that says the value of
the response column must be one of the following items … ['approved', 'declined', 'unknown'].
Now when a new data source comes in with the value of decline our constraint ran stop that
problem in its track.

1 CONSTRAINT repCheck CHECK (response IN ['approved', 'declined', 'unknown'])

What if we didn’t have the constraint in place? Now we have a column with both declined and
decline, some unknowing analyst using our data source with a query that says WHERE response =
'declined' and is now getting incorrect results.
Chapter 8 - Data Modeling 137

Data Definitions.
This topic is might seem a little basic, but it’s a very important one nonetheless. Having a definition
of the data points is a two-fold task.
Readable column names, that carry meaning, as well as descriptions of data points.

• Legible column names with meaning.


• Descriptions.

Legible Names.
Ensure you define column names that tell someone what the data point is without too much problem.
Simply reading a column name should tell you a lot of what you need to know. If the column is a
code, call it ..._code, if it’s a description, call it ..._description. Simply calling something code
instead of response_code or description instead of product_description just causes ambiguity.

Descriptions.
Descriptions should also be kept for each data point. This is usually easiest to do when creating the
schema of the data at the first. The more time that goes by the less likely someone is just going
to remember or know what a particular data set represents. Putting together documentation or
comments in a codebase will come in handy later.
This topic seems boring and pointless until later when you are asked by the business what a definition
is or where to find some obscure data point.

Modeling Data Logically.


Modeling data logically is probably one of the more obvious ways to enable an above-average data
storage solution. What does it mean to define a data logical data model?
Logical data modeling is grouping sets of data that logically should be together.
This is probably the easiest way to start a data model, for SQL or otherwise, as a high-level first pass
attempt when starting a brand new model.
Here are a few obvious examples of how that might look. Let’s say we work at a widget
manufacturing company. That company makes and sells on their website many widgets. What kind
of logical data model would we start to put together?

• Customer information
• Product information
Chapter 8 - Data Modeling 138

• Order information
• Bill of material information
• Marketing information
• Accounting and Pricing information

This is a simple example of the logical data model. In your mind, you can already see the tables
forming that would fall into these different groups.

Logical data models lead to physical relationships.


This is a great way to start because it gives a sense of the type of information that will be available,
and even the relationships (joins) between the different logical groups of data.
Once you have broken data sets into logical units, the JOINs between those data sets start to present
themselves to the designer. If you have a dataset of orders and a dataset of customers, you already
know there is going to be a relationship between these two tables, even if you don’t understand yet
every nuance of what that JOIN might look like.
Logical data models are a good starting point for talks and reviews with businesses and stakeholders
about the data needs and think about how the logical model could answer those questions. The
logical data model will probably make the most sense to a non-technical audience, they are also
great tools for documentation when introducing new technical engineers to a data set.

Grain of Data.
The idea of the “grain” of the data might be a new one to you but is an important thought processing
while data modeling. This becomes especially important during a Data Warehouse or Data Lake
project, even with some smaller relational database designs as well.
The grain of data is the lowest level of information available on a topic.
What’s the mean in real life?
If you have a table that holds order information from customers, does the table hold a single record
describing the overall order, even if multiple items were ordered? Or does the data table hold a
record for each item that a customer orders, and to get the total order summary you have to roll up
those individual order lines?

1 # orders table
2 order_id | product_id | quantity|
3 456 4AB-C 3
4 456 DCA-01 1
Chapter 8 - Data Modeling 139

In the above example, we might assume that the table holds one record for each order, but this is
not the case is it? The “grain” of this dataset is that each order can have multiple lines in the orders
table, one for each product ordered. This is an important distinction.
That might seem like a small difference, but it becomes very important when data modeling. If you
design a data model that relies on an Orders table having only a single line for each order, that
is a summary of multiple items, what happens later when you realize you are missing important
information that can only be obtained from a “lower” grain of data, means you need a data record
for each item ordered.
Knowing the lowest level of information that is required by the business logic and requirements that
are driving the data model is important. Identifying the lowest level of grain required should be one
of the first tasks when data modeling because it directly affects the final schema of the data model.

Uniqueness of Data.
Another data modeling technique that should always be thought about before writing any code is
the uniqueness of your data, and more specifically, each table or data sink you are trying to design
and model should have something that uniquely identifies each record or row.

• Identify each record.


• Helps with JOIN relationships.
• Reduces the age-old duplicate row problem.
• Many times uniqueness is a combination of multiple columns.

This is important for several reasons. First, if you are unable to identify what makes a piece of data, or
row, unique, then you most likely don’t understand enough about the data. Second, the uniqueness
factor of the data will tie directly into what are the primary keys that will eventually be needed for
each record, or join later.
Duplicates in data are one of the oldest and most common problems that have plagued databases
and data warehouses since the beginning of time. What was the problem, how did the duplicates
get there?
I would argue it was a failure during the data modeling process to identify for each logical group
of data, what made each record unique. It’s typically a combination of data points, sometimes even
combined with a date. Anyone working on a data model should ensure they understand what makes
the data unique, and what information might have to be added to make a data record unique, even
if that requires the addition of outside data like file name or insert date.
When duplicate records start to appear in data models, every join to other data sources starts to
compound the problem. And, once duplicate records are placed into the system, they are typically
painful to back out of and remove.
Chapter 8 - Data Modeling 140

Access Patterns.
Access patterns are one of the least used data modeling techniques, mainly because they can be
difficult to ascertain at the beginning of a project when a data model is taking place.

• Understand queries.
• How do downstream systems consume the data.
• Understand business needs and requirements for the data.

What is a data access pattern? It’s how the data is queried and used in the end. When you’re just
starting to work on a data model for a system, this can be sometimes hard to know exactly what
type of queries will show up, and how the data will be used. But, usually, you can get at least a
decent idea by talking to the downstream people and applications that will be using your data and
data model.

Example
It doesn’t matter if your data modeling parquet files in Delta Lake or Postgres, if you can see the
queries that will be run, or mock them up yourself, the access patterns become quite clear.

1 SELECT year, month, day, product, SUM(sales_amount) as sales


2 FROM company_orders
3 WHERE product IN [1, 2, 3] AND year >= YEAR(NOW());
4 GROUP BY year, month, day, product;

In the above example, we can see that our indexes or partitions will most likely break down along
date parts like year and month, as well as product.

Talking to the Business.


For example, if you are working on a data warehouse or data lake, and you have a dataset with
customer orders in it, you could talk to the product team, marketing team, and sales team, asking
them what kind of reports and information they expect and want to answer.
Most likely you will hear them talk about how much product is sold by date, or what the top-selling
products are, maybe which customers order a lot during which times of the year. If you keep having
those conversations some things will start to become clear. Date and Product are very important to
our end-users, and many of their questions revolve around those data points.
Chapter 8 - Data Modeling 141

Business Requirements impact the Data Model.


How does a data model use this information? We know now that most of our data is going to be
aggregated by date and product and customer. So these data values become prime targets for indexes
or partitioning strategies depending on our technical stack.
Thinking about the columns and data points used in WHERE and GROUP BY clauses is key for data
modeling correctly.
If you don’t understand how the data is going to be used there is a good chance the data model
won’t support the queries and access patterns that try to use that data source. This will lead to many
problems, not least of all performance, and difficulty retrieving correct data with reasonable logic.

Normal Forms.
Many folks get glazed over their eyes when the topic of normalization forms comes up. I am one of
those people, but it’s important to take away some of the basics of normal form design when data
modeling.

• De-duplication
• Join integrity

This topic was more important in the days of the classic relational data warehouse on SQL Server
or Oracle but has become less so today. You can loosely follow the levels of data normalization or
de-duplication and benefit from much of what they provide in a file-based Data Lake.
What are some of those old database normalization techniques that can be applied to even new Data
Lakes in the cloud? There are a few timeless classic database normalizations that should be followed.

De-Duplication of Data.
Duplicate data creates many problems in a data sink, some of them not as obvious as you think. I’m
not talking about duplicate data records that we discussed earlier, this is a different problem, I’m
referring to data definitions that exist in more than one place.
Let’s take an example of a database that holds information about specific products. What happens
if we make the easy mistake of having product descriptions in more than one spot?
This would be easy to miss. We might have a product table and an order table that holds the product_-
id and its description. What’s the problem with this?
The product description changes, what do we do now? Now we have a problem if we didn’t follow
proper data normalization we could have product descriptions in several if not many tables, and now
Chapter 8 - Data Modeling 142

we have inconsistent data. The process of updating product descriptions becomes a very painful and
expensive operation.
As you can see, one of the main tenants of normalization, the removal of redundant data is integral
to data models of all kinds.

Join Integrity.
Probably one of the most complex and yet most important parts of data modeling is understanding
how different data sets related to each or the joins.
May times there can be many joins between many tables, giving all sorts of results. Part of
normalizing your data is controlling how these joins occur and the type of results they will give
back.
Will joining two datasets cause one dataset to duplicate many rows? Will joining another table cause
records to drop? In real terms, can someone assume that every single record in your orders table,
that has a customer column, will have one and only one record in the joined customer table?
Think closely about your data model and how these joins will affect everyone and every query
downstream.

Keys - Primary and Foreign.


Before someone starts complaining that their new Data Lake doesn’t support foreign keys, let’s
remember we are trying to review proven techniques for data modeling that solve issues in the long
run. Hint: relational databases support the concept of Primary and Foreign keys, but many new Data
Lake technologies, like Delta Lake, do not.
Primary and foreign keys, regardless of if they are technically supported or not, and incredibly
important topics that we should go through the exercise of designing, if for no other reason than to
understand the data more closely during the modeling process.

The Idea Behind Keys.


What we are talking about here at a high level is the relationship between all the entities in our data
model.
It doesn’t matter what the underlying technology is beneath the model, the fact is that the data is
only useful enough as far as it can be reliably related to other entities.
Understanding how each table or sink of data can be related to others, and how that “join” would
function is critical to any data model. Relationships between entities are core to any data product
and should be looked at in-depth during the data modeling phase.
Chapter 8 - Data Modeling 143

Generate your own keys if you must.


Just because you are using Delta Lake for example, on the cloud, and there isn’t any official
implementation of a foreign key that is enforced, doesn’t mean that you shouldn’t design a key
yourself that acts as a foreign key, in fact, this is a good idea.
Having a key in each entity that can be used to relate to other entities easily will force better practices
and long-term results and data usability.
Being able to design keys of unique data the related entities to each other requires a deeper
understanding of data sources. The simple process of going through this exercise of designing keys
will make you the subject matter expert and result in a better data model.
If you are not familiar with data normalization I recommend you take some time to do some reading
on the subject, the more you learn the more you will be able to pick and choose topics to apply to
your next data model.

Relational Databases (SQL) vs Data Lake (File Based)


Modeling.
Because of the new world we live in, I want to take time to review the differences between data
modeling for SQL and relational databases, and file-based models, like Delta Lake.
The differences between a classic relational database data model and the new Data Lake data model
are somewhat obscure, but important and obvious once you observe them.
The Differences

• SQL models with many dimensions, data norms, and de-dups to the extreme.
• File-Based models have fewer dimensions, normalization in moderation.
• SQL Database models table size doesn’t matter much, small or big.
• File-Based data models table size and file size matter in the extreme.
• SQL Database models are centered around indexing and indexes.
• File-Based data models are centered around partitions and partitioning.

The number of Fact tables and Dimensions and


normalization.
This just comes down to the technical implementation of say Postgres vs Apache Spark. The
technology is just different, and of course, that is going to impact the data model greatly, this isn’t
much you can do about it.
Chapter 8 - Data Modeling 144

Relational SQL databases blazing fast and joining many small tables is no big deal. That’s typically
why the classic Data Warehouse running on a SQL database had a look-up code and dimension for
everything. Data deduplication and normalization were taken to the extreme in these data models
… because they fit the technology.
The following queries are quite common in the classic data warehouse world.

1 SELECT fact.*, dim2.blah, dim3.blah, lookup1.blah, lookup2.blah


2 FROM fact
3 INNNER JOIN dim1 on fact.key = dim1.key
4 INNER JOIN dim2 on fact.key2 = dim2.key
5 INNER JOIN dim3 on fact.key3 = dim2.key
6 INNER JOIN lookup1 on fact.key4 = lookup1.key
7 INNER JOIN lookup2 on fact.key5 = lookup2.key
8 WHERE dim1.blah = x AND dim2.blah = y and lookup2.key = z

File size and table size matter in the new File-Based


Data Lakes.
Most data engineering practitioners who’ve worked around Big Data and those tools like Hive, Spark,
BigQuery, and the like will recognize that both small file sizes and small datasets can wreak havoc
on those tools.
If you have 5 small-to-medium size datasets that have to be broadcast and joined with every single
query in your Spark Delta or otherwise Data Lake … things probably are not going to work as you
think.
Technically most big data tools get slow when you start to include large numbers of small files or
datasets that have to make their way across a distributed cluster network and disk that is behind
those big data tools.
This wasn’t a big deal back in the SQL relational database days because very few systems required
sharding and the data was sitting on a drive or two on a single machine. This is not the case in
the big data Data Lake. Hundreds of terabytes of data can be scattered across many hundreds to
thousands of nodes depending on workload size.

Partitions vs Indexes.
Another major role in the data model differences between the new Data Lakes and the old Data
Warehouses is partitioning vs indexes. The file-based Data Lakes are very much designed around the
Partitions applied to the data, typically a minimum of some datetime series, and usually extending
to other data attributes, like a customer, for example.
Chapter 8 - Data Modeling 145

Because of the size of the data in the Data Lake file systems, it’s an absolute necessity that the data
be “broken-up” aka partitioned according to one or more attributes in that data table.
Otherwise, the data of that size cannot be queried and accessed in a reasonable timeframe. Think
of entire scans running on every single file of a few hundred terabyte data sources, of course, is not
acceptable.
SQL relational databases on the other hand rely on primary and secondary keys to be able to seek,
join, and filter the datasets quickly. The data models of classic relational databases are typically
normalized, leading to many smaller tables with one or two large fact tables, typically designed
around the primary key, or uniqueness value of a data set.
A typical Spark SQL query on a Data Lake might look as follows.

1 SELECT fact.*
2 FROM fact.{s3://location}
3 INNER JOIN dim1.{s3://location}
4 WHERE fact.date BETWEEN x AND Y AND fact.partition_key = x AND dim1.partition_key = y

Walking the data model line between old and new.


It’s a fine line. There are many attributes of the old school Data Warehouse that tried and true, tested
for decades, that should be brought into the new Data Lake. What it comes down to is that it’s a fine
line to walk. The differences in the technology stacks behind the scenes affect the data model.
We can all agree that modeling data as Facts and Dimensions is a tried and true way of organizing
data for a Data Lake or Data Warehouse, the thought process around understanding what is a Fact
table and what is a Dimension table enables high-value data models that can support a myriad of
analytic use cases.
With file-based data models though, we cannot support a large number of small lookup and
dimension tables, our big data tools will not respond well to such workloads. Whereas with a SQL
relational data model, it would be foolish not to normalize the data to the extreme.
If file-based data models our data tables will be centered around and designed with Partition keys
at the forefront. In file-based big data models, we don’t care about the size of large tables, the bigger
the better. The tooling built around Data Lakes was MADE to work a the terabyte level and above. In
our SQL relational models, we have to normalize the tables and break them up as much as possible
in an effort to keep the data size from exploding.
Chapter 9 - Data Quality
Data quality is a topic that is becoming more and more popular. In the past, it’s always been sort of
an afterthought, with solutions put in place after the fact that try to solve and identify problems with
data quality. But, with the explosion of the size and types of data stored in the cloud, data quality has
become a real and formidable problem for anyone to tackle. Expect to see more and more discussion
and products around this topic of data quality in the future.

• What is Data Quality

- Business Quality
- Data Value Quality

• Measures of Data Quality

This chapter will be shorter than others. We will cover just some high-level basics about what to
look for and what you can try to achieve when implementing data quality yourself. This is a new
area of data engineering that has come into focus, so there is still a lot to learn and very few reliable
tools available.
Most Data Quality tools today are made and implemented inside organizations big and small. So,
really what we need is an understanding of what Data Quality is, how we can measure most data
quality, and what some different implementations might look like.

What is Data Quality.


What is data quality? This is a difficult topic to put into precise words, mostly because every data
source is different and has different business logic and requirements that surround it.
Data quality refers to the accuracy, completeness, consistency, and timeliness of data. It is important
because the decisions and insights derived from data can only be as good as the quality of the
data itself. Poor data quality can lead to incorrect or incomplete analyses, unreliable reports, and
ultimately, poor decision-making.
Here are some reasons why data quality is important:
Trustworthiness:
Good data quality inspires trust in the data and the decisions made from it. It ensures that
stakeholders have confidence in the data and the insights derived from it.
Chapter 9 - Data Quality 147

Compliance:
Data quality is essential for regulatory compliance. Organizations must be able to demonstrate that
they have taken appropriate measures to ensure the accuracy and completeness of their data.
Efficiency:
Poor data quality can lead to wasted time and resources spent on cleaning and correcting data.
Ensuring good data quality can streamline processes and improve efficiency.
Cost Savings:
Good data quality can result in cost savings. It can help organizations identify areas for improvement,
reduce errors and redundancies, and avoid the costs of poor decision-making.
Competitive Advantage:
High-quality data can give organizations a competitive advantage by providing insights that others
may not have. Good data quality can help organizations make better decisions, develop better
products and services, and improve customer experiences.
Data quality is essential for making informed decisions and achieving business goals. It is important
for organizations to invest in data quality initiatives to ensure that their data is accurate, complete,
consistent, and timely.
But, we can probably reason about some common Data Quality issues that come up again over
and over regardless of data source and type. Data Quality can probably be split up into two main
headings, the “business” quality standards, and the “data value” qualities of each data source and
sink. Let’s dive into each of these a little more.

• Business logic translated into data quality.


• Data value quality.

Data Quality is also extremely important from an end-user and business point of view. It’s typically
poor data quality that users of our data notice first and immediately reduce their trust in our data
products. Once that trust is lost, it’s very hard to gain back.
Good Data Quality can ensure that trust and long-term use of the data and pipelines engineers work
so hard on every day.

Business Quality
When you think about data quality, you might rightly think about a column that should not be NULL,
and this is correct. But, such simple constraints can easily be built into most data stores. And yes,
these simple constraints are core to data quality but are usually not the problems that are difficult
to solve.
What is at the core of data quality?

• Human reasoning about data.


• Double meanings or standards.
Chapter 9 - Data Quality 148

Example

To make this more concrete, consider what the business might care about for this example. Let’s say
we have a system that produces scores for each of our clients on a daily basis.

1 >> example.csv
2 client_id | scores
3 1234 15
4 4567 | 25
5 7899 | 0

Now maybe the engineering team put in checks to make sure no client gets a NULL score, the business
told us everyone should always have a score every day. But, what happens when in a panic, some
Product Manager calls us up saying our most important client got a score of 0 today, and they had
20 yesterday.
Maybe the business logic requires not only that every client has a score every day, but that the score
also has to be in the range of 5-30. We can see from our example our data has suspect quality from
the business perspective, although any engineer reviewing the data might see no NULL values and
consider the data to be complete and accurate.
This difference is a very important concept to grasp.

Reasoning about data.


The business end of Data Quality is more about how we “reason” about the data as humans who
the consumers. The humans and the applications we build expect certain pieces of information to
be true about the data we use, built in assumptions are all around us.
For example, we might have a data field called response_code, and we expect that field to contain
a STRING value, but not just any string, either APPROVED or DECLINED, not a blank string ‘’, or
even a misspelled word like APPROVED.
This sort of Data Quality seems silly but think of the downstream aspects. If we have a Data Lake
or Warehouse that end-users and scientists use to gather data and information, often using queries
that filter by response_code of APPROVED, guess what, now all that downstream data is wrong
because have records that say APROVED that don’t meet that criteria when they should.

Double meanings.
It’s also imperative in our quest for data quality to identify and deal with all kinds of misunderstand-
ings. Missunderstandings can come from many sources, it could be bad field names, data points that
are too similar, or just ambiguity.
Chapter 9 - Data Quality 149

Many times businesss’ struggle with data quality simply because there are no concrete quality
definitions and rules surrounding the data. Just like in life, in data, many people may refer to the
same data, but be talking about different things, or believe different things about that data.
This is a simple example but shows the importance of business data quality and how we reason
about the business values contained in each data point. It is an often overlooked part of data quality
that leads to poor data quality and user experience that throws a shadow on the reliability of our
data.

Data value quality.


On the other end of the spectrum from the business data quality surrounding how we reason about
quality is the physical data values and ranges contained in each data point.
These value data qualities could be simple ideas and constraints that many are familiar with like
NOT NULL or NULL, or even value CONTAINS x, y, or z. Typically you will find that a data set is
a mix of two general types of data.

• Can contain any value.


• Values are in a range.

If you have a description column, there is a good chance that the values will range all over the board,
a mix of numbers and characters of varying lengths. On the other hand, you might have an amount
column that holds dollar values. Some obvious constraints are that it should be a positive decimal
number holding up the 2 precision points.
When you break data values down, many times there is some sort of expected range of values for
each column or data point. Maybe all the values should be one from a list of strings like YES and
NO. It could be that it’s a numerical value that should always hold an integer in the range of 1 to
1000. Whatever the case is, as your data grows in size and complexity it becomes a very serious and
complex business to know what is happening over time to all those data points.

Measures of Data Quality.


Let’s talk about a few basic data quality checks that should be performed on all data before it’s
ingested into any pipeline.

• Correct header or column names.


• Correct file formatting.
• Correct datatypes.
• Values ranges and values integrity.
Chapter 9 - Data Quality 150

All of the above measures are core to ensuring long-term data quality, and the best part is they are
all fairly easy to approach and check with code in an automated fashion. These factors should be
checked on each data set before it’s ingested, avoiding costly pipeline downtime and bad data being
propagated downstream causing trust and reliability issues.

Correct Header or Column Names.


Probably one of the most common causes of pipeline and data failures any data engineer will
experience is unknown and unanticipated changes in column names. It could be new columns,
columns going away, or just simple name changes.
The problem is that most pipelines and data transformations that are even moderately complex rely
on an expected setlist of column names from which the source data is pulled. The minute a column
name changes typically means a pipeline or some process shortly thereafter is going to break.
Luckily this is probably one of the easiest data quality tests to implement in any system. Each time
a new file or data set hits our staging environment, a server, a cloud bucket, it’s fairly simple to
quickly open that file and pull the first line of the file, the header. Any change at this point, which is
easy to check, we can automate alerts and quartine or move the “bad” file or data into another area
out of harm’s way.

Correct File Formatting.


Another easy topic that is often the root cause of many data pipeline breakages in data engineering
is simply the format of the file or data being ingested. This is probably the second most common
error seen in most data ingestion systems.
Typically this can simply the take form of the wrong file extension showing up. It could be we get
.txt files normally and all of sudden a .csv file appears, or the opposite. Whatever the case might be
for the expected file formats, it’s very easy to write automated checks on file extensions to determine
if we have received what our system is expecting.
Closely related to the file extension we are expecting is the actual data format, typically the
delimiters if applicable, to that dataset in question. Even if a pipeline receives a .txt or .csv file
as in input as expected, what’s to say that data contained in the file is actually in the correct format.
Did the file delimiter change from a comma to a pipe, or visa versa? These are easy mistakes to make,
and therefore it is a very common problem data engineers will deal with. A simple read of a dataset
and grabbing the first line, after the header, if it exists, we can easily and quickly check the format
without reading the entire file.
Chapter 9 - Data Quality 151

Correct data types.


After you worked on the quality of the data file, working on data types is the next area of concern.
Data types are important, but often get a back seat in most data quality systems and discussions.

• Mixing data types in columns creates problems.


• Don’t make every column a STRING.
• Understand the difference between a FLOAT and INTEGER.

There are a few common mistakes, and data quality checks that can be put in place, that save a lot
of problems later on.
Mixing data types in a column is a very common problem in data engineering, and creates numerous
data quality issues later on.

1 name | amount
2 Bill | 5.55
3 Bob | $4.99

This is a data quality issue that will break anything trying to do math or aggregations on the amount
column for example. A simple example, but extremely common, the lesson is to be firm about data
types because many times applications and code rely upon assumptions about data.

Values ranges and values integrity.


Probably the second most common cause of data quality issues in the value ranges and expectations
around certain data points. This typically wanders into the topic of business-based data quality
checks.
It’s very uncommon for the business that either produces or consumes data, to not care what happens
to be in the acceptable range of values for any given data point. Of course, the business cares, even if
they don’t bring it up, because as soon as the value is out of range or obviously wrong, all a sudden
they will start caring.

1 name | amount | quantity


2 Bill | 5.55 | 55
3 Bob | 4.99 | 0

For example, we know that logic based on unspoken engineering and business principles tells us that
we should always have an integer value associated with an order.
But, did anyone ever discuss if 0 is an acceptable order quality? What does it mean if someone has
a 0 value, does that mean there is a system bug, or it is a valid order?
These types of data range and integrity of values seem to be easy on the surface, but after you review
each data column you might be dealing with, things could get complicated quickly.
Chapter 9 - Data Quality 152

Data Quality Applied


Great Expectations is a Python-based library for data quality validation, which provides a wide
range of data quality checks that can be performed on datasets. Here are some examples of data
quality checks that can be performed using Great Expectations:
Null values check:
This check verifies whether the dataset contains any null values or not. It can be performed using
the expect_column_values_to_not_be_null() function.
Data type check:
This check verifies whether the data type of a column is consistent with the expected data type. It
can be performed using the expect_column_values_to_have_data_type() function.
Range check:
This check verifies whether the values in a column fall within a specified range. It can be performed
using the expect_column_values_to_be_between() function.
Uniqueness check:
This check verifies whether the values in a column are unique or not. It can be performed using the
expect_column_values_to_be_unique() function.
Pattern check:
This check verifies whether the values in a column match a specified pattern or regular expression.
It can be performed using the expect_column_values_to_match_regex() function.
Completeness check:
This check verifies whether the dataset contains all expected columns and whether the columns
have non-null values. It can be performed using the expect_table_columns_to_match_ordered_list()
and expect_column_values_to_not_be_null() functions, respectively.
Great Expectations also provides support for custom data quality checks, allowing users to create
their own checks based on their specific requirements.
By performing these and other data quality checks using Great Expectations, users can ensure that
their datasets are of high quality and suitable for analysis and decision-making.
Here are some common data quality checks that can be performed using Great Expectations:
Null values check: This check verifies whether a column contains any null or missing values. It can
be performed using the expect_column_values_to_not_be_null() function:

1 import great_expectations as ge
2 df = ge.read_csv('my_data.csv')
3 null_check = df.expect_column_values_to_not_be_null('column_name')
4 print(null_check)

Data type check: This check verifies whether the data type of a column matches an expected data
type. It can be performed using the expect_column_values_to_have_data_type() function:
Chapter 9 - Data Quality 153

1 dtype_check = df.expect_column_values_to_have_data_type('column_name', 'integer')


2 print(dtype_check)

Range check: This check verifies whether the values in a column fall within a specified range. It can
be performed using the expect_column_values_to_be_between() function:

1 range_check = df.expect_column_values_to_be_between('column_name', min_value=0, max_\


2 value=100)
3 print(range_check)

Uniqueness check: This check verifies whether the values in a column are unique. It can be performed
using the expect_column_values_to_be_unique() function:

1 unique_check = df.expect_column_values_to_be_unique('column_name')
2 print(unique_check)

Pattern check: This check verifies whether the values in a column match a specified pattern or regular
expression. It can be performed using the expect_column_values_to_match_regex() function:

1 pattern_check = df.expect_column_values_to_match_regex('column_name', '^\d{3}-\d{2}-\


2 \d{4}$')
3 print(pattern_check)

Completeness check: This check verifies whether a dataset contains all expected columns and
whether the columns have non-null values. It can be performed using the expect_table_columns_-
to_match_ordered_list() and expect_column_values_to_not_be_null() functions, respectively:

1 completeness_check = df.expect_table_columns_to_match_ordered_list(['column_name_1',\
2 'column_name_2', 'column_name_3']) & df.expect_column_values_to_not_be_null('column\
3 _name')
4 print(completeness_check)

Great Expectations provides a wide range of data quality checks, and also supports custom data
quality checks, allowing users to create their own checks based on their specific requirements.
Chapter 10 - DevOps for Data
Engineers
DevOps for Data Engineers covers a lot of topics and technologies. Depending on the size of the
company a Data Engineer works in they may, or may not, get very involved in DevOps and
CI/CD. Many companies have a team of Platform Engineers and others whose job it is to make
the development of other engineers simple and easy.
DevOps is a methodology or culture that aims to improve collaboration and communication
between software development and operations teams in order to increase efficiency and reduce
time-to-market for software products. The term “DevOps” is a combination of “development” and
“operations”.
Traditionally, development and operations teams have worked separately, often leading to conflicts
and delays in the software development life cycle. DevOps seeks to bridge this gap by creating a
collaborative and iterative process that involves all stakeholders in the software development process,
from planning and development to testing and deployment.
DevOps is based on a set of principles that include continuous integration, continuous delivery,
and continuous deployment. Continuous integration refers to the practice of regularly merging
code changes into a shared repository, while continuous delivery ensures that code changes are
automatically built, tested, and deployed to production. Continuous deployment takes this one step
further by automatically deploying changes to production without manual intervention.
DevOps also emphasizes automation and monitoring, using tools such as configuration management
and infrastructure as code to automate the deployment and management of software applications.
Monitoring tools are used to track performance and identify issues, allowing for rapid response and
continuous improvement.
Overall, DevOps is a culture of collaboration and continuous improvement that seeks to break down
silos and streamline the software development process. It helps organizations deliver software faster,
with higher quality, and with better alignment to business objectives.
But, many companies on the smaller side don’t have the resources to have a whole group of people
working in DevOps and CI/CD. What I want to do in this chapter is give a quick run-through of
some of the most important DevOps topics that will give you the biggest bang for your educational
bucks.

• Dockerfiles and docker-compose


• Unit and Integration tests
• CI/CD options
• Automation
Chapter 10 - DevOps for Data Engineers 155

DevOps applied to Data Engineering


DevOps can improve data engineering in a number of ways:
Collaboration:
DevOps emphasizes collaboration between development and operations teams, and this can also
extend to data engineering teams. By working together, data engineers can better understand
the needs of other teams and ensure that data pipelines are aligned with the overall goals of the
organization.
Continuous
Integration and Delivery: DevOps practices such as continuous integration and delivery can help
data engineers to deliver data pipelines more quickly and with higher quality. By automating testing
and deployment, data engineers can reduce errors and improve the reliability of data pipelines.
Infrastructure as Code:
Infrastructure as code (IaC) is a key DevOps practice that involves managing infrastructure using
code. This can be particularly useful for data engineering, where data pipelines can be managed as
code and automated testing and deployment can be implemented.
Monitoring and Alerting:
DevOps emphasizes monitoring and alerting, and this can also be applied to data engineering. By
monitoring data pipelines and setting up alerts for anomalies or errors, data engineers can quickly
identify and fix issues before they become critical.
Agile Development:
DevOps emphasizes agile development methodologies, which can help data engineering teams to be
more responsive to changing business needs. By working in sprints and delivering small, incremental
changes, data engineers can quickly respond to new requirements and ensure that data pipelines are
aligned with business objectives.
Overall, DevOps practices can help data engineering teams to deliver high-quality data pipelines
more quickly and efficiently, with better alignment to business goals. By collaborating with other
teams, automating testing and deployment, managing infrastructure as code, and implementing
monitoring and alerting, data engineers can improve the reliability, scalability, and agility of their
data pipelines.

Dockerfiles and Docker-compose.


The lack of a Dockerfile in any data pipeline and repo I explore tell’s me everything I need to know
about the quality and setup of the codebase. Most people in the data world live their life without
it, thinking that containerization is for the software engineers of the world, but this is not the case.
If anything the Data Engineering and Data Science worlds have more of a use case for Dockerfiles
than most.
Chapter 10 - DevOps for Data Engineers 156

Dockerfiles level the playing field by pre-packing all the OS and system requirements and dependen-
cies into a single easy-to-use source.
Docker has become an important tool for DevOps teams for several reasons:
Portability:
Docker containers are self-contained and can be deployed on any machine, regardless of the
underlying operating system. This makes it easy to move applications between development, testing,
and production environments.
Consistency:
Docker ensures that the application and its dependencies are packaged together and run consistently
across different environments. This eliminates the “works on my machine” problem and ensures that
the application behaves the same way in all environments.
Scalability:
Docker makes it easy to scale applications horizontally by adding more containers. This allows
applications to handle increased traffic or load without requiring changes to the underlying
infrastructure.
Versioning:
Docker allows developers to version their applications and dependencies, making it easy to roll back
to a previous version if there are issues with the current version.
Collaboration:
Docker makes it easy to share and collaborate on applications and dependencies. Developers can
share Docker images through registries, allowing other team members to quickly deploy and test
the application.
Overall, Docker plays a crucial role in DevOps by enabling developers to package and deploy appli-
cations quickly and consistently, regardless of the underlying infrastructure. This helps to streamline
the development process, increase efficiency, and improve collaboration between development and
operations teams.

Why data needs Dockerfiles.


It’s pretty common today for most Data Engineering/Data Science/ML workloads to be Python-
heavy. What are the best and worst part about PyPI and Python packages? They are incredibly
finicky, break easily, cause requirement conflicts, and require a large amount of magic to not break
over time.
What else is common for data workloads and pipelines, relational databases, and the connections
that go with them. Anything else? An amazing number of command-line tools.
Why could there be more reasons? I’m glad you asked, yes there are more reasons. Typical complex
data pipelines and codebases require environment variables, configurations, specific directories, and
code layout.
Chapter 10 - DevOps for Data Engineers 157

Dockerfiles solve the complexity problem.


This is what a Dockerfile is for. Why not make life for yourself and others easier? With a simple
docker run or docker-compose up command, everything that is needed to run and test pipeline code
is at your fingertips. All the setup complexity is written once and hidden away rarely to be messed
with again.

Reasons to use a Dockerfile for data pipeline(s).


• No surprise updates and breakages due to OS or package updates
• Easier onboarding new engineers into the codebase
• Requirements, configuration, env vars all become easier to manage.
• Everyone is on the same page, no windows vs mac vs Linux gotchas.
• Easier to transition code into distributed environments (think Kubernetes).
• Better DevOps (code deployment) and unit/integration testing.
• Makes you better at the command line (with makes you better in general)

Getting started with Dockerfiles.


Getting started with Dockerfiles. The first thing to do is install Docker desktop easy to use and easy
to install.
There are two (probably three) options to write/use Dockerfiles for data pipelines.
First, it’s good to understand DockerHub, it’s where pretty much every project under the sun, plus
some, stores official Dockerfile(s) for your use. Need to run Apache Spark? Why install it on your
machine when you can get a Dockerfile with it already installed? Got a Python-based project? Why
not just use the many Python Dockerfiles available.
These pre-build Dockerfiles can be obtained by a simple …

1 docker pull python # or whatever else

The other option is to build your Dockerfile, based on whatever OS you want, with whatever
packages and tools you need, even layered on top of some Dockerfile from someone else.
Let’s take the example of someone who builds pipelines that run in AWS on Linux-based images.
You want a good development base that is as close as possible or exactly like production correct? So
you build a Dockerfile that has say Python and Spark-based on Linux with the aws cli installed.
Chapter 10 - DevOps for Data Engineers 158

1 FROM ubuntu:18.04
2
3 RUN apt-get update && \
4 apt-get install -y default-jdk scala wget vim software-properties-common python3\
5 .8 python3-pip curl unzip libpq-dev build-essential libssl-dev libffi-dev
6
7 RUN wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2\
8 .tgz && \
9 tar xvf spark-3.0.1-bin-hadoop3.2.tgz && \
10 mv spark-3.0.1-bin-hadoop3.2/ /spark && \
11 ln -s /spark spark
12
13 RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip\
14 " && \
15 unzip awscliv2.zip && \
16 ./aws/install
17
18 WORKDIR code
19 COPY . /code
20
21 RUN pip3 install -r requirements.txt
22
23 ENV MY_CODE=./code

It’s just an example but you get the point, defining a complex set of tools that won’t easily be
broken that all developers and users of the pipeline can use is a very simple and powerful way to
make development, testing, and code usage easy for all.
Usually, a Dockerfile written like this stored with the code can be built using a simple command.

1 docker build --tag my-special-image .

Docker-compose.
Also, make sure to read up on docker-compose. A great way to automate running tests and bits of
code. Docker-compose is simply a way to have multiple services and images all work together and
talk together, a great tool for more complex projects.

Dockerfile summary.
Dockerfiles are far from rocket science, they are probably one of the easiest things to learn, even as
a new developer. Like anything else they can get complicated when running multiple services, but
the basic usage of a Dockerfile will give you the 80% of what you need upfront.
Chapter 10 - DevOps for Data Engineers 159

I also believe Dockerfiles in general forces a more rigid development structure that is missing from a
lot of data engineering code bases. When you find Dockerfiles you are more likely to find unit tests,
documentation, requirements files, and generally better design patterns.

Unit Testing.
There are few things in life that are worse than cracking open some serious pipeline code, and then
realizing there isn’t a single function written to encapsulate logic, wondering if some change you
are about to make will bring down the whole pipeline.
Why unit testing?

• increase learning
• leads to better quality code
• reduce bugs
• increase development time
• increase developer confidence
• can be automated into the DevOps and CI/CD

Increase learning.
When you are new to a codebase you don’t know what you don’t know, you don’t have any backstory
and you are usually flying by the seat of your pants in the beginning. When you have no unit tests,
usually the only other way to test changes on a pipeline is to run it, this is sometimes easier said
than done in a development environment. The first line of defense should be unit testing the entire
pipeline.
Also simply reading unit tests can be a great way to start learning a codebase, before simply just
trying to read the whole codebase.

Leads to better quality code.


When it comes to unit testing pipeline code, there is at least a baseline that must be followed. The
critical ETL transforms of a script should be encapsulated inside a method/function.
It doesn’t matter if you are more of a OOP or a Functional programmer, or both, anything is better
than not breaking your code up into logical units that can be tested. This is how unit testing pushes
the quality level of code up a few notches. You can’t test code that isn’t testable, so you have to write
pipelines in such a way that they can be.
This is what I mean.
Chapter 10 - DevOps for Data Engineers 160

1 dataframe = dataframe.where(dataframe['that_column'] == 'this_value').groupBy('anoth\


2 er_column').agg(F.sum('yet_another').alias('new_column')).select('another_column', '\
3 new_column', when(dataframe['new_column'] > 1, 'yes').otherwise('no'))

Now imagine a pipeline with line after line of this type of code, with not a single piece of logic
encapsulated in a method/function. This is way more common than people would think, I would
venture a guess that 80% of pipelines are written this way.
Instead if we want testable code, we would do the following.

1 def filter_and_aggregate(input_df: DataFrame) -> DataFrame:


2 output_df = input_df.where(dataframe['that_column'] == 'this_value').groupBy('ano\
3 ther_column').agg(F.sum('yet_another').alias('new_column')).select('another_column',\
4 'new_column', when(dataframe['new_column'] > 1, 'yes').otherwise('no'))
5 return output_df

Now, this is code that can be called in a unit test, and is reusable, making the code more clean and
modular.

Reduce bugs.
An important part of DevOps is automating processes, and that allows for the reduction of bugs make
it into production. Automated unit tests are probably the number one way we can use DevOps to
reduce the chances of bugs getting out.
By making sure we unit test our code and making the running of those tests automated, reduces or
risk of making errors and protects developers from making human mistakes.
You might be new to testing in general, so I want to take this chance to do a deep dive example in
unit testing an imaginary PySpark pipeline.

Deep Dive PySpark Unit Testing.


Say we have a new PySpark pipeline we’ve had to develop for work, we want to be good DevOps
and unit testing advocates, so here is what we do.

Setup directories and pytest.

Obviously, we need a test directory, a file to hold our unit tests test_pipeline.py, and a file specific
to pytest called conftest.py.
Chapter 10 - DevOps for Data Engineers 161

1 >> mkdir tests


2 >> touch tests/test_pipeline.py
3 >> touch tests/conftest.py

Let’s talk about conftest.py for a minute.


Our conftest.py is going to hold a pytest fixture, this is a way to initialize each test function with
whatever we want. In our case, to test PySpark code we are going to need a Spark Session.

1 # >> vim tests/conftest.py


2
3 import pytest
4 from pyspark.sql import SparkSession
5
6 @pytest.fixture(scope="session")
7 def spark_session():
8 spark = SparkSession.builder.master("local[*]").appName("test").getOrCreate()
9 return spark

Now we have a fixture that will provide a Spark Session for each of our unit tests, so we can test
our code on actual Spark dataframes.

Write your first PySpark unit test.


Let’s go back to that original function we wrote, the first step would be to simply put the logic inside
a function, no rocket science here.

1 import pyspark.sql.functions as F
2 from pyspark.sql import DataFrame
3
4 def sample_transform(input_df: DataFrame) -> DataFrame:
5 inter_df = input_df.where(input_df['that_column'] == \
6 F.lit('hobbit')).groupBy('another_column').agg(F.sum('\
7 yet_another').alias('new_column'))
8 output_df = inter_df.select('another_column', 'new_column', \
9 F.when(F.col('new_column') > 10, 'yes').otherwise('n\
10 o').alias('indicator')).where(
11 F.col('indicator') == F.lit('yes'))
12 return output_df

Now we actually have a function to unit test. So let’s write the test, we will crack open our test file…
Chapter 10 - DevOps for Data Engineers 162

1 # >> vim tests/test_pipeline.py


2
3 import pytest
4 from mycode import sample_transform
5
6 @pytest.mark.usefixtures("spark_session")
7 def test_sample_transform(spark_session):
8 test_df = spark_session.createDataFrame(
9 [
10 ('hobbit', 'Samwise', 5),
11 ('hobbit', 'Billbo', 50),
12 ('hobbit', 'Billbo', 20),
13 ('wizard', 'Gandalf', 1000)
14 ],
15 ['that_column', 'another_column', 'yet_another']
16 )
17 new_df = sample_transform(test_df)
18 assert new_df.count() == 1
19 assert new_df.toPandas().to_dict('list')['new_column'][0] == '70'

If we run our function sample_transform against our sample dataframe the following is the output.

1 +--------------+----------+---------+
2 |another_column|new_column|indicator|
3 +--------------+----------+---------+
4 | Billbo| 70| yes|
5 +--------------+----------+---------+

This is what we are trying to validate, that our filters, switching logic, and filter logic are done
correctly and we get the expected outcome.

Keys to writing good unit tests.


As you can see writing unit tests for pipeline code isn’t very hard at all. It doesn’t take much effort
and protects you and other people to ensure changes and initial code being written, with transform
after transform, are being done correctly.

Setting up Docker to run our PySpark unit tests.


The next piece of unit testing our code is having somewhere to test it that insnt a production
environment, somewhere anyone can do it, Docker of course. You will need a few items.
Chapter 10 - DevOps for Data Engineers 163

• Dockerfile
• Docker compose file

The Dockerfile doesn’t need to be rocket science, a little Ubuntu, Java, Python, Spark …
using the below file run docker build --tag spark-test .

1 FROM ubuntu:18.04
2
3 RUN apt-get update && \
4 apt-get install -y default-jdk scala wget vim software-properties-common python3\
5 .8 python3-pip curl unzip libpq-dev build-essential libssl-dev libffi-dev python3-de\
6 v&& \
7 apt-get clean
8
9 RUN wget https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2\
10 .tgz && \
11 tar xvf spark-3.0.1-bin-hadoop3.2.tgz && \
12 mv spark-3.0.1-bin-hadoop3.2/ /usr/local/spark && \
13 ln -s /usr/local/spark spark
14
15 WORKDIR app
16 COPY . /app
17 RUN pip3 install cython==0.29.21 numpy==1.18.5 && pip3 install pytest pyspark pandas\
18 ==1.0.5
19 ENV PYSPARK_PYTHON=python3

And our docker-compose file

1 version: "3.9"
2 services:
3 test:
4 environment:
5 - PYTHONPATH=./src
6 image: "spark-test"
7 volumes:
8 - .:/app
9 command: python3 -m pytest

Now if you build your Dockerfile in your project, you should be able to just run docker-compose up
spark-test and there you have it, your unit tests for PySpark running with one easy command.
Chapter 10 - DevOps for Data Engineers 164

1 test_1 | ============================= test session starts ========================\


2 ======
3 test_1 | platform linux -- Python 3.6.9, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
4 test_1 | rootdir: /app
5 test_1 | collected 1 item
6 test_1 |
7 test_1 | test/test_pipeline.py . [\
8 100%]
9 test_1 |
10 test_1 | ============================== 1 passed in 5.08s =========================\
11 ======

Unit testing your pipeline code really isn’t that hard, and it saves a lot of time later down the road.
Being able to make changes to code and have some idea if you breaking anything, without running
the entire pipeline, is kinda nice. Data Engineers have gotten a bad rap over the years because
unwillingness to incorporate basic software engineering principles like unit tests.
Obviously, from above it doesn’t take much effort to add unit tests for your Spark code. Encapsulate
your logic, write a few docker files, and you are off to the races.

CI/CD.
You may not like it but CI/CD is a major part of DevOps and has become critical to good Data
Engineering pipelines. Let’s face it, we all know we do too many things manually, and it gets even
harder when we have multiple people working in some repositories.
At some point you have cannot rely on some person, no matter how smart, to do a manual sync of
code or build a new Docker image if that is the case. This is where CI/CD comes into play.
Now there is a myriad of tools that will probably continue to grow and expand, Jenkins, Git-
Lab/GitHub CI/CD runners etc. Many tools have been built around taking actions when something
happens in a repository of data. This is something that all Data Engineers need to learn.

Automation is the name of the game.


What most CI/CD and DevOps boils down to is automation. It doesn’t matter what the problem is,
if you have developers spending a lot of time running the same command over and over again, said
to build, tag, and deploy a Docker image, this is something DevOps can help with.
You need to learn to be comfortable around the command line and bash, and with different OS
environments.
Being able to stitch together multiple commands, API calls is key to being a good Data Engineer.
Chapter 10 - DevOps for Data Engineers 165

CI for Data Engineering


Continuous integration (CI) is a DevOps practice that involves automating the build, test, and
deployment of software applications. In the context of data engineering, CI can help to ensure that
data pipelines are built and deployed in a consistent and reliable way. Here are some ways CI can
be applied to data engineering:
Automated Testing: CI involves automating testing as part of the build process. This can be applied
to data pipelines by creating automated tests to check the quality and accuracy of the data being
processed. This can include tests to check for missing or inconsistent data, as well as checks to ensure
that the pipeline is running within expected performance and latency requirements.
Version Control: CI involves version control for source code and other artifacts. This can be applied
to data pipelines by using version control systems to manage the code and configuration files that
define the pipeline. This allows changes to be tracked, tested, and deployed in a consistent way.
Continuous Delivery: CI involves continuous delivery, which is the practice of automatically
deploying software changes to production environments. This can be applied to data pipelines
by automating the deployment of pipeline changes to production environments. This allows data
pipelines to be updated quickly and reliably, and ensures that the pipeline is always running the
latest version of the code and configuration files.
Infrastructure as Code: CI involves managing infrastructure as code, which means that infrastructure
is defined and managed using code. This can be applied to data engineering by using tools like
Terraform to manage the infrastructure required for data pipelines. Infrastructure as code allows data
engineers to easily create and manage environments, and ensures that environments are consistent
across all stages of the pipeline.
Overall, applying continuous integration practices to data engineering can help to ensure that
data pipelines are built and deployed in a consistent and reliable way, while also improving the
efficiency of the development process. By automating testing, version control, continuous delivery,
and infrastructure management, data engineers can ensure that their pipelines are always up-to-date
and operating at peak performance.
Conclusion
Data engineering is a very special field to learn, a difficult one. Why? Because it involves a lot of
disciplines. Many of the skills are hard to learn, and being a good data engineer is more than being
able to just write code.
Learning the disciplines for data engineering is a lot about understanding theories and concepts.
The fundamentals are the key to writing good data pipelines.
“… facilitating the movement, storage, and access to data in a repeatable, resilient, and scalable
manner.”
From the moment we start working on the high-level architecture of data pipelines, to writing unit
tests or just laying out our project structure, we have to understand best practices and that the small
pieces done correctly, working in conjunction with each other make great projects and products.
We talked about storage and file types, the fundamental building blocks for data engineers. From
parquet files to flat files, each has its place and time of use.
Compute and resource usage is another bedrock of cost-effective and scalable pipelines. The bigger
the data you start to work on, the more you realize that resource utilization is going to directly
impact the bottom dollar in your organization.
Don’t forget our discussions on SQL and Data Warehouses and Data Lakes. No matter how hard
you try as a data engineer, you won’t escape SQL and relational databases. Luckily, many of those
fundamental concepts apply well even with changing technology and cloud storage.
My favorite chapter to write was data modeling, a half art, half science topic that combines a large
variety of topics. Data modeling will always make or break the best-laid plans, if done correctly it
leads to incredible gains for the business, if not done well, it leads to heartache and frustration.
We ended with data quality and DevOps, two topics that are widely ignored by the data engineering
community but are worth their weight in gold.
In the end, I truly believe that being a good data engineer has very little to do with “how good you
write code.” This is what most people focus on, rightly, but to the exclusion of the fundamentals.
Being a good data engineer is just using your experience to make good decisions, and knowing that
sometimes writing no-code is better than writing code. You can write beautiful code, but if you did
bother to work on the architecture of your pipeline, or the data model, there is a good chance all
that nice code is going to be deleted or useless.
Being good at writing good code will come with time, work on the fundamentals mentioned in this
book first, the code will follow.

You might also like