LECTURE 01
Welcome to Data Engineering!
(INFO 258/DATA 101)
January 22, 2024
Data 101, Fall 2024 @ UC Berkeley
Aditya Parameswaran https://data101.org/sp24
1
[Enrollment & Logistics] Enrollment is Ongoing; Room can only take 50
● This class can accommodate only 75 students across INFO 258 (for grads) and DATA 101 (for
undergrads)
○ “Skeleton crew” staff of 2 (compared to 8 in Fa23!)
● INFO258: Originally 22 enrolled → Requested enrollment of 30
○ (Thank you if you replied to my email about prerequisites).
○ Remaining will be admitted if there’s room
● DATA101: Originally 25 enrolled → Requested enrollment of 45, so 20 more students will be
enrolled off the waitlist; remaining if there’s room
● Other enrollment concerns:
○ Limited capacity, so no CE students will be admitted
○ No auditing to keep staff workload low.
○ Room capacity is 50 (but will be broadcast on zoom and recorded), so if we go past
capacity, we will turn you away to not violate fire codes
All enrollment questions handled by Data Science Undergraduate Studies [email protected] and I
School Enrollment Staff [email protected], not instructors. Please email them if you have any
questions or concerns, as well as for any exceptions
2
A note about INFO 258/DATA 101
You’ll hear the two used interchangeably - often I might end up using “DATA 101” since it’s a simpler
number
For all practical purposes these are the same class, modulo:
● The grad/undergrad distinction
● One extra project for the grad students
● Grading done separately for grads and undergrads
3
Intro - Aditya Parameswaran
● Ph.D. in Computer Science from Stanford University (2013).
● Postdoc at MIT (2014).
● Assistant/Associate Prof since 2014, and UC Berkeley (2019-now).
● Building better (more scalable, usable, intelligent) data tools
○ Develop (and open-source) tools & write papers about them
○ Tools include: spreadsheet & visualization systems, data science
libs., comp. Notebooks
○ Open-source tools downloaded millions of times
○ Even sometimes start companies!
● Second time teaching this class!
○ Joe H and I originally developed this class in 2021 (offered 2x)
○ It’s a very new course! Lots of experts at Cal!
● Random facts about me:
○ I have three creatures who ensure I don’t get quality sleep at night:
two foster cats, and my hyperactive toddler
Our wonderful Spring 2024 Course Staff
Natalie Chan Mackenzie Moffit
5
Data Engineering: What? Why?
Course Trajectory
Course Logistics
Data Engineering:
What? Why?
Lecture 01, Data 101 Spring 2024
6
Data Science: The Conventional View
Data Science: The Conventional View
A data scientist operating alone, on one
static dataset at a time, with a clean
“rectangular” shape and fitting in main-
memory, employing various statistical
and ML algorithms on predefined
objectives.
● From Data 100
● Also the view reinforced by “popular”
Machine Learning, e.g., leaderboards and
Kaggle competitions
● A valuable component, but sadly,
missing the complete picture!
7
Data Science: The Conventional View Now with Data Engineering
Data Science: The Conventional View Data Science today involves Data Engineering:
A data scientist operating alone, on one A set of activities that include collecting, collating,
static dataset at a time, with a clean extracting, moving, transforming, cleaning, integrating,
“rectangular” shape and fitting in main- organizing, representing, storing, and processing data.
memory, employing various statistical
and ML algorithms on predefined
objectives.
● Happens on a large set of messy (often non-rectangular)
dynamic and large datasets
● From Data 100 ● Happens across teams and across the organization
● Also the view reinforced by “popular” ● The team generating the data may not be the same team(s)
Machine Learning, e.g., leaderboards and consuming it
Kaggle competitions ● The objectives are often rather unclear and ill-defined
● A valuable component, but sadly, ● A prerequisite (and typically, precursor) to real-world data
missing the complete picture! science & ML
● A lot of data engineering needs to happen to support the
conventional view!
Data systems are tools that support
data engineering. 8
The Data Science Industry Now
…once these junior people get to the market, they come in with an unrealistic set of expectations
about what data science work will look like. Everyone thinks they’re going to be doing machine
learning, deep learning, …
Vicky Boykis, 2019.[blog]
This is not their fault; this is what data
science curriculums [sic.] and the tech
media emphasize….
The reality is that “data science” has never
been as much about machine learning as it
has about cleaning, shaping data, and
moving it from place to place.
I personally
like 2 more
9
than 1!
[1/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
● Most of the time spent in real-world data science
projects involve data engineering.
● Often underappreciated compared to other
activities, e.g., ML.
10
[1/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
● Most of the time spent in real-world data science
projects involve data engineering.
● Often underappreciated compared to other
activities, e.g., ML.
● Data engineering activities, e.g., cleaning,
moving, and processing data occupies the
majority of time in data science.
11
[2/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
“… 70% more open roles at companies in data
engineering as compared to data science. As we
train the next generation of data and ML
practitioners, let’s place more emphasis on Mihail Eric, Jan 2021.[blog]
engineering skills.”
“Data engineer” has emerged as a new specialized
job category:
● Data scientist: Use various techniques
in statistics & ML to process & analyze data.
● Data engineer: Develops a robust and
scalable set of data processing We’re not going to be too dogmatic about
tools/platforms. these distinctions, but it’s worth knowing
what industry envisions. 12
[2/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
Even bolder claim: data science roles
may disappear!?!
“Many data science teams have not delivered results
that can be measured in ROI by executives.”
Forbes, Feb 2019. [blog]
Many teams have struggled because they can do “ML”
but can’t do data engineering to get to “ML”
“For complex data engineering tasks, you need five data engineers for every one data scientist.”
Essential idea: ML is the easy part (perhaps even more so, given LLMs!) → but can’t be done without data
engineering and data engineers
13
[3/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.
Even when doing ML, the vast fraction of ML-
powered systems is not “ML code.”
In most cases, “ML code” corresponds to calls to
standard libraries, e.g., scikit-learn, pytorch,
tensorflow, etc.
The hard part is getting the data to
the format and quality that these ML
libraries expect!
Sculley et al., SE4ML 2014 [google research]. 14
Data Engineering is Essential in ML/AI
Monica Rogati, 2017 [blog].
Stuff you need to do
first! A lot of this is data
engineering.
In fact, for any sort of
data-driven decision-
making
(ML/AI or not) you will
need these skills.
15
Data Engineering is Essential in ML/AI
“More often than not, companies
are not ready for AI. Maybe they Monica Rogali, 2017 [blog].
hired their first data scientist to
less-than-stellar outcomes, or “However, under the strong
maybe data literacy is not central influence of the current AI
to their culture. But the most hype, people try to plug in
common scenario is that they data that’s dirty & full of
have not yet built the gaps, that spans years while
infrastructure to implement (and changing in format and
reap the benefits of) the most meaning, that’s not
basic data science algorithms and understood yet, that’s
operations, much less ML.” structured in ways that don’t
make sense, and expect those
tools to magically handle it.”
16
New role: Machine Learning Engineer
Tomasz Dudek,, 2018 [blog].
“ML Engineer”: a specialization
of data engineer focused on
operationalizing ML.
“A need for a person that would reunite
two warring parties. One being fluent
just enough in both fields [Data
Science and Software Engineering] to
get the product up and running.
Somebody taking data scientists’ code
and making it more effective and
scalable. ... Explaining the reasons
behind architectural ideas to the
devops team. “
17
Why Learn Data Engineering?
Data science projects largely focus on data engineering.
Data engineer roles >> data scientist roles.
Data engineering is essential to ML/AI.
Balance your data techniques with a systems perspective.
As a Data Science major, you are
likely familiar with techniques: Techniques …but you are likely less
statistics/ML concepts & familiar with systems.
Systems
algorithms…
● In this class, you will learn systems and the infrastructure that enables these techniques.
● You’ll start thinking about efficiency, especially on large datasets.
● Various “plumbing analogies”:
data pipelines, data flows, …
Data engineering is as essential as plumbing!
● When it works well, you don’t realize it exists.
● When it doesn’t, you’ll really know. 18
All these Data Systems!!!
2023 MAD (ML/AI/Data) Landscape: blog, interactive 19
2023 MAD (ML/AI/Data) Landscape
Data systems is a difficult subject! There are many, many data
systems – too many for us to cover.
● In this class, we will try to cover the key categories and
underlying principles.
● This way, you can make informed decisions about when to use
what type of system.
2023 MAD (ML/AI/Data) Landscape: blog, interactive 20
The Bottom Line
Data engineering is an essential ingredient
of real-world data science projects.
A set of activities that include collecting, The backbone, plumbing, or
collating, extracting, moving, transforming,
cleaning, integrating, organizing, infrastructure that supports data
representing, storing, and processing data. science.
Understanding these skills will help you…:
● Apply skills from intro data science classes to messy, large real-world datasets;
● Get your datasets to the point where you can apply AI/ML;
● Explore new, sought-after, & specialized roles, e.g., data engineer/ML engineer;
● Make informed decisions within the vast and confusing landscape of data systems; and
● Start worrying about efficiency :-)
21
Data Engineering: What? Why?
Course Trajectory
Course Logistics
Course Trajectory
Lecture 01, Data 101 Spring 2024
22
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices! Will sample from both!
Stanford; Founded 2003, IPO, Founded 2022 based on
MIT; Founded 2005, DuckDB from CWI
Acq. 2011 (HPE) Acq. 2019 (Salesforce)
$5OM of funding
Founded 1996, one of the
most popular open-source Founded 2021
Founded 2019,
databases, with many Founded 2013, Founded 2013 based on Apache Acq. 2023
$60+M raised
startups & established co. Acq. 2022 (Alteryx)Spark; one of the hottest pre- (Snowflake)
offerings IPO startups 23
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices!
Two general foundational approaches:
?
Code-centric Query-centric
Main Storage API is files
● AWS S3, Azure File Storage,
Google Cloud Storage, HDFS, …
Libraries in general-purpose programming
languages, lots of separation
● Spark (Scala/Java) for batch processing
● Ad hoc code (Python/pandas) for exploration
● Metadata tracked in a separate store
24
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices!
Two general foundational approaches:
Code-centric Query-centric
Main Storage API is files
● AWS S3, Azure File Storage,
Google Cloud Storage, HDFS, …
Main Storage API is tables
● Snowflake, BigQuery,
Redshift, Azure Synapse,
?
Libraries in general-purpose programming ● Teradata (founded 1979, still relevant!!)
languages, lots of separation One language/paradigm for (almost) everything
● Spark (Scala/Java) for batch processing ● Batch: SQL
● Ad hoc code (Python/pandas) for exploration ● Interactive: SQL
● Metadata tracked in a separate store ● Metadata auto-tracked in database
● Other bytestream data stored in files (e.g. S3)
25
Our approach: Query-centric but Open-Minded
Based on formal (theory) query languages: Relational Algebra and Relational Calculus
● Decades of research and development
● RA: procedural, RC: declarative (describe outputs, not algorithms)
Structured Query Language (SQL): A domain-specific language for data
● Same language for batch (e.g., transformation) and interactive (e.g., queries)
● Declarative complement to general-purpose languages (which are often imperative)
○ Abstraction: No “overfitting” of code to the task at hand
○ Huge plus for cloud environment: dynamically changing workloads, hardware, data
● Even code-centric libraries increasingly include SQL-like interfaces (e.g., SparkSQL)
○ The pendulum has swung back in favor of SQL from the noSQL movement of the 00s
● Decades of extensions, tools, and support
We’ll teach you the concepts using
…but nothing’s perfect! postgreSQL (a flavor of SQL). In
● In practice, for both exploration/data engineering, practice, you’ll be able to apply
you will need extra tools beyond SQL. these concepts to new tools..
● But most recent open-source alternatives are similar.
26
Class Journey
Relational Model and Algebra
Advanced SQL queries (views, subqueries, window functions, Project 1
…)
DML, DDL
Referential integrity, index selection, performance tuning Project 2
Data transformation and preparation + Project 5
Data wrangling and cleaning Project 3
Non-relational data models (Tensors, Spreadsheets, etc.)
Semistructured data (and mongoDB) Project 4
ER and normalization, Spreadsheets, Transactions, BI and
OLAP, parallel computing, security and privacy, data pipelines, important data
… engineering topics
(note: topics are grouped by theme and not to scale with + not in order of the class schedule) 27
Data Engineering: What? Why?
Course Trajectory
Course Logistics
Course Logistics
Lecture 01, Data 101 Spring 2024
28
Syllabus Walkthrough
https://data101.org/sp24/syllabus
29
Beginning-of-Semester Logistics
Discussion Sections
Start this Thursday 1/25!
Not recorded but there will be a live zoom link, but
handouts/solutions will be posted.
Again, “in person” will be capped at room capacity
Office Hours
● Just my office hours (was this AM)
● TA office hours start next week.
30
We are in this class together!
Some of the content will be half-baked & experimental! Please bear with the hiccups.
● i.e., not taught in typical "database" classes.
○ Wherever possible we will emphasize underlying concepts…
…but some of what we say will also be practical advice.
● First time I’m teaching a lecture-based class since Spring 2021 (when we piloted Data 101!)
With all of that said:
● You will be evaluated generously.
● Our goal is for you to learn the material, Use the Extenuating Circumstances
not to stress you out. form!
● If you are feeling lost, please reach out.
We welcome feedback at any time
It is much better to do so than to violate our trust.
about the course. Contact course staff
● This is especially true given the
at
[email protected] or stop by
experimental nature of the class, office hours.
our small staff size, and the state of the world.
31