0% found this document useful (0 votes)

94 views31 pages

Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!

This document provides an overview and introduction for a data engineering lecture. It discusses: - Enrollment details for the course, which can accommodate 75 students total across graduate and undergraduate sections. - An introduction to the instructor, Aditya Parameswaran, who has a PhD from Stanford and focuses his research on building scalable data tools. - A definition of data engineering as a set of activities including collecting, organizing, and processing data, which is often a prerequisite for data science work. - Reasons for learning data engineering, including that it occupies most time in data science projects, there are more data engineer jobs than data scientist jobs, and it is essential for enabling machine learning applications.

Uploaded by

mb.doumi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views31 pages

Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!

Uploaded by

mb.doumi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

LECTURE 01

Welcome to Data Engineering!

(INFO 258/DATA 101)
January 22, 2024

Data 101, Fall 2024 @ UC Berkeley

Aditya Parameswaran https://data101.org/sp24

1
[Enrollment & Logistics] Enrollment is Ongoing; Room can only take 50
● This class can accommodate only 75 students across INFO 258 (for grads) and DATA 101 (for
undergrads)
○ “Skeleton crew” staff of 2 (compared to 8 in Fa23!)
● INFO258: Originally 22 enrolled → Requested enrollment of 30
○ (Thank you if you replied to my email about prerequisites).
○ Remaining will be admitted if there’s room
● DATA101: Originally 25 enrolled → Requested enrollment of 45, so 20 more students will be
enrolled off the waitlist; remaining if there’s room
● Other enrollment concerns:
○ Limited capacity, so no CE students will be admitted
○ No auditing to keep staff workload low.
○ Room capacity is 50 (but will be broadcast on zoom and recorded), so if we go past
capacity, we will turn you away to not violate fire codes

All enrollment questions handled by Data Science Undergraduate Studies [email protected] and I
School Enrollment Staff [email protected], not instructors. Please email them if you have any
questions or concerns, as well as for any exceptions
2
A note about INFO 258/DATA 101
You’ll hear the two used interchangeably - often I might end up using “DATA 101” since it’s a simpler
number

For all practical purposes these are the same class, modulo:
● The grad/undergrad distinction
● One extra project for the grad students
● Grading done separately for grads and undergrads

3
Intro - Aditya Parameswaran

● Ph.D. in Computer Science from Stanford University (2013).

● Postdoc at MIT (2014).
● Assistant/Associate Prof since 2014, and UC Berkeley (2019-now).

● Building better (more scalable, usable, intelligent) data tools

○ Develop (and open-source) tools & write papers about them
○ Tools include: spreadsheet & visualization systems, data science
libs., comp. Notebooks
○ Open-source tools downloaded millions of times
○ Even sometimes start companies!

● Second time teaching this class!

○ Joe H and I originally developed this class in 2021 (offered 2x)
○ It’s a very new course! Lots of experts at Cal!
● Random facts about me:
○ I have three creatures who ensure I don’t get quality sleep at night:
two foster cats, and my hyperactive toddler
Our wonderful Spring 2024 Course Staff

Natalie Chan Mackenzie Moffit

5
Data Engineering: What? Why?
Course Trajectory
Course Logistics

Data Engineering:
What? Why?
Lecture 01, Data 101 Spring 2024

6
Data Science: The Conventional View
Data Science: The Conventional View
A data scientist operating alone, on one
static dataset at a time, with a clean
“rectangular” shape and fitting in main-
memory, employing various statistical
and ML algorithms on predefined
objectives.

● From Data 100

● Also the view reinforced by “popular”
Machine Learning, e.g., leaderboards and
Kaggle competitions
● A valuable component, but sadly,
missing the complete picture!

7
Data Science: The Conventional View Now with Data Engineering
Data Science: The Conventional View Data Science today involves Data Engineering:
A data scientist operating alone, on one A set of activities that include collecting, collating,
static dataset at a time, with a clean extracting, moving, transforming, cleaning, integrating,
“rectangular” shape and fitting in main- organizing, representing, storing, and processing data.
memory, employing various statistical
and ML algorithms on predefined
objectives.
● Happens on a large set of messy (often non-rectangular)
dynamic and large datasets
● From Data 100 ● Happens across teams and across the organization
● Also the view reinforced by “popular” ● The team generating the data may not be the same team(s)
Machine Learning, e.g., leaderboards and consuming it
Kaggle competitions ● The objectives are often rather unclear and ill-defined
● A valuable component, but sadly, ● A prerequisite (and typically, precursor) to real-world data
missing the complete picture! science & ML
● A lot of data engineering needs to happen to support the
conventional view!

Data systems are tools that support

data engineering. 8
The Data Science Industry Now
…once these junior people get to the market, they come in with an unrealistic set of expectations
about what data science work will look like. Everyone thinks they’re going to be doing machine
learning, deep learning, …

Vicky Boykis, 2019.[blog]

This is not their fault; this is what data

science curriculums [sic.] and the tech
media emphasize….

The reality is that “data science” has never

been as much about machine learning as it
has about cleaning, shaping data, and
moving it from place to place.
I personally
like 2 more
9
than 1!
[1/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world data science

projects involve data engineering.
● Often underappreciated compared to other
activities, e.g., ML.

10
[1/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.

● Most of the time spent in real-world data science

projects involve data engineering.
● Often underappreciated compared to other
activities, e.g., ML.
● Data engineering activities, e.g., cleaning,
moving, and processing data occupies the
majority of time in data science.

11
[2/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.

“… 70% more open roles at companies in data

engineering as compared to data science. As we
train the next generation of data and ML
practitioners, let’s place more emphasis on Mihail Eric, Jan 2021.[blog]
engineering skills.”

“Data engineer” has emerged as a new specialized

job category:
● Data scientist: Use various techniques
in statistics & ML to process & analyze data.
● Data engineer: Develops a robust and
scalable set of data processing We’re not going to be too dogmatic about
tools/platforms. these distinctions, but it’s worth knowing
what industry envisions. 12
[2/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.

Even bolder claim: data science roles

may disappear!?!
“Many data science teams have not delivered results
that can be measured in ROI by executives.”
Forbes, Feb 2019. [blog]
Many teams have struggled because they can do “ML”
but can’t do data engineering to get to “ML”
“For complex data engineering tasks, you need five data engineers for every one data scientist.”

Essential idea: ML is the easy part (perhaps even more so, given LLMs!) → but can’t be done without data
engineering and data engineers

13
[3/4] Why Learn Data Engineering?
1. Data science projects largely focus on data engineering.
2. Data engineer roles >> data scientist roles.
3. Data engineering is essential to ML/AI.

Even when doing ML, the vast fraction of ML-

powered systems is not “ML code.”
In most cases, “ML code” corresponds to calls to
standard libraries, e.g., scikit-learn, pytorch,
tensorflow, etc.

The hard part is getting the data to

the format and quality that these ML
libraries expect!
Sculley et al., SE4ML 2014 [google research]. 14
Data Engineering is Essential in ML/AI

Monica Rogati, 2017 [blog].

Stuff you need to do

first! A lot of this is data
engineering.
In fact, for any sort of
data-driven decision-
making
(ML/AI or not) you will
need these skills.

15
Data Engineering is Essential in ML/AI

“More often than not, companies

are not ready for AI. Maybe they Monica Rogali, 2017 [blog].
hired their first data scientist to
less-than-stellar outcomes, or “However, under the strong
maybe data literacy is not central influence of the current AI
to their culture. But the most hype, people try to plug in
common scenario is that they data that’s dirty & full of
have not yet built the gaps, that spans years while
infrastructure to implement (and changing in format and
reap the benefits of) the most meaning, that’s not
basic data science algorithms and understood yet, that’s
operations, much less ML.” structured in ways that don’t
make sense, and expect those
tools to magically handle it.”

16
New role: Machine Learning Engineer

Tomasz Dudek,, 2018 [blog].

“ML Engineer”: a specialization

of data engineer focused on
operationalizing ML.

“A need for a person that would reunite

two warring parties. One being fluent
just enough in both fields [Data
Science and Software Engineering] to
get the product up and running.
Somebody taking data scientists’ code
and making it more effective and
scalable. ... Explaining the reasons
behind architectural ideas to the
devops team. “
17
Why Learn Data Engineering?
Data science projects largely focus on data engineering.
Data engineer roles >> data scientist roles.
Data engineering is essential to ML/AI.
Balance your data techniques with a systems perspective.

As a Data Science major, you are

likely familiar with techniques: Techniques …but you are likely less
statistics/ML concepts & familiar with systems.
Systems
algorithms…

● In this class, you will learn systems and the infrastructure that enables these techniques.
● You’ll start thinking about efficiency, especially on large datasets.
● Various “plumbing analogies”:
data pipelines, data flows, …
Data engineering is as essential as plumbing!
● When it works well, you don’t realize it exists.
● When it doesn’t, you’ll really know. 18
All these Data Systems!!!

2023 MAD (ML/AI/Data) Landscape: blog, interactive 19

2023 MAD (ML/AI/Data) Landscape
Data systems is a difficult subject! There are many, many data
systems – too many for us to cover.

● In this class, we will try to cover the key categories and

underlying principles.
● This way, you can make informed decisions about when to use
what type of system.

2023 MAD (ML/AI/Data) Landscape: blog, interactive 20

The Bottom Line
Data engineering is an essential ingredient
of real-world data science projects.

A set of activities that include collecting, The backbone, plumbing, or

collating, extracting, moving, transforming,
cleaning, integrating, organizing, infrastructure that supports data
representing, storing, and processing data. science.

Understanding these skills will help you…:

● Apply skills from intro data science classes to messy, large real-world datasets;
● Get your datasets to the point where you can apply AI/ML;
● Explore new, sought-after, & specialized roles, e.g., data engineer/ML engineer;
● Make informed decisions within the vast and confusing landscape of data systems; and
● Start worrying about efficiency :-)
21
Data Engineering: What? Why?
Course Trajectory
Course Logistics

Course Trajectory
Lecture 01, Data 101 Spring 2024

22
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices! Will sample from both!

Stanford; Founded 2003, IPO, Founded 2022 based on

MIT; Founded 2005, DuckDB from CWI
Acq. 2011 (HPE) Acq. 2019 (Salesforce)
$5OM of funding

Founded 1996, one of the

most popular open-source Founded 2021
Founded 2019,
databases, with many Founded 2013, Founded 2013 based on Apache Acq. 2023
$60+M raised
startups & established co. Acq. 2022 (Alteryx)Spark; one of the hottest pre- (Snowflake)
offerings IPO startups 23
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices!

Two general foundational approaches:

?
Code-centric Query-centric
Main Storage API is files
● AWS S3, Azure File Storage,
Google Cloud Storage, HDFS, …
Libraries in general-purpose programming
languages, lots of separation
● Spark (Scala/Java) for batch processing
● Ad hoc code (Python/pandas) for exploration
● Metadata tracked in a separate store

24
Roots and Foundations
Data Systems has a long history of academic and industrial interplay.
🎢 Academic jargon meets industry buzzwords!
🤝 Formal foundations meets best practices!

Two general foundational approaches:

Code-centric Query-centric
Main Storage API is files
● AWS S3, Azure File Storage,
Google Cloud Storage, HDFS, …
Main Storage API is tables
● Snowflake, BigQuery,
Redshift, Azure Synapse,
?
Libraries in general-purpose programming ● Teradata (founded 1979, still relevant!!)
languages, lots of separation One language/paradigm for (almost) everything
● Spark (Scala/Java) for batch processing ● Batch: SQL
● Ad hoc code (Python/pandas) for exploration ● Interactive: SQL
● Metadata tracked in a separate store ● Metadata auto-tracked in database
● Other bytestream data stored in files (e.g. S3)
25
Our approach: Query-centric but Open-Minded
Based on formal (theory) query languages: Relational Algebra and Relational Calculus
● Decades of research and development
● RA: procedural, RC: declarative (describe outputs, not algorithms)

Structured Query Language (SQL): A domain-specific language for data

● Same language for batch (e.g., transformation) and interactive (e.g., queries)
● Declarative complement to general-purpose languages (which are often imperative)
○ Abstraction: No “overfitting” of code to the task at hand
○ Huge plus for cloud environment: dynamically changing workloads, hardware, data
● Even code-centric libraries increasingly include SQL-like interfaces (e.g., SparkSQL)
○ The pendulum has swung back in favor of SQL from the noSQL movement of the 00s
● Decades of extensions, tools, and support
We’ll teach you the concepts using
…but nothing’s perfect! postgreSQL (a flavor of SQL). In
● In practice, for both exploration/data engineering, practice, you’ll be able to apply
you will need extra tools beyond SQL. these concepts to new tools..
● But most recent open-source alternatives are similar.
26
Class Journey
Relational Model and Algebra
Advanced SQL queries (views, subqueries, window functions, Project 1
…)
DML, DDL
Referential integrity, index selection, performance tuning Project 2
Data transformation and preparation + Project 5
Data wrangling and cleaning Project 3
Non-relational data models (Tensors, Spreadsheets, etc.)
Semistructured data (and mongoDB) Project 4

ER and normalization, Spreadsheets, Transactions, BI and

OLAP, parallel computing, security and privacy, data pipelines, important data
… engineering topics

(note: topics are grouped by theme and not to scale with + not in order of the class schedule) 27
Data Engineering: What? Why?
Course Trajectory
Course Logistics

Course Logistics
Lecture 01, Data 101 Spring 2024

28
Syllabus Walkthrough

https://data101.org/sp24/syllabus

29
Beginning-of-Semester Logistics
Discussion Sections
Start this Thursday 1/25!
Not recorded but there will be a live zoom link, but
handouts/solutions will be posted.
Again, “in person” will be capped at room capacity

Office Hours
● Just my office hours (was this AM)
● TA office hours start next week.

30
We are in this class together!
Some of the content will be half-baked & experimental! Please bear with the hiccups.
● i.e., not taught in typical "database" classes.
○ Wherever possible we will emphasize underlying concepts…
…but some of what we say will also be practical advice.
● First time I’m teaching a lecture-based class since Spring 2021 (when we piloted Data 101!)

With all of that said:

● You will be evaluated generously.
● Our goal is for you to learn the material, Use the Extenuating Circumstances
not to stress you out. form!
● If you are feeling lost, please reach out.
We welcome feedback at any time
It is much better to do so than to violate our trust.
about the course. Contact course staff
● This is especially true given the
at [email protected] or stop by
experimental nature of the class, office hours.
our small staff size, and the state of the world.
31

Intro To Data Engineering!
No ratings yet
Intro To Data Engineering!
34 pages
Lecture 1.1 - Introduction To DE
No ratings yet
Lecture 1.1 - Introduction To DE
27 pages
Data Engineering: Key Concepts & Career Path
No ratings yet
Data Engineering: Key Concepts & Career Path
2 pages
Introduction to Data Engineering
No ratings yet
Introduction to Data Engineering
3 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
13 pages
Data Engineering Course Overview
No ratings yet
Data Engineering Course Overview
33 pages
Introduction To Data Engineering
100% (2)
Introduction To Data Engineering
23 pages
Data Engineering
No ratings yet
Data Engineering
144 pages
Data Engineering Training Technology Agnostic Foundations
No ratings yet
Data Engineering Training Technology Agnostic Foundations
50 pages
Data Engineering Overview and Tools
No ratings yet
Data Engineering Overview and Tools
34 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Data Careers: Engineer vs Scientist
No ratings yet
Data Careers: Engineer vs Scientist
4 pages
Data Engineering UNIT 1
100% (1)
Data Engineering UNIT 1
16 pages
De Notes
No ratings yet
De Notes
3 pages
Data Engineering Unit-1
No ratings yet
Data Engineering Unit-1
16 pages
Data Science Guide: Concepts & Roles
100% (1)
Data Science Guide: Concepts & Roles
67 pages
Data Engineering for Tech Professionals
No ratings yet
Data Engineering for Tech Professionals
12 pages
Week 1 Slides
No ratings yet
Week 1 Slides
16 pages
Data Science
No ratings yet
Data Science
71 pages
Data Engineering Career Guide
100% (2)
Data Engineering Career Guide
14 pages
Module 1
No ratings yet
Module 1
96 pages
Future Trends in Data Engineering
No ratings yet
Future Trends in Data Engineering
21 pages
Unit 1 Introduction To Data Engineering
No ratings yet
Unit 1 Introduction To Data Engineering
32 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
91 pages
An Introduction To Data Engineering
No ratings yet
An Introduction To Data Engineering
2 pages
Data Engineering For Everyone 1
No ratings yet
Data Engineering For Everyone 1
79 pages
Seminar On Data Science
100% (7)
Seminar On Data Science
25 pages
L1 - Introduction and Data EcoSystem
No ratings yet
L1 - Introduction and Data EcoSystem
42 pages
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
No ratings yet
2OEeUEnBTY CompleteGuideToBecomeModernDataEngineer
43 pages
Data Engineering at Stellenbosch University
No ratings yet
Data Engineering at Stellenbosch University
1 page
DSC Unit 1
No ratings yet
DSC Unit 1
59 pages
CS429: Data Mining Overview
No ratings yet
CS429: Data Mining Overview
26 pages
Mechanical Engineer To Data Scientist
No ratings yet
Mechanical Engineer To Data Scientist
5 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
8.de CDP
No ratings yet
8.de CDP
4 pages
Data Science Presentation Enhanced
No ratings yet
Data Science Presentation Enhanced
34 pages
Lecture 3 Data Engineering Concepts, Processes, and Tools
No ratings yet
Lecture 3 Data Engineering Concepts, Processes, and Tools
2 pages
Data Science & Big Data Course Guide
No ratings yet
Data Science & Big Data Course Guide
119 pages
Data Engineering Explanation
No ratings yet
Data Engineering Explanation
43 pages
Chapter 1 What Is Data Engineering PDF
No ratings yet
Chapter 1 What Is Data Engineering PDF
79 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
33 pages
Big Data CH01
No ratings yet
Big Data CH01
12 pages
DataEngineer Roadmap
No ratings yet
DataEngineer Roadmap
12 pages
The Evolving Role of The Data Engineer
No ratings yet
The Evolving Role of The Data Engineer
61 pages
BE (CSE) - Honors Prog
No ratings yet
BE (CSE) - Honors Prog
16 pages
This Is What I Will Do To Become A Data Engineer in 2025 - by Syed Kadar Ansari Syed Ahamed - Aug, 2024 - Data Engineer Things
No ratings yet
This Is What I Will Do To Become A Data Engineer in 2025 - by Syed Kadar Ansari Syed Ahamed - Aug, 2024 - Data Engineer Things
22 pages
The Essence of Data Engineering
No ratings yet
The Essence of Data Engineering
3 pages
Data Science Programming Essentials
No ratings yet
Data Science Programming Essentials
41 pages
De Unit - I
No ratings yet
De Unit - I
43 pages
Data Engineering Essentials
No ratings yet
Data Engineering Essentials
36 pages
DS Unit 1 - ABM
No ratings yet
DS Unit 1 - ABM
103 pages
DS ML - BROCHURE - Updated
No ratings yet
DS ML - BROCHURE - Updated
30 pages
BDE Pertemuan 1
No ratings yet
BDE Pertemuan 1
20 pages
Scaler Data Science & ML Program Overview
No ratings yet
Scaler Data Science & ML Program Overview
20 pages
Data Engineering Vs Data Science
No ratings yet
Data Engineering Vs Data Science
1 page
Enginerring Students
No ratings yet
Enginerring Students
4 pages
2nd - Semester - Data Science - Final - Updated
No ratings yet
2nd - Semester - Data Science - Final - Updated
15 pages
Step and Touch Potential Testing Guide
No ratings yet
Step and Touch Potential Testing Guide
6 pages
PeopleSoft Technical - Interview Questions
100% (1)
PeopleSoft Technical - Interview Questions
33 pages
Family Apgar
No ratings yet
Family Apgar
6 pages
Math 101 - Supplemental Package Final
No ratings yet
Math 101 - Supplemental Package Final
83 pages
Experiment 5-1
No ratings yet
Experiment 5-1
8 pages
Overview of the 8086 Microprocessor
No ratings yet
Overview of the 8086 Microprocessor
34 pages
Mechanical Design Problem Set Solutions
No ratings yet
Mechanical Design Problem Set Solutions
21 pages
Air Texturing in Yarn Production
No ratings yet
Air Texturing in Yarn Production
1 page
Company Presentation 29.07.2024
No ratings yet
Company Presentation 29.07.2024
20 pages
Online Donation of Unused Medicines
No ratings yet
Online Donation of Unused Medicines
5 pages
Resistive Sensing Elements - POT, RTD, Thermistor
No ratings yet
Resistive Sensing Elements - POT, RTD, Thermistor
33 pages
GCSE Maths Exam Instructions
No ratings yet
GCSE Maths Exam Instructions
24 pages
Retail Services Engineering Homework 2
No ratings yet
Retail Services Engineering Homework 2
2 pages
Axlr8r Racing Intro Guide
No ratings yet
Axlr8r Racing Intro Guide
49 pages
Chionh 1999
No ratings yet
Chionh 1999
12 pages
Single-Phase Inverter Analysis
No ratings yet
Single-Phase Inverter Analysis
29 pages
18cs32 - Data Structure and Its Application
No ratings yet
18cs32 - Data Structure and Its Application
22 pages
Ansible Playbook Execution Log
No ratings yet
Ansible Playbook Execution Log
77 pages
Microsoft Excel Assignment 1
No ratings yet
Microsoft Excel Assignment 1
3 pages
Radar Esm and Elint Receivers
No ratings yet
Radar Esm and Elint Receivers
6 pages
Computer Algorithms - Homework Assignment 3
No ratings yet
Computer Algorithms - Homework Assignment 3
1 page
Database Lab for CS Students
No ratings yet
Database Lab for CS Students
13 pages
232 EEE 2101 A Class 01 Notes
No ratings yet
232 EEE 2101 A Class 01 Notes
16 pages
NACA Duct vs. Scoop: Airflow Analysis
No ratings yet
NACA Duct vs. Scoop: Airflow Analysis
10 pages
Chapter 4
No ratings yet
Chapter 4
14 pages
ENGR3590 CH 4 - Vector Loop Analysis
No ratings yet
ENGR3590 CH 4 - Vector Loop Analysis
140 pages
Data Visualization With Python For Beginners - Visualize
100% (2)
Data Visualization With Python For Beginners - Visualize
280 pages
ISO 2531 2009 Cor 1 2010 en
No ratings yet
ISO 2531 2009 Cor 1 2010 en
4 pages
Flotation Kinetic Test Procedures Guide
No ratings yet
Flotation Kinetic Test Procedures Guide
10 pages
Optimization of Addis Ababa LRT Stadium Station Interchange Based On Microsimulation Modeling
No ratings yet
Optimization of Addis Ababa LRT Stadium Station Interchange Based On Microsimulation Modeling
14 pages

Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!

Uploaded by

Lec 01 - DATA 101 Sp24 - Welcome To Data Engineering!

Uploaded by

LECTURE 01

Welcome to Data Engineering!

Data 101, Fall 2024 @ UC Berkeley

● Ph.D. in Computer Science from Stanford University (2013).

● Building better (more scalable, usable, intelligent) data tools

● Second time teaching this class!

Natalie Chan Mackenzie Moffit

● From Data 100

Data systems are tools that support

Vicky Boykis, 2019.[blog]

This is not their fault; this is what data

The reality is that “data science” has never

● Most of the time spent in real-world data science

● Most of the time spent in real-world data science

“… 70% more open roles at companies in data

“Data engineer” has emerged as a new specialized

Even bolder claim: data science roles

Even when doing ML, the vast fraction of ML-

The hard part is getting the data to

Monica Rogati, 2017 [blog].

Stuff you need to do

“More often than not, companies

Tomasz Dudek,, 2018 [blog].

“ML Engineer”: a specialization

“A need for a person that would reunite

As a Data Science major, you are

2023 MAD (ML/AI/Data) Landscape: blog, interactive 19

● In this class, we will try to cover the key categories and

2023 MAD (ML/AI/Data) Landscape: blog, interactive 20

A set of activities that include collecting, The backbone, plumbing, or

Understanding these skills will help you…:

Stanford; Founded 2003, IPO, Founded 2022 based on

Founded 1996, one of the

Two general foundational approaches:

Two general foundational approaches:

Structured Query Language (SQL): A domain-specific language for data

ER and normalization, Spreadsheets, Transactions, BI and

With all of that said:

You might also like