FSDL Berkeley Lecture8 Data Management
FSDL Berkeley Lecture8 Data Management
Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel
https://veekaybee.github.io/2019/02/13/data-science-is-different/
Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021 2
Data Sources Training
Images
Images
Text Corpus
+
+
Logs
DB records
• Spend 10x as much time exploring the data as you would like to
• KISS
Frameworks &
Distributed Training Experiment Management
Edge Web
Processing Exploration
or or
Sources Compute
Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”
Frameworks &
Distributed Training Experiment Management
Edge Web
Processing Exploration
or or
Sources Compute
Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Sources
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 11
Usually: spend $$$ and time to label own data
https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 12
Data flywheel
Enables rapid improvement with user labels
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 13
Semi-supervised learning
14
Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk)
Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image
source: Doersch et al., 2015)
https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021
Semi-supervised learning
15
• Open-source library
https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021
Image data augmentation
https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 16
Other data augmentation
• Tabular
• Delete some cells to simulate missing
data
• Text
• No well established techniques, but
replace words with synonyms, change
order of things.
• Speech/video
• Change speed, inserts pauses, etc
https://github.com/makcedward/nlpaug
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 17
Synthetic data
Underrated idea that is often
worth starting with
https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 18
This can get pretty deep!
https://microsoft.github.io/AirSim/ https://openai.com/blog/ingredients-for-robotics-research/
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 20
Questions?
- Filesystem
- Object Storage
- Database
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 22
Filesystem
• Foundational layer of storage.
• Can be as simple as a locally mounted disk containing all the files you need.
• Can be distributed (e.g. HDFS): stored and accessed over multiple machines
• Fastest option
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 23
Hard Drive Speeds
https://www.pcworld.com/article/2899351/everything-you-need-to-know-about-nvme.html
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 26
Database
• Persistent, fast, scalable storage and retrieval of structured data that will be accessed repeatedly.
• Mental model: everything is actually in RAM, but software ensures that everything is logged to
disk and never lost.
• Postgres is the right choice most of the time. Supports unstructured JSON.
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 27
Data Warehouse
• Structured aggregation of data for analysis
https://addepto.com/implement-data-warehouse-business-intelligence/
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 28
SQL and DataFrames
• Most data solutions use
SQL. Some, like Databricks,
use DataFrames.
• ELT: dump everything in, then transform for specific needs later.
https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 30
Trend: Lake House
• If need features which are not obtainable from database (e.g logs), set
up data lake and a process to aggregate needed data.
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 32
There's a lot more to the story
https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/
https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/
https://dataintensive.net
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 35
Questions?
Frameworks &
Distributed Training Experiment Management
Edge Web
Processing Exploration
or or
Sources Compute
Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Motivational Example
•We have to train a photo popularity predictor every night.
Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 38
Task Dependencies
• Some tasks can't be
started until other tasks
are finished.
Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 39
Desiderata
Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 40
Hadoop/Spark
• Map/Reduce
implementations
https://data-flair.training/blogs/spark-vs-hadoop-mapreduce/
Full Stack Deep Learning - UC Berkeley Spring 2021 41
Airflow
https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418
Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 42
Distributing work
• The workflow manager has a queue for the tasks, and manages workers
that pull from it, restarting jobs if they fail.
http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/
Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 43
Tensorflow Datasets + Apache Beam
For example,
the 7TB
Colossal
Clean Corpus
https://www.tensorflow.org/datasets/beam_datasets
Full Stack Deep Learning - UC Berkeley Spring 2021 44
Prefect
Frameworks &
Distributed Training Experiment Management
Edge Web
Processing Exploration
or or
Sources Compute
Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Feature Store
https://eng.uber.com/michelangelo-machine-learning-platform/
Full Stack Deep Learning - UC Berkeley Spring 2021 49
50
https://www.tecton.ai
• Don't overengineer
https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 52
Questions?
Frameworks &
Distributed Training Experiment Management
Edge Web
Processing Exploration
or or
Sources Compute
Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Pandas
https://projectcodeed.blogspot.com/2019/08/setting-up-jupyter-notebooks-for-data.html
Frameworks &
Distributed Training Experiment Management
Edge Web
Processing Exploration
or or
Sources Compute
Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Data Labeling
1. User Interfaces
2. Sources of labor
3. Service companies
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 59
Standard set of features:
- bounding boxes,
segmentations,
keypoints, cuboids
- set of applicable
classes
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 60
Training the annotators is crucial
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 61
Sources of Labor
• Hire own annotators, promote best ones to quality control
• Pros: secure, fast (once hired), less QC needed
• Cons: expensive, slow to scale, admin overhead
• ...or, crowdsource (Mechanical Turk)
• Pros: cheaper, more scalable
• Cons: not secure, significant QC effort required
• ...or, full-service data labeling companies
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 62
Service Companies
• Sales calls with several contenders, ask for work sample on same data
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 63
FigureEight is the original AI data labeling company
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 64
Scale.ai is a dominant up-and-comer
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 65
And there are a ton of others
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 66
And there are a ton of others
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 67
Software
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 68
Label Studio
• Open-source edition to run yourself
• Enterprise edition for managed hosting
• Using in lab!
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 70
Aquarium
https://www.aquariumlearning.com
• Snorkel
• Open-source
project snorkel.org
• Commercial
platform snorkel.ai
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 73
Questions?
Frameworks &
Distributed Training Experiment Management
Edge Web
Processing Exploration
or or
Sources Compute
Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Data Versioning
Level 0: unversioned
Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 76
Level 0
Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 77
Level 1
• This allows you to version deployed models, and to get back to past
performance, but is super hacky.
Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 78
Level 2
• Heavy files stored in S3, with unique ids. Training data is stored as JSON or
similar, referring to these ids and include relevant metadata (labels, user activity,
etc).
• JSON files can get big, but using git-lfs lets us store them just as easily as code
• Can improve further with "lazydata": only syncing files that are needed.
• The git signature + of the raw data file defines the version of the dataset
Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 79
Level 3
• Avoid these until you can fully explain how they will improve your project.
Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 80
Data Versioning Solutions
https://dagshub.com/blog/data-version-control-tools/
4
2 3
Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 82
Dolt
A nice simple solution for
versioning databases,
that speaks SQL.
Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 83
Questions?