0% found this document useful (0 votes)
78 views86 pages

FSDL Berkeley Lecture8 Data Management

The document provides an overview of data management in deep learning, emphasizing the importance of data sources, storage solutions, and processing techniques. Key points include the necessity of exploring data extensively, the benefits of data augmentation, and the use of various storage options like databases, data lakes, and object storage. It also discusses the role of frameworks and tools for managing data workflows and processing tasks efficiently.

Uploaded by

Syafri Arlis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views86 pages

FSDL Berkeley Lecture8 Data Management

The document provides an overview of data management in deep learning, emphasizing the importance of data sources, storage solutions, and processing techniques. Key points include the necessity of exploring data extensively, the benefits of data augmentation, and the use of various storage options like databases, data lakes, and object storage. It also discusses the role of frameworks and tools for managing data workflows and processing tasks efficiently.

Uploaded by

Syafri Arlis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Management

Full Stack Deep Learning - UC Berkeley Spring 2021 - Sergey Karayev, Josh Tobin, Pieter Abbeel
https://veekaybee.github.io/2019/02/13/data-science-is-different/

Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021 2
Data Sources Training

Images

Text Corpus Local Filesystem


Different for every project / company!
+
Logs
GPU
DB records

Full Stack Deep Learning - UC Berkeley Spring 2021


Data Sources Training

Images

Full Stack Deep Learning - UC Berkeley Spring 2021


Data Sources Training

Text Corpus
+

Full Stack Deep Learning - UC Berkeley Spring 2021


Data Sources Training

+
Logs

DB records

Full Stack Deep Learning - UC Berkeley Spring 2021


Countless possibilities

Full Stack Deep Learning - UC Berkeley Spring 2021 7


Key Points

Let the data flow through you

• Spend 10x as much time exploring the data as you would like to

• Adding/augmenting data is the best way to improve performance

• KISS

Full Stack Deep Learning - UC Berkeley Spring 2021 8


“All-in-one”

Hyperparameter Tuning Feature


Store Monitoring
Versioning Labeling

Frameworks &
Distributed Training Experiment Management

Edge Web
Processing Exploration

Resource Management Software Engineering


Data Lake / Warehouse CI / Testing

or or
Sources Compute

Data Training/Evaluation Deployment

Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
“All-in-one”

Hyperparameter Tuning Feature


Store Monitoring
Versioning Labeling

Frameworks &
Distributed Training Experiment Management

Edge Web
Processing Exploration

Resource Management Software Engineering


Data Lake / Warehouse CI / Testing

or or
Sources Compute

Data Training/Evaluation Deployment

Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Sources

• Most DL applications require lots of proprietary data

• Exceptions: RL, GANs, GPT-3

• Publicly available datasets = No competitive advantage

• But can serve as starting point

Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 11
Usually: spend $$$ and time to label own data

https://cdn-sv1.deepsense.ai/wp-content/uploads/2017/04/sample_image_from_the_training_set.jpg
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 12
Data flywheel
Enables rapid improvement with user labels

Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 13
Semi-supervised learning
14

Use parts of data to label other parts

Very important idea!

Fig. 1. A great summary of how self-supervised learning tasks can be constructed (Image source: LeCun’s talk)

Fig. 4. Illustration of self-supervised learning by predicting the relative position of two random patches. (Image
source: Doersch et al., 2015)

https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence
https://lilianweng.github.io/lil-log/2019/11/10/self-supervised-learning.html

Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021
Semi-supervised learning
15

• Trained on 1B random images

• Achieved SOTA accuracy on ImageNet top-1 prediction

• Open-source library

https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021
Image data augmentation

• Must do for training vision models


• Frameworks (e.g. torchvision) provide
functions that do this
• Done in parallel to GPU training on the CPU

https://towardsdatascience.com/1000x-faster-data-augmentation-b91bafee896c
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 16
Other data augmentation

• Tabular
• Delete some cells to simulate missing
data
• Text
• No well established techniques, but
replace words with synonyms, change
order of things.
• Speech/video
• Change speed, inserts pauses, etc

https://github.com/makcedward/nlpaug
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 17
Synthetic data
Underrated idea that is often
worth starting with

https://blogs.dropbox.com/tech/2017/04/creating-a-modern-ocr-pipeline-using-computer-vision-and-deep-learning/

Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 18
This can get pretty deep!

Andrew Moffat - https://github.com/amoffat/metabrite-receipt-tests


Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 19
Especially for driving and robotics

https://microsoft.github.io/AirSim/ https://openai.com/blog/ingredients-for-robotics-research/
Data Management - sources Full Stack Deep Learning - UC Berkeley Spring 2021 20
Questions?

Full Stack Deep Learning - UC Berkeley Spring 2021 21


Data Storage
1. Building blocks

- Filesystem

- Object Storage

- Database

- Data Lake / Data Warehouse

2. What goes where

3. Where to learn more

Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 22
Filesystem
• Foundational layer of storage.

• Fundamental unit is a "file", which can be text or binary, is not versioned,


and is easily overwritten.

• Can be as simple as a locally mounted disk containing all the files you need.

• Can be networked (e.g. NFS): accessible over network by multiple machines.

• Can be distributed (e.g. HDFS): stored and accessed over multiple machines

• Fastest option

Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 23
Hard Drive Speeds

https://www.pcworld.com/article/2899351/everything-you-need-to-know-about-nvme.html

Full Stack Deep Learning - UC Berkeley Spring 2021 24


Local Data Format
• Binary data: just files

• TFRecord batches files -- doesn't seem necessary with NVMe drives

• For large tabular / text data, have choices:

• HDF5 is powerful, but bloated and declining

• Parquet is widespread and recommended

• Feather is powered by Apache Arrow, up-and-coming

• Try to use native Tensorflow and PyTorch dataset classes

Full Stack Deep Learning - UC Berkeley Spring 2021 25


Object Storage
• An API over the filesystem. GET, PUT, DELETE files to a service, without
worrying where they are actually stored.

• Fundamental unit is an "object". Usually binary: image, sound file, etc.

• Versioning, redundancy can be built into the service.

• Not as fast as local, but fast enough within the cloud

Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 26
Database
• Persistent, fast, scalable storage and retrieval of structured data that will be accessed repeatedly.

• AKA Online Transaction Processing (OLTP)

• Mental model: everything is actually in RAM, but software ensures that everything is logged to
disk and never lost.

• Not for binary data! Store references instead.

• Postgres is the right choice most of the time. Supports unstructured JSON.

• SQLite is perfectly good for small projects.

• "NoSQL" was a big craze in 2010's. Mostly avoid.

• Redis is very useful when you need a simple key-value store.

Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 27
Data Warehouse
• Structured aggregation of data for analysis

• AKA Online Analytical Processing (OLAP)

• Another acronym: ETL

https://addepto.com/implement-data-warehouse-business-intelligence/
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 28
SQL and DataFrames
• Most data solutions use
SQL. Some, like Databricks,
use DataFrames.

• SQL is the standard interface


for structured data.

• Pandas is the main


DataFrame in the Python
ecosystem.

• Our advice: become fluent in


both
https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html
Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 29
Data Lake
• Unstructured aggregation of data from multiple sources, e.g. databases,
logs, expensive data transformations.

• ELT: dump everything in, then transform for specific needs later.

https://medium.com/data-ops/throw-your-data-in-a-lake-32cd21b6de02

Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 30
Trend: Lake House

Full Stack Deep Learning - UC Berkeley Spring 2021 31


For now

• Binary data (images, sound files, compressed texts) is stored as


objects.

• Metadata (labels, user activity) is stored in database.

• If need features which are not obtainable from database (e.g logs), set
up data lake and a process to aggregate needed data.

• At training time, copy the data that is needed to a filesystem on a fast


drive.

Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 32
There's a lot more to the story

https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/

Full Stack Deep Learning - UC Berkeley Spring 2021 33


Blueprint for AI and ML

https://a16z.com/2020/10/15/the-emerging-architectures-for-modern-data-infrastructure/

Full Stack Deep Learning - UC Berkeley Spring 2021 34


If you're truly interested

https://dataintensive.net

Data Management - storage Full Stack Deep Learning - UC Berkeley Spring 2021 35
Questions?

Full Stack Deep Learning - UC Berkeley Spring 2021 36


“All-in-one”

Hyperparameter Tuning Feature


Store Monitoring
Versioning Labeling

Frameworks &
Distributed Training Experiment Management

Edge Web
Processing Exploration

Resource Management Software Engineering


Data Lake / Warehouse CI / Testing

or or
Sources Compute

Data Training/Evaluation Deployment

Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Motivational Example
•We have to train a photo popularity predictor every night.

• For each photo, training data must include:

• Metadata such as posting time, title, location

• Some features of the user, such as how many times they


logged in today.

• Outputs of photo classifiers (content, style)

Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 38
Task Dependencies
• Some tasks can't be
started until other tasks
are finished.

• Finishing a task should


"kick off" its dependencies

Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 39
Desiderata

• Re-computation should depend on content

• Dependencies are not files, but programs and databases

• Work needs to be spread over many machines

• Many dependency graphs are executing all at once

Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 40
Hadoop/Spark
• Map/Reduce
implementations

• Running data processing


operations and simple
ML on commodity
hardware, with tricks to
speed things up

https://data-flair.training/blogs/spark-vs-hadoop-mapreduce/
Full Stack Deep Learning - UC Berkeley Spring 2021 41
Airflow

https://www.slideshare.net/PyData/how-i-learned-to-time-travel-or-data-pipelining-and-scheduling-with-airflow-67650418

Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 42
Distributing work
• The workflow manager has a queue for the tasks, and manages workers
that pull from it, restarting jobs if they fail.

http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/

Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 43
Tensorflow Datasets + Apache Beam

For example,
the 7TB
Colossal
Clean Corpus

https://www.tensorflow.org/datasets/beam_datasets
Full Stack Deep Learning - UC Berkeley Spring 2021 44
Prefect

Full Stack Deep Learning - UC Berkeley Spring 2021 45


dbt

Full Stack Deep Learning - UC Berkeley Spring 2021 46


Dagster

Full Stack Deep Learning - UC Berkeley Spring 2021 47


“All-in-one”

Hyperparameter Tuning Feature


Store Monitoring
Versioning Labeling

Frameworks &
Distributed Training Experiment Management

Edge Web
Processing Exploration

Resource Management Software Engineering


Data Lake / Warehouse CI / Testing

or or
Sources Compute

Data Training/Evaluation Deployment

Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Feature Store

https://eng.uber.com/michelangelo-machine-learning-platform/
Full Stack Deep Learning - UC Berkeley Spring 2021 49
50

https://www.tecton.ai

Full Stack Deep Learning - UC Berkeley Spring 2021


51

Full Stack Deep Learning - UC Berkeley Spring 2021


Try to keep things simple

• Don't overengineer

• For example, UNIX has powerful


parallelism, streaming, highly
optimized tools

https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html
Data Management - processing Full Stack Deep Learning - UC Berkeley Spring 2021 52
Questions?

Full Stack Deep Learning - UC Berkeley Spring 2021 53


“All-in-one”

Hyperparameter Tuning Feature


Store Monitoring
Versioning Labeling

Frameworks &
Distributed Training Experiment Management

Edge Web
Processing Exploration

Resource Management Software Engineering


Data Lake / Warehouse CI / Testing

or or
Sources Compute

Data Training/Evaluation Deployment

Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Pandas

• The workhorse of Python data science

• Definitely do a few projects using it if you haven't used it before

https://projectcodeed.blogspot.com/2019/08/setting-up-jupyter-notebooks-for-data.html

Full Stack Deep Learning - UC Berkeley Spring 2021 55


Dask

Full Stack Deep Learning - UC Berkeley Spring 2021 56


Full Stack Deep Learning - UC Berkeley Spring 2021 57
“All-in-one”

Hyperparameter Tuning Feature


Store Monitoring
Versioning Labeling

Frameworks &
Distributed Training Experiment Management

Edge Web
Processing Exploration

Resource Management Software Engineering


Data Lake / Warehouse CI / Testing

or or
Sources Compute

Data Training/Evaluation Deployment

Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Data Labeling

1. User Interfaces

2. Sources of labor

3. Service companies

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 59
Standard set of features:

- bounding boxes,
segmentations,
keypoints, cuboids

- set of applicable
classes

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 60
Training the annotators is crucial

Quality assurance is key

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 61
Sources of Labor
• Hire own annotators, promote best ones to quality control
• Pros: secure, fast (once hired), less QC needed
• Cons: expensive, slow to scale, admin overhead
• ...or, crowdsource (Mechanical Turk)
• Pros: cheaper, more scalable
• Cons: not secure, significant QC effort required
• ...or, full-service data labeling companies
Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 62
Service Companies

• Data labeling requires separate software stack, temporary labor, and


quality assurance. Makes sense to outsource.

• Dedicate several days to selecting the best one for you:

• Label gold standard data yourself

• Sales calls with several contenders, ask for work sample on same data

• Ensure agreement with your gold standard, and evaluate on value

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 63
FigureEight is the original AI data labeling company

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 64
Scale.ai is a dominant up-and-comer

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 65
And there are a ton of others

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 66
And there are a ton of others

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 67
Software

• Full-service data labeling is always pricy

• But some companies offer their software without labor

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 68
Label Studio
• Open-source edition to run yourself
• Enterprise edition for managed hosting
• Using in lab!

Full Stack Deep Learning - UC Berkeley Spring 2021 69


Prodigy

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 70
Aquarium

https://www.aquariumlearning.com

Full Stack Deep Learning - UC Berkeley Spring 2021 71


Weak supervision

• Snorkel

• Open-source
project snorkel.org

• Commercial
platform snorkel.ai

Full Stack Deep Learning - UC Berkeley Spring 2021 72


• Conclusions

• outsource to full-service company if you can afford it

• if not, then at least use existing software

• hiring part-time makes more sense than trying to make crowdsourcing


work

Data Management - labeling Full Stack Deep Learning - UC Berkeley Spring 2021 73
Questions?

Full Stack Deep Learning - UC Berkeley Spring 2021 74


“All-in-one”

Hyperparameter Tuning Feature


Store Monitoring
Versioning Labeling

Frameworks &
Distributed Training Experiment Management

Edge Web
Processing Exploration

Resource Management Software Engineering


Data Lake / Warehouse CI / Testing

or or
Sources Compute

Data Training/Evaluation Deployment

Data Management - overview Full Stack Deep Learning - UC Berkeley Spring 2021
Data Versioning

Level 0: unversioned

Level 1: versioned via snapshot at training time

Level 2: versioned as a mix of assets and code

Level 3: specialized data versioning solution

Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 76
Level 0

• Data lives are on filesystem/S3 and database

• Problem: Deployments must be versioned. Deployed machine learning


models are part code, part data. If data is not versioned, deployed
models are not versioned.

• Problem you will face: inability to get back to a previous level of


performance

Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 77
Level 1

• Data is versioned by storing a snapshot of everything at training time

• This allows you to version deployed models, and to get back to past
performance, but is super hacky.

• Would be far better to be able to version data just as easily as code.

Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 78
Level 2

• Data is versioned as a mix of assets and code.

• Heavy files stored in S3, with unique ids. Training data is stored as JSON or
similar, referring to these ids and include relevant metadata (labels, user activity,
etc).

• JSON files can get big, but using git-lfs lets us store them just as easily as code

• Can improve further with "lazydata": only syncing files that are needed.

• The git signature + of the raw data file defines the version of the dataset

• Often helpful to add timestamp

Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 79
Level 3

• Specialized solutions for versioning data.

• Avoid these until you can fully explain how they will improve your project.

• Leading solutions are DVC, Pachyderm, Quill.

Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 80
Data Versioning Solutions

https://dagshub.com/blog/data-version-control-tools/

Full Stack Deep Learning - UC Berkeley Spring 2021 81


DVC
1

4
2 3

Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 82
Dolt
A nice simple solution for
versioning databases,
that speaks SQL.

Data Management - versioning Full Stack Deep Learning - UC Berkeley Spring 2021 83
Questions?

Full Stack Deep Learning - UC Berkeley Spring 2021 84


Privacy
• Federated Learning: training a global
model from data on local devices,
without ever having access to the
data

• Differential privacy: aggregating data


such that individual points cannot be
identified

• Another topic: Learning on encrypted


data

• Let us know about the best resources!


https://blog.ml.cmu.edu/2019/11/12/federated-learning-challenges-methods-and-future-directions/
https://blogs.nvidia.com/blog/2019/10/13/what-is-federated-learning/

Full Stack Deep Learning - UC Berkeley Spring 2021 85


Thank you!

Full Stack Deep Learning - UC Berkeley Spring 2021 86

You might also like