Building an ML
Application
and Transfer
Learning
Applied Machine Learning
Derek Hoiem
Dall-E
Today’s lecture
• Review a few exam questions
• Example of building an ML application
• Transfer learning
Exam
• Well done!
False: It’s possible (and common) for a method to achieve low/zero training error, but still perform badly in
testing, especially if the training examples are few compared to the model size
(a): The parameters optimize the objective for the training data, so evaluation on the training data is a
strongly biased optimistic estimate of performance, and is not a good indicator of expected performance
for future examples
(c) The trees are independently trained (a) All features are used to train each tree
(b) x=3y~=3 for regression and
nearest neighbor
False: The weight update is not sampled randomly from a uniform distribution, but computed
from a random sample of data. Also, SGD does not proceed by checking whether an update
decreases the loss -- it just takes a step according to the loss gradient for that mini-batch.
False: Sigmoid activations are very non-linear. The problem is that the gradient
is always less than 1 and often very small, so with many layers, the gradient
becomes negligible.
We’ve covered a lot of ground in deep networks
• ReLU activations, residual connections, and improved
optimization techniques enabled training arbitrarily large
and deep models
• Transformers provide a general and scalable way to process
many kinds of data
• Training on large annotated datasets or even larger
unannotated datasets yields impressive models that are
useful for many applications
How do you make your own ML application?
Example: Safety inspector wants to know what fraction of
workers are wearing helmets, gloves, and boots on each job site
• PPE use is low (e.g., 60% use in a study in Egypt; frequent lack of use in US and other countries too)
• 1,008 fatal and 174,100 non-fatal injuries in US construction in 2020
• Consistently using PPE would significantly reduce injury and sometimes death
Image src
Step 1: Propose a solution in more technical terms
Proposed solution: Process images from the job site to detect the
workers and count what fraction of detected workers are
wearing each item
Left Glove: No
Right Glove: No
Hard hat: Yes
Vest: Yes
Boots: Yes
Step 1: Propose a solution in more technical terms
Main ML problem: Given an image, detect each worker and
whether each detected worker is wearing: (a) glove on left hand;
(b) glove on right hand; (c) boots; (d) hard hat; (e) vest
Note: There are lots of other aspects to the problem that we won’t consider in this example
• How to get images onto a server where we can process them
• How to avoid duplicate counts when the same person is in more than one image on the same day src
• How to summarize results and report them to the safety inspector src
Step 2: Decide how to measure success
• What matters?
– We want the overall estimate of
fraction of workers wearing each
item to be accurate
– We want to report specific
instances of workers not wearing
an item, so that they can be
checked as problematic or not
Step 2: Decide how to measure success
• Key aspects of performance
– Human detection performance
• Do we care about “small” or heavily occluded
workers?
• What counts as correct? (maybe high overlap
in bounding boxes)
• Measure precision (fraction of detections that src
are correct) and recall (fraction of workers
that are detected)
• Can measure Precision and Recall for each
level of confidence and generate a P-R curve
• Common overall performance measure is
average precision
• We may care about recall at a high precision
value because we don’t care about counting
the number of workers, just knowing how
likely a worker is to wear PPE
src
Step 2: Decide how to measure success
• Key aspects of performance
– Human detection performance
– Apparel classification performance,
for correctly detected humans and
each item: EER
• TP rate: fraction of actual items that
are detected
• FP rate: fraction of item detections
that are false
• Summarize with equal error rate,
accuracy when confidence is set so
that FP rate = (1- detection rate)
Step 2: Decide how to measure success
• Key aspects of performance
– Human detection performance
– Apparel detection performance, for correctly detected humans
and each item
– Overall: Deviance between the estimated fraction of workers
wearing equipment from the true fraction over a set of images
• Difference in fractions
• Bias: tends to overcount or undercount
• Variance: how much could the difference be expected to vary, given a
particular number of images
Step 3: collect and annotate validation/test images
1. Collect images
– Should be the same kind of images that will be processed in deployment
– Collect from a variety of sites and different dates. Try to get representative diversity
2. Annotate
– Draw boxes around each worker, even very small and hard to detect ones
– For each PPE item, label “present”, “absent”, or “not visible”
– How to get annotations
• In house:
– Use open source tool, such as VGG image annotator, or commercial tool like LabelBox
– Develop custom tool (e.g. to process 360 images or fully integrate into existing application)
• Outsource:
– Amazon Mechanical Turk or other crowdsourcing tool
– Commercial service
• In this case, creating a small initial development validation set in-house and larger set by
outsourcing could make sense
3. Split into a validation set and test set
Step 4: Determine technical details of approach
• For this example, we’ll base the approach on Mask-RCNN
Detects objects and person keypoints
Includes additional branch to
detect person keypoints
Modifications
• Remove bounding box detections and masks for non-
person objects
• Add classification layer to keypoint branch to classify
• Wearing left glove
• Wearing right glove
• Wearing hard hat
• Wearing boots
• Wearing safety vest
Step 5: Collect training data
• Consider combination of existing data (with applicable
licenses) and new data
• Existing
– Papers with code
– Google for existing papers/datasets, e.g.
• Collect own data
– similar to collecting test/validation, but not quite as much concern
about being representative or reflecting actual use cases
– E.g., could ask job sites to send photos of workers wearing and not
wearing PPE (on purpose, briefly) while in natural poses
Step 6: Develop model
(from Chat GPT)
• Whenever possible, start
with a pretrained model
• Alternatively, you could
use unsupervised
pretraining to initialize
your model (e.g. Masked
Autoencoder)
https://huggingface.co/models
Step 6a: Develop model: establish baselines
• Run the model as-is on your validation data and
measure human detection performance
• Train a linear probe for classifying PPE item
presence and measure all performance metrics
• Manually validate your evaluation code by
displaying images and detections and checking
against metrics
Step 6b: Develop model: refine model
• Fine-tune the model on your
data
• Train using mix of existing and
application-specific data
– Apply only the losses that are
applicable (e.g. detection or src
pose only for some datasets)
• Use tools like TensorBoard or
Weights and Biases to
monitor training and compare
results
– Always plot validation and
training loss, and measure
validation performance at
training milestones
https://huggingface.co/autotrain
Step 7: Evaluate on test set
• Measure performance metrics and characterize when it works
and doesn’t
– As function of occlusion, person size, camera viewpoint, etc
Step 8: Integrate into application
• Beta test in complete workflows
• Write guides for when it works and doesn’t
• Improve efficiency, refine approach
Summary of how to build a new ML application
1. Identify problem and general approach to solution
– This also involves thinking ahead to metrics, available models, data, and more, to ensure viability
2. Specify success metrics
– Check with product managers and/or users to ensure these metrics reflect important performance
characteristics
– Often, the metrics can’t be optimized directly
3. Create evaluation sets
– Achieving targets for success metrics on these sets should indicate high likelihood of application success
4. Select model, objectives, and other design details
– Usually this involves finding an analogous approach that has been successful
5. Collect data for training
– Custom data and labeling is expensive and time-consuming, so exploit available data sources where
available, and as allowed by license terms
6. Develop model, starting with baselines and simple approaches
– Starting simple is critical so that it is easier to debug and validate changes
7. Evaluate on your test set
– It’s not just about the performance number, but about predictability and effectiveness within the
application
8. Integrate into the application
– This requires a lot of work and testing
2 minute break
Thank you to Yuxiong Wang for following slides on
domain adaptation and transfer learning!
Challenge for Machine Learning Models
• Development and real-world application may face different
scenarios
• Limiting model performance and reliability
Curated Trained Real-world Questionable
Dataset for ML Model Setting Performance
Development
29
Slide credit: Yuxiong Wang
Types of Shifts
• Mainly two types of shifts from one scenario to another:
Task shift Domain shift
30
Slide credit: Yuxiong Wang
Task Shift: Changed Model Objectives
Classifying dogs and cats Classifying squirrels and birds
Source (Old) Task Target (New) Task
31
Slide credit: Yuxiong Wang
Domain Shift: Changed Input Data Distributions
Classifying dogs and cats in studio Classifying dogs and cats on grass
Source (Old) Domain Target (New) Domain
32
Slide credit: Yuxiong Wang
Types of Shifts: Task or Domain?
• Task shift
– Objective of model is changed
– But data distributions are usually assumed similar or related
• Domain shift
– Input data come from changed distributions
– But model task usually remains the same
33
Slide credit: Yuxiong Wang
Overcoming Task/Domain Shift
Curated Trained Real-world Questionable
Dataset for ML Model Setting Performance
Development
Adapted Real-world Improved
ML Model Setting Performance
34
Slide credit: Yuxiong Wang
Overcoming Task/Domain Shift
• Task adaptation
• Task shift
– Transfer learning
– Changed task objective
– Meta-learning
• Domain adaptation
• Domain shift
– Instance translation
– Changed data distribution
– Domain adversarial training
• Some adaptation ideas may be applicable for both (e.g., Meta-
learning)
35
Slide credit: Yuxiong Wang
Application: Autonomous Driving
• Adapt to different weather conditions, lighting conditions, or
driving environments
Normal Weather Condition Foggy Weather Condition
Slide credit: Yuxiong Wang
Images from Sakaridis et al. IJCV '18 36
Application: Robotics
• Adapt from simulated environment to real-world robotic
systems, or adapt from one learned task to another
Slide credit: Yuxiong Wang
Images from Google Research, 2020 37
Application: Speech recognition
• Adapt to different accents, speaking styles, or environmental
conditions
• Example: Model trained with American English could be
adapted to British English by fine-tuning on new domain
38
Slide credit: Yuxiong Wang
Methods for Task Adaptation
• Transfer learning: Pre-training and fine-tuning
• Meta-learning: Model-Agnostic Meta-Learning (MAML) and
variants
39
Slide credit: Yuxiong Wang
Transfer Learning
• Goal: To reuse knowledge learned from one task (which
usually has abundant supervisory information), to another
related task
• Implementation is simple
– "Pre-train" model on source task
– Copy learned weights from learned model
– "Fine-tune" new model on target task
40
Slide credit: Yuxiong Wang
Transfer Learning
Model 1
Task 1 Data Backbone Head Task 1 Outputs
Initialize
weights
Model 2
Task 2 Data Backbone New Head Task 2 Outputs
41
Slide credit: Yuxiong Wang
Transfer Learning
• Step 1: Pre-train Model 1 on Task 1
Model 1
Task 1 Data Backbone Head Task 1 Outputs
Initialize
weights
Model 2
Task 2 Data Backbone New Head Task 2 Outputs
42
Slide credit: Yuxiong Wang
Transfer Learning
• Step 2: Initialize weights using learned Model 1
Model 1
Task 1 Data Backbone Head Task 1 Outputs
Initialize
weights
Model 2
Task 2 Data Backbone New Head Task 2 Outputs
43
Slide credit: Yuxiong Wang
Transfer Learning
• Step 3: Fine-tune Model 2 on Task 2
– Backbone may use a smaller learning rate or even be "frozen"
Model 1
Task 1 Data Backbone Head Task 1 Outputs
Initialize
weights
Model 2
Task 2 Data Backbone New Head Task 2 Outputs
44
Slide credit: Yuxiong Wang
Model-Agnostic Meta-Learning (MAML)
• Proposed by Finn et al. ICML '17
• Goal: To learn a good parameter initialization that can be
quickly adapted to new tasks
• Model-agnostic: Can be applied to any differentiable model
– Flexible, can be used in a wide range of applications
– Including computer vision, natural language processing, and robotics
45
Slide credit: Yuxiong Wang
Model-Agnostic Meta-Learning (MAML)
• Assumption and setting
– Have a pool of various tasks
– Each task contains a set of training/validation samples
• An example of task pool
– Classify Dogs into Shepherd, Labrador, Golden, Husky ...
– Classify Cat into Siamese, Maine, Persian, Shorthair ...
– Classify Bird into Canary, Parrot, Dove, Sparrow ...
46
Slide credit: Yuxiong Wang
Model-Agnostic Meta-Learning (MAML)
• Meta-learning phase
– Use pool of tasks to obtain a good
parameter initialization
– Learn from the "experience of learning"
• Adaptation phase
– Use few samples and optimization steps to
adapt to new task
– New task can be outside the task pool used
in meta-learning
47
Slide credit: Yuxiong Wang
Find gradient step(s) to
improve parameters for
each few-shot task
Update parameters so
that those update steps
reduce the loss as much
as possible for all tasks
MAML is “learning to learn” – it learns parameters that are close
to good parameters for many classification tasks, so that new
tasks can be learned from a few examples and optimization steps
Methods for Domain Adaptation
• Instance translation
– Transform target-domain data into source-domain
• Domain adversarial training
– Align source-domain and target-domain feature spaces
54
Slide credit: Yuxiong Wang
Instance Translation
• Use generative models (e.g., CycleGAN by Zhu et al. ICCV '17)
to create instances
• Look like source domain but preserve same target domain
content
• Then feed source-like instances into source-domain model ✅
55
Slide credit: Yuxiong Wang
Instance Translation
CycleGAN by Zhu et al. ICCV
'17
56
Slide credit: Yuxiong Wang
Domain Adversarial Training
• Proposed by Ganin et al. JMLR '17
• Goal: Learn a domain-invariant model
– Model produces features that do not change with domain shift
– Only reflect contents about labels, but not domain characteristics
57
Slide credit: Yuxiong Wang
Domain Adversarial Training
• Attach a domain classifier network and apply adversarial training
• Aim of domain classifier: To distinguish source vs. target domains
58
Slide credit: Yuxiong Wang
Domain Adversarial Training
• Aim of main network: 1) Correctly predict label of source-
domain data;
59
Slide credit: Yuxiong Wang
Domain Adversarial Training
• Aim of main network: 1) Correctly predict label of source-
domain data; 2) Using features that cannot distinguish
between source and target domains
60
Slide credit: Yuxiong Wang
Domain Adversarial Training
• Adversarial training: Domain classifier 𝜃𝜃𝑑𝑑 minimizes
discrimination loss 𝐿𝐿𝑑𝑑 , while main network's feature extractor
𝜃𝜃𝑓𝑓 maximizes 𝐿𝐿𝑑𝑑
Adversarial training is
implemented by
reversing gradients here
61
Slide credit: Yuxiong Wang
Domain Adversarial Training
• One mainstream of domain adaptation
– Various follow-up methods study how to better learn domain-
invariant models or feature representations
• Other ideas (may be combined with domain adversarial
training)
– Instance translation
– Pseudo-labeling and self-training
– Domain randomization
62
Slide credit: Yuxiong Wang
Summary
• Task adaptation for changed task objective
– Transfer learning
– Meta-learning
• Domain adaptation for changed data distribution
– Instance translation
– Domain adversarial training
72
Slide credit: Yuxiong Wang
Coming up
• Thursday: Ethics and Impact of AI