An introduction to weakly supervised learning.
Best practices.
Kristina Khvatova
Software Developer
Softec S.p.A.
Master in Computer Science and Applied Mathematics
Saint-Petersburg State University
Master in Computer Science and Data Analysis
Milano-Bicocca University
[email protected]
https://www.linkedin.com/in/kristina-khvatova-a529b21
Overview
➢ Introduction to weak supervision
➢ Three types of weakly supervised learning:
● incomplete
● inexact
● inaccurate
➢ Snorkel
➢ Brexit tweets classification with weak supervised learning
Problem Definition
4

Weak supervision
Weak supervision is the technique
of building models based on new
generated data.
Types:
- incomplete
- inexact
- inaccurate
5

Incomplete weak supervision
● Active learning
● Semi - supervised learning
6

Incomplete weak supervision
Active learning
● High accuracy
● Low costs
7
Incomplete weak supervision
Active learning
High costs for the project Decrease costs and Cost of query labels is
and high precision (90%) precision of the project the same as in (b), but
(70%) the precision is much
more higher (90%) the
same as (a)
8
Incomplete weak supervision
Semi-supervised learning
9

Incomplete weak supervision
Semi-supervised learning
Generative models Label propagation TSVM
1
0
Inexact weak supervision
1
1
Inexact weak supervision
“Is object localization for free? – Weakly-supervised learning with convolutional neural networks.” (CVPR2015)
1
2
Inaccurate weak supervision
http://www.scholarpedia.org/article/Ensemble_learning
1
3
Snorkel: The System for Programmatically
Building and Managing Training Data
Snorkel is a system for programmatically building and managing training datasets to rapidly and flexibly fuel machine
learning models.
● Data Programming with DDLite: Putting Humans in a Different Part of the Loop (June 2016)
● Conversational agents at IBM: Bootstrapping Conversational Agents With Weak Supervision (AAAI 2019)
● Web content & event classification at Google: Snorkel DryBell: A Case Study in Deploying Weak Supervision at Industrial
Scale (SIGMOD Industry 2019), and Google AI blog post
● Business intelligence at Intel: Osprey: Non-Programmer Weak Supervision of Imbalanced Extraction Problems (SIGMOD
DEEM 2019)
● Anti-semitic tweet classification w/ Snorkel + transfer learning.
● Clinical text classification: A clinical text classification paradigm using weak supervision and deep representation (BMC
MIDM 2019)
● Social media text mining: Deep Text Mining of Instagram Data without Strong Supervision (ICWI 2018)
● Cardiac MRI classification with Stanford Medicine: Weakly supervised classification of rare aortic valve malformations
using unlabeled cardiac MRI sequences (BioArxiv 2018)
● Medical image triaging at Stanford Radiology: Cross-Modal Data Programming for Medical Images (NeurIPS ML4H 2017)
● GWAS KBC with Stanford Genomics: A Machine-Compiled Database of Genome-Wide Association Studies (NeurIPS
ML4H 2016)
1
4
Weak Supervision Formulation
A high-level schematic of the basic weak supervision “pipeline”: We start with one or more weak supervision sources:
crowdsourced data, heuristic rules, distant supervision, and/or weak classifiers provided by a subject matter expert. The
core technical challenge is to unify and model these sources. Then, this must be used to train the end model.
https://dawn.cs.stanford.edu/2017/07/16/weak-supervision/ 1
5
Snorkel: data programming
“Prime Minister Lee
Hsien Loong and his wife
Ho Ching leave a polling
station after casting their
votes in Singapore”
(NYTimes.com)
1
6
Demo: Step-By-Step Guide for Building a
Brexit Tweet Classifier
https://github.com/HazyResearch/snorkel
https://github.com/HazyResearch/metal
1
7
Demo: Step-By-Step Guide for Building a
Brexit Tweet Classifier
➔ Collecting unlabeled data: 3184
(tweets that contain #Brexit)
➔ Label 500 examples: 250 - ‘leave’, 250 - ‘stay’
➔ Create 5 LFs, apply on 2684 unlabeled tweets.
“Predicting Brexit:Classifying Agreement is Better than Sentiment and Pollsters” 1
8
Demo: Step-By-Step Guide for Building a
Brexit Tweet Classifier
Safer In #EU? No! No! No! Terrorists want
the UK to STAY Remember 7/7 Paris
#EUreferendum #VoteLeave
#Liverpool have broke the #Spanish
dominance in Europe... #English #football
says Yes We Belong in #Europe! #Stay
#strongerin
Tweet Label functions
@StrongerIn so if we stay in eu that means we get more zero hours contracts and employers can say 'we dont need to
now, fuck off' #TakeControl #VoteLeave
“Predicting Brexit:Classifying Agreement is Better than Sentiment and Pollsters” 1
9
Result: Brexit Tweet Classifier
Tweet Classifier on 500 labeled examples Tweet Classifier with Snorkel
LR ACCURACY: 0.52 LR ACCURACY: 0.78
2
0
Summary
➢ Weak supervision
■ incomplete
■ inexact
■ inaccurate
➢ Snorkel and Snorkel metal
➢ Demo application: Brexit Tweet Classifier