Script
Script
Simon Clematide
[email protected]
Note: This script includes all slides presented by Simon Clematide. No content from the tutorial or
iPython notebooks was included. The script was automatically generated from the lecture slides and is
therefore not optimized for continuous text in terms of layout and wording.
University of Zurich
Institut für Computerlinguistik
Andreasstrasse. 15
8050 Zürich
1
Contents
1 Lecture Information 3
1.1 Infos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 “Leistungsnachweis”/Academic Achievements . . . . . . . . . . . . . . . 4
1.2 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
4 Learning with sklearn 63
4.1 sklearn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 skorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6 Automatic Differentiation 86
6.1 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.1 Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.2 Symbolic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.3 Reverse-Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 autograd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.2 pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.3 tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
2
8.4.2 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.4.3 DyNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.5 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
3
12.4.2 n:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.4.3 n:n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.4.4 n:m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
12.4.5 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
12.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
4
17 Clustering and Topic Modeling 323
17.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
17.2 Hard Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
17.2.1 Flat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
17.2.2 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
17.3 Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
17.3.1 TM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
17.3.2 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
17.3.3 NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
17.3.4 Top2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
17.3.5 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
17.3.6 ProdLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
17.3.7 CTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
17.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
17.4.1 pyLDAvis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
17.4.2 Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
17.4.3 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
17.4.4 Exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
17.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
5
20.4 Lora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
20.5 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
6
List of Figures
7
Chapter 1
Lecture Information
1.1 Infos
1.1.1 Material
General information and Requirements
• Lecture expenditure: 6 ECTS points, i.e. 150 hours of work in total expected
• Jupyter notebooks for tutorial and lecture will be published via Google Colab
• Text version of slides with concept back-of-the-book index available before the exam
8
1.1.2 “Leistungsnachweis”/Academic Achievements
1 Written Exam (75%) and 6 Assignments (25%)
Written onsite exam: 75% of the final grade
• In English
• Partly with sample solutions, but also with discussion in the tutorial
1.2 Content
Learning Objectives and Concept of the Lecture
Learning objectives
9
1.3 Software
Software/Hardware for Exercises
• Work with your own laptop/computer; small models can be calculated on CPU
• Google Colab▲ with CPU/GPU: “Colaboratory is a free Jupyter notebook environment that requires
no setup and runs entirely in the cloud.” We try to adapt the dataset sizes for the exercises so
that they can be run on Colab without hacks.
• AWS Studio Lab▲ , Azure▲ , Google Cloud▲ (as a new user you get a $300 credit), Saturn
Cloud▲
• Useful helpers for explaining code snippets, improving documentation, add type hints,
cleaning . . .
• Don’t annoy your peers by submitting raw AI generated content in the exercises! We
might deduct points for this behavior.
• For each exercise, there must be a short declaration of the involved use of generative AI
(required by the Faculty of Arts).
10
Chapter 2
Learning Objectives
2.1 Intro
2.1.1 CL
Computational Linguistics and Language Technology
Theory-oriented: competence/knowledge
Computational Linguistics (CL; de Computerlinguistik) “is an interdisciplinary field concerned Computational
CL
with the statistical or rule-based modeling of natural language from a computational perspec- Linguistics
Application-oriented: performance
NLP
Natural Language Processing (NLP, de Sprachtechnologie) deals with the application-oriented de- Natural
velopment of language software. Language
Processing
Peer-2-Peer Concept ping-pong: What are the essential ideas of AI and NLP for you? (3
Minutes)
11
ML as an Interdisciplinary Field Motivation
Machine
Machine Learning Learning
Computer science
Artfcial Intelligence
Info
the rmat
ory on
e
ienc
rosc
Neu
Machine learning
cs
Statst Opt
m izat
on
Physics
Heike Adel Introduction 22.02.2019 10 / 55
[?]
[?]
Do you agree?
12
Motivation
Output
Mapping
from
Output Output features
Hand- Hand-
Simplest
designed designed Features
features
program features
Computer Science
13
Jacob Eisenstein: An Introduction To Natural Language Processing
Computer science
Natural language processing draws on several aspects of “core”
computer science:
I Natural language can be modeled using formal language
theory, building on similar theoretical tools that are used to
analyze programming languages.
I Natural language data requires efficient algorithms, which can
be analyzed in terms of time and space complexity.
I These algorithms must be implemented on diverse
architectures, including distributed systems, GPUs, and mobile
devices. Jacob Eisenstein: An Introduction To Natural Language Processing
This course will draw on basic tools from complexity theory,4 and
Linguistics
[?]
14
Learning and knowledge
[?]
2.1.2 ML
Typical NLP Prediction Problems
Prediction
Prediction Tasks in Ascending Order of Difficulty types
• Prediction of general structures (sequences, relations, trees, graphs): e.g. parsing, trans-
lation, summarization
2.1.3 AI
AI: How to learn intelligent behavior?
“Intelligence is the computational part of the ability to achieve goals in the world.”
(John McCarthy, 2004) from [?]
• Can AI keep up? Microsoft’s translation from Chinese to English (Human Parity: People
find them equally good)
15
AI for difficult games: Neural machine learning and searching
Syntactic Games
Complex dependencies and modification relations in real sentences
Which word group depends on which one?
[The board] approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at
its monthly meeting].
Crucial question: Can we learn syntax from data? Actually, a very old controversial issue . . .
2.1.4 Data
Chomsky: Data vs. Linguistic Theory
Chomsky’s furious argument against data-driven models
It is fair to assume that neither sentence (1) nor (2) [. . . ] has ever occurred in an English dis-
course. Hence, in any statistical model for grammaticalness, these sentences will be ruled out
on identical grounds as equally "remote" from English. Yet (1), though nonsensical, is gram-
matical, while (2) is not grammatical. — Chomsky, 1957
n
count(w1 , w2 )
p(w1 . . . wn ) = p(w1 ) p(wi |wi−1 ) P (w2 |w1 ) =
Y
i=2
count(w1 )
16
p(furiously|sleep) = p(sleep|furiously) = 0 if neither the bigram “sleep furiously” nor “furiously
sleep” occurs in the data.
Learn latent (hidden) “part-of-speech classes” c ∈ C in a way that the probability of observed
texts gets high!
n
p(w1 . . . wn ) = p(w1 ) p(wi |wi−1 )
Y
i=2
c∈C
• Why restrict the hidden parameters to 16? Let’s have millions or billions of them. . .
• Why restrict the latent variable connections to neighbor words? Let’s connect any word
with any other word (aka. self-attention) . . .
n
p(w1 , . . . , wn ) = p(wi |w1 , . . . , wi−1 )
Y
i=1
2.1.5 Annotation
Empirical Turn in (Computational) Linguistics
CL Can machines learn from annotated data to interpret new data accordingly?
17
CS
506
CJ
S
504
SB HD OA SVP HD
CNP
502
CJ CD CJ
NP
500
NK NK
18
2.1.6 Tasks
Tasks = Language + Data + a Bit of Linguistics
Linguistics
Annotation
Language Conventions
AI Data
Tasks
Theory of numerical
optimization
Further tasks
• Answer the question!
Development
“Anytime a linguist leaves the group, the recognition rate goes up” (1988), Fred Jelinek, pioneer in Statis-
tical Speech Recognition
In OLAT forum++▲
19
2.2 ML Cultures
ML Cultures in NLP Reduced to 3 Symbols
ML Cultures
in NLP
∀ Rule and logic-based modeling
2.2.1 xor
XOR Task
∀: XOR Task with Nominal Categories
XOR
A simple data set
a b a XOR b
True True False
True False True
False True True
False False False
True if a ̸= b
(
a XOR b = ∀a, b ∈ {True, False}
False if a = b
Interpretability/Explainability
Interpretability
20
∂: XOR Task as a Neural Network I
A simple attempt
Computation graph and formula
Probabilistic interpretation of y
If y is the probability of x1 XOR x2 being True then 1-y is the probability of x1 XOR x2 being
False.
• sigmoid(x) = 1
1+e−x as nonlinear activation
• Bad news! Cannot calculate XOR! There are no good values for w1 , w2 , b.
21
Good news! With unlimited number of nodes in the intermediate layer any computable function
can be approximated.
That doesn’t mean it can be learned effectively from data!
4. Optimization: Correct weights slightly according to their share of the error! (Backpropa-
gation)
• prediction: y = p(1 XOR 1 = True) = 58%: Randomly set weights make mistakes!
22
∂: Optimize Weights (Backpropagation)
Stochastic gradient methods use partial derivation (∂). Change edge weights slightly to make
mistakes smaller. New probability of y after a training step: 54%.
∂: Reflexion
Interpretability
• Because it has learned from a random configuration iteratively to output the right one.
• “A goal achieving system is one that is more usefully understood in terms of outcomes
than in terms of mechanisms.” (Rich Sutton, 2016)
23
LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in
Recurrent Neural Networks
Hendrik Strobelt, Sebastian Gehrmann, Hanspeter Pfister, and Alexander M. Rush
– Harvard School of Engineering and Applied Sciences –
t i1 i2
c
a
arXiv:1606.07461v2 [cs.CL] 30 Oct 2017
d e1 e2
g1 h g2
b
f
▲
▲
Fig. Framework
1. The LSTMV ISfor userexploration
interface. The and testing ofselects
user interactively NNs a[?] range of text specifying a hypothesis about the model in the
Select View (a). This range is then used to match similar hidden state patterns displayed in the Match View (b). The selection is made
by specifying a start-stop range in the text (c) and an activation threshold (t) which leads to a selection of hidden states (blue lines).
The 2.2.2
start-stopLexical
range can Semantics
be further constrained using the pattern plot (d). The meta-tracks below depict extra information per word
position like POS (e1) or the top K predictions (e2). The tool can then match this selection with similar hidden state patterns in the
data set of varying lengths (f), providing insight into the representations learned by the model. The match view additionally includes
Lexical Semantics
user-defined meta-data encoded as heatmaps (g1,g2). The color of one heatmap (g2) can be mapped (h) to the word matrix (f) which
allows the user to see patterns that lead to further refinement of the selection hypothesis. Navigation aids provide convenience (i1, i2).
Abstract— Recurrent neural networks, and in particular long short-term memory (LSTM) networks, are a remarkably effective tool for
∀: Word Semantics as Taxonomies (WordNets)
sequence modeling that learn a dense black-box hidden representation of their sequential input. Researchers interested in better
Manually
understanding created
these modelsstructures:
have studied the changes in hidden state representations over time and noticed some interpretable
patterns but also significant noise. In this work, we present LSTMV IS, a visual analysis tool for recurrent neural networks with a
focus on understanding these hidden state dynamics. The tool allows users to select a hypothesis input range to focus on local state
changes, to match these states changes to similar patterns in a large data set, and to align these results with structural annotations
from their domain. We show several use cases of the tool for analyzing specific hidden state properties on dataset containing nesting,
phrase structure, and chord progressions, and demonstrate how the tool can be used to isolate patterns for further statistical analysis.
We characterize the domain, the different stakeholders, and their goals and tasks. Long-term usage data after putting the tool online
revealed great interest in the machine learning community.
1 I NTRODUCTION
In recent years, deep neural networks have become a central modeling representations make the models themselves difficult to interpret. So
tool for many artificial cognition tasks, such as image recognition, while it is possible for users to produce high-performing systems, it is
speech recognition, and text classification. These models all share a difficult for them to analyze what the system has learned.
common property in that they utilize a hidden feature representation of While all deep neural networks utilize hidden features, different
their input, not pre-specified by the user, which is learned for the task model structures have shown to be effective for different tasks. Stan-
at hand. These hidden representations have proven to be very effective dard deep neural networks (DNNs) learn fixed-size features, whereas
for classification. However, the black-box nature of these learned convolutional neural networks (CNNs), dominant in image recognition,
will learn a task-specific filter-bank to produce spatial feature maps. In
All sharks are fish. All fish can swim. this work, we focus on deep neural network architectures known as
• HS – contact: [email protected] recurrent neural networks (RNNs) that produce a time-series of hidden
• SG, HP, AR – contact: {gehrmann, pfister,rush}@seas.harvard.edu. feature-state representations.
N: Corpus-based Distributionalism [?] RNNs [7] have proven to be an effective general-purpose approach
• “Die Bedeutung eines Wortes ist sein Gebrauch in der Sprache.” (Ludwig Wittgenstein
(1953))
24
• “You shall know a word by the company it keeps!” (J. R. Firth (1957))
• “Words that occur in the same contexts tend to have similar meanings” (Pantel (2005)
Tiny corpus
Window
• I likebased cooccurence matrix
deep learning.
• I like NLP.
• Example corpus:
• I enjoy flying.
• I like deep learning.
Idea
•Similar
I like NLP.
words have similar lines.
•WordI bigram
enjoy statistics
flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160
[?, 9]
“a
distributional thesaurus is an automatically produced “thesaurus” which finds words that
tend to occur in similar contexts as the target word” 1
1
https://www.sketchengine.co.uk/thesaurus/
25
In%vector%space%terms,%this%is%a%vector%with%one%1%and%a%lot%of%zeroes%
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality:%20K%(speech)%–%50K%(PTB)%–%500K%(big%vocab)%–%13M%(Google%1T)%
∂: Words as Numeric Vectors
We%call%this%a%“oneJhot”%representaGon.%Its%problem:%
One-Hot encoding: Lots of zeros and a one
Each word is a position (dimension) in a vector.
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0%
35%
Problems with One-Hot Encoding
26
Continuous, sub-symbolic word representations (word embeddings).
27
Prediction task of word2vec [?]
p(„iPhone6s“)=0.7
p(„MacBook“)=0.2
p(„iPad“)=0.1
∂: Most similar words to “motel” in vector space: Depending on training corpus and pre-
processing
28
Computing in Learned Vector Spaces
Source: http://jalammar.github.io/illustrated-word2vec/
29
Figure 2.1: Model of syntactic transfer translation
30
Parse and transfer tree from Langenscheidt T1 (1997)
Phrase-based translation
31
Encoder-decoder Models
(Sutskever et al. 2014)
Encoder
kono eiga ga kirai </s>
LSTM [?]
Recurrent neural network (RNN) that learns suitable representations!
Generation
2.2.4 Generation
Morphology
∀: Generate Word Forms
Finite automaton with actions on character level
work+3rdSg --> works
work+3rdSg --> works
+Base:0
+Base:0
+3rdSg:s
t:t a:a +3rdSg:s
t:t a:a
0:n
+Progr:i 00:g
+Progr:i
a:a l:l
:n 0:g
w:w a:a k:k
l:l
w:w k:k
o:o o:or:r r:r
+Past:e +Past:e 0:d 0:d
Source: Lauri Karttunnen 2005
• Copy (w:w)
• Substitute (+3rdSg:s)
• Insert (0:d)
32
• Delete (+Base:0) Note: 0 = empty string
Sigmorphon Shared Task 2018: Learn to Generate Word Forms for 100 Languages
base form x features f → Word form y
Challenge
• Idea: Build a neuronal stack machine that is trained with machine learning and searching
(similar to AlphaGo).
• Idea: Embed letters and morphological features as continuous vectors (similar to trans-
lation).
...
Stack s3 Buffer
g0 g1 g2 g3 h3 h4 h5 h6 h7 h8
f
INS[bos] COPY COPY i e g e n eos
t Action at Output y Stack Buffer f
0 INSERT ( BOS ) [] [INSERT(BOS)] [f,l,i,e,g,e,n,EOS] f
1 COPY [f] [COPY, INSERT(BOS)] [l,i,e,g,e,n,EOS] f
2 COPY [f, l] [COPY, COPY, . . . ] [i,e,g,e,n, EOS] f
3 DELETE [f, l] [DELETE,COPY, . . . ] [e,g,e,n,EOS] f
4 DELETE [f, l] [DELETE, DELETE, . . . ] [g,e,n,EOS] f
5 INSERT (o) [f, l, o] [INSERT(o), DELETE, . . . ] [g,e,n,EOS] f
6 COPY [f, l, o, g] [COPY, INSERT(o), . . . ] [e,n,EOS] f
7 DELETE [f, l, o, g] [DELETE, COPY, . . . ] [n,EOS] f
8 DELETE [f, l, o, g] [DELETE, DELETE, . . . ] [EOS] f
9 INSERT ( EOS ) [f, l, o, g] f
33
∂: “Label efficiency”: Generalizing
#x = How often is basic form x in training data? #f = How often is feature combination in
training data?
System x / #x f /#f y
Correct Arsphenamin N;NOM;PL Arsphenamine
100 0 3 Arsphenaminen
1000 0 61 Arsphenaminnen
10000 0 594 Arsphenamine
System x / #x f /#f y
Correct belaufen V;IND;PST;3;PL beliefen
100 0 1 belauften
1000 2 14 belauften
10000 3 157 beliefen
Positive: errors are often intuitive (over-)generalizations
And still SOTA-like on morphological tasks such as grapheme to phoneme conversion in 2021,
morphological segmentation and inflection in 2022 [?] (MA thesis outcome)
2.3 Conclusion
ML Cultures in NLP Reduced to 3 Symbols
34
Important Questions Regarding AI and NLP
• Can AI methods learn any task from scratch (raw data) without explicit intermediate lin-
guistic structures? End-to-end learning paradigm
• What is the efficiency of such learning methods? Data efficiency, label efficiency, parame-
ter efficiency, compute efficiency
• Which tasks on raw data are optimal for foundational models that quickly learn to solve
specialized downstream tasks (transfer learning)? Prompting (zero-shot, few-shot) vs.
fine-tuning . . .
Conclusion
• Machine learning techniques are better for complex, broad and “vague” problems than
manual modeling.
• Targeted training: Appropriate numerical representation allows numerical optimization
(learning from mistakes).
• End-to-end systems in which input and output are fully connected and optimizable,
avoid the problem of uncorrectable consequential errors.
• Representation learning: The optimization process learns appropriate numerical repre-
sentations for the input and “latent” levels!
• The internal numerical representations are difficult to interpret (black box problem).
Questions
• What characterized the development of ML in NLP over the last years?
• What are the typical steps in neural learning?
• What is meant by representation learning?
• What can we learn from Chomsky’s argument and counter-argument about learning and
modeling?
• What output will a shallow neural network produce for XOR-style classification with or
without hidden layers▲ ?
35
Chapter 3
Learning Objectives
3.1 Books
3.1.1 ML
[?]: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.)
36
[?]: Deep Learning
• Can be overwhelming. . .
• Good and very practical interactive introduction into neural modeling (pyTorch code
snippets)
37
• Specific to NLP, applied but still academic
• General ML in the beginning needed for neural modeling; specific neural modeling in
the end
• General ML in the beginning needed for neural modeling; specific neural modeling in
the end
38
39
• Specific to NLP, medium level
• One of the best conceptual books (without code)
40
Outline
x2
Data input:
x1 x2
0.2 0.2
⇒
0.4 0.3
0.9 0.6 Outline
1.0 1.2 x1
Machine Learning: Preview: Supervised: Regression
[?]
This is . . . Regression
Regression
y
Data input x
with labels y:
Heike Adel Introduction 22.02.2019 31 / 55
x y
0.2 0.2 ⇒
0.4 0.3
Outline
0.9 0.6
1.0 Learning:
Machine 1.2 Preview: Unsupervised: Clustering x
[?]
This is . . . Clustering
Clustering
x2
Data input:
x1
Heike Adel x2 Introduction 22.02.2019 33 / 55
0.2 0.2
⇒
0.4 0.3
0.9 0.6
1.0 1.2 x1
[?]
This is . . . Classification
Classification
Heike Adel 41
Introduction 22.02.2019 32 / 55
Machine Learning: Preview: Supervised: Classification
x2
Data input x
with labels (classes) y:
x1 x2 y
0.2 0.2 0 ⇒
0.4 0.3 0
Supervised/Unsupervised Learning
0.9 0.6 1
Machine Learning systems can be classified according to the amount and type
1.0 1.2 1 they get during training. There are four major categories: x1
of supervision
supervised learning, unsupervised learning, semisupervised learning, and
[?] Reinforcement Learning.
Supervised learning
Supervised Machine Learning:
In supervised Classification
learning, the training data you feed to the algorithm includes
the desired solutions, called labels (Figure 1-5). Classification
Figure 1-5. A labeled training set for supervised learning (e.g., spam classification)
Source: [?]
one)
NOTE
In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has
Modern ML-basedseveral
NLP Pipeline
meanings Lifecyle
depending on the context, but generally means an attribute plus its value
(e.g., “Mileage = 15,000”). Many people use the words attribute and feature
interchangeably, though.
[?]
42
Read Chapter 2 of [?] for more details on each step!
Chapter 1
Preprocessing – getting
5 Steps for Supervised Machine Learning data into shape
Raw data rarely comes in the form and shape that is necessary for the optimal
1. Feature engineering: extraction, encoding, transformation (e.g. standardization (e.g. to
performance of a learning algorithm. Thus, the preprocessing of the data is one of the
values between 0 and 1)), selection
most crucial steps in any machine learning application. If we take the Iris flower
dataset from the previous
2. Performance section asinternal
metrics selection: an example, weoptimized
(directly could think of training)
for in the raw data
vs. external
as a series of flower
evaluation images
measure from which
(application we want to extract meaningful features.
dependent)
Useful features could be the color, the hue, the intensity of the flowers, the height,
and3.theSelection
flower of classifier
lengths andand optimization
widths. algorithm:learning
Many machine Loss Function, Optimizer,
algorithms Regulariza-
also require
tion, hyper-parameter tuning
that the selected features are on the same scale for optimal performance, which is
often
4. achieved
Evaluation byoftransforming the features
models: cross-validation in the range [0, 1] or a standard normal
if feasible
distribution with zero mean and unit variance, as we will see in the later chapters.
5. Revise any of the preceding steps!
Some of the selected features may be highly correlated and therefore redundant
tosklearn
a certain degree. In those cases, dimensionality reduction techniques are useful
Framework
for compressing the with
supports these steps features onto a lower
well-designed dimensional
abstract interfaces subspace. Reducing the
dimensionality of our feature space has the advantage that less storage space is
required, and the learning algorithm can run much faster.
3.2.1 Splits
Proper Fitting: Training and Held-Out Sets
[ 11 ]
43
Source: [?, 174]
Validation/Development Set
Use the validation samples for tuning hyperparameters:
• of the machine learning algorithm
Test Set
Use test sample (ideally once, after final training) for measuring performance on unseen data.
Introduction
Measures the generalizability of your model on new data (if training/dev data is representa-
Capacity - Overfitting - Underfitting
tive for it).
Error
Underfitting Overfitting
???
Training error
Heike Adel
44
Machine Learning 08.03.2019 23 / 68
• Features: amount of input features
• Capacity ≈ amount of model parameters
• Training epochs: How many times did a (gradient-based) learning algorithm saw the
training set
Questions
• What is ??? error?
• Where is the perfect spot for fitting?
0 x 1
[?, 4]
detailed
Sparse treatment
noisy traininglies
databeyond the scope
and a “known” of this
target book.
function
Although each of these tasks needs its own tools and techniques, many of the
key ideas that
Underfitting undunderpin them
Overfitting are common
on Numerical Datato all such problems. One of the main
goals of this chapter is to introduce, in a relatively informal way, several of the most
3.2.2 Linear
important Algebra
of these concepts and to illustrate them using simple examples. Later in
the book we shall see these same
Linear Algebra: Scalars, Vectors, ideasTensors
Matrices, re-emerge in the context of more sophisti-
cated models that are applicable to real-world pattern recognition applications. This
chapter also provides a self-contained introduction to three important tools that will
be used throughout the book, namely probability theory, decision theory, and infor-
mation theory. Although these might sound like daunting topics, they are in fact
straightforward, and a clear understanding of them is essential if machine learning
techniques are to be used to best effect in practical applications.
Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
1.1. What
Example: Polynomial Curve Fitting
are tensors? Attend tutorial session!
sin(2πx) and then adding a small level of random noise having a Gaussian distri-
1bution (the Gaussian distributionM =is0 discussed 1 in Section 1.2.4) to each such M =point
1 in
t
order to obtain the corresponding value tnt
. By generating data in this way, we are
capturing a property of many real data sets, namely that they possess an underlying
0regularity, which we wish to learn, but that0individual observations are corrupted by
random noise. This noise might arise from intrinsically stochastic (i.e. random) pro-
cesses such as radioactive decay but more typically is due to there being sources of
−1variability that are themselves unobserved. −1
Our goal is to exploit this training set in order to make predictions of the value
!t of the target variable for some new value ! x 0of the input variable. As we shall1 see
0 x 1 x
later, this involves implicitly trying to discover the underlying function sin(2πx).
This is intrinsically a difficult problem as we have to generalize from a finite data
set. Furthermore the observedM data are corrupted with noise, and so forMa =given !
x
1 =3 1 9
!
there is uncertainty as to the appropriate value for t. Probability theory, discussed
t
in Section 1.2, provides a framework fort expressing such uncertainty in a precise
0
and quantitative manner, and decision theory, 0
discussed in Section 1.5, allows us to
exploit this probabilistic representation in order to make predictions that are optimal
according to appropriate criteria.
−1 For the moment, however, we shall −1 proceed rather informally and consider a
simple approach based on curve fitting. In particular, we shall fit the data using a
polynomial function of the form
0 x 1 0 x 1
"M
Figure 1.4 Plots of polynomials having various orders M
2 , shown as red curves,
M fitted to the data set shown in
Figure 1.2. y(x, w) = w0 + w1 x + w2 x + . . . + wM x = wj xj (1.1)
j =0
(RMS) error defined by
where M is the order of the polynomial, ERMSand=x! j
denotes
2E(w⋆ )/N x raised to the power of j.
(1.3)
[?, 4] The polynomial coefficients w 0 , . . . , w M are collectively denoted by the vector w.
in which the division by N allows us to compare different sizes of data sets on
Note that, although the polynomial function y(x, w) is a nonlinear function of x, it
an equal footing, and the square root ensures that ERMS is measured on the same
is a linear function of the coefficients w. Functions,
scale (and in the same units) as the target variable such t.as Graphs
the polynomial, which
of the training and
are linear in thetestunknown
set RMS parameters
errors are shown, haveforimportant properties
various values of M , and are called
in Figure 1.5. Thelinear
test
What are Tensors? (broad
set sense)
error is a measure of how well
models and will be discussed extensively in Chapters 3 and 4. we are doing in predicting the values of t for
The values new data observations of x. We note from Figure 1.5 that small values of M give
over of the coefficients will be determined by fitting the polynomial to the Tensor Rank
A generalization scalars, vectors, matrices, tensors (narrow sense)
relatively large values of the test set error, and this can be attributed to the fact that
training data. the
This can be done
corresponding by minimizing
polynomials are ratheraninflexible
error function that measures
and are incapable the
of capturing
• Note:
misfitNumerical
betweenthe datafunction
the structures with dimensions
nfor or
any givenValues ranks
value
oscillations iny(x,
the w),
function sin(2πx). ofofM w, andrange
in the the 3training
! M !set 8
data points. Onegivesimple
• A rank-0 tensor:ofAny choice
small values of
for the
number (scalar) error
test setfunction, which
error, and these alsoisgive
widely used,
reasonable is given
representations
x ∈ R as can be seen, for the case of M = 3, from by
the sum of the squaresthe generating
of the function
errors between
sin(2πx), the predictions y(xn , w) for each data
Figure 1.4.
• Apoint
rank-1 and the
xntensor: Any corresponding
numerical vectortarget xvalues∈ Rn tn , so that we minimize
46
Dot Product: Vectors and Column Vectors
Matrix Multiplication▲
Fill the result matrix with vector dot products. . . Matrix multi-
plication
47
row vector dot matrix
X Y Z
b
W1
W2
Goldberg’s “Idiosyncratic” Notational Conventions
.W /2 ; .W 3 /2 Œ!
bŒi! i b W Œi;j !
i j W
bi i b
wi;j W ! w!v D
P P
i wi vi D i wŒi ! vŒi ! x 1Wn x1; : : : ; xn
x1Wn x1 ; : : : ; xn x nW1 x 1Wn Œi ! D
x i x nW1 Œi! D x n!i C1 Œv1 I v2 !
xW C b
WxCb
Read from left to right, Goldberg’s notation resembles more the direction of processing from
input to output.
~
NLP papers also often use widely different notation styles. . . You need to get used to it/them. . .
3.2.3 Preprocessing/Encoding
Levels of Measuring▲ (de: Skalenniveaus)
Levels of
Measuring
48
commons.wikimedia.org /w/index.php?curid=724035
Comparing levels; in red: Additional property. nominal: frequencies, ordinal: ordering, interval: distances, ratio:
meaningful zero
Source: [?, 9]
49
Introduction
[?]
h is your system’s prediction function, also called a hypothesis. When your system is given
an instance’s feature vector x(i), it outputs a predicted value ŷ(i) = h(x(i)) for that instance (ŷ
50
is pronounced “y-hat”).
For example, if your system predicts that the median housing price in the first district is
$158,400, then ŷ(1) = h(x(1)) = 158,400. The prediction error for this district is ŷ(1) – y(1)
= 2,000.
RMSE(X,h) is the cost function measured on the set of examples using your hypothesis h.
We use lowercase italic font for scalar values (such as m or y(i)) and function names (such as h),
lowercase bold font for vectors (such as x(i)), and uppercase bold font for matrices (such as X).
Encoding: Dealing with Words
Nominal
A small corpus Features
Time flies like an arrow.
[?]
51
Sparse text representation: BOW
[?]
Bag-of-Words
Bag-of-Words
(~105 , 1)
2 1 0 0 ... … … 0 1 Representa-
tion
Sum
1 0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0
… … … … … … … … …
0 0 ... ... ... ... ... ... ...
… … … … … … … … …
0 0 0 0 0 0 0 0 1
(~105 ,9) The quick brown fox jumps over the dog .
[?]
Term Frequency
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
52
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
TF.IDF
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
TF.IDF: An Example
A = The car is driven on the road. TF.IDF
B = The truck is driven on the highway.
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/
53
3.3 Models
Model Classes (Hypothesis Classes)
• Supervised parametrized learning: Find the best parameters θ for function y = fθ (x) =
f (x; θ) that computes optimal outcome y for given evidence x.
• Hypothesis space: Classes of all possible models. What kind of hypothesis spaces do you
know? Linear models, decision trees, multilayer perceptrons . . .
http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html
• "any two algorithms are equivalent when their performance is averaged across all possi-
ble problems."
• Machine Learning requires the use of different methods and their systematic evaluation
54
• Con: Modeling class introduces (eventually wrong) assumption about the data
Non-Parametric
16 K = 3
KNN classification with Chapter 1
KNN
[?, 16]
(a) (b)
Figure
Standard Procedure 1.14 (a) Illustration
in Parametric Learning of a K-nearest neighbors classifier in 2d for K = 3. The 3 n
of test point x1 have labels 1, 1 and 0, so we predict p(y = 1|x1 , D, K = 3) = 2/3
neighbors of test point x2 have labels 0, 0, and 0, so we predict p(y = 1|x2 , D, K =
• Features get parameterized by weights
Illustration of the Voronoi tesselation induced by 1-NN. Based on Figure 4.13 of (Duda et
generated
• Objective Function by knnVoronoi.
connects feature values, parameters and prediction
• Optimization algorithm maximizes objective function, that is, minimizes loss function
2010).include
• Objective should Such some
models often
form have better term
of regularization predictive acccuracy
for improved than association
generalization rules,
may be less interpretible. This is typical of the difference between data mining
learning: in data mining, there is more emphasis on interpretable models, where
learning, there is more emphasis on accurate models.
Supervised
Linear Regression y
Input: vector x ∈ Rn Logistic Regression
Output: scalar y ∈ R Support Vector Machines (SVMs)
Parameters of LR model:
Decision Trees and Random Forests
Where are the parameters? Where is the bias? Can you complete the drawing to match the
formula of linear regression?
56
!.x/
dout D 1 w b
f .x/ D x ! w C b:
[?]
k Œ"1; C1!
f .x/ sign
"1 w ; w C1
;:::
x yO D f .x/ D x ! wL L
yCb :
L2f ; ; ; ; ; g
yO 2 R6 D . # w1 C # w2 C b/;
yO
! b w D Œw1 ; w2 !
yO $ 0 w1
• What is learned?
• What is predicted?
57
2500
2000
1500
Size
1000
500
0
1000 2000 3000 4000 5000
Price
w2 b
Sign Function
w1 w2
if x < 0
−1
sgn(x) = 0 if x = 0
1 if x > 0
Sign Function
How does the graph of this function look like?
w x!wCb D0
Linear Separability
Separability,
Linear
58
2500
2000
1500
Size
1000
500
0
1000 2000 3000 4000 5000
Price
w2 b
◦ Blau: Dupont Circle, DC
w1 w2
× Grün: Fairfax, VA
Questions
Are the regions linearly separable by two features?
Are the regions linearly separable
w
byxa! wsingle
Cb D0
feature?
Important
The typical high dimensionality of NLP features helps to generate linear separability. . .
3.3.2 Binary
Weakness of Linear Modeling
• Class probabilities!
59
σ(x)
1.0
0.8
0.6
0.4
0.2
0.0
-6 -4 -2 0 2 4 6
Supervised
!.x/ Who am I?
Logistic Regression What can I do?
yO D f .x/ D x ! W C b
D yO D yO Œi! :
i
x
0 yO 2 R6
yO
1
P(y = 1|x, w , b) = 1+exp(−h)
P(y = 0|x, w , b) = 1 − P(y = 1|x, w , b)
Log-linear
HeikeModeling
Adel Machine Learning 08.03.2019 60 / 68
Œ0; 1!
1
".x/ D 1Ce !x
1
yO D ".f .x// D :
1C e !.x "wCb/
1
Œ0; 1! 2
60
x/ Œ$1; 1! f$1; C1g
Sigmoid
Œ0; 1!
1
".x/ D 1Ce !x
e x Œi!
1 .x/Œi! D P x :
; 1! 2 j e
Œj !
use softmax!
yO D .xW C b/
e .xW Cb/Œi !
yO Œi! D P :
.xW Cb/Œj !
j e
yO
∂: SoftMax: From vectors to probabilities
SoftMax: The “work horse” of probabilistic classification
Every numeric vector with n numbers can be normalized and interpreted as a probability
distribution over n classes! n
x D x ;
Vector (logits)
1Wn 1 x 2 ; : : : ; x
SoftMax
n Probability y D y
Interpretation
1Wn 1 ; y 2 ; : : : ; yn
x 1Wn y 1Wn
f ./
p(„iPhone6s“)=0.7
f ./ yO D f .x/
p(„MacBook“)=0.2
yO p(„iPad“)=0.1
y O y/
L.y;
yO y
3.4 Training W b
L
3.4.1 Loss
Training: The Function of Loss Functions .x 1Wn ; y 1Wn / L
f .xI ‚/ ‚
What is training (optimization) for? Searching for better parameters
Xn
Change model parameters θ such that the1 value of the loss function (or “cost function”) over
the training set gets smaller. L .‚/ D L.f .x i I ‚/; y i /:
n
iD1
61
‚
L
1X
n
O D
‚ L.‚/ D L.f .x i I ‚/; y i /:
‚ ‚ n
iD1
L.‚/ D L.f .x i I ‚/; y i /:
n
iD1
‚
L
1X
n
O D
‚ L.‚/ D L.f .x i I ‚/; y i /:
‚ ‚ n
iD1
Linear models allow for a guided search with guarantees on convergence (on the training
data). Or even for an analytical solution . . .
y
i y Œi !
The Function of Loss Functions
What are loss functions for? What are their properties?
• They quantify the quality of the current model by a single number. Computed per train-
ing item and summed up over whole training set.
• Minimal when all examples from the training set get predicted perfectly.
• What does perfect mean? For probabilistic classifiers 100% probability for the correct
solution is needed.
• What characterizes good loss functions? They have unique extreme values which have a
closed solution or can be approximated iteratively for any data set.
https://www.dataquest.io/blog/understanding-regression-error-metrics/
3.4.2 Optimization
Convex and Concave Functions
When is a function z = f (x, y) concave?
62
• How can I turn any concave function into a convex function?
• Their graph is on or below of any straight line drawn between two points of the graph.
3.4.3 Gradients
Iterative Computation of ArgMax of a Concave Continuous Function: Gradient Ascent
Algorithm arg max f (x) for unary functions (1 parameter)
x
• parameter
– function f : R → R
63
– Step size a > 0 (if a is too large, algorithm diverges)
• f ′ is derivative f
• Gradient leads to maximum of a function. How can we find the minimum of a convex
function?
f_deriv = diff(f, x)
argmax = random.random()-.5 if init is None else init
converged = False
iteration = 0
while not converged and iteration < maxi:
iteration += 1
oldargmax = argmax
slope = f_deriv.subs(x, argmax)
argmax += a*slope
if abs(oldargmax - argmax) < eps:
converged = True
return argmax
Q: How can we turn this code into a gradient descent?
Notation
• ∇f : shorthand form to denote the gradient of f . The collection of all partial derivatives
of f .
64
cients in the network.
But such an approach would be horribly inefficient, becaus
CHAPTER 4. NUMERICAL COMPUTATION pute two forward passes (which are expensive) for every ind
which there are many, usually thousands and sometimes up to m
ter approach is to take advantage of the fact that all operation
Directed Search: Gradients as Compass
are differentiable, and compute the gradient of the loss with re
coefficients. You can then move the coefficients in the opposi
gradient, thus decreasing the loss.
If you already know what differentiable means and what a grad
section 2.4.3. Otherwise, the following two sections will help
concepts.
is smaller than Obviously, this linear approximation is valid only when x is close
We assume the reader is already The slopef (x)a is called
familiar with the calculus, butofprovide
derivative f in p. Ifa abrief
is negative, it
(for small
review of howenough ϵ)
calculus concepts
of x relate
aroundtopoptimization
will result in ahere.
decrease of f(x) (as shown in figur
Try to explain this figure to youritive,
neighbour!
Suppose we have a function y a=small f (x),change
wherein x will
both x result
and yin anreal
are increase of f(x). Furth
numbers.
of isa (the dy
magnitude of the derivative) tells you how quickly thi
The Why
derivative of this
do we need a stepfunction
size? (Learning Rate) as f (x) or as dx . The derivative f (x)
denoted
gives the slope of f (x) at the willpoint
happen.
x. In other words, it specifies how to scale Learning Rate
a small change in the input in order to obtain the corresponding change in the
output: f (x + ) ≈ f (x) + f (x). Local linear
approximation of f,
The derivative is therefore useful for minimizing
with slope a a
function because it tells us
how to change x in order to make a small improvement in y. For example, we
know that f (x − sign(f (x))) is less than f(x) for small enough . We can thus
reduce f (x) by moving x in small steps with opposite sign of the derivative. This
f Figure 2.10 Derivative of f in p
technique is called gradient descent (Cauchy, 1847). See Fig. 4.1 for an example of
this technique. Source: [?]
• Solution II: general kernel function (can sometimes be done efficient with kernel trick ) 3
Conclusion
• Modern ML uses objective functions that combine loss functions and regularization for
better generalization
Mandatory reading
• Chapter 2 of [?] on NLP Pipelines: Read it carefully if you never heard of things like
tokenization, lemmatization, evaluation (Precision, Recall, F-Measure)
Recommended reading
3
Video https://www.youtube.com/watch?v=OdlNM96sHio
66
• Chapter 1 and 2 of [?] on linear classification: Please answer all questions in the slides
connected to Goldberg.
• Chapter 1,2,4 of [?] if your ML and sklearn skills are rusty (next tutorial will on sklearn )
and if you find dl2ai too difficult for the moment
Questions
• What is the difference between the two workflow schemas (the one from Vajjala 2020 vs
Raschka 2015)?
67
Chapter 4
Learning Objectives
• Confusion matrices
4.1 sklearn
ML Techniques Covered by sklearn
ML in sklearn
http://scikit-learn.org/stable/_static/ml_map.png
68
When we are talking about categorical data, we have to further distinguish between
nominal and ordinal features. Ordinal features can be understood as categorical
values that can be sorted or ordered. For example, T-shirt size would be an ordinal
feature, because we can define an order XL > L > M. In contrast, nominal features
don't imply any order and, to continue with the previous example, we could think of
T-shirt color as a nominal feature since it typically doesn't make sense to say that, for
example,
4.1.1 red
Datais larger than blue.
Before we explore
Categorical Data:different techniques to handle such categorical data, let's create a
Panda Dataframes
new data frame to illustrate the problem:
Mini example: T-Shirts with colors, size, price
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['green', 'M', 10.1, 'class1'],
... ['red', 'L', 13.5, 'class2'],
... ['blue', 'XL', 15.3, 'class1']])
>>> df.columns = ['color', 'size', 'price', 'classlabel']
>>> df
color size price classlabel
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1
AsWhich
we can
Building
see
levels
Good
inofthe
Training
preceding output, the newly created DataFrame contains a
measurement?
Sets – Data Preprocessing
nominal feature (color), an ordinal feature (size), and a numerical feature (price)
column.
We The class
can reverse labels
the (assuming
key-value that we created a dataset for a supervised
Do-it-yourself Mappings: f pairs
: M →inNthe mapping dictionary as follows to map the
learning
convertedtask) are stored
classinjective in
labels back the last
to the column. The learning algorithms for classification
Define your function fororiginal
encodingstring representation:
and decoding of.
that we discuss in this book do not use ordinal information in class labels.
>>> inv_class_mapping
Define mappings = {v: k for k, v in class_mapping.items()}
>>> df['classlabel']
size_mapping = {'XL':3, = df['classlabel'].map(inv_class_mapping)
'L':2, 'M':1}
Mapping ordinal features
inv_size_mapping
>>> df = {v:k for k,v in size_mapping.items() }
color size price classlabel
ToVectorized
make suremapping
that theonlearning
columnalgorithm interprets the ordinal features correctly, we
0 green 1 a10.1 class1
need to convert
df['size'] the categorical string values into integers. Unfortunately, there is no
1 red = df['size'].map(size_mapping)
2 13.5 class2
convenient function that can automatically derive the correct order of the labels of
our 2sizeblue 3
feature. Thus, 15.3
we have to class1
define the mapping manually. In the following
Using the preprocessing Package of sklearn
simple example,
Alternatively, let'sisassume
there that weLabelEncoder
know the difference betweenimplemented
features, for in
sklearn supports the acreation
convenient class directly
and application of mappings. sklearn
example, XLto= achieve
scikit-learn 2 . same:
L + 1 = M +the Estimator
The LabelEncoder class in action ▲
Methods
>>> size_mapping = {
>>> from sklearn.preprocessing import LabelEncoder
... 'XL': 3,
>>> class_le = LabelEncoder()
... 'L': 2,
>>> y = class_le.fit_transform(df['classlabel'].values)
>>> y [ 104 ]
array([0, 1, 0])
Note that the fit_transform method is just a shortcut for calling fit and
Methods separately,
transform and we canestimators
of these preprocessing use the inverse_transform method to
transform the integer class labels back into their original string representation:
>>> class_le.inverse_transform(y)
array(['class1', 'class2', 'class1'], dtype=object)
We briefly introduced the concept of partitioning a dataset into separate datasets for
training and testing in Chapter 1, Giving Computers the Ability to Learn from Data, and
70
Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. Remember that the
test set can be understood as the ultimate test of our model before we let it loose on
the real world. In this section, we will prepare a new dataset, the Wine dataset. After
we have preprocessed the dataset, we will explore different techniques for feature
Bringing features onto the same scale
Feature scaling is a crucial step in our preprocessing pipeline that can easily be
Building Good Training Sets – Data Preprocessing
forgotten. Decision trees and random forests are one of the very few machine
learning algorithms
Although
Dealing
the removal whereof missing we
with Missing or Unknown/Unseen
datadon't
seemsneed
to be a to worry
convenient
Values about feature
approach, it also
in Test/Application scaling. However,
Data
comes with certain disadvantages; for example, we may end up removing too
the majority
manyMany
of machine
samples,
possiblewhich learning
will make
strategies
and optimization algorithms behave much
. .a. reliable analysis impossible. Or, if we remove too
better if features
many are on
feature columns, we the same
will run scale,
the risk as we
of losing saw
valuable in Chapter
information
Chapter 4
that2, Training Machine
our
classifier needs to discriminate
• Ignore/remove databetween
record classes. In the next section, we will thus
Learning
look
Algorithms
at one
for Classification, whenfor we implemented the gradient descent
This way, we can count the of the most
number commonly
of missing valuesused alternatives
per column; in the dealing with missing
optimization
following values:•we
subsections, algorithm.
Ignore/remove
interpolation
will atfeature
differentvalues
techniques.
take a look strategies for how to
deal with this missing data.
• Provide explicit UNK values (unseen characters, bigrams, words); maybe already do that
The importance of feature scaling can be illustrated by a simple example. Let's
Imputing
assumeAlthough
for rare missing
thatscikit-learn
we have
features in train/dev
two features
was developed
values set to “accustom” the model to this
where
for working with NumPy one feature is measured on a scale from
Often,
arrays, the
it canremoval
sometimes ofbe
samples or dropping
more convenient of entire
to preprocess datafeature columns is simply not
1 to 10 and•
feasible,
using the
Compute
because
pandas' second
we might
DataFrame. Wefeature
replacement
losealways
can is
values
too much measured
access
(imputing
valuable onInathis
missing
data.
the underlying scale
values):from
case, we can 1 to 100,000.
What values
use
would beWhen we
good for
NumPy nominal
array of the scales? via the attribute before
thinkdifferent
offeed
we the squared
it into a scikit-learnerror
DataFrame
interpolation techniques function
estimator:
values
to estimate inthe
Adaline in Chapter
missing values from the 2, Training Machine Learning
other
training samples in our dataset. One of the most common interpolation techniques is
Algorithms for Classification,
>>> df.values
mean imputation, where we simply it isreplace
intuitive to say
the missing that
value themean
by the algorithm
value of will mostly be busy
optimizing the
array([[ 1.,
the entire feature
[ 5.,
weights
column. Aaccording
2.,
6., nan,
3., 4.],
convenient way
8.],
to to
the larger
achieve thiserrors inthe
is by using theImputer
second feature. Another
class from
example is the scikit-learn,
k-nearest as shown in
neighborsthe following code:
(KNN) algorithm with a Euclidean distance
[ 10., 11., 12., nan]])
• tutorial on text feature extraction▲ introducing sklearn vectorizers for textual data
• Chapter 8 of [?] on a sentiment analysis problem... if you need a more basic introduction
4.1.3 Learning
The Generic Estimator API of sklearn
sklearn
Estimator API
72
Hyperparameter Tuning
[ 172 ]
73
Pipeline Example with Text Vectorization▲
# Hyperparameters
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss")
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)
# Supervised Pipeline
pipeline = Pipeline(
[
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
("clf", SGDClassifier(**sdg_params)),
]
)
Play with this example https://bit.ly/sklearn-text-classification
parameter_grid = {
"vect__max_df": (0.2, 0.4, 0.6, 0.8, 1.0),
"vect__min_df": (1, 3, 5, 10),
"vect__ngram_range": ((1, 1), (1, 2)), # unigrams or bigrams
"vect__norm": ("l1", "l2"),
"clf__alpha": np.logspace(-6, 6, 13),
}
random_search = RandomizedSearchCV(
estimator=pipeline,
param_distributions=parameter_grid
)
74
sklearn supports several gridsearch strategies
Gridsearch Visualization▲
https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix
4.2 skorch
skorch▲ : Provide Pytorch Functionality with sklearn Interfaces
75
• pytorch’s performance on GPUs
76
Chapter 5
Learning Objectives
• Understand the feature representation for linear binary and multiclass classification
• Understand the perceptron learning rule for binary and multiclass classification
• Understand the different learning strategies (minimize errors, maximize margin, maxi-
mize training data) and their connection with loss functions
5.1 Binary
Linear Models in Theory
Linear Models in Theory Linear
Modeling
I The linear model for binary classification:
(
1 if w · x + b > 0
ŷ =
−1 if w · x + b < 0
w ← w − η∇f (w, b; T )
LinearGeneralized
Features inClassifiers
Linear Practice: Sentiment Classification 2(38)
77
x3 1
0 otherwise
Now we have an algorithm that given an instance x computes the probability P(y =
x4 makecount(1st
1|x). How do we ⇢a decision?and
For2nd
a testpronouns
instance x,2wedoc) 3
say yes if the probability
decision
P(y = 1|x) is more 1 if “!” 2 doc call .5 the decision
: THEboundary: 5
boundary x5 than .5, and no otherwise.
5.1 • We C LASSIFICATION SIGMOID
0
0 otherwise
⇢
x6 ln(word 1 if ofP(y
count doc)= 1|x) > 0.5 ln(66) = 4.19
ŷ = x2=2 0 otherwise
x3=1
5.1.1 Let’s
It'sExample:
assume sentiment
hokey . There are virtually
for the moment classification
no surprises , and the writing is second-rate .
that we’ve already learned a real-valued weight for
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couchtext,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive
6 features and lexicon) 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
p(+|x) x3 = P(Y = 1|x) = s (w · x + b) 1
0 otherwise
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, 5.0, 21.2, doc)
0.5, 2.0, 0.7] 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
Figure 5.2 1 Aif sample “!”=2 doc mini
s (.833) test document showing the extracted features in the vector x.
x5 0
0 otherwise= 0.70 (5.6)
each x log(word
6 of these Ccount
5.2 • features, ofand
LASSIFICATIONdoc)that
WITHthe 6 weights
L OGISTIC ln(66) =5 4.19
corresponding
R EGRESSION to the 6 features are
30
5.2 Multiclass
Generalized Linear Classifiers 3(38)
Multiclass Classification
Multiclass Classification
One-Versus-Rest/All (OVR/A)
Generalized Linear Classifiers 12(38)
OVR/A
79
One-Versus-All
I Given multiclass training data:
OVA or AVA?
OVA vs. AVA Linear Classifiers
Generalized 14(38)
f(x) : X → Rm
f(x, y) : X × Y → Rm
Examples
Feature Generalized
Functions as Numeric Vectors
Linear Classifiers 17(38)
81
Block Feature Vectors
Feature Functions as Numeric Vectors: Blocks
Feature
Function
Blocks
I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]
I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]
I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]
We can rearrange the long vectors into a matrix of the non-zero section of each class. See [?]
Multiclass
Chapter 5 Linear Classification
Generalized Linear Classifiers 19(38)
Decision Function of Multiclass Linear Classification
Decision
I Let w ∈ Rm be a weight vector Function
I If we assume that w is known, then we define our classifier as
ŷ = argmax w · f(x, y)
y
m
X
= argmax wj × fj (x, y)
y
j=0
82
Bias Terms
I Often linear classifiers presented as
m
X
ŷ = argmax wj × fj (x, y) + by
y
j=0
5.2.2 Supervised
Perceptron Learning
General Supervised Learning
(i)
I Input: Training examples T = {(x(i) , yt )}N
i=1
I Feature representation f : X × Y → Rm
I Output: A vector w that optimizes some important function
of the training set:
I minimize error (Perceptron, SVMs, Boosting)
I maximize likelihood of data (Logistic Regression, Naive Bayes)
I NB: Same as binary case except for feature representation
1: w←0
2: for a fixed number of iterations do
3: for all (x, y) ∈ T do
4: ŷ = argmaxy w · f(x, y)
5: if ŷ 6= y
6: w = w + f(x, y) − f(x, ŷ)
7: end if
8: end for
9: end for
There is an error in the binary case. Can you spot and fix it? Use the algorithm formulation
from the next slide!
Generalized Linear Classifiers 25(38)
83
Binary Perceptron Algorithm [?]
e w·f(x,y) X
e w·f(x,y )
′
P(y|x) = , where Zx =
Zx
y′ ∈Y
e w·f(x,y)
arg max P(y|x) = arg max
y y Zx
= arg max e w·f(x,y)
y
= arg max w · f(x, y)
y
84
Linear Classifiers
10 C HAPTER 5 • e w·f(x,y)
L OGISTIC R EGRESSION
P(y|x) =
Zx
If you work out the matrix arithmetic, you can see that the estimated score of
the first output class ŷ1 (before we take the softmax) will correctly turn out to be
w1 · x + b1 .
◮ Q: How do we learn weights w Fig. 5.3 shows an intuition of the role of the weight vector versus weight matrix
in the computation of the output class probabilities for binary versus multinomial
◮ A: Set weights to maximize log-likelihood of training data:
logistic regression.
Important: We are back to binary decisions (+1, -1) for this section
5.3.3 Features in Multinomial Logistic Regression
I The linear model Features
for inbinary classification:
multinomial logistic regression act like features in binary logistic regres-
sion, with the difference mentioned above that we’ll need separate weight vectors
and biases for each of the K classes. Recall our binary exclamation point feature x
(
Figure 5.3 Binary versus multinomial logistic regression. Binary logistic regression uses a
5
1 if w · x + b > 0
single weight vector w, and has a scalar output ŷ. In multinomial logistic regression we have
K separate weight vectors corresponding to the K classes, all packed into a single weight
matrix W, and a vector output ŷ. ŷ =
1 if w · x + b < 0
5.3.3 Features in Multinomial Logistic Regression
I
Features in multinomial logistic regression act like features in binary logistic regres-
Learning as optimization (loss + regularization):
sion, with the difference mentioned above that we’ll need separate weight vectors
Minimize Error
and biases for each of the K classes. Recall our binary exclamation point feature x5
w w ⌘rf (w, b; T )
85
Minimize Error 3
2 loss
−3 −2 −1 1 2 3
margin
(
1 if y ŷ ≤ 0
L(w, b; T ) =
0 otherwise
Train Test
The perceptron (or 0-1 loss) does not care about margin
What could we have done better when drawing the decision boundary?
Minimize Error
Generalized and Maximize Margin
Linear Classifiers 6(38)
Loss, Hinge
86
Maximize Margin
4 loss
−3 −2 −1 1 2 3
margin
L(w, b; T ) = max(0, 1 − y ŷ )
margin
x1 x1
Maximize Likelihood
Loss, Log
87
Maximize Likelihood
4 loss
−3 −2 −1 1 2 3
margin
1
L(w, b; T ) = log(1 + exp(−y ŷ ))
log 2
Log loss improves beyond a margin of 1
Minimizing
Min Error log loss meansLikelihood
6= Max maximizing likelihood
Generalized Linear Classifiers 8(38)
Example I: Binary Loss Differences
Summary
• True multinomial multiclass classification uses features function that encode the features
per class
89
• Logistic Regression in sklearn▲
Recommended reading
• Perceptron▲ on Wikipedia
• Confused about the connection between Perceptron and SGD with Perceptron Loss?1
Note from sklearn: Perceptron() is equivalent to SGDClassifier(loss="perceptron",
eta0=1, learning_rate="constant", penalty=None).
1
https://stats.stackexchange.com/questions/137834/clarification-about-perceptron-rule-vs-gradient-descent-
vs-stochastic-gradient
90
n
The dot product of two vectors is the sumX of their elementwise products:
a · b = a> b = (ai · bi )
Xn
i=1
a · b = a> b = (ai · bi )
Example:
i=1
Example: 1 1
3
2
· = 1 · 1 + 3 · 2 + 2 · 1 + 0.5 · 4 = 11
1
2 1 1
3 2
0.5 · 4 = 1 · 1 + 3 · 2 + 2 · 1 + 0.5 · 4 = 11
2 1
0.5 4
2.3 Scalars
Chapter
If a scalar
2.3
6 with a vector or a matrix, it is multiplied to every entry
is multiplied
Scalars
of the vector/matrix:
If a scalar is multiplied witha vector
or a
matrix, it ismultiplied to every entry
of the vector/matrix: 1 4 7 3 12 21
Automatic Differentiation 3· =
2 0 1 6 0 3
1 4 7 3 12 21
3· =
2 0 1 6 0 3
2.4 Multiplication
Cell i, jMultiplication
2.4
Learning ofObjectives
the product of two matrices A and B is the dot product of row i of
matrix A and column j of matrix B:
Cell i, j of the product of two matrices A
graph and B is the dot product of row i of
• Understand the computation
matrix A and column j of matrix B:7 8
1 2 3 58 64
• Grasp how gradients ∗ 9 10 =computed on computation graphs
4 5 can 6 be efficiently
7 12 8 139 154
1 2 3 11 58 64
• Grasp reverse-mode ∗ 9 10 =
4 5 auto-differentiation
6 139 154m×n
As a result, we can only multiply 11 two 12
matrices A ∈ R and B ∈ Rk×l if
n ==
• Knowk. Thethe
dimension of the product
typical ingredients andmatrix is m
steps for × l.
computation-graph-based optimization
As a result, we can only multiply two matrices A ∈ Rm×n and B ∈ Rk×l if
n == k. The dimension of the product matrix is m × l.
3 Derivatives
6.1 Gradients
3The derivative
Derivatives
Derivatives ▲ : Symbolic
tells us theDifferentiation
slope of a function
[?] at any point. Examples:
The•derivative
The slopetells
of aus
constant value
the slope of a(e.g., 7) isat
function always 0
any point. Examples:
line w · xvalue
• The slope of a constant is w. (e.g.,
Example: 0 ⇒ f’(x) = 3
f(x) = 3x
7) is always
• The slope of a line w · x is w. Example: f(x) = 3x ⇒ f’(x) = 3
3.1 Calculating Derivatives
There are
3.1 rules for calculating
Calculating derivatives, such as:
Derivatives
• Multiplication
There by constant:
are rules for calculating g(x) = c ·such
derivatives, ⇒ g 0 (x) = c · f 0 (x)
f (x)as:
Power rule: f (x)
• Multiplication by = xn ⇒ n ·g(x)
constant: xn−1= c · f (x) ⇒ g 0 (x) = c · f 0 (x)
91
Function name Function Derivative
Exponential ex ex
1
Logarithm ln(x) x
Sine sin(x) cos(x)
Cosine cos(x) − sin(x)
• Sum rule: h(x) = f (x) + g(x) ⇒ h0 (x) = f 0 (x) + g 0 (x)
• Product rule: h(x) = f (x) · g(x) ⇒ f (x) · g 0 (x) + f 0 (x) · g(x)
Note:
• Chain rule: h(x) = f (g(x)) ⇒ h0 (x) = f 0 (g(x)) · g 0 (x)
d df
• To denote differentiation,
f (x)
we
0
can feither
0 use ’ as in f’(x) or
(x)·g(x)−g 0 (x)·f (x) dx as in dx
• Quotient rule: h(x) = g(x) ⇒ h (x) = g 2 (x)
Example:
Some special functions and their derivatives:
2
• Calculate
Function name the derivative
Function of h(x) = ex
Derivative
Exponential ex ex
⇒Logarithm
We can apply the chain rule:x1
ln(x)
h(x) = f (g(x)) sin(x)
Sine with f (y) = ey and y = g(x) = x2 .
cos(x)
Cosine 0
(g(x)) · g−
Thus, h0 (x) = fcos(x) 0 sin(x) x2
(x) = e · 2x
3.1.1
Chain Afor
Rule
Note:
Note on the
Function Chain Rule
Composition
0 Chain Rule
The •chain rule h(x)
To denote = f (g(x))
differentiation, we⇒
canh either f 0 (g(x))
(x) =use · g 0 (x)
’ as in f’(x) or can
d alsodfbe written as
dh dh dg dx as in dx
dx = dg · dx .
2
Example:from before: h(x) = ex with g = x2
Example
2
• Calculate the derivative of h(x) = exg
dh dh dg de dx2 g x2
⇒ =
⇒ We can apply the chain
= = e · 2x = e · 2x
dx rule:
dg x dg dx
h(x) = f (g(x)) with f (y) = ey and y = g(x) = x2 .
2
Thus, h0 (x) = f 0 (g(x)) · g 0 (x) = ex · 2x
3.2 Partial Derivatives
If3.1.1 A Notehas
our Numeric
6.1.1 function on the
moreChain
than Rule
one variable and these variables are independent
2 0
ofThe
each other,
chain
Numeric
i.e., f=(x,
rule h(x)
Differentiation:
y) =⇒ xy)
z = f (x,
f (g(x)) h+ y 3 , we
0 can calculate
= (x=∗fx)(g(x))
(x) g 0 (x)
∗ y + ·(y
the partial derivatives to
+ 2)can also be written as
dg
each
dh of them, treating the other one as a constant:
dh
dx = dg · dx . Differentia-
2
Example from before: h(x) = ex with g = x2 tion, Numeric
∂f ∂f
f (x, y) = x2 + y 3 ⇒ =g 2x2 + 0 = 2x ; 2 = 0 + 3y 2 = 3y 2
dh dh dg ∂xde dx ∂y
⇒ = = = eg · 2x = ex · 2x
dx dg x dg dx
92
How many times do we call function f for all partial derivatives? How many times would we
call a function with 1000 parameters (which neural networks easily have)?
6.1.2 Symbolic
Symbolic Differentiation
• Build the diff expression from the leaves to the top of the original expression
93
>>> diff(f,y)
2
x + 1
# Lambdify
>>> df_dx = diff(f,x)
>>> df_dx_fun = lambdify(x,df_dx)
>>> df_dx_func(x=3)·
6y
• Symbolic differentiation can produce huge graphs for nested functions (exponential growth)
• How do we get the partial derivative of the output z with respect to the input x? Mean-
ing: What is the rate of change of z if x changes?
• Chain rule:
∂z ∂z ∂s
= ·
∂x ∂s ∂x
∂s ∂z
• In forward mode (starts from input), we first compute then
∂x ∂s
∂z ∂s
• In reverse mode (starts from output), we first compute , then .
∂s ∂x
• x = 3, z(x) = sin(x2 )
• How does forward and reverse mode work out for this example?
94
Reverse Mode: Example and Principle
∂z ∂z ∂s
= ·
∂x ∂s ∂x
Full Chain . . .
If z is the output of a sequence of functions with intermediate outputs s1 , s2 , ..., sn , the chain
∂z ∂s1 ∂s2 ∂s3 ∂sn−1 ∂sn ∂z
rule applies as follows: = · · · ··· · · ·
∂x ∂x ∂s1 ∂s2 ∂sn−2 ∂sn−1 ∂sn
∂z ∂z ∂s
= ·
∂x ∂s ∂x
95
Source: [?, 512]
96
What is constant? What are the variables w.r.t. differentiation of the top node?
∂f
•
∂x
97
• Start at top node
• n7 is output
Sum rule
∂(a+b)
∂a = ∂a
∂a + ∂b
∂a
98
∂n7
∂n5 = ∂(n5 +n6 )
∂n5 = ∂n5
∂n5 + ∂n6
∂n5 =1+0=1
Product rule
∂u = u ∂u + v ∂u
∂(uv) ∂v ∂u
∂n5
∂n4 = ∂(n4 ·n2 )
∂n4 = n4 ∂n
∂n2
4
+ n2 ∂n
∂n4 = 0 + 4 = n2 = 4
4
99
Backward Pass: What is the rate of change w.r.t x?
∂x = ∂n4 · ∂n1 ∂n1 = = n1 + n1 = 3 + 3
∂f ∂f ∂n4 ∂n4 ∂(n1 ·n1 )
∂n1
There are two paths from n7 to n1 .
How do the combine?
They add up!
∂f
z = f (x, y) = (x ∗ x) ∗ y + (y + 2) for x = 3, y = 4 = 24
∂x
When we change x by +e, how does it change z?
Applying the rate of change w.r.t. x
What does it tell us? How does this relate to learning rates?
100
Source: [?, 512]
101
Source: [?, 512]
102
Do you understand now? If not . . . watch this video:
https://www.youtube.com/watch?v=wG_nF1awSSY:
103
Do you understand now?
104
6.2 Implementations
6.2.1 autograd
Autograd▲
• Autograd can automatically differentiate native Python and Numpy code by overloading
• Automatically derives the derivation of the function w.r.t the relevant parameters
[?] https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
Tensorflow’s eager evaluation makes it dynamic as well. . . pytorch’s compile functionality▲ goes the other way
round (and speeds up processing)
6.2.2 pytorch
Reverse Autograd Diff in pytorch
105
Try on Colab▲ Forward computation happens automatically here
Try on Colab▲
Inner Details of PyTorch Autograd
https://www.youtube.com/watch?v=MswxJw-8PvE
Jacobian Matrix
Jacobian
Matrix
106
Mathematically, if you have a vector valued function ⃗y = f (⃗x), then the gradient of ⃗y with
respect to ⃗x is a Jacobian matrix:
1 ∂y1
∂y
···
∂x. 1 ∂xn
..
J = ..
.. . .
∂ym ∂ym
∂x1 ··· ∂xn
6.2.3 tensorflow
Reverse Autograd Diff in Tensorflow
Try on Colab▲
• Input: Placeholders in static computation graphs; actual data structures in dynamic com-
putation graphs
107
• Trainer: Optimizers (SGD, AdaDelta, etc.) that use the results of backward pass to update
the parameters
• Make sure you understand the relevant parts of the mathematical background [?] (in
OLAT) and visit the links if a refresher is needed
• Wikipedia Page on Automatic Differentiation▲ and a nice blog post on different auto-
matic differentiation▲
108
Chapter 7
Learning Objectives
• Know the typical OOP programming-style with pytorch for data, networks and opti-
mization
• Know the typical setup, vectorization, mini-batch generator, learning with pytorch
7.1 Tensors
Linear Algebra: Scalars, Vectors, Matrices, Tensors
Tensor
Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
Tensors in Pytorch
Squeezing
• Work through d2ai introductory pytorch notebook▲ , or the more extended examples in
Chapter 1 of [?]
109
Tensor Operations with Broadcasting Support1
Tensor arguments can be automatically expanded to be of equal sizes (without making copies Broadcasting
of the data).
Two tensors are “broadcastable” if the following holds:
• Each tensor has at least one dimension.
• When iterating over the dimension sizes, starting at the trailing dimension, the dimen-
sion sizes must either be equal, one of them is 1, or one of them does not exist.
Playlists
istory
Chapter 3. Foundational Components of
opics Neural Networks
This chapter sets the stage for later chapters by introducing the basic ideas involved in building neural
earning Paths
networks, such as activation functions, loss functions, optimizers, and the supervised training setup.
We
ffers & Deals begin by looking at the perceptron, a oneunit neural network, to tie together the various concepts.
The perceptron itself is a building block in more complex neural networks. This is a common pattern
ighlights that will repeat itself throughout the book—every architecture or network we discuss can be used
either standalone or compositionally within other complex networks. This compositionality will
ettings become clear as we discuss computational graphs and the rest of this book.
Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
Support
The Perceptron: The Simplest Neural Network
Sign Out The simplest neural network unit is a perceptron. The perceptron was historically and very loosely
7.2 ML modeled after the biological neuron. As with a biological neuron, there is input and output, and
“signals” flow from the inputs to the outputs, as illustrated in igure 31.
Perceptron
1
https://pytorch.org/docs/stable/notes/broadcasting.html
Figure 31. The computational graph for a perceptron with an input (x) and an output (y). The weights (w) and
110
bias (b) constitute the parameters of the model.
H
OF
S
LT
Note: Graphical representations of neural networks are computation graph visualizations. . .
class Perceptron(nn.Module):
Super
• What is super(...) for? See Python 2 compatibility▲ and this blog on the inner working
of super()▲ .
111
Σ f p1
p2
w1, 1
b
2 Neuron Model and Network Architectures
Σ p3 n
f a
1 w1, R b
pR
a = f (wp + b)
1
Layer of Neurons
Get Your Dimensions Right! Where do batches of data go? a = f (Wp + b)
le-Input Neuron
Simple Neuron Input Layer of S Neurons Matrix Multi-
plication in
Inputs Multiple-Input Neuron Input Multiple-Input Neuron NNs
p a
Rx1 W Sx1
p1 p n a
p2
w1, 1 SxR
Rx1 W
Sx1 f 1x1
n
p3
Σ n f a 1
Sx1
b 1xR
1x1 f
w1, R b R S
pR 1 b
1x1
1 R a = f(Wp + b) 1
a = f (Wp + b)
a = f (Wp + b)
Three Layers of Neurons
Input Multiple-Input Neuron
Deep Network
Input First Layer Second Layer Third Layer
p a 2-16
Rx1 W 1 x1
p n a 1 a2 a3
Rx1
W
1
1 x1R
n 11x 1 S1 x 1 f
W2
n2 S2 x 1
W3
n3 S3 x 1
1
S xR
b S1 x 1
f 1 S2 x S1
S2 x 1
f2 S3 x S2
S3 x 1
f3
R 1 b11x 1 1 1 b2 1 b3
S1 x 1 S2 x 1 S3 x 1
R S1 S2 S3
a = f (Wp + b)
a1 = f 1 (W1p + b1) a2 = f 2 (W2a1 + b2) a3 = f 3 (W3a2 + b3)
a3 = f 3 (W3 f 2 (W2f 1 (W1p + b1) + b2) + b3)
• torch.nn.parameter.Parameter▲ : Parameters
a(0) are complex objects, containing values (nu-
merical data), gradients, and additional information.
• Dealing with unseen/unknown words: Unking rare tokens in training data allows model
to learn their "behaviour"
112
• Vectorization of data
113
Chapter 8
Learning Objectives
8.1 Motivation
8.1.1 XOR
Closed Solution to Linear Regression
Closed
Common loss function for linear regression Solution
1 Xm
MSE(X, θ) = (θ T x(i) − y (i) )2
m i=1
Loss, MSE
Analytical closed solution: “Normal Equation”
θ̂ = (XT X)−1 XT y
114
Closed Solution to Ridge Regression
Common loss function for linear regression
1X n
J(θ) = MSE(θ) + α θi 2
2 i=1
• A is n × n identity matrix
1 1
h2
x2
0 0
0 1 0 1 2
x1 h1
Figure 6.1:
How does linear regression with Solving the XOR
MSE loss fail? problem by learning a representation. The bold numbers
1 P m printed on the plot indicate the value that the learned function must output at each point.
MSE(X, θ) = (θ T x(i) − y (i) )2
m i=1 (Left)A linear model applied directly to the original input cannot implement the XOR
function. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,
Closed solution: θ̂ =the X)−1 XToutput
(XTmodel’s y must decrease as x2 increases. A linear model must apply a fixed
See [?, 172] coefficient w2 to x2 . The linear model therefore cannot use the value of x1 to change
import numpy as np the coefficient on x2 and cannot solve this problem. (Right)In the transformed space
represented
# constant bias feature firstby the features extracted by a neural network, a linear model can now solve
the problem. In our example solution, the two points that must have output 1 have been
X = np.array([[1,0,0],[1,0,1],[1,1,0],[1,1,1]])
y = np.array([[0,1,1,0]]).T
collapsed into a single point in feature space. In other words, the nonlinear features have
# Normal equation
mapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]> .
theta_best = (np.linalg.inv(X.T @ X)) @ X.T @ y
theta_best The linear model can now describe the function as increasing in h1 and decreasing in h2 .
>>> array([[ 0.5], In [ this
0. ],example,
[ 0. ]]) the motivation for learning the feature space is only to make the model
capacity greater so that it can fit the training set. In more realistic applications, learned
representations can also help the model to generalize.
115
173
8.1.2 Depth
Can Deep Linear Models Help?
Going deep means composition of functions: f (g(h(x)))
Composition of linear transformations
If two function f : Rm → Rk and g : Rn → Rm are linear transformations, then their composi-
tion (f ◦ g) : Rn → Rk is a linear transformation.
0 1 2
Source: [?]
Featuree
(transformation) engineering
function mapped needed!
the data into Or can we
a representation thatlearn the transformations
is suitable end-to-
for linear classification.
end? Yeswe
Having can!disposal, we can now easily train a linear classifier to solve the XOR problem.
at our
In general,
• Linear one can successfully
transformation train
of input x: a linear
z(x) = xWclassifier
′
+ b′ over a dataset which is not linearly
separable by defining a function that will map the data to a representation in which it is linearly
• Elementwise
separable, and thennonlinear activation:
train a linear classifierg(x) = max(0,
on the x) representation. In the XOR example
resulting
the transformed data has the same dimensions as the original one, but often in order to make the
• Non-linear transformation of linear transformation: y = g(xW + b)
data linearly separable one needs to map it to a space with a much higher dimension.
is solution
• Expression has one glaring(except
is differentiable problem,
for however: we need
non-smooth change at 0▲ ) anddefine
to manually SGD the function
is applicable
for optimizing
, a process which is the parameters
dependent on the particular dataset, and requires a lot of human intuition.
• However, nonlinearity introduces non-convex loss functions. A problem? Not as long as
3.3 KERNEL
it finds METHODS
good solutions
Kernelized Support Vectors Machines (SVMs) [Boser and et al., 1992], and Kernel Methods in
general [Shawe-Taylor and Cristianini, 2004], approach this problem by defining a set of generic
mappings, each of them mapping the data into very high dimensional—and sometimes even
infinite—spaces, and then performing linear classification in the transformed space. Working
in very high dimensional spaces significantly increase 116 the probability of finding a suitable linear
separator.
One example mapping is the polynomial mapping, .x/ D .x/d . For d D 2, we get
.x1 ; x2 / D .x1 x1 ; x1 x2 ; x2 x1 ; x2 x2 /. is gives us all combinations of the two variables, allow-
ing to solve the XOR problem using a linear classifier, with a polynomial increase in the number
y y
h1 h2 h
g (z ) = max{0, z}
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left) In this style, we draw every unit as a node in the
graph. This style is very explicit and unambiguous but for networks larger than this
example it can consume too much space. (Right) In this style, we draw a node in the
graph for each entire vector representing a layer’s activations. This style is much more
compact. Sometimes we annotate the edges in this graph with the name of the parameters
that describe the relationship between two layers. Here, we indicate that a matrix W
describes the0mapping from x to h, and a vector w describes the mapping from h to y.
We typically omit the intercept parameters associated with each layer when labeling this
kind of drawing. 0
z
affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
Figure 6.3: The
8.1.4 Depth andthat rectified linear activation function.
Nonlinearity This activation function is the default
is applied element-wise, with h i = g(x W :,i + ci ). In modern neural networks,
activation function recommended for use with most feedforward
the default recommendation is to use the rectified linear unit or ReLU neural networks. Applying
(Jarrett
Adding a Bit of Depth
et al. , to
this function to the output of a linear transformation yields a nonlinear transformation.
2009XOR
; Nair and Hinton , 2010 ; Glorot et al. , 2011a ) defined by the activation
function g(z ) = max{0, z } depicted in Fig. 6.3.
However, the function remains very close to linear, in the sense that is a piecewise linear
An “engineered“ exact solution
We can now specify our complete network as
function with two linear pieces. Because rectified linear units are nearly linear, they
preserve many of the properties
f (x; Wthat ) = w linear
, c, w, bmake max{0, W
models easy
x + c} + b. to optimize
(6.3) with gradient-
based methods. We They also preserve many of the properties that make linear models
can now
generalize well. A common principle throughout computer science is that we can build
1 1
W = , (6.4)
complicated systems from minimal components. 1 1 Much as a Turing machine’s memory
needs only to be able to store 0 or 1 states, we0 can
build a universal function approximator
c= , (6.5)
from rectified linear functions. −1
1
w= , b=0 (6.6)
−2
173
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Can a solution be learned in practical terms▲ ?
y y
h1 h2 h
174
W
x1 x2 x
Figure
Make sure6.2:
youAn example ofhow
understand a feedforward network,
this operations workdrawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left) In this style, we draw every unit as a node in the
Gradient
graph.Descent Pitfalls
This style for explicit
is very Non-Convex Cost Functions
and unambiguous but for networks larger than this
Gradient
example it can consume too much space. (Right) In this style, we draw a node in the
Descent,
graph for each entire vector representing a layer’s activations. This style is much more Pitfalls
compact. Sometimes we annotate the edges in this graph with the name of the parameters
that describe the relationship between two layers. Here, we indicate that a matrix W
describes the mapping from x to h, and a vector w describes the mapping from h to y.
We typically omit the intercept parameters associated with each layer when labeling this
kind of drawing.
• Scalar inputs
41
• Weights for inputs CHAPTER 4
• (Nonlinear) activation function
Output y1
Neuron ∫
Input x1 x2 x3 x4
e neurons are connected to each other, forming a network: the output of a neuron may
feed into the inputs of one or more neurons. Such networks were shown to be very capable com-
putational devices. If the weights are set correctly,
118a neural network with enough neurons and a
nonlinear activation function can approximate a very wide range of mathematical functions (we
will be more precise about this later).
A typical feed-forward neural network may be drawn as in Figure 4.2. Each circle is a
neuron, with incoming arrows being the neuron’s inputs and outgoing arrows being the neuron’s
outputs. Each arrow carries a weight, reflecting its importance (not shown). Neurons are arranged
An MLP is often used for classification, with each output corresponding to a different
binary class (e.g., spam/ham, urgent/not-urgent, and so on). When the classes are
exclusive (e.g., classes 0 through 9 for digit image classification), the output layer is
typically modified by replacing the individual activation functions by a shared soft‐
max function (see Figure 10-9). The softmax function was introduced in Chapter 3.
The output of each neuron corresponds to the estimated probability of the corre‐
sponding class. Note that the signal flows only in one direction (from the inputs to
. . . to MLP/FFNN for Classification
the outputs), so this architecture is an example of a feedforward neural network
(FNN). MLP
Figure 10-9. A modern MLP (including ReLU and softmax) for classification
Multilayer Architecture: Ingredients for Deep Learning
• Sequence of layers
119
How Many Hidden Layers are Needed?
Universal Approximation Theorem
• Theoretically, one hidden layer with a squashing non-linear activation function would be
enough to approximate any real function as close as possible to any non-zero error (see
[?, 197ff])
• Does not guarantee the learnability of the function from training data!
• One hidden layer would sometimes need to be exponentially large (e.g. an order of 2n
for binary vectors {0, 1}n )
CHAPTER •6.Practical
DEEPresults indicate that “going
FEEDFORWARD deep” helps to generalize better
NETWORKS
• Practical results indicate that “going broad” without “going deep” is not always helpful
from rectifier nonlinearities or maxout units) can represent functions with a number
of regions that is exponential in the depth of the network. Figure 6.5 illustrates how
a network with absolute value rectification creates mirror images of the function
computed on top of some hidden unit, with respect to the input of that hidden
unit. Each hidden unit specifies where to fold the input space in order to create
mirror responses (on both sides of the absolute value nonlinearity). By composing
Source: [?, 437]
these folding operations,
What’s going on here? we obtain an exponentially large number of piecewise
linear regions which can capture all kinds of regular (e.g., repeating) patterns.
Repeated Transformations
Each hidden layer can “linearize” further on the transformation of the preceding layer.
Figure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper
rectifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification
unit has the same output for every pair of mirror points in its input. The mirror axis
120
of symmetry is given by the hyperplane defined by the weights and bias of the unit. A
function computed on top of that unit (the green decision surface) will be a mirror image
of a simpler pattern across that axis of symmetry. (Center)The function can be obtained
by folding the space around the axis of symmetry. (Right)Another repeating pattern can
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
96.5
96.0
Test accuracy (percent) 95.5
95.0
94.5
94.0
93.56. DEEP FEEDFORWARD NETWORKS
CHAPTER
93.0
92.5
92.0
3 4 5 6 7 8 9 10 11
Figure 6.6: Empirical results showing that deeper networks generalize better when used
Going Broad and Deepmulti-digit
to transcribe on the MNIST
numbersTask
from photographs of addresses. Data from Goodfellow
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
figure 6.7 for a control experiment
Effect of demonstrating that other increases to the model size
Number of Parameters
do not
97 yield the same effect.
96 3, convolutional
Test accuracy (%)
Another
95 key consideration of architecture design3,isfully connected
exactly how to connect a
pair of layers to each other. In the default neural network 11, layer
convolutional
described by a linear
transformation
94 via a matrix W , every input unit is connected to every output
unit.93Many specialized networks in the chapters ahead have fewer connections, so
that each unit in the input layer is connected to only a small subset of units in
the 92
output layer. These strategies for reducing the number of connections reduce
the 91
number of parameters and the amount of computation required to evaluate
the network,
0.0 but are
0.2 often highly
0.4 problem-dependent.
0.6 0For
.8 example, 1.0convolutional
8
networks, described in chapter parameters patterns of sparse×10
9, useofspecialized
Number connections
that are very effective for computer vision problems. In this chapter, it is difficult
towithout
give
Figure
Going broad 6.7:much
Deepermore
going specific
models
deep tend toadvice
performconcerning
introduces better. Thisthe
overfitting. architecture
is not of athegeneric
merely because model isneural
network.
larger. Subsequent
This experiment chapters
from develop
Goodfellow et al. the particular
(2014d architectural
) shows that strategies
increasing the number that
ofhave
parameters in layers of convolutional networks without
been found to work well for different application domains. increasing their depth is not
Fat vs. Deep
nearly as effective at increasing test set performance. The legend indicates the depth of
network used to make each curve and whether the curve represents variation in the size of
the convolutional or the fully connected layers. We observe that shallow models in this
context overfit at around 20 million parameters while deep ones can benefit from having
over 60 million. This suggests that using a deep model expresses a useful preference over
the space of functions the model can learn. Specifically, it expresses a belief that the
function should consist of many simpler functions composed together. This could result
either in learning a representation that is composed in turn of simpler representations (e.g.,
corners defined in terms of edges) or in learning 202a program with sequentially dependent
steps (e.g., first locate a set of objects, then segment them from each other, then recognize
them).
202
121
Architectures
...
... ...
...
...
shallow ...
Architectures
deep
Fat + Short vs. Thin + Tall:
[?] ASR
Heike Adel Neural Networks II 29.03.2019 45 / 61
Capacity [?]
Capacity of Networks
Capacity
Heike Adel Neural Networks II 29.03.2019 48 / 61
122
Heike Adel Neural Networks II 29.03.2019 49 / 61
Recap: Capacity - Overfitting - Underfitting
Error
Underfitting Overfitting
Test error
Training error
Capacity
[?]
8.2 FFNN
Heike Adel Neural Networks II 29.03.2019 52 / 61
Neural Network
8.2.1 Notation
Many Hidden
Deep Fully-Connected Layers with
Feedforward Many
Network Neurons
(FFNN)
y = f (W N ...f (W 2 (f (W 1x + b 1 ) + b 2 )... + b N )
Neural Network
f: non-linear activation function
Notation [?]
Heike Adel Neural Networks I 22.03.2019 15 / 46
123
[?]
124
Neural Network
Notation
• Loss computation: Quantify the scalar loss (that is cost including regularization if present)
of ŷ wrt. the gold solution y.
• Backward Pass: Determine contribution of each parameter to the loss by applying the
chain rule (back-propagation).
• Training Step: Adapt the parameters according to the computed gradients and update
regime (learning rate and gradient descent)
Initialization
• Different initializations (via random seeds) lead to different final parameters (models).
Why? Non-convexity of objective function! Saddle points, local minima. Different mod-
els are good for ensembling!
125
Table 11-1. The initialization strategy for the ReLU activation function (and its var‐
iants, including the ELU activation described shortly) is sometimes called He initiali‐
zation (after the last name of its author).
Source: [?, ]
By default, the fully_connected() function (introduced in Chapter 10) uses Xavier
initialization (with a uniform distribution). You can change this to He initialization
8.3.1 Activations
by using the variance_scaling_initializer() function like this:
Some Nonlinear Activation Functions
sigmoid(x) tanh(x) hardtanh(x) ReLU(x)
1.0 he_init = tf.contrib.layers.variance_scaling_initializer()
1.0 1.0 1.0
0.5 hidden1 = fully_connected(X,
0.5 n_hidden1,
0.5 weights_initializer=he_init,
0.5 scope="h1")
0.0 0.0 0.0 0.0
-0.5 -0.5 -0.5 -0.5
-1.0 -1.0 -1.0 -1.0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
Some !f
activation functions are “more !f
linear” than others! Can !f
actually be better... !f
!x !x !x !x
What
1.0 are the graphs of their
1.0derivatives? 1.0 1.0
See ▲
0.5 d2l section on activation
0.5 functions 0.5 0.5
3 This simplified strategy was actually already proposed much earlier—for example, in the 1998 book Neural
0.0 0.0 0.0 0.0
Networks: Tricks of the Trade by Genevieve Orr and Klaus-Robert Müller (Springer).
Activation
-0.5 Functions in Layers
-0.5 -0.5 -0.5
4 Such as “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” K.
-1.0
For-6hidden layers -1.0 -1.0 -1.0
-4 -2He 0et al.
2 (2015).
4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
• sigmoid, tanh, ReLU, ELU, Swish▲ , etc.
278 | activation
• Different Chapter 11:functions
Training Deep Neural
have Nets problems in learning (saturation, “dead” at
different
zero) Good introduction ▲
• Sigmoid
126
from the training set until the average of the objective function stops variations of the inp
variants in their experiments: training time was reduced and the neural network per‐
decreasing. It is called
formed better on stochastic
the test set.because each small
It is represented set of11-3,
in Figure examples illumination
and Equation 11-2 of an ob
gives a noisy estimate
shows of the average gradient over all examples. This while being very sens
its definition.
simple procedure usually finds a good set of weights surprisingly the difference betw
quickly whenEquation
compared 11-2. ELU
withactivation
far more function
elaborate optimization tech- dog called a Samoye
18
niques . After training,
ELUα z =
the zperformance
α exp − 1 if z < 0 of the system is measured different poses and
on a different set of examplesz called i f z a≥ test
0 set. This serves to test the from each other, wh
generalization ability of the machine — its ability to produce sensible same position and on
answers on new inputs that it has never seen during training. other. A linear classi
c d Com
answ
yl = f (zl )
Output units l
zl = wkl yk
wkl k H2
E E
= wkl
yk = f (zk ) yk zl
Hidden units H2 k I out
zk = w jk y j
wjk E E yk
j H1 =
zk yk zk
zj = wij xi
wij
i Input
Input units i
Source: [?]
Figure 1 | Multilayer neural networks and backpropagation. a, A multi- which one can backpro
layer neural
• What network
is not shown (shown
here? by the connected dots) can distort the input the total input z to eac
space to make the classes of data (examples of which are on the red and the units in the layer b
blue lines)
Forward linearly separable.
Computation in Matrix Note
Vectorhow a regular grid (shown on the left)
Notation z to get the output of th
in input space is also transformed (shown in the middle panel) by hidden The non-linear functio
units. This is an illustrative example with only two input units, two hidden linear unit (ReLU) f(z)
units and one output unit, but the networks used for object recognition well as the more conve
or natural language processing contain tens or hundreds of thousands of f(z) = (exp(z) − exp(−z
units. Reproduced with permission from C. Olah (http://colah.github.io/). f(z) = 1/(1 + exp(−z)). d
b, The chain rule of derivatives tells us how
127two small effects (that of a small pass. At each hidden la
change of x on y, and that of y on z) are composed. A small change Δx in the output of each unit
x gets transformed first into a small change Δy in y by getting multiplied with respect to the tota
by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change convert the error deriv
Δy creates a change Δz in z. Substituting one equation into the other derivative with respec
gives the chain rule of derivatives — how Δx gets turned into Δz through At the output layer, the
Neural Network
1 1 1
NNMLP1 .x/ D ag.xW1
= f (W /Wb 12)C b2a2 = f (W 2 a1 + b 2 )
C bx + a(4.2)
L
= f (W L aL−1 + b L )
2 Rdin ; W 1 2 Rdin d1 ; b1 2 Rd1 ; W 2 2 Rd1 d2 ; b2 2 R[?]
d2
:
and b1 are a matrix and a bias termHeike forAdel Neural Networks I
the first linear transformation of the input, 22.03.2019 21 / 46
applied 42
function that isForward 4. FEED-FORWARD
element-wise
Computation (also calledNEURAL
in Goldberg’s NETWORKS
a nonlinearity
Notation or an activation
W and b are the matrix and bias term for a second linear transform. has no outgoing arrows, and is the output of the
2 2 the input to the network. e top-most layer
it down, xW 1 C b1• isBe network.
aware
a linear of e other layers
Goldberg’s
transformation areinput
considered
ofnon-standard
the x from “hidden.”
notation erow
with
din dimensions sigmoid shapeinstead
vectors inside the
of neurons in the
the usual
column x
ns. g is then applied to each ofvectors.
middle layers
the d1 represent
dimensions, a nonlinear
and the function
matrix(i.e.,
W 2the logistic function 1=.1 C e /) that is applied
together
to the neuron’s value before passing it to the output. In the figure, each neuron is connected to all
b2 are then used to• transform
Can ofyou the result
draw a into the d2graph?
computation dimensional output vector.
the neurons in the next layer—this is called a fully connected layer or an affine layer.
ctivation function g has a crucial role in the network’s ability to represent complex
out the nonlinearity in g , the neural network can only Outputrepresent
layer linear transfor-
y1 y2 y3
input.⁴ Taking the view in Chapter 3, the first layer transforms the data into a
tion, while the second layer applies a linear classifier to that representation.
dd additional linear-transformations and nonlinearities, resulting in an MLP with
rs (the network in Figure 4.2 is of this form): ∫ ∫ ∫ ∫ ∫
Hidden layer
2 1
NNMLP2 .x/ D .g .g .xW C b /W C b //W 3 :
1 1 2 2
(4.3)
arer to write deeper networks like this using intermediary variables:
NNMLP2 .x/ Dy Hidden layer ∫ ∫ ∫ ∫ ∫ ∫
h1 Dg 1 .xW 1 C b1 /
(4.4)
h2 Dg 2 .h1 W 2 C b2 /
x1 x2 x3 x4
y Dh2 W 3 : Input layer
ure 4.2 does not include bias terms. A bias term can be added to a layer by adding to it an additional neuron
any incoming connections, whose Figure 4.2: Feed-forward neural network with two hidden layers.
Computation Graph 1.
value is always
er that a sequence of linear transformations is still a linear transformation.
While the brain metaphor is sexy and intriguing, it is also distracting and cumbersome
to manipulate mathematically. We therefore switch back to using more concise mathematical
notation. As will soon become apparent, a feed-forward network as the one in Figure 4.2 is simply
a stack of linear models separated by nonlinear functions.
e values of each row of neurons in the network can be thought of as a vector. In Figure 4.2
the input layer is a 4-dimensional vector (x ), and the layer above it is a 6-dimensional vector (h1 ).
e fully connected layer can be thought of as a linear transformation from 4 dimensions to 6
dimensions. A fully connected layer implements a vector-matrix multiplication, h D xW where
the weight of the connection from the i128th neuron in the input row to the j th neuron in the output
row is W Œi;j .² e values of h are then transformed by a nonlinear function g that is applied to
each value before being passed on as input to the next layer. e whole computation from input
to output can be written as: .g.xW 1 //W 2 where W 1 are the weights of the first layer and W 2
are the weights of the second one. Taking this view, the single neuron in Figure 4.1 is equivalent
to a logistic (log-linear) binary classifier .xw/ without a bias term .
two hidden-layers (the network in Figure 4.2 is of this form):
NNMLP2 .x/ D .g 2 .g 1 .xW 1 C b1 /W 2 C b2 //W 3 : (4.3)
It is perhaps clearer to write deeper networks like this using intermediary variables:
NNMLP2 .x/ Dy
h1 Dg 1 .xW 1 C b1 /
(4.4)
2 2 1 2 2
h Dg .h W C b /
y Dh2 W 3 :
³e network in Figure 4.2 does not include bias terms. A bias term can be added to a layer by adding to it an additional neuron
that does not have any incoming connections, whose value is always 1.
⁴To see why, consider that a sequence of linear transformations is still a linear transformation.
• Each leaf node has to be input/parameter data. Each internal node an operation!
129
Neural Network
mall set of examples illumination of an object, or variations in the pitch or accent of speech,
r all examples. This whilePass
Forward beingofvery
ansensitive
MLP to particular minute variations (for example,
eights surprisingly the difference between a white wolf and a breed of wolf-like white
optimization tech- dog called a Samoyed). At the pixel level, images of two Samoyeds in
system is measured different poses and in different environments may be very different
input layer 1 layer 2 layer L
his serves to test the from each other, whereas two images of...a Samoyed and a wolf in the
to produce sensible same position and on similar backgrounds may be very similar to each
ng training. other. A linear classifier, or any other ‘shallow’ classifier operating on
vector ... vector
x b ... y
... ... ...
z z
Δz = y ΔyL
x a1 a2 a
z
y Δy = xy Δx
y
y = f (W L ...f (W 2 (f (W 1x + b 1 ) y+ b 2 )... + zb L )y
Δz = y x Δ x
x
[?] z z y
Heike Adel Neural Networks I
x x = y x 22.03.2019 22 / 46
Hidden 8.3.3 Backward
(2 sigmoid)
Backward Pass: A Closer Look
E E
= wkl
yk = f (zk ) yk zl
I out
zk = w jk y j k
E E yk wjk
j H1 =
zk yk zk E E
= w jk
y j = f (zj ) yj zk
j k H2
E E yj
zj = wij xi wij =
zj y j zj
i Input
Source: [?]
gation. a, A multi- which one can backpropagate gradients. At each layer, we first compute
an distort the input the layer,
At each hidden total input
computez tothe
each unit,
error which is
derivative a weighted
wrt. sumwhich
to the output of theisoutputs of sum
a weighted
re on the red and
of the error the units in the layer below. Then a non-linear function f(.) is applied to
derivatives wrt. to the total inputs to the units in the layer above.
(shown on the left) z to get the output of the unit. For simplicity, we have omitted bias terms.
• What stands t for?
le panel) by hidden The non-linear functions used in neural networks include the rectified
ut units, two hidden linear E
• What stands unit (ReLU) f(z) = max(0,z), commonly used in recent years, as
for?
bject recognition well as the more conventional sigmoids, such as the hyberbolic tangent,
ds of thousands of f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic,
p://colah.github.io/). f(z) = 1/(1 + exp(−z)). d, The equations
130 used for computing the backward
effects (that of a small pass. At each hidden layer we compute the error derivative with respect to
mall change Δx in the output of each unit, which is a weighted sum of the error derivatives
getting multiplied with respect to the total inputs to the units in the layer above. We then
imilarly, the change convert the error derivative with respect to the output into the error
Backpropagation as Reverse-Mode Autodiff
“The backpropagation algorithm is a fancy name for methodically computing the derivatives
of a complex expression using the chain-rule, while caching intermediary results. More gener-
ally, the backpropagation algorithm is a special case of the reverse-mode automatic differenti-
ation algorithm.” [?]
• See 3Blue1Brown video series▲ [?] for a good well-animated formal explanation of back-
propagation in FFNNs
131
How many rows and columns in the original grayscale matrix?
132
Weighting the Input
133
Predicting the digit of a grayscale image with a FFNN.
How man Hidden Layers? How many nodes in the hidden layer? How many connection
weights?
Gradient Landscape
134
Backprop Motivation: What are gradients telling us? The relative influence of certain
weights with respect to the cost function!
The gradient of the cost function C specifies for any model parameter its change of rate with re-
spect to the C.
Nudging the connection weight with the yellow gradient has 32 times more influence on the
cost function than changing the pink arrow
Backprop Motivation: Larger Deviations from Ground Truth Should Change More into the
Desired Direction
135
What are the arrows indicating?
Predicted values that are closer to the expected value should change less!
136
Backprop: Minimal Example
137
The influence factors.
Backprop: Propagating back recursively over more layers with more than one neuron
138
Study the written text https://www.3blue1brown.com/lessons/backpropagation-calculus
8.3.4 Problems
Learning Problems for Deep Networks
What to do?
• Forgetting the original input signal in deeper layers: Highway networks [?] and Residual
Networks▲
• Deep FFNNs quickly have too much capacity for NLP problem generalization. CNNs
and RNNs can cure this
8.4 Tooling
8.4.1 PyTorch
FFNN for Markov Language Modeling
https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#an-example-n-gram-language-mode
Study the code together with your neighbor! Ask questions in the forum for things that you
don’t understand!
Can you draw the computation graph of single forward step?
139
Introduction to Keras
8.4.2 MLPs
Keras in Keras
Keras Sequential Layer API
[?]
8.4.3 DyNet
DyNet: An Early Dynamic NLP-oriented Framework
If you struggle with what goes behind the scene in pytorch, DyNet is more transparent!
Installation and Use
• DyNet runs well on CPUs (and GPUs); based on Eigen library (as Tensorflow)
140
58 5. NEURAL NETWORK TRAINING
Algorithm 5.5 Neural network training with computation graph abstraction (using minibatches
of size 1).
endentlybeofshared
the between
other
softmax connected components). 5
different computation graphs (recall that each graph corresponds to a specific
A More
training Real-World
example).
1 × 17 e Classification Computation
second block turns the model Graph
parameters into the graph-node (Expression)
1×1
types. e third block retrieves the Expressions for the embeddings
ADD of the input words. Finally,
N neg 1 ! 1
the fourth block is where the graph is created. Note how transparent the graph creation is—
1 × 17 1×1
there is an almost
MUL
a one-to-one correspondence between creating the graph and describing
@N it
log d.i /
mathematically. e last block shows a forward and backward pass. e equivalent code @i in the
1 × 1 d.i / i
TensorFlow 1package
× 20 20 is:⁵
× 17 1 × 17
tanh 2 2
W b (c) d.1/; : :pick
: ; d.N /
import t e n s o r f l o w as t f
1 × 17
1 × 20 5
x . g et _ va ri a b l e ( ”W1” , [ 2 0softmax
W1 = t f ADD , 150])
b1 = t f . g e t _va ri a b l e ( ”b1” , [ 2 0 ] )
W2d.N
= t/f . g et @N
1 _ va ri a b l e ( ”W2” , [ 1 7 , 2 0 ] ) F D1
b2 = t f .1 g× 20
e t _va ri a b l e ( ”b2” , [ 1 7 ]1)× 17 @N
MUL
lookup = t f . g e t_ v a r i a b l e ( ”W” , ADD
P
[100 , 50])
@fj @N X @N @j
d.i / j 2!.i / d.j / " F D
d e f get_index
1 × 150 ( 150
x ) ×: 20 1 × 20 @i @i @j @i
j 2!.i/
p a s sconcat
# Logic W 1 omitted
b1 1 × 17
L 1 × 50
p1lookup
1 × 50
= t f . p llookup
1 × 50
a c e h o l d elookup
r ( t f . int32 , [ ] )
MUL
p2 = t f . p l a c e h o l d e r ( t f . int32 , [ ] )
20 × 17
p3 = t f . p l a c e h o l d e r ( t f . int32 , [1 ]×)20
1 × 17 20 × 17 1 × 17
× 50
t a 2r g e t = t f . p l a c e h@f o ljd e r|V|( ×t 50
f . int32 , [ ] ) !1 2
W 2
b “the” “black” “dog” E tanh 2
W fj .! .jb // i 2 ! !1 .j /
@i
v_w1 = t f . nn . embedding_lookup f
[?, 52]
( jlookup , p1 )
Computation graph with expected output (5) andv.a 1 /;computation.
loss : : : ; v.am / E is the : : : ; am D
a1 ;embedding
!1
v_w2 = t f . nn . embedding_lookup ( lookup , p2 )
!matrix.
.j / 1 × 20
th concrete input.
v_w3(c)= Graph with concrete ( lookup
t f . nn . embedding_lookup , p3 )
ADD
x = t f . concat ( [ v_w1,Selection
v_w2, @fi
Derivation of Element v.i/ v_w3(pick)] , 0)
output = t f . nn . softmax (
“The pick node @x
!1 implements an indexing 1 × 20 operation, receiving a vector and an index (in this Pick Function
cal expression, it can
t f x
. 2 ! .i
bereturning
einsum /
representedcorresponding
( ” i j , j -> i ”as a
, W2, t f . tanh (
L case, 5) and t f . einsum ( the MUL
” i j , j -> i ” , W1, xentry in )the
) + b1 ) +vector.”
b2 ) [?, 53]
he computation graph for an MLP with
l o s s = - t f . l o g ( output [ t a r g e t ] )
@fi ( l o s s )
0
t r a i n e oval
. In our notation, r = tnodesf . t r a i nrepresent
. GradientDescentOptimizer ( 0 . 1 ) . minimize
1 × 150
150 × 20 1 × 20 150 × 20 1 × 20 @x
W 1 1 C concat
#bGraph d e f i n i t i o n done , compile i t and Wf e1e d c o n cbr1 e t e data .
# Only one data - point i s pick.x;
shown , 5/i n p r a c t i c e we w i l l use
× 50 1 × 50# a data - f e e d i n g loop . 1 × 50 1 × 50 1 × 50
i
okup lookupwith t f . S e s s i o n ( ) as s e lookup
ss : lookup lookup
s e s s . run ( t f . g l o b a l _ v a r i a b l e s _ i n i t i a l i z e r ( ) )
pi ck.x; 5/= {
feed_dict g x g Œ5" D 1 g Œi¤5" D 0
|V| × 50 p1 : get_index ( ” the.0; ” )x/, |V| ×150 x > 0 0
ack” “dog” E p2 : get_index ( ” black “the”” ) , “black” “dog” E
p3 : get_index ( ”dog” ) , [?, 52]
What is code
⁵TensorFlow max(0,x)?
provided by Tim Rocktäschel. anks Tim!
t. (b) Graph with concrete input. (c) Graph with concrete
142
e.
• FFNNs (sometimes called MLP) with non-linear activation functions and at least 1 hid-
den layer can approximate any computable function
• Chapters 4,5,6 in [?]; especially 5.1 introduces Computation Graphs and the Backpropa-
gation Algorithm
143
Chapter 9
Learning Objectives
9.1 Distributionalism
What is bardiwac?
[?]
• “You shall know a word by the company it keeps!” (J. R. Firth (1957))
144
• “words which are similar in meaning occur in similar contexts (Rubenstein & Goode-
nough, 1965)”
• “words with similar meanings will occur with similar neighbors if enough text material
is available” (Schütze & Pedersen, 1995)
• “words that occur in the same contexts tend to have similar meanings” (Pantel, 2005)
What is Meaning?
But what is meaning? Meaning
What is bardiwac?
Distributional semantics
What is Meaning?
[?]
145
Window based cooccurence matrix
• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160
Source: [?, 9]
“an
automatically produced thesaurus which identifies words that occur in similar contexts as the
target word”▲
Types of contexts for adjectives are for instance adjective and its modified noun (stupid error)
or coordinated adjectives (he is stupid and mean).
9.1.1 Variations
Many Ways of Distributional Modeling
Distribution-
alism
146
7 (Semi-)supervision 17
8 Tools 19
tokenization
annotation
tagging
parsing
feature selection
..
. cluster texts by date/author/discourse context/. . .
# .
Dimensionality Vector
Matrix type Weighting reduction comparison
word ⇥ document probabilities LSA Euclidean
word ⇥ word length normalization PLSA Cosine
word ⇥ search proximity ⇥ TF-IDF ⇥ LDA ⇥ Dice
adj. ⇥ modified noun PMI PCA Jaccard
word ⇥ dependency rel. Positive PMI IS KL
verb ⇥ arguments PPMI with discounting DCA KL with skew
.. .. .. ..
. . . .
Source: [?]
(Nearly the full cross-product to explore; only a handful of the combinations are ruled out mathe-
See [?] Chapter 6 for definitions of PPMI etc.
matically, and the literature contains relatively little guidance.)
Idea: co-occurrence counts
Distributionalism: Reducing Co-occurrence Counts
Corpus sentences Co-occurrence counts vector
small vector
Dimensionality
reduction
[?]
147
Latent semantic analysis (LSA)
𝑋 - document-term co-occurrence matrix LSA term vectors
Hope: term having common
𝑋 ≈ 𝑋 = 𝑈 Σ 𝑉 𝑇 meaning are mapped to the
same direction
d d w
≈ × ×
LSA document vectors
Hope: documents
w discussing similar topics U Σ 𝑉𝑇
have similar representations
[?]
X = d × w contains the weights (relative counts in the simplest case) of all considered words w for all documents
d.
The dimension (rank k) of the diagonal matrix Σ sets the compression factor.
~
Bad news: Computational costs grow quadratic for d × w. New words/documents are hared to integrate.
9.2 Embeddings
Atomic Word Representation: One-Hot-Encoding
[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality:%20K%(speech)%–%50K%(PTB)%–%500K%(big%vocab)%–%13M%(Google%1T)%
We%call%this%a%“oneJhot”%representaGon.%Its%problem:%
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0%
35% Source: [?, 35]
148
Neural word embeddings
as a distributed representation
Similar%idea% %
%
Combine%vector%space%
semanGcs%with%the%predicGon%of% $ 0.286%
probabilisGc%models%(Bengio%et% 0.792%
$ −0.177%
al.%2003,%Collobert%&%Weston%
−0.107%
2008,%Turian%et%al.%2010)% linguis<cs$$=% 0.109%
In%all%of%these%approaches,% −0.542%
including%deep%learning%models,% 0.349%
a%word%is%represented%as%a% 0.271%
dense%vector%
[?, 38]
38%
Word Embeddings
Continuous, numeric, dense word representations learned from raw text.
Similar vectors mean similar words (cosine vector distance).
Embeddings are a perfect input for numeric ML methods.
• Odd-One-Out: Identify the word that does not belong into a set of words. . .
149
Linguistic Regularities in Word Vector Space
Semantic Connections in Vector Space Substructures▲
Implicitly, linear semantic AND syntactic connections are learned and modeled in the vector
space! Astonishing!
11 / 31
https://carl-allen.github.io/nlp/2019/07/01/explaining-analogies-explained.html
Allen & Hospedales, ICML 2019, Best Paper Honourable Men
Analogy Computation in VectorOfficial
Space blog: https://carl-allen.github.io/nlp/2019/07/01/exp
150
trained using W2V, one could take the vector of the word king, subtract the word man, add
the word woman and get that the closest vector to the result (when excluding the words king, man,
and woman) belongs to the word queen. at is, in vector space wking wman C wwoman wqueen .
Similar results are obtained for various other semantic relations, for example wFrance wParis C
wLondon wEngland , and the same holds for many other cities and countries.
is has given rise to the analogy solving task in which different word embeddings are eval-
uated on their ability to answer analogy questions of the form man:woman ! king:? by solving:
analogy.m W w ! k W‹/ D argmax cos.v; k m C w/: (11.4)
v2V nfm;w;kg
Levy and Goldberg [2014] observe that for normalized vectors, solving the maximization
in Equation (11.4) is equivalent to solving Equation (11.5), that is, searching for a word that is
similar to king, similar to man, and dissimilar to woman:
analogy.m W w ! k W‹/ D argmax cos.v; k/ cos.v; m/ C cos.v; w/: (11.5)
v2V nfm;w;kg
Levy and Goldberg refer to this method as 3CA. e move from arithmetics between
Source: [?]
words in vector space to arithmetics between word similarities helps to explain to some extent the
ability of the word embeddings to “solve” analogies, as well as suggest which kinds of analogies
The
can Cosine Similarity
be recovered by this method. It also highlights a possible deficiency of the 3CA analogy
recovery method: because of the additive nature of the objective, one term in the summation may Cosine
Dot Product Geometrically Similarity
· ⃗b = ∥⃗a∥∥the
⃗adominate ⃗b∥cos θ = ∥⃗b∥∥⃗
expression, effectively ignoring the∥⃗
a∥cos θ Explanation: others.
a∥cos θAs suggested
projects ⃗b. and Goldberg, this
by Levy
⃗a onto
can be alleviated by changing to a multiplicative objective (3CM):
cos.v; k/ cos.v; w/
analogy.m W w ! k W‹/ D argmax : (11.6)
v2V nfm;w;kg cos.v; m/ C
http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/
151
Images adapted from this blog▲ The red vector is the re-
sult of resolving MAN + ? = WOMAN, that is ? = WOMAN - MAN. What is it in numbers?
Analogies in 2D-Space
152
Analogies in 2D-Space
Analogies in 2D-Space
153
Analogies in 2D-Space
Analogies in 2D-Space
154
In-Class Task: Analogies
Go to https://cutt.ly/ml4nlp1-hs22-we
• PCA (Principal Component Analysis): Linear algebra method which maximizes the vari-
ance of the data. Stable visualization!
1
https://towardsdatascience.com/visualizing-word-embedding-with-pca-and-t-sne-961a692509f5
155
Linguistic Regularities - Results
Which one is PCA? Which one is t-SNE?
[?]
The word2vec performance revolution in 2013 . . .
14 / 31
156
Early Approach Using Classical n-Gram Language Modeling
Feedforward Neural Net Language Model Feedforward
Neural Net
LM
5 / 34
Source: http://www.micc.unifi.it/downloads/readingroup/TextRepresentationNeuralNetwork.pdf See simplified
PyTorch implementation▲
9.2.1 word2vec
Continuous Bag-Of-Word Language Modeling
Neural Network learns to estimate the probability a word using some context represented
as a vector: word2vec [?]
www.youtube.com/watch?v=aZarigloqXc
157
ontinuous Bag-of-words Architecture
w(t-2)
SUM
w(t-1)
w(t)
w(t+1)
w(t+2)
Source: [?]
Predicts the current
• Givenword given predict
the context, the context
the current word!
• Efficient computation with shallow neural nets (1 hidden9 /layer)
31
• Sum dense input word representations (and divide them for proper averaging)
• Shallow NNs can work with larger training sets than deep NNs
• There is no data like more data. . .
Idea: Skip-Gram
Skip-Gram
158
p-gram Architecture
w(t-2)
w(t-1)
w(t)
w(t+1)
w(t+2)
Source: [?]
Predicts the surrounding words given the current word
• Given the current word, predict the most probable context by predicting each context word separately!
8 / 31
• Negative sampling (i.e., also learn the improbability of selected improbable words) improves quality and
efficiency
Word2Vec
word2vec Architecture
Word2Vec
word2vec: SKIP-GRAM Windows and Probabilities
Examples windows and and process for computing 𝑃(𝑤𝑡+𝑗 |𝑤𝑗 )
[?] http://web.stanford.edu/class/cs224n/syllabus.html
159
Word2Vec: objective function
word2vec SKIP-GRAM Objective Function
For each position 𝑡 = 1, ... , 𝑇, predict context words within a window of fixed
size m, given center word 𝑤t.
Likelihood =
𝜃 𝑖𝑠 𝑎𝑙𝑙 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
𝑡𝑜 𝑏𝑒 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑑
[?]
exp(𝑢𝑜𝑇 𝑣𝑐 )
𝑃 𝑜𝑐 = 𝑇 𝑣 )
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐
Word2Vec:
[?]
prediction function
word2vec: Conditional Word Probabilities
[?]
Who does the workload? Softmax
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
160
This is softmax!
word2vec:Softmax
Softmax function ℝ𝑛 → ℝ𝑛 : Softmax
exp(𝑥𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒙)𝑖 = = 𝑝𝑖
σ𝑛𝑗=1 exp(𝑥𝑗 )
I saw a cat . 39
39 1592 10 2548 5
Token index in
the vocabulary V V
[?]
banking
into into
problems problems
turning turning
crises crises
V V
[?]
banking
into into
problems problems
turning turning
crises crises
V V
161
[?]
Where is 𝜽?
word2vec: What are the parameters?
[?]
Where is 𝜽?
word2vec: Where are the parameters?
[?]
162
https://ronxin.github.io/wevi/
Possible solutions:
Hierarchical softmax
𝑇 𝑇 𝑣 )
exp(𝑢𝑤
Negative sampling exp(𝑢𝑤 𝑣𝑐 ) 𝑐
𝑤∈𝑉 𝑤∈{𝒐}∪𝑺_𝒌
Hierarchical Softmax
163
http://building-babylon.net/2017/08/01/hierarchical-softmax/
𝑁 𝑤,𝑐 × |𝑉|
𝑃𝑀𝐼(𝑤, 𝑐) = log
𝑁 𝑤 𝑁(𝑐)
𝑃𝑀𝐼 = 𝑋 ≈ 𝑋 = 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
w w
≈ × ×
c
c
[?] Levy et al, TACL 2015 http://www.aclweb.org/anthology/Q15-1016
PMI: Point-wise Mutual Information (Blog Post with code▲ ) When is PMI positive? PMI is positive if the two
words tend to co-occur, 0 if they occur together as often as one would expect by chance, and less than 0 if they are
in complementary distribution. [?]
164
Word2Vec:
(Near) equivalence to matrix factorization
𝑁 𝑤,𝑐 × |𝑉|
𝑃𝑀𝐼(𝑤, 𝑐) = log
𝑁 𝑤 𝑁(𝑐)
Context vectors
𝑃𝑀𝐼 = 𝑋 ≈ 𝑋 = 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
w w
c
≈ × ×
Word vectors
c 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
[?]
Levy et al, TACL 2015 http://www.aclweb.org/anthology/Q15-1016
Famous paper reinterpreting the results of word2vec algorithm in a declarative fashion [?]
9.2.2 GloVe
GloVe2 : Global Vectors for Word Representations
Window based cooccurence matrix
• GloVe combines count-based and prediction methods
• Example corpus:
• Training is performed on• aggregated global word-word co-occurrence statistics from a
I like deep learning.
corpus. • I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160
GloVe vs word2vec
165
[?]
GloVe: Properties
“The training objective of GloVe is to learn word vectors such that their dot product equals the
logarithm of the words’ probability of co-occurrence.”
Nice Properties
• Trains quickly
prediction methods
GloVe: Technicalities
GloVe
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
166
GloVe: combine count-based and direct
prediction methods
GloVe: Punishing Rare Events and Reduce the Effect of Frequent Events
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
prediction methods
And prevent frequent co-occurences to be overweighted
𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉
9.2.3 Properties
Limits/Properties of Embeddings
• Type of similarity varies and is not trivial to control. Which is more similar? dog, cat,
tiger.
• Black-sheep effect: Trivial standard features are verbalized less often than salient ones:
black sheep vs. white sheep.
• Corpus bias: Stereotypes of text corpora are reflected in semantic space. Not always
desirable.
167
Word Embeddings and Ambiguity
• Next question: can we divide apart the different meanings in the vectors?
Sense
Sense 1
Sense 2
Sense 3
Sense 1
Sense 2
168
Variants
But how can I directly calculate the meaning of a word in context as a vector? Contextualized
embeddings . . .
• The choice of type and size of context influences the semantic space generated.
• A small window size makes embeddings “more syntactic” and more sensitive to local
co-occurrences.
• A syntactic context makes the words more functionally/semantically similar and homo-
geneous with respect to parts of speech.
However,
Australian scientist discovers star with telescope
odel; con- prep with
%
%
[%%%%%%%%%%%%]%
%L%%=%%%%%%%%% % %%%%%%%%%%%%%%… %%%%%%%%%%n%%
% %%%%%%%%%%the%%%cat%%%%%%mat%%…%
• These%are%the%word%features%we%want%to%learn%
• Also%called%a%lookJup%table%
• Conceptually%you%get%a%word’s%vector%by%le^%mulGplying%a%
oneJhot%vector%e%by%L:%%%%%x%=%Le$
46%
Embeddings in PyTorch▲
170
Using/loading pretrained embeddings
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1 via index lookup
embedding(torch.LongTensor([1]))
>>> tensor([[ 4.0000, 5.1000, 6.3000]])
embedding_layer = Embedding(
num_tokens,
embedding_dim,
embeddings_initializer=\
keras.initializers.Constant(embedding_matrix),
trainable=False,
)
9.2.5 fastText
171
“Bag of Tricks for Efficient Text Classification” [?]
EmbeddingBag▲ uses efficient mean aggregation per default. Offsets are for 1D representation.
5
https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
173
Model AG Sogou DBP Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.
BoW (Zhang et al., 2015) 88.8 92.9 96.6 92.2 58.0 68.9 54.6 90.4
ngrams (Zhang et al., 2015) 92.0 97.1 98.6 95.6 56.3 68.5 54.3 92.0
ngrams TFIDF (Zhang et al., 2015) 92.4 97.2 98.7 95.4 54.8 68.5 52.4 91.5
char-CNN (Zhang and LeCun, 2015) 87.2 95.1 98.3 94.7 62.0 71.2 59.5 94.5
char-CRNN (Xiao and Cho, 2016) 91.4 95.2 98.6 94.5 61.8 71.7 59.2 94.1
VDCNN (Conneau et al., 2016) 91.3 96.8 98.7 95.7 64.7 73.4 63.0 95.7
fastText, h = 10 91.5 93.9 98.1 93.8 60.4 72.0 55.8 91.2
fastText, h = 10, bigram 92.5 96.8 98.6 95.7 63.9 72.3 60.2 94.6
Table 1: Test accuracy [%] on sentiment datasets. FastText has been run with the same parameters for all the datasets. It has
10 hidden units and we evaluate it with and without bigrams. For char-CNN, we show the best reported numbers without data
augmentation.
to Tang et al. (2015) following their evaluation Model Yelp’13 Yelp’14 Yelp’15 IMDB
protocol. We report their main baselines as SVM+TF 59.8 61.8 62.4 40.5
well as their two approaches based on recurrent CNN 59.7 61.0 61.5 37.5
networks (Conv-GRNN and LSTM-GRNN). 174 Conv-GRNN 63.7 65.5 66.0 42.5
LSTM-GRNN 65.1 67.1 67.6 45.3
Results. We present the results in Figure 1. We
use 10 hidden units and run fastText for 5 fastText 64.2 66.2 66.6 45.2
epochs with a learning rate selected on a valida- Table 3: Comparision with Tang et al. (2015). The hyper-
tion set from {0.05, 0.1, 0.25, 0.5}. On this task, parameters are chosen on the validation set. We report the test
adding bigram information improves the perfor- accuracy.
Tutorial Using Pre-trained Embedding Initialization with EmbeddingBag for Classifica-
tion
See Microsoft Learn Tutorial▲
Summary
• Symbolic representations of words lead to high-dimensional and sparse vectors that are
not suitable for expressing similarities.
• Word Embeddings in combination with neural approaches have revolutionized NLP re-
search and text analysis in recent years.
• word2vec representation learning was the first success story of transfer learning!
Further Study
• Mandatory reading: Chapter 6.2 onwards from [?] if you have never heard of word em-
beddings
6
https://carl-allen.github.io/nlp/2019/07/01/explaining-analogies-explained.html
175
Chapter 10
10.1 Options
10.1.1 A
Assignment 4: Option A: Paper Dissection
Identify an interesting and high-quality (short) NLP paper
• If paper is long and covers many Machine Learning approaches, focus on the best or
clearest setup
• If you don’t understand some concepts, search introductory resources (WP pages, quora,
book chapters, chatgpt, blogs, videos) that help.
• But do not waste too much time into researching things that are totally unclear. Try to
formulate/pinpoint what you don’t understand and what is unclear.
• Abstract
• Conclusion
• Look at examples/figures/tables
1
https://francescolelli.info/thesis/read-scientific-papers-quickly-and-effectively/
176
• Introduction
• Methods
• Results
• Discussion
2. Which ML methods are used? What is the main innovation of the paper?
Some rules
• What does one need to know for understanding the paper? List the resources that were
helpful for you.
10.1.2 B
Option B: Short Student Talk
• Or: create a short screencast (e.g. with Screencastify▲ ) for “future” students (no perfec-
tionism asked for!); e.g. a walkthrough to a code example
Topics
177
10.2 Organization
Organization and Deadlines
For this exercise, you can team up in pairs or work alone. Yes, no teams of 3 students allowed.
Communicate your topics and suggestions via Feedback-Forum in OLAT
• For talks: Reply ASAP in forum thread “Student Talks” in OLAT and email me at the
same time.
178
Chapter 11
Learning Goals
• Know how to implement CNNs with high-level interfaces in pytorch, tensorflow and
keras
• Know about the classical MNIST task and its solution with CNNs in tensorflow
11.1 Motivation
11.1.1 Local Features
Sequence Classification Tasks
Sequence
Classical NLP Sequence Labeling Task(s) Labeling
x y
Evidence Class
Word Lemma POS Tag NER Tag
Anton Anton NE B-PER
Schürmanns Schürmann NE I-PER
Reise Reise NN O
über über APPR O
das d ART O
Sustenjoch Sustenjoch NE B-GEO
im im APPRART O
Jahre Jahr NN O
1881 @card@ CARD O
How would a simple neural approach look like?
Sliding window approach: Local features for local predictions! Chapter 8.2.1 in [?]
179
NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH
Lookup Table
LTW 1
.. d
.
LTW K
concat
Linear
M1 × ·
n1
hu
HardTanh
Linear
M2 × ·
n2
hu = #tags
Source: [?]
Figure 1: Window approach network.
Simple FFNN Architecture
complex features (e.g., extracted from a parse tree) which can impact the computational cost which
• Input: words and externally available features. Which ones might be useful?
might be important for large-scale applications or applications requiring real-time response.
Instead, we •advocate
Vectorization: Embeddings
a radically of words
different and features
approach: as input we will try to pre-process our
features as little •asSimplest
possible and then use a multilayer neural network (NN) architecture, trained in
architecture
an end-to-end fashion. The architecture takes the input sentence and learns several layers of feature
• Role of first linear layer? Linear modeling
extraction that process the inputs. The features computed by the deep layers of the network are
• Role of HardTanh? Adding non-linearity
automatically trained by backpropagation to be relevant to the task. We describe in this section a
general multilayer architecture
• Role suitable
of last linear for all ourreduction
layer? Dimension NLP tasks, whichofisoutput
to number generalizable
tags to other NLP
tasks as well.
Our architecture is summarized
Representation of Features:in Sparse
Figurevs1Dense
and Figure 2. The first layer extracts features for
each word. The second layer extracts features from a window of words or from the whole sentence,
treating it as a sequence with local and global structure (i.e., it is not treated like a bag of words).
The following layers are standard NN layers.
We consider a neural network fθ (·), with parameters θ. Any feed-forward neural network with L
layers, can be seen as a composition of functions fθl (·), corresponding to each layer l:
(a) pw=the pt=DET w=dog&pw=the
pt=NOUN
w=dog w=dog&pt=DET w=chair&pt=DET
(b)
x = (0.26, 0.25, -0.39, -0.07, 0.13, -0.17) (-0.43, -0.37, -0.12, 0.13, -0.11, 0.34) (-0.04, 0.50, 0.04, 0.44)
Word Embeddings
[?, 91] F1: Current word = dog, F2: Preceding word=the, F3: preceding PoS=DET (a) one-hot encoding of features
and their combination
(b) dense embeddings of features (networks learns how to combine the evidence)
Why can a simple CBOW representation fed into an FFNN work to some degree?
As an approximation, lexical clues are informative regardless of their position!
Why only as a rough approximation?
181
“Montias pumps a lot of energy into his nuanced narative, and surrounds himself with a cast
of quirky—but not stereotyped—street characters” vs “Montias surrounds himself with a cast
of quirky—but not stereotyped—street characters, thereby pumping a lot of energy into his
nuanced narative.”
BTW: “A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors”
[?]
Interesting paper that tackles (among other tasks) the problem of deriving good embeddings
for unseen or rare n-grams. nonce2vec problem [?].
• N-grams for capturing local information exponentially grow the input size
• FFNNs are not prepared for global ordering invariance (shifted inputs with similar func-
tion)
11.2 CNN
182
CNN Architecture: Convolution-and-Pooling
Design goals of CNNs
11.2.1 Convolution
Intuition: Visual Filters Based on Matrix Convolution
Convolution kernels are like special “glasses” to see different aspects of rich information sources
Blurring
Edge Detection
183
Download from finelybook www.finelybook.com
Now you know all the building blocks to create a convolutional neural network. Let’s
see how to assemble them.
Classical LeNet Architecture for Object Recognition
CNN Architectures
Idea: From pixels to edges to shapes to classification
Pooling
Typical CNN architectures stack a few convolutional layers (each one generally fol‐
lowed by a ReLU layer), then a pooling layer, then another few convolutional layers
(+ReLU), then another pooling layer, and so on. The image gets smaller and smaller
as it progresses through the network, but it also typically gets deeper and deeper (i.e.,
with more feature maps) thanks to the convolutional layers (see Figure 13-9). At the
top of the stack, a regular feedforward neural network is added, composed ▲ of a few
fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a
softmax
Sandwichlayer that outputs
architecture with estimated class
convolution andprobabilities).
pooling layers
Figure
CNNs13-9. Typical
dominate CNN vision
computer architecture
since 2012. Live browser-based demo▲
Over the years, variants of this fundamental architecture have been developed, lead‐
ing to amazing advances in the field. A good measure of this progress is the error rate
in competitions such as the ILSVRC ImageNet 184 challenge. In this competition the
top-5 error rate for image classification fell from over 26% to barely over 3% in just
five years. The top-five error rate is the number of test images for which the system’s
top 5 predictions did not include the correct answer. The images are large (256 pixels
high) and there are 1,000 classes, some of which are really subtle (try distinguishing
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
Which problems are solved here?
Well-known object recognition model: https://www.v7labs.com/blog/yolo-object-detection
https://cs.stanford.edu/people/karpathy/neuraltalk2/demo.html
What problem should have been solved here?
185
CNNs for NLP
186
CNNs for ASR: Wav2vec 1.0▲ (2019)
• Directly vectorizes 30ms wave form data by CNNs using a self-supervision approach
187
• Input encoder: 512-dimensional output vector
• Context encoder: takes input encoder output from 210ms and outputs 512-dimensional
context representation
• Self-Supervision goal: Learn to recognize the next 12 input encoder vectors (each out of
10 randomly selected vectors) from the context representation input only
188
CNNs for Text-To-Speech: WaveNet▲ [?]
• Idea: A generative dilated CNN directly produces the sound wave signal (therefore it
can also learn to generate music).
189
• Basic convolution is just a weighted sum
190
CNNs
Convolutional Layer
Intuition:
Learn local features which are important for classification
Position independent extraction of features
[?]
~
Filter matrix uses kernel coordinate system with
Heike Adel 0CNNs
CNNs
at central position (equidistant from borders).
05.04.2019 13 / 67 Stride
How would a normal matrix notation for the filter look like?
Convolutional Filter
Convolutional Filters Without Bias
5 3 -1 2 4
-2 -1 0 1 0 1 2 1 5 -2 3
0 2 1 1 1 * 0 0 0 = -6 -6 2
0 -1 4 -1 -2 -1 -2 -1 1 -5 -12
1 0 3 4 5
5*1+3*2+(-1)*1+(-2)*0+(-1)*0+0*0+0*(-1)+2*(-2)+1*(-1)=5
1
X 1
X
y [k, l] = f [i, j] · x[k i, l j]
j= 1 i= 1
191
11.2.2 1D
One 1D Convolution Over Text: Sliding Window
words
• that are adverb-like”).⁷
Sliding window of sizeFigure
k: How13.3 shows
many in aa sequence
two-layer of
hierarchical
length n? convolution with k D 2.
• Dimensionality of xi ? xi i∈ e R emb s
k∗d
a l e r vc wa od
u s t y
act al ice no ver go
e
• Weight
th filter u foracdot-productsezi = xi · u
t u r v
wa
s
no
t
ver
y
Figure
Many13.3: Two-layer hierarchical
1D Convolutions Over Text convolution with k =2.
Figure 13.1 show narrow and wide convolutions in the two notations.
Example: 1D Convolution
e
ic
as
rv
od
al
w
se
y
tu
go
er
no
e
ac
al
ic
tv
*PAD*
ry
tu
rv
as
e
no
th
ve
ac
se
the
actual
service
was
not
very
the actual service was not very good good
*PAD*
(a) and how many filters?
Which dimensions (b)
193
In NLP, CNN has become popular for sentence modeling
Kalchbrenner et al., 2014
Kim, 2014
etc...
RS: CONVOLUTIONAL NEURAL NETWORKS
If the input is a sentence, where can we get the “image” for the CNN?
ons How many)vectors do word
p ieach
Represent we have? For a(e.g.,
as a vector sentence of length
with word n
embeddings)
ere are n k C ) 1 positions
A sentence in
canwhich
then beto start theassequence,
represented a matrix and we
C1 . is is called a narrow convolution. An alternative is to pad the
-words to each side, resulting in n C k C 1 vectors p1WnCkC1 . is is
I
chbrenner et al., 2014]. We use m to denote
like the number of resulting
sentence
this
movie
very
much
!
of Convolutions In our description of sentence convolutions
representation over a sequence
associated
~ with a d -dimensional vector, and the vector are concate-
The sequence of words is still in just 1 direction! In “real images” neighborhood of pixels makes
Heike Adel CNNs 05.04.2019 34 / 67
sentence vector.
sense e convolution
in all directions, network
but not in texts! with a window of size k
ased on a k d ` matrix. is matrix is applied to segments of the
Horizontal vs Vertical Stacking of Windows
at correspondHorizontal
to k -word windows. Each such multiplication results
Stacking Stacking,
Horizontal
values can be •thought of as the result of a dot product between a
Input of n words with d dimension (channels): R1×n∗d
matrix) and a •sentence segment.
Convolution matrix U with windows of size k and l filters: Rk∗d×l
ormulation that is often used in1×k∗d the literature is one in which the n
• Window segment R is multiplied by Rk∗d×l , resulting in R1×l
each other, resulting in an n d▲ sentence matrix. e convolution
• See NumPy’s hstack
by sliding ` different k d matrices (called “kernels” or “filters”)
a matrix
d performingVertical Stackingconvolution between each kernel and the cor-
segment. e matrix convolution operation between two matrices is
• Input of n words with d dimensions (channels): Rn×d
nt-wise multiplication of the two matrices, and summing the results.
l different convolution kernels/filters size Rk×d
nel convolution• operations produces a single value, for a total of `
oneself that the• two
Theseapproaches
kernels slide over
are matrix
indeed rows while performing
equivalent, matrix convolution.
by observing
to a row in the• k ` matrix,is and
d convolution
Matrix theofconvolution
the sum the elements ofwith a kernelproduct of two same-
the Hadamard
shape matrices.
t with a matrix row.
• See NumPy’s vstack▲
ow and wide convolutions in the two notations.
1D Convolution in the Vertical Stacking
od
y
t
go
er
no
tv
*PAD*
ry
as
*PAD* the
no
ve
w
the
actual the actual
service actual service
was service was
was not
not
not very
very very good
not very good good good *PAD*
*PAD*
(b)
194
Padding
195
Advantages of Wide Convolution [?]
Convolution,
Wide
• All weights in a filter reach every element in the sequence, . . .
• But hierarchical convolution with smaller windows is preferable and covers the same
span. CNNs
• Robustly produces wellformed convoluted vectors, even if input is smaller than window.
Zero Padding
• Padding with which values? Typically a zero vector for CNNs.
Zero Padding
Problem: Filters are not well defined for data near the borders
) Zero padding
0 1
0 1 0 0 0 0 0 0
0 1 0 0 B 0C
B0 0 1 0 0 C
@2 3 3 3A ) B0 2 3 3 3 0C
B C
4 0 1 0 @0 4 0 1 0 0A
0 0 0 0 0 0
[?]
Note: Zero padding size is another hyperparameter
• Idea: keep the relevant, that is, informative part with respect to the task
N in Action
4 1 6 9
max pooling
layer
2 4 3 5 5 6
-1 1 0 9 0 2
2 3 3 3 0 1 2 0
4 0 1 0
Heike Adel [?] CNNs 05.04.2019 17 / 67
Read from bottom to top
197
CNNs
Exercise
? ?
? ?
max pooling Compute the output
layer
of the convolutional
? ? ? ?
and max pooling
? ? ? ? layer
Filter:
convolutional -1 1
layer
0 1 0 -2 1
2 -2
2 3 3 2 3
4 0 CNNs
1 0 2
[?]
N in Action Heike Adel CNNs 05.04.2019 25 / 67
fully-
connected
layer
non-linearity
4 1 6 9
max pooling
layer
2 4 3 5 5 6
-1 1 0 9 0 2
2 3 3 3 0 1 2 0
4 0 1 0
Heike Adel [?] CNNs 05.04.2019 19 / 67
Note: no non-linearity after convolution
198
CNNs
Pooling Windows
5 2 -1 -2
11 5 -2 3 11 3
-4 -6 -6 2 1 2
0 1 -5 -12
Typical in NLP: max pooling over time
) Only store the maximum value for the whole sentence
5 2 -1 -2 5
11 5 -2 3 11
-4CNNs
-6 for NLP
-6 2 2
0 1 -5 -12 1
CNN for NLP
[?] Heike Adel CNNs 05.04.2019 15 / 67
n = number of words
d = Dimension/channel of words
k = size of kernel
n k
Convolution
Heike Adel
and Max-Pooling onCNNs
Text 05.04.2019 35 / 67
Average
c[j] = max pi[j] for all filter j Pooling
1<i≤m
199
particular sort of predictors, and max operation will pick on the most important predictor of each
type.
Figure 13.2 provides an illustration of the convolution and pooling process with a max-
pooling operation.
6×3
W max
the quick brown fox jumped over the lazy dog
convolution pooling
Padding
Text The
each dimension, resulting in a... final 3-dimensional
1
pooled
1
vector. 1
Feature 1 w1 w2 . . . wN
Padding
.. K
Padding
Feature
. K w1K w2K . . . wN
Feature K 1 2 N wK wK . . . wK
Average Pooling e second most common
Lookup Table pooling type being average-pooling —taking the
average value of each index
LTinstead
Lookup
W of
Table the1max:
LT...W 1 d
m
X
LTW .. K
.
1 d
cD pi : (13.5)
LTW K m
iD1
Convolution
Convolution M1 × ·
M1 × ·
n1
hu
n1
hu
n1
Linear hu
Linear
M2 × ·
n2
M2 × · hu
HardTanh n2
hu
HardTanh
Linear
Linear
M3 × ·
n3
M3 × · hu = #tags
n3
hu = #tags
Break-through paper of applying CNNs to basic NLP tasks (POS tagging, chunking, NER,
SRL)
wait
for
the
video
and
do
n't
rent
it
Be aware that each convolution filter has a bias parameter per default!
Dynamic Pooling Rather than performing a single pooling operation over the entire sequence,
we may want to retain some positional information based on our domain understanding of the
prediction problem at hand. To this end, we can split the vectors pi into r distinct groups, apply
the pooling separately on each group, and then concatenate the r resulting `-dimensional vectors
c1 ; : : : ; cr . e division of the pi s into groups is performed based on domain knowledge. For
example, we may conjecture that words appearing early in the sentence are more indicative than
words appearing late. We can then split the sequence into r equally sized regions, applying a
separate max-pooling to each region. For example, Johnson and Zhang [2015] found that when
classifying documents into topics, it is useful to have 20 average-pooling regions, clearly separating
the initial sentences (where the topic is usually introduced) from later ones, while for a sentiment
classification task a single max-pooling operation over the entire sentence was optimal (suggesting
that one or two very strong signals are enough to determine the sentiment, regardless of the
position in the sentence).
202task we may be given two words and asked to
Similarly, in a relation extraction kind of
determine the relation between them. We could argue that the words before the first word, the
words after the second word, and the words between them provide three different kinds of infor-
⁵In this chapter, we use k to denote the window-size of the convolution. e k in k -max pooling is a different, and unrelated,
value. We use the letter k for consistency with the literature.
Example: CNN for Relation Classifi
P(r|c)
softmax
sentence representation s
fully-connected MLP
u
Pleft Pmiddle Pright v flag
flatten flatten flatten 4
pooling result
k-max pooling k-max pooling k-max pooling
2
m 1
In
case indicator
wordvector,
left context middle context right context
contextC
Heike Adel CNNs
203
P
Pright v flag
flatten 4
top 1
top 3
3 top 5
pooling result
pooling
2
m 1
conv
W2c+1W2c+2…W3c-1W3c
,
in c ill
am ,
e rs
's
n e its
a ry
In
d iv ity
n
ad
u re
ra i e >
r>
su w es
is io
lu d
w
lle
lr o
u ti
id i
a rt
fu t
case indicator
< fi
wordvector,
bs
<n
qu
right context
• Height of a bar is frequency that the 3-gram around corresponding word was selected by k-max pooling:
“newest” stands for “its newest subsidiary”
r being baptized ... block-size pooling is not always optimal for NLP
• Fixed
• Consider the sequence of words by traversing a parse tree, not by reading order
204
Fully connected
layer
tence:
3 K-Max pooling
(k=3)
ws 5 (2)
Folding
varying sentence
Wide
the maximum of convolution
(m=2)
c yielding a vector
3
:) Dynamic
7
(3)
k-max pooling
5 (k= f(s) =5)
:)
levant feature, i.e. Wide
convolution
for each of the d (m=3)
The fixed-sized
ut to a fully con-
Projected
sentence
detectors
through is lim-
the network. NB OW 42.4 80.5
ntence models, m the or with
NBoW is a dynamic pooling layers
48.5
given by dynamic k-
86.8
. Increasing DCNN
d the RNN has a linear chain max pooling. In the network the width of a feature
layers of the nar-
ubgraphs induced in the Max- map atTable 1: Accuracy of layer
an intermediate sentiment prediction
varies in the on
depending
feature detectors
a single fixed-range feature ob- movie reviews dataset. The first four results are
the length
11.2.4 Hierarchical of the input sentence; the resulting ar-
xacerbates
x pooling. The the ne-
recursive neural reported from Socher et al. (2013b). The baselines
Hierarchicalchitecture
he structure of an Convolution
external parse NB andis the
B I NBDynamic Convolutional
are Naive Bayes Neural
classifiers with,
nce and increases
variable range are computed Network.
at Figureunigram
respectively, 3 represents
features and a DCNN.
unigram and Webi- pro-
sentence
tree required
combining one or more of gram features. SVM is 205
a support vector machine
ceed to describe the network in detail.
ason higher-order
tree. Unlike in a DCNN, where with unigram and bigram features. R EC NTN is a
hierarchy
cannot be easilyorders, in recursive neural network with a tensor-based fea-
of feature
der features like those of 3.1 sin- Wide Convolution
ture function, which relies on external structural
max pooling op-
directly combined with higher features given by a parse tree and performs best
13.3 HIERARCHICAL CONVOLUTIONS
e 1D convolution approach described so far can be thought of as an ngram detector. A convo-
lution layer with a window of size k is learning to identify indicative k -grams in the input.
e approach can be extended into a hierarchy of convolutional layers, in which a sequence
of convolution layers are applied one after the other. Let CONVk‚ .w1Wn / be the result of applying
a convolution with window size k and parameters ‚ to each k -size window in the sequence w1Wn :
pi Dg.˚.wi Wi Ck 1 / U C b/
( (13.6)
n k C 1 narrow convolution
mD
n C k C 1 wide convolution:
We can now have a succession of r convolutional layers that feed into each other as follows:
1
p1Wm DCONVk1 1 1 .w1Wn /
1 U ;b
2
p1Wm DCONVk2 2 2 .p1Wm
1
/
2 U ;b 1
(13.7)
r
p1Wm DCONVkr r r .p1Wm
r 1
/:
r U ;b r 1
160 13. NGRAM DETECTORS: CONVOLUTIONAL NEURAL NETWORKS
r
e resulting vectors p1Wm capture increasingly larger effective windows (“receptive-fields”) of the
can be further specialized (i.e.,
r “a sequence of words that do not contain not ” or “a sequence of
sentence. For r layers with a window of size k , each vector pir will be sensitive to a window
words that are adverb-like”).⁷
Simple Figure 13.3of shows a two-layer hierarchical convolution with k D 2.
of r.k Hierarchical
1/ C 1 words.⁶CNN With
Moreover, Stride the vector1 (word
pir can vector)
be sensitive to gappy-ngrams of k C r 1
words, potentially capturing e patterns such as as “not good ” or “obvious predictable plot ”
e r vic c ew n ot y o d
where a l a short sequence
stands for s
er v
i of words,waas
s well more vspecialized
t
e r patterns
go where the gaps
act
u
ua
ls
vic
e
sn
o er y
tv
the act ser wa no
⁶To see why, consider that the first convolution layer transforms each sequence of k neighboring word-vectors into vectors
representing k -grams. en, the second convolution layer will combine each k consecutive k -gram-vectors into vectors that
capture a window of k C .k 1/ words, e and so on, until the r th convolution will capture k C .r 1/.k 1/ D r.k
1/ C 1 words. t u a l vic wa
s
c l ser e ot er y oo
d
h e a
c t ua e r vi c
a sn o tv e r yg
t a s w n v
Strides, Dilation and Pooling So far, the convolution operation is applied to each k -word win-
dow in the sequence, i.e., windows starting at indices 1; 2; 3; : : :. is is said to have a stride of
size 1. Larger strides are also possible, i.e., with a stride of size 2 the convolution operation will
be applied to windows starting at indices 1; 3; 5; : : :. More generally, we define CONVk;s as:
p1Wm DCONVk;s .w1Wn /
U ;b
(13.8)
pi Dg.˚.w1C.i 1/sW.sCk/i / U C b/;
whereissthe
What is the stride
stride size. e result will be a shorter output sequence from the convolutional
parameter?
layer.
a dilated convolution architecture [Strubell et al., 2017, Yu and Koltun, 2016] the hi-
In Stride
Effect of
erarchy of convolution layers each has a stride size of k 1 (i.e., CONVk;k 1 ). is allows an
exponential growth in the effective window size as13.3. a function of the number
HIERARCHICAL of layers.161
CONVOLUTIONS Figure 13.4
(a)
shows convolution layers with different stride lengths. Figure 13.5 shows a dilated convolution
architecture.
An alternative to the dilation approach is to keep the stride-size fixed at 1, but shorten the
0
sequence length between each layer by applying k = 3,local
s = 1 pooling, i.e, consecutive k -gram of vectors
⁷To see why, consider a sequence of two convolution layer each with a window of size 2 over the sequence funny and appealing.
e first convolution layer will encode funny and and and appealing as vectors, and may choose to retain the equivalent of
“funny ” and
(b) “ appealing ” in the resulting vectors. e second convolution layer can then combine these into “funny
appealing,” “funny ” or “ appealing.”
k = 3, s = 2
(c)
k = 3, s = 3
Figure 13.4: Strides. (a–c) Convolution layer with k =3 and stride sizes 1, 2, 3.
Dilation: Holes in Convolution
can be converted into a single vector using max pooling or averaged pooling. Even if we pool
just every two neighboring vectors, each convolutional-and-pooling layer in the hierarchy will
halve the length of the sequence. Similar to the dilation approach, we again gain an exponential
decrease in sequence length as a function of the number of layers.
Parameter Tying and Skip-connections Another variation that can be applied to the hierarchical
convolution architecture is performing parameter-tying, using the same set of parameters U; b
in all the parameter layers. is results in more parameter sharing, as well as allowing to use an
unbounded number of convolution layers (as all the convolution layers share the same parameters,
the number of convolution layers need not be set in advance), which in turn allows to reduce
arbitrary length sequences into a single vector by using a sequence of narrow convolutions, each
resulting in a shorter sequence of vectors.
When using deep architectures, skip-connections are sometimes useful: these work by feed-
ing into the i th layer not only the vectors resulting from the i 1th layer, but also vectors from
207
Figure 412. A convolution with kernel_size=2 applied to an input matrix with the hyperparameter dilation=2.
The increase in dilation from its default value means the elements of the kernel matrix are spread further apart as
they multiply the input matrix. Increasing dilation further would accentuate this spread.
[?]
Implementing CNNs in PyTorch
162 13. NGRAM
Dilated DETECTORS:
Convolution NetworksCONVOLUTIONAL NEURAL NETWORKS
In this section, we work through an endtoend example that will utilize the concepts introduced in the
previous section. Generally, the goal of neural network design is to find a configuration of
hyperparameters that will accomplish a task. We again consider the nowfamiliar surname
classification task introduced in Example: Surname Classification with an MLP”, but we will use
CNNs instead of an MLP. We still need to apply a final Linear layer that will learn to create a
prediction vector from a feature vector created by a series of convolution layers. This implies that the
goal is to determine a configuration of convolution layers that results in the desired feature vector. All
CNN applications are like this: there is an initial set of convolutional layers that extract a feature map
that
Not becomes
Figure 13.5: input in some
really illustrating
ree-layer
theupstream
point processing. In classification, the upstream processing is almost
dilated hierarchical convolution with k =3.
always the application of a Linear (or fc) layer.
previous layers which are combined to the vectors of the i 1th layer using either concatenation,
The implementation
averaging, walk through in this section iterates over the design decisions to construct a
or summation.
2
feature vector. We begin by constructing an artificial data tensor mirroring the actual data in shape.
Further Reading e use of hierarchical and dilated convolution and pooling architectures is
The size of the data tensor is going to be threedimensional—this is the size of the minibatch of
very common in the computer-vision community, where various deep architectures—comprising
vectorized text data.
of arrangements ofIfmany
you use a onehot vector
convolutions for eachlayers
and pooling character
withindifferent
a sequence of characters,
strides—have a pro-
been
sequence of onehot
posed, resulting vectors
in very is a matrix,
strong and a minibatch
image classification andofobject
onehot matrices isresults
recognition a threedimensional
[He et al., 2016,
tensor. Using the terminology of convolutions, the size of each onehot vector (usually the size of thefor
Krizhevsky et al., 2012, Simonyan and Zisserman, 2015]. e use of such deep architectures
NLP is still more preliminary. Zhang et al. [2015] provide initial experiments with text classi-
vocabulary) is the number of “input channels” and the length of the character sequence is the “width.”
fication with hierarchical convolutions over characters, and Conneau et al. [2016] provide fur-
ther results, this time with very deep convolutional networks. e work of Strubell et al. [2017]
As illustrated in xample 414, the first step to constructing a feature vector is applying an instance of
provides a good overview of hierarchical and dilated architectures for a sequence labeling task.
PyTorch’s
Kalchbrenner et al.class
Conv1d to theuse
[2016] threedimensional data tensor.
dilated convolutions By checking
as encoders in anthe size of the output,archi-
encoder-decoder you
can getillustration
tecture
Better a(Section
sense of how
17.2)
frommuch
for machine
WaveNet ▲ translation.
the tensor has been reduced. We referofyou
e hierarchy to igure 49with
convolutions for alocal
visual
pooling
approach isofused
explanation whybytheXiao andtensor
output Cho [2016], who apply it to a sequence of character in a document-
is shrinking.
classification task, and then feed the resulting vectors into a recurrent neural network. We return
to this example in Section 16.2.2, after discussing
Example 414. Artificial data and using a Conv1d class
208 recurrent-neural-networks.
• Exponential growth of receptive field as a function of the deepness of the CNN.
7 A fully connected layer with 150 × 100 neurons, each connected to all 150 × 100 × 3 inputs, would have 1502
× 1002 × 3 = 675 million parameters!
209
Download from finelybook www.finelybook.com
Moreover, input images are also composed of multiple sublayers: one per color chan‐
nel. There are typically three: red, green, and blue (RGB). Grayscale images have just
one channel, but some images may have much more—for example, satellite images
that capture extra light frequencies (such as infrared).
Figure 13-6. Convolution layers with multipleSource: [?] and images with three
feature maps,
channels
Channels
Specifically, a neuron located in row i, column j of the feature map k in a given convo‐
lutional layer l is connected to the outputs of the neurons in the previous layer l – 1,
located in rows i × sw to i × sw + fw – 1 and columns j × sh to j × sh + fh – 1, across all
feature maps (in layer l – 1). Note that all neurons located in the same row i and col‐
umn j but in different feature maps are connected to the outputs of the exact same
neurons in the previous layer.
Equation 13-1 summarizes the preceding explanations in one big mathematical equa‐
tion: it shows how to compute the output of a given neuron in a convolutional layer.
[?]
The convoluted channels are added. Another option to combine channels: 1x1 convolution.
Channels vs Filters
210
Figure 47. A convolution operation is shown with two input matrices (two input channels). The corresponding
kernel also has two layers; it multiplies each layer separately and then sums the results. Configuration:
input_channels=2, output_channels=1, kernel_size=2, stride=1, padding=0, and dilation=1.
Figure 48. A convolution operation with one input matrix (one input channel) and two convolutional kernels (two
output channels). The kernels apply individually to the input matrix and are stacked in the output tensor.
Configuration: input_channels=1, output_channels=2, kernel_size=2, stride=1, padding=0, and dilation=1.
It [?]
is a bit ugly due to all the different indices, but all it does is calculate the weighted
It’s difficult to immediately know how many output channels are appropriate for the problem at hand.
sum of all the inputs, plus the bias term.
To simplify this difficulty, let’s say that the bounds are 1 and 1,024—we can have a convolutional
Full Formula for 2D Convolution with Bias
layer with a 13-1.
Equation singleComputing
channel, up the
to aoutput
maximumof aof
neuron
1,024 in a convolutional
channels. Now that layer
we have bounds, the next
thing to considerf is how f w many
f input channels there are. A common design pattern is not to shrink the
h n i = u . sh + f h − 1
number
k + ∑
zi, j, k =ofbchannels by∑more∑than
x a factor
u = 1 v = 1 k = 1 i , j ,k
. w of two from
u, v, k , k
one convolutional
with layer to the next. This is not
j = v . sw + f w − 1
a hardandfast rule, but it should give you some sense of what an appropriate number of
out_channels would look like.
• zi, j, k is the output of the neuron located in row i, column j in feature map k of the
SIZE layer (layer l).
convolutional
KERNEL
• As explained earlier, sh and sw are the vertical and horizontal strides, fh and fw are
The width
the of the
height andkernel
widthmatrix
of theis receptive
called the kernel size (kernel_size
field, and fn is the numberinofPyTorch). In igure 46
feature maps
in the previous layer (layer l – 1).
• xi , j , k is the output of the neuron located in layer l – 1, row i , column j , feature
map k (or channel k if the previous layer is the input layer).
• bk is the bias term for feature map k (in layer l). You can think of it as a knob that
tweaks the overall brightness of the feature map k.
• wu, v, k ,k is the connection weight between any neuron in feature map k of the layer
l and its input located at row u, column v (relative to the neuron’s receptive field),
and feature map k .
[?]
TensorFlow Implementation
11.3 Training
In TensorFlow, each input image is typically represented as a 3D tensor of shape
[height, width, channels]. A mini-batch is represented as a 4D tensor of shape
Training of CNN Layers
[mini-batch size, height, width, channels]. The weights of a convolutional
layer are represented as a 4D tensor of shape [fh, fw, fn, fn ]. The bias terms of a convo‐
lutional layer are simply represented as a 1D tensor of shape [fn].
211 loads two sample images, using
Let’s look at a simple example. The following code
Scikit-Learn’s load_sample_images() (which loads two color images, one of a Chi‐
nese temple, and the other of a flower). Then it creates two 7 × 7 filters (one with a
vertical white line in the middle, and the other with a horizontal white line), and
applies them to both images using a convolutional layer built using TensorFlow’s
Training can be done with gradient descent and backpropagation on the computation graph –
just as with FFNNs.
Additionally needed:
CNNs
[?, 54]
Recap: Backpropagation
Mathematical Recap: Forward and Backward Pass
@C @zil @C
@wijl
= @wijl @zil
212
https://grzegorzgwardys.wordpress.com/2016/04/22/8/
https://grzegorzgwardys.wordpress.com/2016/04/22/8/
Gradients of CNN
213
CNNs
δ1l+1 δ2l+1
4 0 1 0
[?]
Heike Adel CNNs 05.04.2019 29 / 67
11.3.1 Deep
Issues with Deep CNNs
Deep CNN
• Skip connections can bypass certain layers: Bypassed information from lower levels is
integrated again on high levels.
• More parameter saving: Share parameters between layers (similar idea as in recurrent
neural networks)
• Apply dropout: deactivate a random selection of neurons in a training step for regular-
ization
11.3.2 Normalization
Variants of Normalization Illustrated on Image Data
214
s into
we
and GN,
and GN, perform
malization,perform thepresent
the
and then following
following computation:
GNcomputation:
in this formulation. A fam-
s not
ily of feature normalization11methods, including BN, LN, IN,
s we (1)
(1)
and GN, perform thex̂x̂following
= (x
ii = (xiicomputation:
µµii).
).
ii
H, W
H, W
H, W
H, W
C N C N C N C N
Figure 2. Normalization methods. Each subplot shows a feature map tensor, with N as the batch axis, C as the channel axis, and (H, W )
Group
as the spatial axes. The Normalization
pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels.
[?]
number. ShuffleNet [65] proposes a channel shuffle oper- 3.1. Formulation
Yuxinthe
ation that permutes Wuaxes of groupedKaiming He These
features.
We first describe a general formulation of feature nor-
methodsSize
Batch all involve
and dividingNormalization
Batch the channel dimension into
Facebook AI Research (FAIR) malization, and then present GN in this formulation. A fam-
groups.are
Why Despite
small thebatchsizes
relation to these
badmethods, GN does
for Batch not
Normalization?
{yuxinwu,kaiminghe}@fb.com ily of feature normalization methods, including BN, LN, IN,
require group convolutions. GN is a generic layer, as we
and GN, perform the following computation:
evaluate in standard ResNets [20].
Abstract 36
Batch Norm 1
Batch Normalization (BN) is a milestone technique in the x̂i = (xi µi ). (1)
3. Group Normalization
34 Group Norm
i
evelopment of deep learning, enabling various networks 32
o train. However, normalizing Thealong the batch
channels dimension
of visual representations Here x is the feature computed by a layer, and i is an index.
30 are not entirely
error (%)
ntroduces problems — BN’s error increases rapidly when In the case of 2D images, i = (iN , iC , iH , iW ) is a 4D vec-
independent. Classical features of SIFT [39], HOG [9],
he batch size becomes smaller, caused by inaccurate batch 28
tor indexing the features in (N, C, H, W ) order, where N is
and GIST [41] are
atistics estimation. This limits BN’s usage for training group-wise representations by design,
wherefeatures
each group of channels the batch axis, C is the channel axis, and H and W are the
vision is constructed by some kind
26
arger models and transferring to computer
of histogram. These
asks including detection, segmentation, and video, which features are often processed
24 by group- spatial height and width axes.
wise normalization
equire small batches constrained over each histogram or22each orientation.
by memory consumption. µ and in (1) are the mean and standard deviation (std)
32 16 8 computed 4 by: 2
n this paper, we presentHigher-level
Group Normalization
features such (GN) as as VLAD [29] and Fisher Vec-
batch size (images per worker)
simple alternative to BN.torsGN (FV)divides
[44]thearechannels into
also group-wise features where a group s
roups and computes within Figure 1. ImageNet classification error vs. batch 1 X sizes. This is 1 X
caneach group theofmean
be thought as theandsub-vector
vari- computed with respect
a ResNet-50 model trained in the ImageNet µ i = x
training set using
k , 8i = (xk µi )2 + ✏, (2)
nce for normalization. GN’s computation is independent
to a cluster. workers (GPUs), evaluated in the validation set.m k2Si m
k2Si
f batch sizes, and its accuracy is stable in a wide range
f batch sizes. On ResNet-50Analogously, it is not
trained in ImageNet, GNnecessary
has toDespite
think itsofgreat
deepsuccess,
neu- [?] BN exhibits drawbacks that are
with ✏ as a small constant. Si is the set of pixels in which
0.6% lower error than its ral BN
network features
counterpart when as using
unstructured
a vectors.
also caused For by itsexample,
distinct behavior of normalizing along
fortypical
conv1batch(the sizes,
first convolutional the batch
layer) of adimension.
network, itInisparticular, the mean and std for
it is required are BNcomputed, and m is the size of this set.
atch size of 2; when using GN is com-
arably good with BN and Group
reasonable Normalization
outperforms to expect definesto a
work fixed
with
a filter and its horizontal flipping to
other normaliza- a number
sufficiently of
largeMany
batch types
channels size of
perfeature
(e.g., 32 per normalization
instance (e.g. methods
32) thatmainly
are differ
nor-
on variants. Moreover,malized.
GN can be naturally transferred worker
exhibit similar distributions of filter responses on natural
2
[26, 59, 20]). A small in how
batch the
leads set
to S is defined
inaccurate
i (Figure 2), discussed as follows.
om pre-training to fine-tuning.
images. GNIfcan outperform
conv its BN- estimation of the batch statistics, and Batch Norm
Inreducing BN’s batch[26], the set Si is defined as:
At application 1 happens
time, to approximately
the normalization learn this pair
ased counterparts for object detection and segmentation in size increases the parameters from training
model error dramatically (Figure 1). data As are used...
of filters, or if the horizontal flipping (or other transforma- [59, 20,▲57, 24, 63] are trained
a result, many recent models (3)
See d2ai chapter
OCO,1 and for video classification in Kinetics,on Batch Normalization
showing for details Si = {k | kC = iC },
tions) is made into the architectureswith by design
non-trivial[11,batch
8], then
sizes that are memory-consuming.
hat GN can effectively replace the powerful BN in a variety
f tasks. GN can be easily theimplemented
corresponding by achannels
few lines of of theseThefilters
heavycan be normal-
reliance where iCto (and
on BN’s effectiveness kC ) denotes
train models in the sub-index of i (and k) along
ode in modern libraries. ized together. turn prohibits people from exploring the C axis. This means
higher-capacity mod- that the pixels sharing the same
11.4 MNIST
The higher-level layers are moreelsabstract that would andbetheir
limitedbe-by memory.channel index are normalized together, i.e., for each chan-
haviors are not as intuitive. However, The
in restrictiontoonorien-
addition batch sizes isnel,
more demanding
BN computes in com-
µ and along the (N, H, W ) axes. In
. Introduction puter8]),vision tasks
tations (SIFT [39], HOG [9],1 or [11, there areincluding
many detection Layer [12,Norm
47, 18], segmen-
[3], the set is:
MNIST
Batch Normalization (Batch
in BN)Tensorflow tation [38, 18], video recognition [60, 6], and other high-
factorsNormthat or [26] has
could lead been
to grouping, e.g., frequency, shapes,
Solving
stablished as a very effective the
component Hello
in deep World
learning, level systems
problem built on them.
(MNIST) withForCNNs!example, the Fast/erSand (4)
illumination, textures. Their coefficients can be interde- i = {k | kN = iN },
rgely helping push the frontier in computer vision [59, 20] Mask R-CNN frameworks [12, 47, 18] use a batch size of
pendent. In fact, a well-accepted1 or computational
2 images model
because of higher resolution, where
Solving
nd beyond [54]. BN normalizes thethe handwritten
features by the mean character recognition meaning
problem that
withLN BN dense is
computes µ and
and along the (C, H,
convolutional NNsW)
in neuroscience is to normalize across “frozen” theby cell responses to a linear layer [20]; in video
transforming
nd variance computed within a (mini-)batch. This has been axes for each sample. In Instance Norm [61], the set is:
hown by many practices to •
[21,ease
52, optimization
55, 5], “with various receptive-field
and enablefamous classificationcenters
with 3D(cov- convolutions [60, 6], the presence of
ering the Martin Görner’s presentation
ery deep networks to converge. Thevisual field)
stochastic and with various
uncertainty spatiotemporal
spatial-temporal featuresfre-introduces a trade-off betweenSi = {k the| k = i , k = i }.
N N C C (5)
quency tunings” (p183,
f the batch statistics also acts as a regularizer that can ben- [21]); this temporal
can happenlengthnot and batch
only in size. The usage of BN often re-
fit generalization. BN hasthe •primary
been aPresentation manyslides
visualofcortex,
foundation but alsoquires
state- these systems
“throughout to compromise
the visual between
meaning theIN
that model de-
computes µ and along the (H, W ) axes
f-the-art computer visionsystem”
algorithms. [5]. Motivated by these works, sign andwe batch sizes. new
propose for each sample and each channel. The relations among BN,
• Links
generic to videos
group-wise normalization for deep neural networks. LN, and IN are in Figure 2.
2 In the context of this paper, we use “batch size” to refer to the number
1 https://github.com/facebookresearch/Detectron/ of samples per worker (e.g., GPU). BN’s statistics are computed for each
lob/master/projects/GN. worker, but not broadcast across workers, as is standard in many libraries.
• Source code as a workbook tutorial
3
1
• (Re-)Introduces many concepts of ML in tensorflow setting (including dropout)
1
https://github.com/GoogleCloudPlatform/tensorflow-without-a-phd
216
Modern Approach: Pooling/Down-Sampling via Stride
217
Source▲
11.5 Conclusions
Summary
• Convolution allows to detect information using small shared parameters present in con-
volutional kernels
• CNNs easily allow for efficient parallelized execution on GPUs and/or multi-GPUs
218
Chapter 12
Learning Goals
• Know why Recurrent Neural Networks can deal with sequences with sequence-intern
dependencies: very well with input dependencies and to a good degree with output
dependencies
• Know about the theoretical abilities of RNNs and the practical problems in training (van-
ishing and exploding gradients)
• Know most important facts about unfolding recurrent relations and backpropagation
through time
12.1 Motivation
Recap: Convolutional Networks (CNNs)
219
Limitations of Non-Recurrent Neural Networks
• Dealing with variable lengths input or output needs special treatment (shortening,padding,
masking)!