0% found this document useful (0 votes)
1K views460 pages

Script

This document provides an overview and outline of the Fall 2023 "Machine Learning for NLP 1" course taught by Simon Clematide at the University of Zurich. The course covers perspectives on AI and NLP, machine learning and linear classification models, using sklearn for learning tasks, and generalized linear classification methods. It includes information on course materials, content, software, recommended books, and topics that will be discussed over the course of the semester.

Uploaded by

Mara Bucur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views460 pages

Script

This document provides an overview and outline of the Fall 2023 "Machine Learning for NLP 1" course taught by Simon Clematide at the University of Zurich. The course covers perspectives on AI and NLP, machine learning and linear classification models, using sklearn for learning tasks, and generalized linear classification methods. It includes information on course materials, content, software, recommended books, and topics that will be discussed over the course of the semester.

Uploaded by

Mara Bucur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 460

Fall 2023: Machine Learning for NLP 1

Simon Clematide
[email protected]

Note: This script includes all slides presented by Simon Clematide. No content from the tutorial or
iPython notebooks was included. The script was automatically generated from the lecture slides and is
therefore not optimized for continuous text in terms of layout and wording.

Version from December 18, 2023


PDF script: https://files.ifi.uzh.ch/cl/siclemat/lehre/hs22/ml4nlp1/script/script.pdf
OLAT: https://lms.uzh.ch/auth/RepositoryEntry/17073865781

University of Zurich
Institut für Computerlinguistik
Andreasstrasse. 15
8050 Zürich

1
Contents

1 Lecture Information 3
1.1 Infos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 “Leistungsnachweis”/Academic Achievements . . . . . . . . . . . . . . . 4
1.2 Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Perspectives on AI and NLP 6


2.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.1 CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.3 AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.6 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 ML Cultures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 xor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.2 Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 MT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.4 Morphology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Machine Learning and Linear Classification 31


3.1 Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 ML for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.2.1 Splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.2.2 Linear Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2.3 Preprocessing/Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.1 Linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.3.2 Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.1 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.4.3 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

1
4 Learning with sklearn 63
4.1 sklearn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.3 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2 skorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5 Generalized Linear Classification 72


5.1 Binary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Multiclass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.3 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 Losses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6 Automatic Differentiation 86
6.1 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.1.1 Numeric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.2 Symbolic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.3 Reverse-Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.2 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 autograd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.2 pytorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.3 tensorflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

7 Learning with PyTorch 104


7.1 Tensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
7.2 ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
7.2.1 Data Loader . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.3 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

8 Feed-forward Neural Networks 109


8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1.1 XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
8.1.2 Depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.1.3 Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
8.1.4 Depth and Nonlinearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.2 FFNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
8.3.1 Activations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
8.3.2 Forward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.3.3 Backward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.3.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.4 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8.4.1 PyTorch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

2
8.4.2 Keras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.4.3 DyNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.5 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

9 Static Word Embeddings 139


9.1 Distributionalism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
9.1.1 Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
9.2 Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
9.2.1 word2vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
9.2.2 GloVe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
9.2.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.2.4 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.2.5 fastText . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

10 Assignment 4: Two Exclusive Options: A XOR B 171


10.1 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
10.1.1 A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
10.1.2 B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
10.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

11 Convolutional Neural Networks (CNN) 174


11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.1.1 Local Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.2 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
11.2.1 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
11.2.2 1D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
11.2.3 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
11.2.4 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
11.2.5 Stride . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
11.2.6 Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
11.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.3.1 Deep . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.3.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.4 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.6 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213

12 Recurrent Neural Networks 214


12.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
12.1.1 LM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
12.1.2 Recurrence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
12.2 RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
12.2.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
12.2.2 States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
12.2.3 Deep RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
12.2.4 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
12.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
12.3.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
12.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
12.4.1 1:n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235

3
12.4.2 n:1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
12.4.3 n:n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
12.4.4 n:m . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
12.4.5 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
12.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
12.6 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

13 Gated RNNs (LSTM, GRU) and Applications 249


13.1 RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
13.1.1 Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
13.2 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
13.2.1 Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
13.2.2 Peepholes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
13.2.3 GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
13.2.4 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
13.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
13.3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
13.3.2 Structured . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
13.3.3 LM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
13.3.4 seq2seq . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
13.4 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
13.5 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268

14 Seq2Seq with Attention 269


14.1 seq2seq with Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
14.2 CTC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275

15 Transformer Architecture 280


15.1 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
15.1.1 Subwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
15.1.2 Self-Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
15.1.3 Block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
15.1.4 Position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
15.1.5 Vis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
15.2 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
15.2.1 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298
15.2.2 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
15.2.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
15.2.4 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305

16 Contextualized Word and Sentence Embeddings 307


16.0.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 307
16.1 Flair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308
16.2 ELMo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
16.2.1 NER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
16.3 SBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315
16.3.1 Bi-Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319
16.3.2 Triplet Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
16.3.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321

4
17 Clustering and Topic Modeling 323
17.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
17.2 Hard Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
17.2.1 Flat . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
17.2.2 Hierarchical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
17.3 Soft Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
17.3.1 TM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
17.3.2 LDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
17.3.3 NMF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
17.3.4 Top2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
17.3.5 BERTopic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
17.3.6 ProdLDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
17.3.7 CTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
17.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
17.4.1 pyLDAvis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
17.4.2 Saliency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358
17.4.3 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
17.4.4 Exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362
17.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363

18 GPT: 4 Lessons from Generative Pre-Training & AI Marketing 365


18.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
18.1.1 Generative LM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
18.1.2 Subwords . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
18.1.3 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
18.1.4 Big . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
18.2 GPT 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
18.2.1 Pretraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
18.2.2 Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
18.2.3 Finetuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
18.2.4 Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
18.3 GPT 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
18.3.1 Ethical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
18.4 GPT 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
18.5 GPT 3.? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
18.6 GPT 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
18.6.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
18.7 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405

19 Prompt Engineering 406


19.1 Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
19.2 Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 407
19.3 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
19.4 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 416

20 Parameter-Efficent Fine-Tuning: Adapters & Co. 417


20.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 417
20.2 Prompts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 420
20.3 Adapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
20.3.1 Tooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425

5
20.4 Lora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
20.5 Further Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

21 Multi-task Learning and Related Ideas 430


21.1 Intro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430
21.2 Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
21.3 Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
21.4 Multitasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
21.5 Multilingual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
21.5.1 Whisper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441

22 Wrap Up and Outlook 446


22.1 Wrap Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
22.2 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
22.3 Finale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448

6
List of Figures

2.1 Model of syntactic transfer translation . . . . . . . . . . . . . . . . . . . . . . . . . 25

17.1 Zutaten für probabilistisches Topic Modeling . . . . . . . . . . . . . . . . . . . . . 336

7
Chapter 1

Lecture Information

1.1 Infos
1.1.1 Material
General information and Requirements

• Lecture expenditure: 6 ECTS points, i.e. 150 hours of work in total expected

• Requirement I: Basic knowledge in statistics and probability theory

• Requirement II: Programming skills in Python

• Master students with a focus on Natural Language Processing

• We are quite a diverse crowd from Computational Linguistics and Informatics.

Booking deadlines of CL modules (Faculty of Arts and Social Sciences)


~
• Book/Cancel until Tuesday 10.10.2023 23:59:59; Different
( from Faculty of Economics)

Olat Course and Teaching Materials


Campus course https://lms.uzh.ch/auth/RepositoryEntry/17430413424/
“23HS 521-505a Machine Learning for Natural Language Processing 1”

• Literature: mandatory reading electronically available in OLAT “Material/literature” or


shared via Links If not freely available, the material may only be used in this lecture and
may not be passed on.

• Jupyter notebooks for tutorial and lecture will be published via Google Colab

• Slides: PDF slides published after lecture.

• Text version of slides with concept back-of-the-book index available before the exam

Please give me feedback on errors/inconsistencies/problems on my slides (e.g. commented


PDF).

8
1.1.2 “Leistungsnachweis”/Academic Achievements
1 Written Exam (75%) and 6 Assignments (25%)
Written onsite exam: 75% of the final grade

• Monday, 8.1.2024 14-15:10 (70 minutes)

• Theory test with focus on essential concepts; 1 A4 cheatsheet allowed

• In English

• Content: script, mandatory readings, tutorial material, exercises

6 Assignments (AS): 25% of the final grade

• 5 practical exercises (starting from week 3) in teams of up to 3 partners

• Partly with sample solutions, but also with discussion in the tutorial

• Grading via peer assessment on OLAT (supervised by TAs)

• EITHER 10 minute short live presentation/screencast OR 1 research paper dissection;


can be submitted in January 2024

• Each AS gives 0, 0.25, 0.5, 0.75, or 1 point

• Grading: Number of points = grade

1.2 Content
Learning Objectives and Concept of the Lecture
Learning objectives

• know about relevant machine learning techniques in NLP

• understand concepts for (semi/un-)supervised learning and linguistic structure predic-


tion

• gain practical experience in applying ML to NLP problems

Concept for lecture


Classical lecture with additional and mandatory reading before/after the lecture. Ask ques-
tions and let us discuss live or in the forum!

Concept for tutorial


Monday 16:15-17.30h by Michail Andrianos and Patrick Haller: explanation and deepening of
the lecture’s content, introduction and discussion of exercises, technical assistance for hands-
on while you solve the exercises.

9
1.3 Software
Software/Hardware for Exercises

• Exercises must be programmed in Python; we will use PyTorch, Huggingface, Tensor-


flow/Keras

• Anaconda environments▲ (Win, Mac, Linux)

• Work with your own laptop/computer; small models can be calculated on CPU

• Google Colab▲ with CPU/GPU: “Colaboratory is a free Jupyter notebook environment that requires
no setup and runs entirely in the cloud.” We try to adapt the dataset sizes for the exercises so
that they can be run on Colab without hacks.

• AWS Studio Lab▲ , Azure▲ , Google Cloud▲ (as a new user you get a $300 credit), Saturn
Cloud▲

• If you know any other good option, let us know:-)

ChatGPT, CoPilot, Anaconda Assistant, etc.

• Useful helpers for explaining code snippets, improving documentation, add type hints,
cleaning . . .

• Don’t get addicted! Always think yourself!

• Don’t annoy your peers by submitting raw AI generated content in the exercises! We
might deduct points for this behavior.

• For each exercise, there must be a short declaration of the involved use of generative AI
(required by the Faculty of Arts).

10
Chapter 2

Perspectives on AI and NLP

Learning Objectives

• Reflect the development of NLP and AI methods

• Understand the role of tasks in NLP

• Reflect important concepts of AI, optimization and learning

2.1 Intro
2.1.1 CL
Computational Linguistics and Language Technology
Theory-oriented: competence/knowledge
Computational Linguistics (CL; de Computerlinguistik) “is an interdisciplinary field concerned Computational
CL
with the statistical or rule-based modeling of natural language from a computational perspec- Linguistics

tive, as well as the study of appropriate computational approaches to linguistic questions.”


(WP)

Application-oriented: performance
NLP
Natural Language Processing (NLP, de Sprachtechnologie) deals with the application-oriented de- Natural
velopment of language software. Language
Processing

Scientific Games: NLP modeling as shared-task competitions


The model with the best performance wins. See nlpprogress.com▲ or Hugging Face datasets Shared Tasks
tasks▲

Peer-2-Peer Concept ping-pong: What are the essential ideas of AI and NLP for you? (3
Minutes)

• What is the most important keyword/concept X in AI and NLP?

• Person A whose first name comes first in the alphabet starts.

• Person A: Explain to Person B why X is so important in 1 minute.

• Person B: Explain to Person A why your X is so important.

11
ML as an Interdisciplinary Field Motivation

Machine
Machine Learning Learning

Computer science
Artfcial Intelligence
Info
the rmat
ory on
e
ienc
rosc
Neu
Machine learning

cs
Statst Opt
m izat
on

Physics
Heike Adel Introduction 22.02.2019 10 / 55
[?]

What’s your background?

A take on the connection of AI, ML, NLP and Deep Learning. . .

[?]
Do you agree?

CL and the Revolutions in Machine Learning/AI


Features
From rule-based to transfer learning. . .
Modeling,
rule-based
• from knowledge-based, symbolic, rule-based modeling

• to data-based, numerical modeling

• to transfer-based generative AI.

12
Motivation

Bengio: DL: Theoretical Motivations

Output

Mapping
from
Output Output features

Mapping Mapping Most


from from complex
Output features features features

Hand- Hand-
Simplest
designed designed Features
features
program features

Input Input Input Input

rule-based classic representation deep


systems ML learning learning
Heike Adel Introduction 22.02.2019 27 / 55
[?] ML is the new green:-)

Generative AI: Prompting strong transferable


language models

Computer Science

13
Jacob Eisenstein: An Introduction To Natural Language Processing

Computer science
Natural language processing draws on several aspects of “core”
computer science:
I Natural language can be modeled using formal language
theory, building on similar theoretical tools that are used to
analyze programming languages.
I Natural language data requires efficient algorithms, which can
be analyzed in terms of time and space complexity.
I These algorithms must be implemented on diverse
architectures, including distributed systems, GPUs, and mobile
devices. Jacob Eisenstein: An Introduction To Natural Language Processing

This course will draw on basic tools from complexity theory,4 and
Linguistics
[?]

will highlight connections to other areas of computer science.


Where is Linguistics?
The4 goal of linguistics is understand how language works —
Michael Sipser (2012). Introduction to the Theory of Computation.
possibly using computational techniques. For example:
Cengage Learning.
I What are the major language families and how are they
related to each other?
I What are the principles that determine whether a sentence is
grammatical? Can we identify shared principles that explain
grammaticality across many di↵erent kinds of languages?
I How and why do languages change?
I How do people learn their first language? What, if anything,
is di↵erent when they learner their second language?

Natural language processing leverages insights from linguistics


to build language technology.
[?]

The Big Question

14
Learning and knowledge

Given the dominance of machine learning, what role is left for


linguistic theory? Some possibilities:
I The NLP stack: A series of systems transforms text from raw
strings into progressively higher level linguistic representations.
I Preprocessing: The base representation for machine learning
is a set of linguistically meaningful features.
I Model design: The architecture of the learning algorithm is
designed to reflect linguistic principles.
I Nothing: Language is just another kind of data, and
language processing is just another learning problem.

[?]

2.1.2 ML
Typical NLP Prediction Problems
Prediction
Prediction Tasks in Ascending Order of Difficulty types

• Classification (binary or n-ary) of single events: e.g. text categorization

• Classification (binary or n-ary) of sequences of events, sequence labeling : e.g. part-of-


speech tagging problems, named entity tagging problems

• Prediction of general structures (sequences, relations, trees, graphs): e.g. parsing, trans-
lation, summarization

What type of prediction problem is ChatGPT?


Predict a sequence of tokens!

2.1.3 AI
AI: How to learn intelligent behavior?

“Intelligence is the computational part of the ability to achieve goals in the world.”
(John McCarthy, 2004) from [?]

Popular NLP operationalizations

• Can AI win? IBM Watson 2011 in Jeopardy!

• Can AI keep up? Microsoft’s translation from Chinese to English (Human Parity: People
find them equally good)

• Can AI fool human intelligence? Turing’s conversation test▲

15
AI for difficult games: Neural machine learning and searching

• Go: Strategy game with huge search space

• complex control problem

• Google’s AlphaGo 2016: Superhuman performance

• Supervised Learning + Reinforcement Learning

• controlled by neural networks

Syntactic Games
Complex dependencies and modification relations in real sentences
Which word group depends on which one?

[The board] approved [its acquisition] [by Royal Trustco Ltd.] [of Toronto] [for $27 a share] [at
its monthly meeting].
Crucial question: Can we learn syntax from data? Actually, a very old controversial issue . . .

2.1.4 Data
Chomsky: Data vs. Linguistic Theory
Chomsky’s furious argument against data-driven models

1. Colorless green ideas sleep furiously.

2. Furiously sleep ideas green colorless.

It is fair to assume that neither sentence (1) nor (2) [. . . ] has ever occurred in an English dis-
course. Hence, in any statistical model for grammaticalness, these sentences will be ruled out
on identical grounds as equally "remote" from English. Yet (1), though nonsensical, is gram-
matical, while (2) is not grammatical. — Chomsky, 1957

A (too) simple causal data-driven Markov word-bigram model

n
count(w1 , w2 )
p(w1 . . . wn ) = p(w1 ) p(wi |wi−1 ) P (w2 |w1 ) =
Y

i=2
count(w1 )

16
p(furiously|sleep) = p(sleep|furiously) = 0 if neither the bigram “sleep furiously” nor “furiously
sleep” occurs in the data.

Learn latent (hidden) “part-of-speech classes” c ∈ C in a way that the probability of observed
texts gets high!
n
p(w1 . . . wn ) = p(w1 ) p(wi |wi−1 )
Y

i=2

p(wi |wi−1 ) = p(wi |c)p(c|wi−1 )


X

c∈C

p(Colorless green ideas sleep furiously.)


≈ 200′ 000
p(Furiously sleep ideas green colorless.)

An empirical rebuttal [?] of Chomsky


The ungrammatical sentence is about 200’000 times less probable than the semantically prob-
lematic one – definitely no “identical grounds to be ruled out”. (newspaper corpus with
|C| = 16)

Generative Pre-Training (GPT): Deep Transformer Architectures as Massive “Latent” Parametriza-


tions for Causal Language Models

• Why restrict the hidden parameters to 16? Let’s have millions or billions of them. . .

• Why restrict the latent variable connections to neighbor words? Let’s connect any word
with any other word (aka. self-attention) . . .

n
p(w1 , . . . , wn ) = p(wi |w1 , . . . , wi−1 )
Y

i=1

p(wi |w1 , . . . , wi−1 ) = transformer(w1 , . . . , wi−1 )


Big question: Is raw text enough?

2.1.5 Annotation
Empirical Turn in (Computational) Linguistics

• Corpora = Digital collections of texts

• Annotated corpora (datasets) = text data interpreted by humans

CL Can machines learn from annotated data to interpret new data accordingly?

Tasks Morphological analysis, classification of words, syntactic analysis

17
CS
506
CJ

S
504
SB HD OA SVP HD

CNP
502
CJ CD CJ

NP
500
NK NK

Sie gehen gewagte Verbindungen und Risiken ein , versuch


0 1 2 3 4 5 6 7
PPER VVFIN ADJA NN KON NN PTKVZ $, VVFIN
3.Pl.*.Nom 3.Pl.Pres.Ind Pos.*.Akk.Pl.St Fem.Akk.Pl.* −− Neut.Akk.Pl.* −− −− 3.Pl.Pres.

Complex Semantic Annotations


“Officials have warned opposition activists not to hold demonstrations” Semantic
Task
Semantic Task
Calculate the intended meaning of a sentence!

Discourse representation structure

18
2.1.6 Tasks
Tasks = Language + Data + a Bit of Linguistics

Linguistics

Annotation
Language Conventions

AI Data
Tasks

Theory of numerical
optimization

Solve tasks = achieve goals


“Intelligence is the computational part of the ability to achieve goals in the world.”

Further tasks
• Answer the question!

• Translate the sentence!

• Classify the texts! (Fake News vs Facts)

• Transcribe the recording as text!

Development
“Anytime a linguist leaves the group, the recognition rate goes up” (1988), Fred Jelinek, pioneer in Statis-
tical Speech Recognition

Your Task: Modern NLP Tasks and their SOTA

In OLAT forum++▲

19
2.2 ML Cultures
ML Cultures in NLP Reduced to 3 Symbols
ML Cultures
in NLP
∀ Rule and logic-based modeling

N Data-oriented statistical modeling

∂ Probabilistic, data-driven, neural modeling

Illustrated with 3 example tasks in the following

2.2.1 xor

XOR Task
∀: XOR Task with Nominal Categories
XOR
A simple data set

a b a XOR b
True True False
True False True
False True True
False False False

True if a ̸= b
(
a XOR b = ∀a, b ∈ {True, False}
False if a = b

Interpretability/Explainability

• Question: Why does a XOR b result in True?

• The declarative definition answers the question trivially.

N: XOR Task as a Numeric Calculation

a XOR b = (a − b)(b − a) ∀a, b ∈ {0, 1}

Interpretability

• Numeric encoding of the truth values

• A numeric score is calculated, which in turn can be interpreted as a truth value.

• Many other arithmetizations are conceivable, e.g. a XOR b = (a − b)2

20
∂: XOR Task as a Neural Network I
A simple attempt
Computation graph and formula

Probabilistic interpretation of y
If y is the probability of x1 XOR x2 being True then 1-y is the probability of x1 XOR x2 being
False.

• Weighted sum of input and bias

• sigmoid(x) = 1
1+e−x as nonlinear activation

• Squeezes any number to an interval of 0 to 1:

• Bad news! Cannot calculate XOR! There are no good values for w1 , w2 , b.

∂: XOR Task as a Neural Network II


With an intermediate layer
Latent nodes needed!

21
Good news! With unlimited number of nodes in the intermediate layer any computable function
can be approximated.
That doesn’t mean it can be learned effectively from data!

Ingredients and recipe for learning the XOR function

1. Random: Initialize all violet edge weights randomly with numbers!

2. Predictions: Calculate the (incorrect) output for an input!

3. Error quantification: Quantify the error! (Loss Function)

4. Optimization: Correct weights slightly according to their share of the error! (Backpropa-
gation)

5. Iteration: Start again at (2)

Principle: learn from mistakes thanks to numerical optimization!

∂: Quantify errors (Loss Function)

• prediction: y = p(1 XOR 1 = True) = 58%: Randomly set weights make mistakes!

• Continuous Loss Function quantifies errors (binary Logloss)

22
∂: Optimize Weights (Backpropagation)
Stochastic gradient methods use partial derivation (∂). Change edge weights slightly to make
mistakes smaller. New probability of y after a training step: 54%.

∂: Reflexion
Interpretability

• Why does a neural XOR output the correct result?

• Because it has learned from a random configuration iteratively to output the right one.

• “A goal achieving system is one that is more usefully understood in terms of outcomes
than in terms of mechanisms.” (Rich Sutton, 2016)

Visualization of Activations: Light into the Black Box

23
LSTMVis: A Tool for Visual Analysis of Hidden State Dynamics in
Recurrent Neural Networks
Hendrik Strobelt, Sebastian Gehrmann, Hanspeter Pfister, and Alexander M. Rush
– Harvard School of Engineering and Applied Sciences –

t i1 i2

c
a
arXiv:1606.07461v2 [cs.CL] 30 Oct 2017

d e1 e2

g1 h g2

b
f



Fig. Framework
1. The LSTMV ISfor userexploration
interface. The and testing ofselects
user interactively NNs a[?] range of text specifying a hypothesis about the model in the
Select View (a). This range is then used to match similar hidden state patterns displayed in the Match View (b). The selection is made
by specifying a start-stop range in the text (c) and an activation threshold (t) which leads to a selection of hidden states (blue lines).
The 2.2.2
start-stopLexical
range can Semantics
be further constrained using the pattern plot (d). The meta-tracks below depict extra information per word
position like POS (e1) or the top K predictions (e2). The tool can then match this selection with similar hidden state patterns in the
data set of varying lengths (f), providing insight into the representations learned by the model. The match view additionally includes
Lexical Semantics
user-defined meta-data encoded as heatmaps (g1,g2). The color of one heatmap (g2) can be mapped (h) to the word matrix (f) which
allows the user to see patterns that lead to further refinement of the selection hypothesis. Navigation aids provide convenience (i1, i2).

Abstract— Recurrent neural networks, and in particular long short-term memory (LSTM) networks, are a remarkably effective tool for
∀: Word Semantics as Taxonomies (WordNets)
sequence modeling that learn a dense black-box hidden representation of their sequential input. Researchers interested in better
Manually
understanding created
these modelsstructures:
have studied the changes in hidden state representations over time and noticed some interpretable
patterns but also significant noise. In this work, we present LSTMV IS, a visual analysis tool for recurrent neural networks with a
focus on understanding these hidden state dynamics. The tool allows users to select a hypothesis input range to focus on local state
changes, to match these states changes to similar patterns in a large data set, and to align these results with structural annotations
from their domain. We show several use cases of the tool for analyzing specific hidden state properties on dataset containing nesting,
phrase structure, and chord progressions, and demonstrate how the tool can be used to isolate patterns for further statistical analysis.
We characterize the domain, the different stakeholders, and their goals and tasks. Long-term usage data after putting the tool online
revealed great interest in the machine learning community.

1 I NTRODUCTION
In recent years, deep neural networks have become a central modeling representations make the models themselves difficult to interpret. So
tool for many artificial cognition tasks, such as image recognition, while it is possible for users to produce high-performing systems, it is
speech recognition, and text classification. These models all share a difficult for them to analyze what the system has learned.
common property in that they utilize a hidden feature representation of While all deep neural networks utilize hidden features, different
their input, not pre-specified by the user, which is learned for the task model structures have shown to be effective for different tasks. Stan-
at hand. These hidden representations have proven to be very effective dard deep neural networks (DNNs) learn fixed-size features, whereas
for classification. However, the black-box nature of these learned convolutional neural networks (CNNs), dominant in image recognition,
will learn a task-specific filter-bank to produce spatial feature maps. In
All sharks are fish. All fish can swim. this work, we focus on deep neural network architectures known as
• HS – contact: [email protected] recurrent neural networks (RNNs) that produce a time-series of hidden
• SG, HP, AR – contact: {gehrmann, pfister,rush}@seas.harvard.edu. feature-state representations.
N: Corpus-based Distributionalism [?] RNNs [7] have proven to be an effective general-purpose approach

An old pragmatic, empirical idea. 1

• “Die Bedeutung eines Wortes ist sein Gebrauch in der Sprache.” (Ludwig Wittgenstein
(1953))

24
• “You shall know a word by the company it keeps!” (J. R. Firth (1957))

• “Words that occur in the same contexts tend to have similar meanings” (Pantel (2005)

Tiny corpus

Window
• I likebased cooccurence matrix
deep learning.

• I like NLP.
• Example corpus:
• I enjoy flying.
• I like deep learning.
Idea
•Similar
I like NLP.
words have similar lines.

•WordI bigram
enjoy statistics
flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160

[?, 9]

N: Distributional Empirical Thesauri


Similarity values calculated from huge corpora [?]

“a
distributional thesaurus is an automatically produced “thesaurus” which finds words that
tend to occur in similar contexts as the target word” 1
1
https://www.sketchengine.co.uk/thesaurus/

25
In%vector%space%terms,%this%is%a%vector%with%one%1%and%a%lot%of%zeroes%

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality:%20K%(speech)%–%50K%(PTB)%–%500K%(big%vocab)%–%13M%(Google%1T)%
∂: Words as Numeric Vectors

We%call%this%a%“oneJhot”%representaGon.%Its%problem:%
One-Hot encoding: Lots of zeros and a one
Each word is a position (dimension) in a vector.
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0%
35%
Problems with One-Hot Encoding

• similarity (vector product) of two words is always 0!

• vector length: Large corpora have hundreds of thousands of different words!

Solution: Word Embeddings: 50 to 500 real numbers


Dense, dimension-reduced continuous representation: Wi ∈ Rd

∂: Learned 300-dimensional Word Representations▲

26
Continuous, sub-symbolic word representations (word embeddings).

∂: Words as vectors of 64 real numbers


Which line represents which word? (“war”, “peace”, “tomato”)

Continuous, sub-symbolic word representations (word embeddings).

∂: Word Embeddings as Hidden Layer


Distributional autoencoder: Your contexts determine who you are! Embeddings

dimension reduction: “Don’t count, predict!” [?]


Learn to compress large one-hot vectors to low-dimensional vectors as Hidden Layer.

27
Prediction task of word2vec [?]

Embeddings as neural weights


Network has as many outputs as words in vocabulary.

∂: SoftMax: From vectors to probabilities


SoftMax: The “work horse” of probabilistic classification
Every numeric vector with n numbers can be normalized and interpreted as a probability
distribution over n classes!
Vector (logits) SoftMax Probability Interpretation

p(„iPhone6s“)=0.7

p(„MacBook“)=0.2

p(„iPad“)=0.1

∂: Most similar words to “motel” in vector space: Depending on training corpus and pre-
processing

28
Computing in Learned Vector Spaces

Source: http://jalammar.github.io/illustrated-word2vec/

Data-Driven Embeddings in NLP


Robust, versatile representations: One of the main factors of progress in NLP!

2.2.3 Machine Translation


∀: Rule-based machine translation
Explicit construction of linguistic representations! Machine
Errors propagate! Translation

No automatic learning from mistakes!

29
Figure 2.1: Model of syntactic transfer translation

30
Parse and transfer tree from Langenscheidt T1 (1997)

N: Statistical Machine Translation


Required material: Parallel corpora = many translated sentences

Learn word alignments from parallel texts.

Phrase-based translation

Statistics about aligned word N-grams and monolingual N-gram sequences

∂: Neuronal translation à la seq2seq

31
Encoder-decoder Models
(Sutskever et al. 2014)
Encoder
kono eiga ga kirai </s>

LSTM LSTM LSTM LSTM LSTM

I hate this movie

LSTM LSTM LSTM LSTM

argmax argmax argmax argmax argmax

I hate this movie </s>


Decoder
G. Neubig▲
A generic model to map any sequence of words to another sequence of words.
Encoder: Input
Reads Word Embeddings of the source sentence one after the other!

Autoregressive decoder: output


Takes encoded input and previously output words as input and outputs next word!

LSTM [?]
Recurrent neural network (RNN) that learns suitable representations!
Generation
2.2.4 Generation
Morphology
∀: Generate Word Forms
Finite automaton with actions on character level
work+3rdSg --> works
work+3rdSg --> works
+Base:0
+Base:0

+3rdSg:s
t:t a:a +3rdSg:s
t:t a:a
0:n
+Progr:i 00:g
+Progr:i
a:a l:l
:n 0:g
w:w a:a k:k
l:l
w:w k:k
o:o o:or:r r:r
+Past:e +Past:e 0:d 0:d
Source: Lauri Karttunnen 2005

• Copy (w:w)

• Substitute (+3rdSg:s)

• Insert (0:d)

32
• Delete (+Base:0) Note: 0 = empty string

N: Count-based approaches equip non-deterministic finite state automatons with transition


weights. This results in weighted string pairs

Sigmorphon Shared Task 2018: Learn to Generate Word Forms for 100 Languages
base form x features f → Word form y

• German: fliegen V;IND;PST;1;SG → flog

• Maltese: ried V;FIN;PST;PRF;3;SG;FEM → radet

Challenge

• Can a system also learn from only 100 examples?

• How do you control the generation?

• Idea: Build a neuronal stack machine that is trained with machine learning and searching
(similar to AlphaGo).

• Idea: Embed letters and morphological features as continuous vectors (similar to trans-
lation).

A quick look into our CL work [?]


i]
IN TE
S[
LE
PY

“fliegen V;IND;PST;1;SG” → “flog”


DE
CO

...

Stack s3 Buffer

g0 g1 g2 g3 h3 h4 h5 h6 h7 h8
f
INS[bos] COPY COPY i e g e n eos
t Action at Output y Stack Buffer f
0 INSERT ( BOS ) [] [INSERT(BOS)] [f,l,i,e,g,e,n,EOS] f
1 COPY [f] [COPY, INSERT(BOS)] [l,i,e,g,e,n,EOS] f
2 COPY [f, l] [COPY, COPY, . . . ] [i,e,g,e,n, EOS] f
3 DELETE [f, l] [DELETE,COPY, . . . ] [e,g,e,n,EOS] f
4 DELETE [f, l] [DELETE, DELETE, . . . ] [g,e,n,EOS] f
5 INSERT (o) [f, l, o] [INSERT(o), DELETE, . . . ] [g,e,n,EOS] f
6 COPY [f, l, o, g] [COPY, INSERT(o), . . . ] [e,n,EOS] f
7 DELETE [f, l, o, g] [DELETE, COPY, . . . ] [n,EOS] f
8 DELETE [f, l, o, g] [DELETE, DELETE, . . . ] [EOS] f
9 INSERT ( EOS ) [f, l, o, g] f

33
∂: “Label efficiency”: Generalizing
#x = How often is basic form x in training data? #f = How often is feature combination in
training data?
System x / #x f /#f y
Correct Arsphenamin N;NOM;PL Arsphenamine
100 0 3 Arsphenaminen
1000 0 61 Arsphenaminnen
10000 0 594 Arsphenamine

System x / #x f /#f y
Correct belaufen V;IND;PST;3;PL beliefen
100 0 1 belauften
1000 2 14 belauften
10000 3 157 beliefen
Positive: errors are often intuitive (over-)generalizations

∂: Our shared task results for 102 languages [?]

• 100 training examples: on average 57%

• 1000 training examples: on average 87%

• 10000 training examples: on average 96%

Overall clearly the best system in 2018. . .

And still SOTA-like on morphological tasks such as grapheme to phoneme conversion in 2021,
morphological segmentation and inflection in 2022 [?] (MA thesis outcome)

2.3 Conclusion
ML Cultures in NLP Reduced to 3 Symbols

∀ Rule and logic-based modeling

N Data-oriented statistical modeling

∂ Probabilistic, data-driven, neural modeling with gradient methods

“The Bitter Lesson” of AI from Rich Sutton 2019


AI which tries to model “human-centric”, ultimately works less well.

∂ for linguistics [?]


“Recurrent Neural Networks in Linguistic Theory: Revisiting Pinker and Prince (1988) and the
Past Tense Debate”: “We suggest that the empirical performance of modern networks warrants
a reexamination of their utility in linguistic and cognitive modeling.”
Fueled the debate again. . .

34
Important Questions Regarding AI and NLP
• Can AI methods learn any task from scratch (raw data) without explicit intermediate lin-
guistic structures? End-to-end learning paradigm
• What is the efficiency of such learning methods? Data efficiency, label efficiency, parame-
ter efficiency, compute efficiency
• Which tasks on raw data are optimal for foundational models that quickly learn to solve
specialized downstream tasks (transfer learning)? Prompting (zero-shot, few-shot) vs.
fine-tuning . . .

Conclusion
• Machine learning techniques are better for complex, broad and “vague” problems than
manual modeling.
• Targeted training: Appropriate numerical representation allows numerical optimization
(learning from mistakes).
• End-to-end systems in which input and output are fully connected and optimizable,
avoid the problem of uncorrectable consequential errors.
• Representation learning: The optimization process learns appropriate numerical repre-
sentations for the input and “latent” levels!
• The internal numerical representations are difficult to interpret (black box problem).

2.4 Further Study


• Mandatory reading for this lecture: Chapter 1 “NLP: A Primer” from [?] (available via
OLAT)
• Preparatory reading for next lecture: Chapter 1-2 from [?]
• More on NLP and linguistics from a a classical book: Chapter 1 "Introduction" and 3 "Lin-
guistic Essentials" from [?]: https://github.com/shivamms/books/blob/master/nlp/snlp.pdf
• More on mathematical backgrounds for NLP from a a classical book: Chapter 2 "Mathe-
matical Foundations" from [?]: https://github.com/shivamms/books/blob/master/nlp/snlp.pdf
• See OLAT program for direct links and/or materials/literature folder for PDFs.

Questions
• What characterized the development of ML in NLP over the last years?
• What are the typical steps in neural learning?
• What is meant by representation learning?
• What can we learn from Chomsky’s argument and counter-argument about learning and
modeling?
• What output will a shallow neural network produce for XOR-style classification with or
without hidden layers▲ ?

35
Chapter 3

Machine Learning and Linear


Classification

Learning Objectives

• Know the basics of machine learning

• Understand the connection of linear regression and (log-)linear classification

• Know the basic procedure of training: loss function and optimization

• Understand the basics of stochastic gradient

3.1 Books
3.1.1 ML
[?]: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2nd Ed.)

• General and practical introduction into ML using sklearn (= Scikit-Learn)

• Well-written and strong focus on concepts

• A lot of hands-on and jupyter/python code1

• Uses TensorFlow▲ and Keras▲ high-level abstraction API. . .

• First edition 2017 is on OLAT.


1
https://github.com/ageron/handson-ml2

36
[?]: Deep Learning

• Standard textbook on neural modeling covering a broad range of topics

• Can be overwhelming. . .

• Free online version https://www.deeplearningbook.org

• Nice wrap up of introductory chapters in hadrienj’s blog2

[?]: Dive into Deep Learning

• Good and very practical interactive introduction into neural modeling (pyTorch code
snippets)

• We will use it in class and tutorials

• Comprehensive online version https://d2l.ai/

[?]: Neural Network Methods for Natural Language Processing


2
https://hadrienj.github.io/deep-learning-book-series-home/

37
• Specific to NLP, applied but still academic

• Sometime pretty fast-paced, sometimes with detailed examples

• General ML in the beginning needed for neural modeling; specific neural modeling in
the end

3.1.2 ML for NLP


[?]: Natural language processing with PyTorch

• Specific to NLP, introductory level

• Bound to the dynamic NN framework PyTorch▲

• General ML in the beginning needed for neural modeling; specific neural modeling in
the end

38
39
• Specific to NLP, medium level
• One of the best conceptual books (without code)

[?]: Practical Natural Language Processing

• Practically oriented modern NLP primer


• Focus on conversational agents in some chapters
• Good to learn about the practicalities with a very decent amount of theory

[?]: Natural Language Processing: A Machine Learning Perspective

• High-quality academic textbook with a lot of formulas (but no code base)


• Strong on theory of ML and NLP

40
Outline

Machine Learning: Preview: Dataset in Vector Space


3.2 Basics
Typical ML Vector Data

x2
Data input:
x1 x2
0.2 0.2

0.4 0.3
0.9 0.6 Outline

1.0 1.2 x1
Machine Learning: Preview: Supervised: Regression
[?]

This is . . . Regression
Regression
y
Data input x
with labels y:
Heike Adel Introduction 22.02.2019 31 / 55

x y
0.2 0.2 ⇒
0.4 0.3
Outline
0.9 0.6
1.0 Learning:
Machine 1.2 Preview: Unsupervised: Clustering x
[?]

This is . . . Clustering
Clustering
x2
Data input:
x1
Heike Adel x2 Introduction 22.02.2019 33 / 55

0.2 0.2

0.4 0.3
0.9 0.6
1.0 1.2 x1

[?]

This is . . . Classification
Classification

Heike Adel 41
Introduction 22.02.2019 32 / 55
Machine Learning: Preview: Supervised: Classification

x2
Data input x
with labels (classes) y:
x1 x2 y
0.2 0.2 0 ⇒
0.4 0.3 0
Supervised/Unsupervised Learning
0.9 0.6 1
Machine Learning systems can be classified according to the amount and type
1.0 1.2 1 they get during training. There are four major categories: x1
of supervision
supervised learning, unsupervised learning, semisupervised learning, and
[?] Reinforcement Learning.

Supervised learning
Supervised Machine Learning:
In supervised Classification
learning, the training data you feed to the algorithm includes
the desired solutions, called labels (Figure 1-5). Classification

Heike Adel Introduction 22.02.2019 34 / 55

Figure 1-5. A labeled training set for supervised learning (e.g., spam classification)
Source: [?]

A typical supervised learning task is classification. The spam filter is a good


What steps does a predictive NLP modeling (su-
example of this: it is trained with many example emails along with their class
(spam or ham), and it must learn how to classify new emails.
pervisedAnother
learning) taska target
typical task is to predict involve?
numeric value, such as the price of a
car, given a set of features (mileage, age, brand, etc.) called predictors. This
Peer-to-peer
sort of task isexplanation (3To minutes):
called regression (Figure 1-6). train the system, you needPerson
1 to A
give it many examples of cars, including both their predictors and their labels
(the older one)
(i.e., their explains to the person B (the younger
prices).

one)
NOTE
In Machine Learning an attribute is a data type (e.g., “Mileage”), while a feature has
Modern ML-basedseveral
NLP Pipeline
meanings Lifecyle
depending on the context, but generally means an attribute plus its value
(e.g., “Mileage = 15,000”). Many people use the words attribute and feature
interchangeably, though.

"****** DEMO - www.ebook-converter.com*******"

[?]

42
Read Chapter 2 of [?] for more details on each step!
Chapter 1

Typical Workflow of Predictive Modeling


Predictive
modeling
Workflow

Source: [?, 11]

Preprocessing – getting
5 Steps for Supervised Machine Learning data into shape
Raw data rarely comes in the form and shape that is necessary for the optimal
1. Feature engineering: extraction, encoding, transformation (e.g. standardization (e.g. to
performance of a learning algorithm. Thus, the preprocessing of the data is one of the
values between 0 and 1)), selection
most crucial steps in any machine learning application. If we take the Iris flower
dataset from the previous
2. Performance section asinternal
metrics selection: an example, weoptimized
(directly could think of training)
for in the raw data
vs. external
as a series of flower
evaluation images
measure from which
(application we want to extract meaningful features.
dependent)
Useful features could be the color, the hue, the intensity of the flowers, the height,
and3.theSelection
flower of classifier
lengths andand optimization
widths. algorithm:learning
Many machine Loss Function, Optimizer,
algorithms Regulariza-
also require
tion, hyper-parameter tuning
that the selected features are on the same scale for optimal performance, which is
often
4. achieved
Evaluation byoftransforming the features
models: cross-validation in the range [0, 1] or a standard normal
if feasible
distribution with zero mean and unit variance, as we will see in the later chapters.
5. Revise any of the preceding steps!
Some of the selected features may be highly correlated and therefore redundant
tosklearn
a certain degree. In those cases, dimensionality reduction techniques are useful
Framework
for compressing the with
supports these steps features onto a lower
well-designed dimensional
abstract interfaces subspace. Reducing the
dimensionality of our feature space has the advantage that less storage space is
required, and the learning algorithm can run much faster.
3.2.1 Splits
Proper Fitting: Training and Held-Out Sets

[ 11 ]

43
Source: [?, 174]

Data Splits in Supervised ML


Training Set
Use the training samples for optimizing the parameters of the model. Training Set

Validation/Development Set
Use the validation samples for tuning hyperparameters:
• of the machine learning algorithm

• control generalization behavior (over-/underfitting, early stopping)

Test Set
Use test sample (ideally once, after final training) for measuring performance on unseen data.
Introduction

Measures the generalizability of your model on new data (if training/dev data is representa-
Capacity - Overfitting - Underfitting
tive for it).

Generalization: Fitting the Fitting on ??? Data

Error

Underfitting Overfitting
??? 

Training error

Features/ /Training Epochs


Capacity
[?]

Factors for fitting

Heike Adel
44
Machine Learning 08.03.2019 23 / 68
• Features: amount of input features
• Capacity ≈ amount of model parameters
• Training epochs: How many times did a (gradient-based) learning algorithm saw the
training set

Questions
• What is ??? error?
• Where is the perfect spot for fitting?

In what ways is this picture idealizing? More on learning curve diagnostics▲ . . .


1. INTRODUCTION
Fitting Illustrated on Artificial Data

Figure 1.2 Plot of a training data set of N =


10 points, shown as blue circles,
each comprising an observation
1
of the input variable x along with
the corresponding target variable t
t. The green curve shows the
function sin(2πx) used to gener-
ate the data. Our goal is to pre- 0
dict the value of t for some new
value of x, without knowledge of
the green curve.
−1

0 x 1
[?, 4]

detailed
Sparse treatment
noisy traininglies
databeyond the scope
and a “known” of this
target book.
function
Although each of these tasks needs its own tools and techniques, many of the
key ideas that
Underfitting undunderpin them
Overfitting are common
on Numerical Datato all such problems. One of the main
goals of this chapter is to introduce, in a relatively informal way, several of the most
3.2.2 Linear
important Algebra
of these concepts and to illustrate them using simple examples. Later in
the book we shall see these same
Linear Algebra: Scalars, Vectors, ideasTensors
Matrices, re-emerge in the context of more sophisti-
cated models that are applicable to real-world pattern recognition applications. This
chapter also provides a self-contained introduction to three important tools that will
be used throughout the book, namely probability theory, decision theory, and infor-
mation theory. Although these might sound like daunting topics, they are in fact
straightforward, and a clear understanding of them is essential if machine learning
techniques are to be used to best effect in practical applications.

Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
1.1. What
Example: Polynomial Curve Fitting
are tensors? Attend tutorial session!

We begin by introducing a simple regression45 problem, which we shall use as a run-


ning example throughout this chapter to motivate a number of key concepts. Sup-
pose we observe a real-valued input variable x and we wish to use this observation to
predict the value of a real-valued target variable t. For the present purposes, it is in-
structive to consider an artificial example using synthetically generated data because
1.1. Example: Polynomial
1.1. Example: Curve
Polynomial Fitting
Curve Fitting 5
7

sin(2πx) and then adding a small level of random noise having a Gaussian distri-
1bution (the Gaussian distributionM =is0 discussed 1 in Section 1.2.4) to each such M =point
1 in
t
order to obtain the corresponding value tnt
. By generating data in this way, we are
capturing a property of many real data sets, namely that they possess an underlying
0regularity, which we wish to learn, but that0individual observations are corrupted by
random noise. This noise might arise from intrinsically stochastic (i.e. random) pro-
cesses such as radioactive decay but more typically is due to there being sources of
−1variability that are themselves unobserved. −1
Our goal is to exploit this training set in order to make predictions of the value
!t of the target variable for some new value ! x 0of the input variable. As we shall1 see
0 x 1 x
later, this involves implicitly trying to discover the underlying function sin(2πx).
This is intrinsically a difficult problem as we have to generalize from a finite data
set. Furthermore the observedM data are corrupted with noise, and so forMa =given !
x
1 =3 1 9
!
there is uncertainty as to the appropriate value for t. Probability theory, discussed
t
in Section 1.2, provides a framework fort expressing such uncertainty in a precise
0
and quantitative manner, and decision theory, 0
discussed in Section 1.5, allows us to
exploit this probabilistic representation in order to make predictions that are optimal
according to appropriate criteria.
−1 For the moment, however, we shall −1 proceed rather informally and consider a
simple approach based on curve fitting. In particular, we shall fit the data using a
polynomial function of the form
0 x 1 0 x 1

"M
Figure 1.4 Plots of polynomials having various orders M
2 , shown as red curves,
M fitted to the data set shown in
Figure 1.2. y(x, w) = w0 + w1 x + w2 x + . . . + wM x = wj xj (1.1)
j =0
(RMS) error defined by
where M is the order of the polynomial, ERMSand=x! j
denotes
2E(w⋆ )/N x raised to the power of j.
(1.3)
[?, 4] The polynomial coefficients w 0 , . . . , w M are collectively denoted by the vector w.
in which the division by N allows us to compare different sizes of data sets on
Note that, although the polynomial function y(x, w) is a nonlinear function of x, it
an equal footing, and the square root ensures that ERMS is measured on the same
is a linear function of the coefficients w. Functions,
scale (and in the same units) as the target variable such t.as Graphs
the polynomial, which
of the training and
are linear in thetestunknown
set RMS parameters
errors are shown, haveforimportant properties
various values of M , and are called
in Figure 1.5. Thelinear
test
What are Tensors? (broad
set sense)
error is a measure of how well
models and will be discussed extensively in Chapters 3 and 4. we are doing in predicting the values of t for
The values new data observations of x. We note from Figure 1.5 that small values of M give
over of the coefficients will be determined by fitting the polynomial to the Tensor Rank
A generalization scalars, vectors, matrices, tensors (narrow sense)
relatively large values of the test set error, and this can be attributed to the fact that
training data. the
This can be done
corresponding by minimizing
polynomials are ratheraninflexible
error function that measures
and are incapable the
of capturing
• Note:
misfitNumerical
betweenthe datafunction
the structures with dimensions
nfor or
any givenValues ranks
value
oscillations iny(x,
the w),
function sin(2πx). ofofM w, andrange
in the the 3training
! M !set 8
data points. Onegivesimple
• A rank-0 tensor:ofAny choice
small values of
for the
number (scalar) error
test setfunction, which
error, and these alsoisgive
widely used,
reasonable is given
representations
x ∈ R as can be seen, for the case of M = 3, from by
the sum of the squaresthe generating
of the function
errors between
sin(2πx), the predictions y(xn , w) for each data
Figure 1.4.
• Apoint
rank-1 and the
xntensor: Any corresponding
numerical vectortarget xvalues∈ Rn tn , so that we minimize

• A rank-2 tensor: Any numerical matrix


"NX ∈ Rm×n
1 2
{y(xn , w) − tn } m×n×o
E(w) = (1.2)
2
• A rank-3 tensor: A 3-dimensional numerical structure A ∈ R
n=1
• A rank-n tensor: An n-dimensional numerical structure . . .
where the factor of 1/2 is included for later convenience. We shall discuss the mo-
tivation for this choice of error function later in this chapter. For the moment we
simply
• What note
is the that of
shape it is a nonnegative
a tensor? Number quantity that would be zero if, and only if, the
of indexes

• What are tensors of type Nn good for? Indexing

46
Dot Product: Vectors and Column Vectors

Matrix Multiplication▲
Fill the result matrix with vector dot products. . . Matrix multi-
plication

Matrix Multiplication and Sizes/Shapes▲


Matrix Multi-
row vector dot column vector plication

column vector dot row vector

47
row vector dot matrix

matrix dot column vector

X Y Z
b
W1
W2
Goldberg’s “Idiosyncratic” Notational Conventions
.W /2 ; .W 3 /2 Œ!
bŒi! i b W Œi;j !
i j W
bi i b
wi;j W ! w!v D
P P
i wi vi D i wŒi ! vŒi ! x 1Wn x1; : : : ; xn
x1Wn x1 ; : : : ; xn x nW1 x 1Wn Œi ! D
x i x nW1 Œi! D x n!i C1 Œv1 I v2 !

xW C b

WxCb

Read from left to right, Goldberg’s notation resembles more the direction of processing from
input to output.
~
NLP papers also often use widely different notation styles. . . You need to get used to it/them. . .

3.2.3 Preprocessing/Encoding
Levels of Measuring▲ (de: Skalenniveaus)
Levels of
Measuring

din " dout


dout " din

48
commons.wikimedia.org /w/index.php?curid=724035
Comparing levels; in red: Additional property. nominal: frequencies, ordinal: ordering, interval: distances, ratio:
meaningful zero

Try to give an example for each level from NLP. . .

Encoding: Turning Reality into Numerical Values


A process of abstraction and measuring. . .

Source: [?, 9]

Encoding: Turning Reality into Numerical Values


Classical feature engineering for speech (and music): Mel-frequency cepstral coefficients▲ Speech
Features

49
Introduction

Example: Features for Speech

Task: Speech recognition:

Features: MFCC features [Davis and Mermelstein 1980]


Each frame of 10ms is represented by a 13-dimensional vector
The vector is derived from the spectrum (frequency domain) of the
speech signal
[More information: [?]
http://practicalcryptography.com/miscellaneous/machine-
More information about this transformation [?]. Good technical blog with implementation▲ .
learning/guide-mel-frequency-cepstral-coefficients-mfccs/]

Encoding of Data into Fixed-Width Numerical Representations


Data
Heike Adel Machine Learning 08.03.2019 10 / 68
Encoding

[?]

Standard ML Notation for Supervised Datasets


Some authors deviate slightly from standard as used in [?].
• m: Number of instances in data set
• x(i) : Vector of all feature values (excluding the label of the ith instance)
• y (i) : Numeric output/target valueX isof ith containing
a matrix instancesall the feature values (excluding labels) of all instances in the
dataset. There is one row per instance and the ith row is equal to the transpose of x(i), noted
• X: matrix containing all feature values (excluding labels).
(x(i))T.4
For example, if the first district is as just described, then the matrix X looks like this:

h is your system’s prediction function, also called a hypothesis. When your system is given
an instance’s feature vector x(i), it outputs a predicted value ŷ(i) = h(x(i)) for that instance (ŷ
50
is pronounced “y-hat”).
For example, if your system predicts that the median housing price in the first district is
$158,400, then ŷ(1) = h(x(1)) = 158,400. The prediction error for this district is ŷ(1) – y(1)
= 2,000.

RMSE(X,h) is the cost function measured on the set of examples using your hypothesis h.

We use lowercase italic font for scalar values (such as m or y(i)) and function names (such as h),
lowercase bold font for vectors (such as x(i)), and uppercase bold font for matrices (such as X).
Encoding: Dealing with Words
Nominal
A small corpus Features
Time flies like an arrow.

Fruit flies like a banana.


How can we encode the words as a numerical vectors?

Encoding: Turning Words into Numerical Values


One-Hot-Encoding of vocabulary of small corpus

[?]

Applicable to all nominal/categorical data items.

A Heatmap for One-hot-Encoded Texts▲


import seaborn as sns
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['Time flies flies like an arrow.',
'Fruit flies like a banana.']
one_hot_vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w+\b',
binary=True)
one_hot = one_hot_vectorizer.fit_transform(corpus).toarray()
sns.heatmap(one_hot, annot=True, cmap="Reds", cbar=False,
xticklabels=one_hot_vectorizer.get_feature_names(),
yticklabels=[s[0:5]+"..." for s in corpus])
[?]
Note binary=True caps counts at 1.

A Heatmap for One-hot-Encoded Texts: Set-of-Words

51
Sparse text representation: BOW
[?]

Bag-of-Words
Bag-of-Words
(~105 , 1)
2 1 0 0 ... … … 0 1 Representa-
tion

Sum

1 0 0 0 0 0 1 0 0
0 1 0 0 0 0 0 0 0
… … … … … … … … …
0 0 ... ... ... ... ... ... ...
… … … … … … … … …
0 0 0 0 0 0 0 0 1
(~105 ,9) The quick brown fox jumps over the dog .

[?]

Term Frequency

https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/

Inverse Document Frequency

52
https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/

TF.IDF

https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/

TF.IDF: An Example
A = The car is driven on the road. TF.IDF
B = The truck is driven on the highway.

https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/

53
3.3 Models
Model Classes (Hypothesis Classes)

• Supervised parametrized learning: Find the best parameters θ for function y = fθ (x) =
f (x; θ) that computes optimal outcome y for given evidence x.

• Hypothesis space: Classes of all possible models. What kind of hypothesis spaces do you
know? Linear models, decision trees, multilayer perceptrons . . .

• The decision on the hypothesis space has consequences. . . Which ones?

Decision Boundaries of Different ML Models

http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

Learning Theory: No Free Lunch Theorem▲ [?]

• "any two algorithms are equivalent when their performance is averaged across all possi-
ble problems."

• Each dataset is different.

• Different classifiers have different assumptions. With respect to what?

• Different classifiers can draw different decision boundaries.

• Machine Learning requires the use of different methods and their systematic evaluation

• However, SOTA methods of an application area show “best meal” practice

Parametric vs Non-parametric Modeling


Parametric [?, 16]

• Fixed set of parameters

• Pro: Simple to use (parameters are given by modeling class)

54
• Con: Modeling class introduces (eventually wrong) assumption about the data

Non-Parametric

• Increasing parameters as training size increases

• Pro: Only data count, flexible modeling

• Con: Suboptimal estimations for high dimensional data

• Con: High calculation effort for large datasets

Example for Non-Parametric Modeling: KNN


Non-
K-nearest Neighbors (KNN) Classification Parametric
Learning
• Idea: Select class c for input x with the most frequent k similar training examples (=nearest
neighbors).

• So-called memory-based learning (also instance-based learning)

16 K = 3
KNN classification with Chapter 1
KNN

[?, 16]
(a) (b)

Figure
Standard Procedure 1.14 (a) Illustration
in Parametric Learning of a K-nearest neighbors classifier in 2d for K = 3. The 3 n
of test point x1 have labels 1, 1 and 0, so we predict p(y = 1|x1 , D, K = 3) = 2/3
neighbors of test point x2 have labels 0, 0, and 0, so we predict p(y = 1|x2 , D, K =
• Features get parameterized by weights
Illustration of the Voronoi tesselation induced by 1-NN. Based on Figure 4.13 of (Duda et
generated
• Objective Function by knnVoronoi.
connects feature values, parameters and prediction

• Optimization algorithm maximizes objective function, that is, minimizes loss function

2010).include
• Objective should Such some
models often
form have better term
of regularization predictive acccuracy
for improved than association
generalization rules,
may be less interpretible. This is typical of the difference between data mining
learning: in data mining, there is more emphasis on interpretable models, where
learning, there is more emphasis on accurate models.

1.4 Some basic concepts


55 in machine learning

In this Section, we provide an introduction to some key ideas in machine lear


expand on these concepts later in the book, but we introduce them briefly here,
of things to come.
3.3.1 Linear
Supervised Machine Learning: Regression

Supervised

Linear Regression Figure 1-6. Regression


Source: [?]
Note that some regression algorithms can be used for classification as well,
and vice versa. For example, Logistic Regression is commonly used for
Linear Regression classification, as it can output a value that corresponds to the probability of
belonging to a given class (e.g., 20% chance of being spam).
Regression problem: Linear
Here are some of the most important supervised learning algorithms (covered Regression
find a function thatin this
fitsbook):
the given data points as good as possible
Linear regression (LR): the
k-Nearest function is linear, i.e., y = w > x + b
Neighbors

Linear Regression y
Input: vector x ∈ Rn Logistic Regression
Output: scalar y ∈ R Support Vector Machines (SVMs)
Parameters of LR model:
Decision Trees and Random Forests

weight vector w ∈ R Neural networks2


n

and bias b ∈ R Unsupervised learning x


In unsupervised learning, as you might guess, the training data is unlabeled
(Figure 1-7). The system tries to learn without a teacher.
Great tutorial on Linear Regression▲ if you are new to that concept

Linear Regression as a"******


Neural Network ▲
DEMO - www.ebook-converter.com*******"
Heike Adel Machine Learning 08.03.2019 44 / 68

Where are the parameters? Where is the bias? Can you complete the drawing to match the
formula of linear regression?

Simple Linear n-ary Classification Example


Language Identification for 6 Languages
English, French, German, Italian, Spanish, Other

56
!.x/

dout D 1 w b

f .x/ D x ! w C b:

[?]
k Œ"1; C1!
f .x/ sign
"1 w ; w C1
;:::

x yO D f .x/ D x ! wL L
yCb :
L2f ; ; ; ; ; g

How does this work?


wL 2 R784 ; b L W 2 R784!6
6
b2R
From Regression to Classification
x Classification,
yO D f .x/ D x ! W C b linear
x Œ0! x Œ1!
D yO D yO Œi! :
yO D .f .x// D .x ! w C b/
i

yO 2 R6 D . # w1 C # w2 C b/;
yO
! b w D Œw1 ; w2 !
yO $ 0 w1

• What is learned?

• What is predicted?

• Which values do we expect for f (x)?

“Geometrically, the points x · w + b = 0 define a hyperplane (which in two dimensions corre-


sponds to a line) that separates the space into two regions.”

57
2500

2000

1500
Size

1000

500

0
1000 2000 3000 4000 5000
Price

w2 b
Sign Function
w1 w2
if x < 0

−1


sgn(x) = 0 if x = 0
1 if x > 0

Sign Function
How does the graph of this function look like?
w x!wCb D0

Linear Separability
Separability,
Linear

58
2500

2000

1500
Size

1000

500

0
1000 2000 3000 4000 5000
Price

w2 b
◦ Blau: Dupont Circle, DC
w1 w2
× Grün: Fairfax, VA

Questions
Are the regions linearly separable by two features?
Are the regions linearly separable
w
byxa! wsingle
Cb D0
feature?

Important
The typical high dimensionality of NLP features helps to generate linear separability. . .

3.3.2 Binary
Weakness of Linear Modeling

• What’s missing in linear modeling?

• Class probabilities!

• Which values are possible f (y) = x · w + b?

• Which values are possible f (y) = sign(x · w + b)?

• How can we get to probabilities?

• What is the value range of probabilities?

• What else has to be fulfilled?

59
σ(x)
1.0

0.8

0.6

0.4

0.2

0.0
-6 -4 -2 0 2 4 6
Supervised
!.x/ Who am I?
Logistic Regression What can I do?

Logistic Regression: Probabilistic Binary Classification


k
Idea: Apply Logistic
yO Dlogistic function
.f .x// D to .x
result
!wC ofb/linear regression
w ;w ;::: Regression
>
y D .xaa " waa Ch x=ab w " wxab+Cbxac " wac : : : C b/:
yO D f .x/ D o = x ! wL 1C bL:
1 L2f ; ; ; ; ; g 1+exp(−h)
f .x/ # 0
wL 2 R784w
; bL W 2 R784!6
b 2 R6

yO D f .x/ D x ! W C b
D yO D yO Œi! :
i

x
0 yO 2 R6
yO

1
P(y = 1|x, w , b) = 1+exp(−h)
P(y = 0|x, w , b) = 1 − P(y = 1|x, w , b)

Log-linear
HeikeModeling
Adel Machine Learning 08.03.2019 60 / 68

f .x/ Œ$1; 1! f$1; C1g


sign

Œ0; 1!
1
".x/ D 1Ce !x

1
yO D ".f .x// D :
1C e !.x "wCb/

1
Œ0; 1! 2

" .f .x// D P .yO D 1 j x/ x


P .yO D 0 j x/ D 1 $ P .yO D 1 j x/ D 1 $ ".f .x//

How can we generalize to n classes?

Outlook: N-ary Log-linear Modeling

60
x/ Œ$1; 1! f$1; C1g

Sigmoid
Œ0; 1!
1
".x/ D 1Ce !x

What is sigmoid doing to numbers?1


yO D ".f .x// D :
Softmax
1 C e !.x "wCb/ Softmax

e x Œi!
1 .x/Œi! D P x :
; 1! 2 j e
Œj !

What is softmax doing to vectors of numbers?


" .f .x// D P .yO D 1 j x/ x
P .yO D 0 j •x/Instead
D1$ of P .yO D .1. . j x/ D 1 $ ".f .x//
sigmoid

use softmax!

• How could the name “log-linear” be motivated?

• Softmax regression is Multinomial Logistic Regression

yO D .xW C b/

e .xW Cb/Œi !
yO Œi! D P :
.xW Cb/Œj !
j e

yO
∂: SoftMax: From vectors to probabilities
SoftMax: The “work horse” of probabilistic classification
Every numeric vector with n numbers can be normalized and interpreted as a probability
distribution over n classes! n
x D x ;
Vector (logits)
1Wn 1 x 2 ; : : : ; x
SoftMax
n Probability y D y
Interpretation
1Wn 1 ; y 2 ; : : : ; yn
x 1Wn y 1Wn
f ./
p(„iPhone6s“)=0.7
f ./ yO D f .x/
p(„MacBook“)=0.2

yO p(„iPad“)=0.1
y O y/
L.y;
yO y

3.4 Training W b
L
3.4.1 Loss
Training: The Function of Loss Functions .x 1Wn ; y 1Wn / L
f .xI ‚/ ‚
What is training (optimization) for? Searching for better parameters

Xn
Change model parameters θ such that the1 value of the loss function (or “cost function”) over
the training set gets smaller. L .‚/ D L.f .x i I ‚/; y i /:
n
iD1

61

L
1X
n
O D
‚ L.‚/ D L.f .x i I ‚/; y i /:
‚ ‚ n
iD1
L.‚/ D L.f .x i I ‚/; y i /:
n
iD1


L
1X
n
O D
‚ L.‚/ D L.f .x i I ‚/; y i /:
‚ ‚ n
iD1

Linear models allow for a guided search with guarantees on convergence (on the training
data). Or even for an analytical solution . . .
y
i y Œi !
The Function of Loss Functions
What are loss functions for? What are their properties?

• They quantify the quality of the current model by a single number. Computed per train-
ing item and summed up over whole training set.

• Minimal when all examples from the training set get predicted perfectly.

• What does perfect mean? For probabilistic classifiers 100% probability for the correct
solution is needed.

• What characterizes good loss functions? They have unique extreme values which have a
closed solution or can be approximated iteratively for any data set.

Typical loss function for regression: Mean Squared Error


1 Xm
MSE(X, θ) = (θ T x(i) − y (i) )2
m i=1

Squared Error: Visualized

https://www.dataquest.io/blog/understanding-regression-error-metrics/

3.4.2 Optimization
Convex and Concave Functions
When is a function z = f (x, y) concave?

62
• How can I turn any concave function into a convex function?

What characterizes convex functions y = f (x) ?


Convexity

• Their graph is on or below of any straight line drawn between two points of the graph.

• Is a straight line convex?

• Convex functions have a unique extreme value.

3.4.3 Gradients
Iterative Computation of ArgMax of a Concave Continuous Function: Gradient Ascent
Algorithm arg max f (x) for unary functions (1 parameter)
x

• parameter

– function f : R → R

63
– Step size a > 0 (if a is too large, algorithm diverges)

• Update rule for argument x :


1: repeat
2: x ← x + a ∗ f ′ (x)
3: until “x has converged”

• f ′ is derivative f

• What is crucially needed? Automatic differentiation! [?, D]

• Gradient leads to maximum of a function. How can we find the minimum of a convex
function?

Gradient Ascent▲ for f : R → R in Python with Sympy▲


def baby_gradient_ascent(f, a, init=None, eps=0.001, maxi=100): Derivation
""" Return arg max of function f: float -> float

f is a sympy expression with x as symbolic parameter. a is the step size.


init is set to a random number [-0.5,0.5) if not specified.
eps is the convergence criterion. maxi is the maximum number of
iterations."""

f_deriv = diff(f, x)
argmax = random.random()-.5 if init is None else init
converged = False
iteration = 0
while not converged and iteration < maxi:
iteration += 1
oldargmax = argmax
slope = f_deriv.subs(x, argmax)
argmax += a*slope
if abs(oldargmax - argmax) < eps:
converged = True
return argmax
Q: How can we turn this code into a gradient descent?

Notation

• f ′ (x): The derivative of f if x has only one dimension.


∂f
• ∂w1 : The partial derivative of f with respect to w1 .

• ∇f : shorthand form to denote the gradient of f . The collection of all partial derivatives
of f .

• ∇w f : the gradient of f with respect to some (vector) w

Please study [?] Chapter 5 for a well-motivated and well-worked example.

64
cients in the network.
But such an approach would be horribly inefficient, becaus
CHAPTER 4. NUMERICAL COMPUTATION pute two forward passes (which are expensive) for every ind
which there are many, usually thousands and sometimes up to m
ter approach is to take advantage of the fact that all operation
Directed Search: Gradients as Compass
are differentiable, and compute the gradient of the loss with re
coefficients. You can then move the coefficients in the opposi
gradient, thus decreasing the loss.
If you already know what differentiable means and what a grad
section 2.4.3. Otherwise, the following two sections will help
concepts.

2.4.1 What’s a derivative?


Consider a continuous, smooth function f(x) = y, mapping a r
󰤓
real number y. Because the function is continuous, a small chan
󰤓 in a small change in y—that’s the intuition behind continuity. L
by a small factor epsilon_x: this results in a small epsilon_y ch
󰤓
f(x + epsilon_x) = y + epsilon_y
󰤓
󰤓 󰤓 󰤓 In
󰤓
addition, because the function is smooth (its curve doesn’t ha
when epsilon_x is small enough, around a certain point p, it
Quelle: [?, 83]
Figure mate
4.1: An illustration of how as a linear of
thef derivatives function of slope
a function a, used
can be so that follow the becom
to epsilon_y
Search
function strategy for
downhill to aminimization
minimum. f(xThis technique′ is=called
y + agradient descent.
f (x − ϵsign(f (x))
+ epsilon_x) * epsilon_x

is smaller than Obviously, this linear approximation is valid only when x is close
We assume the reader is already The slopef (x)a is called
familiar with the calculus, butofprovide
derivative f in p. Ifa abrief
is negative, it
(for small
review of howenough ϵ)
calculus concepts
of x relate
aroundtopoptimization
will result in ahere.
decrease of f(x) (as shown in figur
Try to explain this figure to youritive,
neighbour!
Suppose we have a function y a=small f (x),change
wherein x will
both x result
and yin anreal
are increase of f(x). Furth
numbers.
of isa (the dy
magnitude of the derivative) tells you how quickly thi
The Why
derivative of this
do we need a stepfunction
size? (Learning Rate) as f (x) or as dx . The derivative f (x)
denoted
gives the slope of f (x) at the willpoint
happen.
x. In other words, it specifies how to scale Learning Rate

a small change in the input in order to obtain the corresponding change in the
output: f (x + ) ≈ f (x) + f (x). Local linear
approximation of f,
The derivative is therefore useful for minimizing
with slope a a
function because it tells us
how to change x in order to make a small improvement in y. For example, we
know that f (x − sign(f (x))) is less than f(x) for small enough . We can thus
reduce f (x) by moving x in small steps with opposite sign of the derivative. This
f Figure 2.10 Derivative of f in p
technique is called gradient descent (Cauchy, 1847). See Fig. 4.1 for an example of
this technique. Source: [?]

When f (x) = 0,y =


• Assumption: thef (x) derivative
is aFor provides
every
smooth, no information
differentiable
continuous function function about
(no sudden f(x) which direction
(differentiable
changes of direction) means “can
to move.• APoints where
small change in xf leads
(x) ple,
= are known
to a0small
smooth,
change inas
y. criticalfunctions
continuous points orcanstationary points
be derived), .
there exist
A local• minimum is a point f'(x) f(x)maps
where that is lower
valuesthan
of x at all slope
to the neighboring
Locally, the slope of a linear function f ′ (x) = a predicts the rate of change
points,
of the local linear appro
so it is no longer possible to decrease f (x) by making infinitesimal steps. A local
• f (x + ϵx ) ≈ y + ϵx f ′ (x)
maximum is a point where f(x) is higher than at all neighboring points, so it is
65
83
Licensed to SIMON CLEMATIDE <simon.clematide@u
XOR: Famous Nonlinear Problem Demo on Colab▲
xor

XOR Problem: Not linearly separable▲

• Solution I: problem-specific feature transformation (non-linear transformation)

• Solution II: general kernel function (can sometimes be done efficient with kernel trick ) 3

• Solution III: Multi-Layer Neural Networks▲

Conclusion

• Linear regression is a simple, yet powerful parametric modeling approach

• Adding a threshold function as sign results in linear classification

• Adding a sigmoid function results in probabilistic binary classification

• Adding softmax function results in probabilistic multinomial classification (MaxEnt mod-


eling)

• MaxEnt was very popular in NLP before DL

• Modern ML uses objective functions that combine loss functions and regularization for
better generalization

• sklearn’s SGDClassifier▲ is a powerful linear modeling tool

3.5 Further Study

Mandatory reading

• Chapter 1-3 from [?] (use the online version https://d2l.ai/▲ )

• Chapter 2 of [?] on NLP Pipelines: Read it carefully if you never heard of things like
tokenization, lemmatization, evaluation (Precision, Recall, F-Measure)

Recommended reading
3
Video https://www.youtube.com/watch?v=OdlNM96sHio

66
• Chapter 1 and 2 of [?] on linear classification: Please answer all questions in the slides
connected to Goldberg.

• Chapter 1,2,4 of [?] if your ML and sklearn skills are rusty (next tutorial will on sklearn )
and if you find dl2ai too difficult for the moment

Questions

• Which steps does a modern ML-based NLP pipeline include?

• What is the difference between the two workflow schemas (the one from Vajjala 2020 vs
Raschka 2015)?

• Why are linear models interesting in NLP?

• Why are linear models “not enough” for modeling in NLP?

• What are the ingredients of ML with linear models?

67
Chapter 4

Learning with sklearn

Learning Objectives

• Data handling with pandas

• Preprocessing and data transformation with sklearn

• Fitting and predicting with sklearn API

• Combining data transformation with pipelines

• Confusion matrices

4.1 sklearn
ML Techniques Covered by sklearn
ML in sklearn

http://scikit-learn.org/stable/_static/ml_map.png

68
When we are talking about categorical data, we have to further distinguish between
nominal and ordinal features. Ordinal features can be understood as categorical
values that can be sorted or ordered. For example, T-shirt size would be an ordinal
feature, because we can define an order XL > L > M. In contrast, nominal features
don't imply any order and, to continue with the previous example, we could think of
T-shirt color as a nominal feature since it typically doesn't make sense to say that, for
example,
4.1.1 red
Datais larger than blue.

Before we explore
Categorical Data:different techniques to handle such categorical data, let's create a
Panda Dataframes
new data frame to illustrate the problem:
Mini example: T-Shirts with colors, size, price
>>> import pandas as pd
>>> df = pd.DataFrame([
... ['green', 'M', 10.1, 'class1'],
... ['red', 'L', 13.5, 'class2'],
... ['blue', 'XL', 15.3, 'class1']])
>>> df.columns = ['color', 'size', 'price', 'classlabel']
>>> df
color size price classlabel
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1

AsWhich
we can
Building
see
levels
Good
inofthe
Training
preceding output, the newly created DataFrame contains a
measurement?
Sets – Data Preprocessing
nominal feature (color), an ordinal feature (size), and a numerical feature (price)
column.
We The class
can reverse labels
the (assuming
key-value that we created a dataset for a supervised
Do-it-yourself Mappings: f pairs
: M →inNthe mapping dictionary as follows to map the
learning
convertedtask) are stored
classinjective in
labels back the last
to the column. The learning algorithms for classification
Define your function fororiginal
encodingstring representation:
and decoding of.
that we discuss in this book do not use ordinal information in class labels.
>>> inv_class_mapping
Define mappings = {v: k for k, v in class_mapping.items()}
>>> df['classlabel']
size_mapping = {'XL':3, = df['classlabel'].map(inv_class_mapping)
'L':2, 'M':1}
Mapping ordinal features
inv_size_mapping
>>> df = {v:k for k,v in size_mapping.items() }
color size price classlabel
ToVectorized
make suremapping
that theonlearning
columnalgorithm interprets the ordinal features correctly, we
0 green 1 a10.1 class1
need to convert
df['size'] the categorical string values into integers. Unfortunately, there is no
1 red = df['size'].map(size_mapping)
2 13.5 class2
convenient function that can automatically derive the correct order of the labels of
our 2sizeblue 3
feature. Thus, 15.3
we have to class1
define the mapping manually. In the following
Using the preprocessing Package of sklearn
simple example,
Alternatively, let'sisassume
there that weLabelEncoder
know the difference betweenimplemented
features, for in
sklearn supports the acreation
convenient class directly
and application of mappings. sklearn
example, XLto= achieve
scikit-learn 2 . same:
L + 1 = M +the Estimator
The LabelEncoder class in action ▲
Methods
>>> size_mapping = {
>>> from sklearn.preprocessing import LabelEncoder
... 'XL': 3,
>>> class_le = LabelEncoder()
... 'L': 2,
>>> y = class_le.fit_transform(df['classlabel'].values)
>>> y [ 104 ]
array([0, 1, 0])

Note that the fit_transform method is just a shortcut for calling fit and
Methods separately,
transform and we canestimators
of these preprocessing use the inverse_transform method to
transform the integer class labels back into their original string representation:
>>> class_le.inverse_transform(y)
array(['class1', 'class2', 'class1'], dtype=object)

Performing one-hot encoding on nominal


features
In the previous section, we used a simple dictionary-mapping approach to convert
the ordinal size feature into integers. Since scikit-learn's estimators treat class labels
without any order, we used the convenient LabelEncoder class to encode the string
labels into integers. It may appear that we could use a similar approach to transform
the nominal color column of our dataset, as follows:
69

>>> X = df[['color', 'size', 'price']].values


>>> color_le = LabelEncoder()
>>> X[:, 0] = color_le.fit_transform(X[:, 0])
>>> X
Building Good Training Sets – Data Preprocessing
similar to the size_mapping dictionary that we used previously.
We can reverse the key-value pairs in the mapping dictionary as follows to map the
converted class labels back to the original string representation:
Encoding class labels
>>> inv_class_mapping = {v: k for k, v in class_mapping.items()}
>>> df['classlabel'] = df['classlabel'].map(inv_class_mapping)
Many>>> machine
df learning libraries require that class labels are encoded as integer
Proper
values. Data Transformation:
Although
color Fit on for
mostclasslabel
size price estimators Train, Apply on Test
classification in scikit-learn convert class
0 green 1 10.1 class1
labels
1
tored
integers
2
internally,
13.5
it is considered good practice to provide class labels as
class2
integer
2 arrays to
blue 3 avoid
15.3 technical
class1 glitches. To encode the class labels, we can use an
approach
Alternatively,similar
there is ato the mapping
convenient of ordinal features discussed previously. We need
LabelEncoder class directly implemented in
to remember
scikit-learn thatthe
to achieve class labels are not ordinal, and it doesn't matter which integer
same:
number
>>> fromwesklearn.preprocessing
assign to a particular string-label.
import LabelEncoderThus, we can simply enumerate the
class
>>>labels starting
class_le at 0:
= LabelEncoder()
>>> y = class_le.fit_transform(df['classlabel'].values)
>>>
>>>y import numpy as np
array([0, 1, 0])
>>> class_mapping = {label:idx for idx,label in
Note...
that the fit_transform methodenumerate(np.unique(df['classlabel']))}
is just a shortcut for calling fit and
transform separately, and we can use the inverse_transform method to
>>> class_mapping
transform the integer class labels back into their original string representation:
{'class1': 0, 'class2': 1} Source: [?, 103]
>>> class_le.inverse_transform(y)
array(['class1', 'class2', 'class1'], dtype=object)
Next we can use the
Transformation mapping
of Nominal dictionary to transform the class labels into integers:
Features
Encoding T-Shirt colors: What’s the problem?
Performing
Which color has
one-hot
>>> df['classlabel']
which code?
encoding on nominal
= df['classlabel'].map(class_mapping)
>>> df
features
color size price classlabel
In the previous section, we used a simple dictionary-mapping approach to convert
0 green 1 10.1 0
the ordinal size feature into integers. Since scikit-learn's estimators treat class labels
1 any red
without 2 the13.5
order, we used convenient LabelEncoder 1 class to encode the string
labels2into integers.
blue It may
3 appear that
15.3 we could use a similar
0 approach to transform
the nominal color column of our dataset, as follows:
>>> X = df[['color', 'size', 'price']].values
>>> color_le = LabelEncoder()
>>> X[:, 0] = color_le.fit_transform(X[:, 0])
>>> X
array([[1, 1, 10.1],
[2, 2, 13.5],
[0, 3, 15.3]], dtype=object)
Building Good Training Sets – Data Preprocessing
[ 105 ] ~
AnTypical
even more
machineconvenient way tomake
learning algorithms create
usethose dummy
of order betweenfeatures
numericalvia one-hot
values! encoding
Beware of
arbitrary ordering relations! [ 106 ]
is to use the get_dummies method implemented in pandas. Applied on a DataFrame,
the get_dummies method will only convert string columns and leave all other
One-Hot-Encoding▲ : A Dimension for Each Value
columns unchanged: One-Hot-
Dummy Features with Pandas: Easy and Explicit Encoding
>>> pd.get_dummies(df[['price', 'color', 'size']])
price size color_blue color_green color_red
0 10.1 1 0 1 0
1 13.5 2 0 0 1
2 15.3 3 1 0 0

How relevant are categorical feature values for language technology?

Partitioning a dataset in training and test


Too many dummy values are a waste of resources!
• sklearn’s OneHotEncoder supports sparse matrices (default: sparse_output=True)
sets• But one-hot-encoding only selected categorical columns needs the ColumnTransformer ! ▲

We briefly introduced the concept of partitioning a dataset into separate datasets for
training and testing in Chapter 1, Giving Computers the Ability to Learn from Data, and
70
Chapter 3, A Tour of Machine Learning Classifiers Using Scikit-learn. Remember that the
test set can be understood as the ultimate test of our model before we let it loose on
the real world. In this section, we will prepare a new dataset, the Wine dataset. After
we have preprocessed the dataset, we will explore different techniques for feature
Bringing features onto the same scale
Feature scaling is a crucial step in our preprocessing pipeline that can easily be
Building Good Training Sets – Data Preprocessing
forgotten. Decision trees and random forests are one of the very few machine
learning algorithms
Although
Dealing
the removal whereof missing we
with Missing or Unknown/Unseen
datadon't
seemsneed
to be a to worry
convenient
Values about feature
approach, it also
in Test/Application scaling. However,
Data
comes with certain disadvantages; for example, we may end up removing too
the majority
manyMany
of machine
samples,
possiblewhich learning
will make
strategies
and optimization algorithms behave much
. .a. reliable analysis impossible. Or, if we remove too
better if features
many are on
feature columns, we the same
will run scale,
the risk as we
of losing saw
valuable in Chapter
information
Chapter 4
that2, Training Machine
our
classifier needs to discriminate
• Ignore/remove databetween
record classes. In the next section, we will thus
Learning
look
Algorithms
at one
for Classification, whenfor we implemented the gradient descent
This way, we can count the of the most
number commonly
of missing valuesused alternatives
per column; in the dealing with missing
optimization
following values:•we
subsections, algorithm.
Ignore/remove
interpolation
will atfeature
differentvalues
techniques.
take a look strategies for how to
deal with this missing data.
• Provide explicit UNK values (unseen characters, bigrams, words); maybe already do that
The importance of feature scaling can be illustrated by a simple example. Let's
Imputing
assumeAlthough
for rare missing
thatscikit-learn
we have
features in train/dev
two features
was developed
values set to “accustom” the model to this
where
for working with NumPy one feature is measured on a scale from
Often,
arrays, the
it canremoval
sometimes ofbe
samples or dropping
more convenient of entire
to preprocess datafeature columns is simply not
1 to 10 and•
feasible,
using the
Compute
because
pandas' second
we might
DataFrame. Wefeature
replacement
losealways
can is
values
too much measured
access
(imputing
valuable onInathis
missing
data.
the underlying scale
values):from
case, we can 1 to 100,000.
What values
use
would beWhen we
good for
NumPy nominal
array of the scales? via the attribute before
thinkdifferent
offeed
we the squared
it into a scikit-learnerror
DataFrame
interpolation techniques function
estimator:
values
to estimate inthe
Adaline in Chapter
missing values from the 2, Training Machine Learning
other
training samples in our dataset. One of the most common interpolation techniques is
Algorithms for Classification,
>>> df.values
mean imputation, where we simply it isreplace
intuitive to say
the missing that
value themean
by the algorithm
value of will mostly be busy
optimizing the
array([[ 1.,
the entire feature
[ 5.,
weights
column. Aaccording
2.,
6., nan,
3., 4.],
convenient way
8.],
to to
the larger
achieve thiserrors inthe
is by using theImputer
second feature. Another
class from
example is the scikit-learn,
k-nearest as shown in
neighborsthe following code:
(KNN) algorithm with a Euclidean distance
[ 10., 11., 12., nan]])

measure; the computed distances between


>>> from sklearn.preprocessing import Imputer samples will be dominated by the second
>>> imr = Imputer(missing_values='NaN', strategy='mean', axis=0)
feature >>>axis. imr = imr.fit(df)
Eliminating>>>samples
imputed_dataor features with missing
= imr.transform(df.values)
>>> imputed_data
values
Now, there are 1.,
array([[ two common
2., 3., approaches
4.], to bringing different features ontoChapter
the same
4
scale:
One of the easiestnormalization
ways to deal
Although
[ with anddata
5., missing
6., standardization.
7.5, 8.], remove the Those terms are often used quite loosely
is to simply
corresponding features [normalization
(columns)
10.,or samples
11., (rows)
12.,viafrommin-max scaling
the dataset
6.]]) entirely; is a commonly used technique that
inmissing
rows with different fields, and the meaning
is useful when we need values in a bounded
values can be easily dropped via the dropna has to
method: be interval,
derivedstandardization
from the context. can Most
be moreoften,
Here, we replaced each NaN value by the corresponding mean, which is separately
>>> normalization
practicalfor
df.dropna()
calculated
refers
foreach
many to the
machine
feature
rescaling
column. learning
ofalgorithms.
If we changed
the features Theto
the setting axis=0
a range
reason of [0, 1], which is a
is that
to axis=1 , we'dmany linear
A B C D4.1.2 Normalization
special
0 1 2 3calculate
4
casetheof
models, suchmin-max
row as the Other
means. scaling.
logistic To
regression
options normalize
for the and SVM
strategy our data,
that
parameterwe we can
are remember
median or simply apply 3,
from Chapter the
min-maxA Scaling
Tour scaling
most_frequent , where
of Numerical
Machinetohave the
eachDatalatter
Learning replaces
featureClassifiersthe missing
column, Using values by the
Scikit-learn,
where the most
new frequent
initialize
value the( i )
xnormweights to 0 or x (i )
of a sample
Similarly, we can drop
values. columns
This is for
Important that
useful at least
for imputing
stochastic one
gradient NaN in any
categorical row by
feature setting
values.the
can be
axis argument small : random values
to 1calculated as follows: close to 0.and neural
Using networks!
standardization, we center the feature columns
at Normalization
mean 0 with standard deviation 1 so that the feature columns take the form of
>>> df.dropna(axis=1)
A BUnderstanding
a normal distribution, the whichscikit-learn
makes it easier to(i )estimator API Furthermore,
learn the weights.
0 1 2
x −aboutx outliers and makes the
( i ) class from scikit-learn
standardization maintains useful information
6 In the previous section, we used the Imputer to impute
1 5
x
2 10 11 missing values in our dataset. The Imputernorm
= min
class belongs to the so-called transformer
algorithm less sensitive to them in contrast xmax
to −
classes in scikit-learn that are used for data transformation. xmintwo essential
min-max
The
scaling, which scales
the data
The dropna method to several
supports a limited range
additional of values.
parameters that can come in handy:
methods of those estimators are fit and transform. The fit method is used to
learn
# only drop thewhere
rows parameters from the
all columns aretraining
NaN MinMaxScaler
data, and the transform method uses those
The procedure of standardization can be expressed by the following equation:
(i )
x the is
>>> df.dropna(how='all')
Here,parameters
have
a particular
samenot
number
Standardization sample,
of features as the
xminarray
to transform the data. Any data array that is to be transformed needs to
data
is the smallest value in a feature column,
that was used to fit the model.
xmaxfollowing
# drop rows that have at least 4 non-NaN values
>>> and the largest value, how respectively.
The
df.dropna(thresh=4) figure illustrates a transformer fitted
( itest x(i ) on
) dataset: −µthe training data is used
to transform a training dataset as well as a new xstd(here:
= 'C') x
# only drop rows where NaN appear in specific columns
>>> The min-max scaling procedure is implemented
df.dropna(subset=['C']) σ x in scikit-learn and can be used
[ 102 ]
as follows: [ 101 ] StandardScaler
Here, µ x is the sample mean of a particular feature column and σ x the corresponding
>>> from sklearn.preprocessing import MinMaxScaler
standard deviation, respectively.
>>> mms = MinMaxScaler()
The following
>>> table illustrates
X_train_norm the difference between the two commonly used
= mms.fit_transform(X_train)
feature
>>> scaling techniques,
X_test_norm standardization and normalization on a simple sample
= mms.transform(X_test)
dataset consisting of numbers 0 to 5:
Where do I get the minimal and maximal values from?
input or TFIDFstandardized
Nice: Relative frequencies normalized
values are already reasonably bounded
0.0 -1.336306[ 110 ] 0.0
1.0 -0.801784 71 0.2
2.0 -0.267261 0.4
3.0 0.267261 0.6
4.0 0.801784 0.8
Here, µ x is theThe procedure
sample mean of of
standardization
a particularcan be expressed
feature column by the
andfollowing equation:
σ x the corresponding
standard deviation, respectively.
(i ) x(i ) − µ x
xstd =
The following table illustrates the difference between σ x the two commonly used
feature scaling techniques, standardization and normalization on a simple sample
dataset consisting
Here,ofµ xnumbers 0 tomean
is the sample 5: of a particular feature column and σ the corresponding
x

Normalization vs standard deviation, respectively.


Standardization
input table illustrates
The following standardized normalized
the difference between the two commonly used
feature scaling techniques, standardization and normalization on a simple sample
0.0 -1.336306 0.0
dataset consisting of numbers 0 to 5:
1.0 -0.801784 0.2
2.0 -0.267261 standardized
input 0.4 normalized

3.0 0.00.267261 -1.336306 0.6 0.0


1.0 -0.801784 0.2
4.0 0.801784 0.8
2.0 -0.267261 0.4
5.0 3.01.336306 0.267261 1.0 0.6
4.0 0.801784 0.8
Similar to MinMaxScalerWhat’s the impact
, scikit-learn
5.0
of extreme values?
also1.336306
implements a 1.0
class for standardization:
Proper dataimport
>>> from sklearn.preprocessing
transformation
Similar to MinMaxScaler, scikit-learn alsoStandardScaler
implements a class for standardization:
Learn on
>>> stdsc = StandardScaler() training set, apply on held-out sets!
>>> from sklearn.preprocessing import StandardScaler
>>> X_train_std = stdsc.fit_transform(X_train)
>>> stdsc = StandardScaler()
>>> X_test_std
>>> = stdsc.transform(X_test)
X_train_std = stdsc.fit_transform(X_train)
>>> X_test_std = stdsc.transform(X_test)

Feature Engineering with sklearn: Self-contained Introduction to Bag-of-Words, TF.IDF


etc.
[ 111 ] [ 111 ]
Bag-of-Words

• tutorial on text feature extraction▲ introducing sklearn vectorizers for textual data

• Features selection with sklearn

• Sklearn Tutorial: Working with textual Data▲

• Chapter 8 of [?] on a sentiment analysis problem... if you need a more basic introduction

4.1.3 Learning
The Generic Estimator API of sklearn
sklearn
Estimator API

Source: [?, 103]

72
Hyperparameter Tuning

• Systematic grid search: Test all possible hyperparameters. Problems?


Learning Best Practices for Model Evaluation and Hyperparameter Tuning
• Greedy Search: Chasing the most promising settings. Problem?
The intermediate steps in a pipeline constitute scikit-learn transformers, and the
last step
• sklearn is an estimator.
supports In the preceding
the systematic codeofexample,
evaluation settings we built
with its asimple
pipeline that
pipelining model
consisted of two intermediate steps, a StandardScaler and a PCA transformer, and a
logistic
• More regression
ambitious andclassifier
generalasframework:
a final estimator.
Optuna ▲ is we
When executed
a general the fit methodoptimiza-
hyperparameter
on the pipeline
tion framework pipe_lr , the StandardScaler performed fit and transform on the
training data, and the transformed training data was then passed onto the next object
in the pipeline,
• Optuna examplethe . Similar▲ to
PCAsklearn
with the previous
from step, PCA also
the comprehensive executed
Scikit, ▲ sklearn intro
fit and
No Tears
transform on the scaled input data and passed it to the final element of the pipeline,
the estimator. We should note that there is no limit to the number of intermediate
Pipelining
stepsModel
in this with sklearn
pipeline. The concept of how pipelines work is summarized in the
following figure:
sklearn
Pipeline

An example on text processing▲

• sequential application of fit/transform estimators

• last estimators only needs fit

A Pipeline Dealing with Different Feature Types▲

[ 172 ]

73
Pipeline Example with Text Vectorization▲
# Hyperparameters
sdg_params = dict(alpha=1e-5, penalty="l2", loss="log_loss")
vectorizer_params = dict(ngram_range=(1, 2), min_df=5, max_df=0.8)

# Supervised Pipeline
pipeline = Pipeline(
[
("vect", CountVectorizer(**vectorizer_params)),
("tfidf", TfidfTransformer()),
("clf", SGDClassifier(**sdg_params)),
]
)
Play with this example https://bit.ly/sklearn-text-classification

Grid Search▲ of Hyperparameters


pipeline = Pipeline( Grid Search
[("vect", TfidfVectorizer()),
("clf", ComplementNB())]
)

parameter_grid = {
"vect__max_df": (0.2, 0.4, 0.6, 0.8, 1.0),
"vect__min_df": (1, 3, 5, 10),
"vect__ngram_range": ((1, 1), (1, 2)), # unigrams or bigrams
"vect__norm": ("l1", "l2"),
"clf__alpha": np.logspace(-6, 6, 13),
}

random_search = RandomizedSearchCV(
estimator=pipeline,
param_distributions=parameter_grid
)

74
sklearn supports several gridsearch strategies

Gridsearch Visualization▲

Evaluation: Confusion Matrix▲


Confusion
Matrix

https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix

4.2 skorch
skorch▲ : Provide Pytorch Functionality with sklearn Interfaces

• sklearn’s clean minimalistic ML interface

• pytorch’s neural modeling abilities

75
• pytorch’s performance on GPUs

• Learning by doing in the exercise 1 skeleton

Alternatives with simpler abstractions: PyTorch Lightning▲

4.3 Further Study


• Introduction to Pandas in 10 minutes▲

• Tutorial for sklearn from sklearn documentation▲

• Comprehensive introduction “Scikit, No Tears”▲ into sklearn covering all aspects

• Chapter 2 of [?] illustrates the use of pipelines and evaluations

• Extensive text classification example▲ using sklearn with feature analysis

76
Chapter 5

Generalized Linear Classification

Learning Objectives

• Understand the feature representation for linear binary and multiclass classification

• Understand the perceptron learning rule for binary and multiclass classification

• Understand the different learning strategies (minimize errors, maximize margin, maxi-
mize training data) and their connection with loss functions

5.1 Binary
Linear Models in Theory
Linear Models in Theory Linear
Modeling
I The linear model for binary classification:
(
1 if w · x + b > 0
ŷ =
−1 if w · x + b < 0

I Learning as optimization (loss + regularization):

ŵ, b̂ = argmin L(w, b; T ) + λR(w, b)


w,b

I Gradient descent for convex optimization:

w ← w − η∇f (w, b; T )

LinearGeneralized
Features inClassifiers
Linear Practice: Sentiment Classification 2(38)

77
x3 1
0 otherwise
Now we have an algorithm that given an instance x computes the probability P(y =
x4 makecount(1st
1|x). How do we ⇢a decision?and
For2nd
a testpronouns
instance x,2wedoc) 3
say yes if the probability
decision
P(y = 1|x) is more 1 if “!” 2 doc call .5 the decision
: THEboundary: 5
boundary x5 than .5, and no otherwise.
5.1 • We C LASSIFICATION SIGMOID
0
0 otherwise

x6 ln(word 1 if ofP(y
count doc)= 1|x) > 0.5 ln(66) = 4.19
ŷ = x2=2 0 otherwise
x3=1
5.1.1 Let’s
It'sExample:
assume sentiment
hokey . There are virtually
for the moment classification
no surprises , and the writing is second-rate .
that we’ve already learned a real-valued weight for
So why was it so enjoyable ? For one thing , the cast is
Let’s have
great an example.
. Another nice Suppose wemusic
touch is the are doing
. I wasbinary sentiment
overcome with theclassification
urge to get offon
movie the
review
couchtext,
andand
startwe would. like
dancing to know
It sucked mewhether to assign
in , and it'll do the the
samesentiment
to you . class
+ or to a review document doc. We’ll represent each input observation by the 6
x4=3
x1input
features x1 ...x6 of the =3 shown x5=0in the following
x6=4.19 table; Fig. 5.2 shows the features
in a sample mini test document.
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Var Definition Value in Fig. 5.2
x1 these
Given count(positive
6 features and lexicon) 2 doc)
the input review x, P(+|x)3 and P( |x) can be com-
putedx2usingcount(negative
⇢Eq. 5.5: lexicon) 2 doc) 2
1 if “no” 2 doc
p(+|x) x3 = P(Y = 1|x) = s (w · x + b) 1
0 otherwise
x4 count(1st
⇢ = s
and 2nd pronouns
([2.5, 5.0, 21.2, doc)
0.5, 2.0, 0.7] 3 · [3, 2, 1, 3, 0, 4.19] + 0.1)
Figure 5.2 1 Aif sample “!”=2 doc mini
s (.833) test document showing the extracted features in the vector x.
x5 0
0 otherwise= 0.70 (5.6)
each x log(word
6 of these Ccount
5.2 • features, ofand
LASSIFICATIONdoc)that
WITHthe 6 weights
L OGISTIC ln(66) =5 4.19
corresponding
R EGRESSION to the 6 features are
30

p( |x) = P(Y = 0|x) = 1 s (w · x + b)


[2.5, 5.0, 1.2, 0.5, = 0.302.0, 0.7], while b = 0.1. (We’ll discuss in the next section how
test document.
the
Let’sLinear weights
assume Features
for the are
in learned.)
Practice:
moment The weight
thatSentiment
we’ve wlearned
1 , for aexample
Classification
already real-valued indicates
weight for how important a
Var Definition Value in Fig. 5.2
feature
eachLogistic
ofx1
these the number
regression
features, isand of positive
commonly
that theapplied
count(positive lexicon words 2 doc)
lexicon
6 weights 3
words
sorts of NLP(great,
to allcorresponding tasks, nice,
to theand6 any enjoyable,
property
features are etc.) is to
of the
[2.5, ax5.0, input can
1.2, 0.5,be a feature. Consider
2.0, 0.7],decision,
while2bdoc) the
= while task
0.1. (We’ll of period
discuss disambiguation:
us inthetheimportance deciding
next sectionof how
2positive sentiment
count(negative
⇢ is the endlexicon words w
2 2 tells negative lexicon
if a period
the weights are learned.) of a
The sentence
weight or
w part
, for of a word,
example by classifying each period
words.
x3 one ofNote
1 if “no” that w1 = 2.5 is positive,
2 doc 1
1while w2 = 5.0, meaning thatanegative words
indicates how important
into
feature 0 two classes
otherwise
thenegatively
number ofassociated EOS (end-of-sentence)
positive lexicon words and
(great, not-EOS. We might use
nice, enjoyable, etc.)features
isare
to about twice as
are
likex x below
count(1st expressing
and 2nd that
pronouns with
the
2 doc) a
current positive
word3 is sentiment
lower case decision,
and the classandis EOS
a positive 4 1 sentiment
⇢ decision, while w2 tells us the importance of negative lexicon
words.
important
(perhaps
x5Note that
1 if as
with “!”positive
2 doc weight),
wa1 positive
words.or
= 2.5 is positive, thatwthe=0current
while word is in
5.0, meaning our
that abbreviations
negative words
0 otherwise 2
dictionary
are negatively
x6
Given these
(“Prof.”)
associated 6
and
ln(word count of doc)
features
the
with class
a and
is
positiveEOS the input
(perhaps
sentiment review
with a x,
negative
decision,
ln(66) = 4.19 and P(+|x)
weight).
are aboutand P(
Atwice
feature |x) can be com-
as
can also using
importantputed as express
positive a quite
Eq. 5.5:complex combination of properties. For example a period
words.
following
Lets assume an upper
somecase likely tow:
word isweights
arbitrary be an EOS, but if the word itself is St. and
Let’sthe previous
for theword is capitalized, then the period
b) is likely part of a shortening of the
assume moment that we’ve already learned a real-valued weight for
p(+|x) = P(y = 1|x) = s (w · x +
word street.
⇢ = s ([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
1 if “Case(wi ) = Lower”
x1 = s (.833)
0 =otherwise

1 =if 0.70
“wi 2 AcronymDict” (5.7)
x2 =
0 otherwise
p( |x) = P(y = 0|x) ⇢ = 1 s (w · x + b)
1 if “wi = St. & Case(wi 1 ) = Cap”
x3 = = 0.30
Linear Models in Practice 0 otherwise
Figure 5.2 A sample mini test document showing the extracted features in the vector x.
Designing
5.2.2 features: Other classification
Features are generally tasks andbyfeatures
designed examining the training
each of these features, and that the 6 weights corresponding to the 6 features areon the domain. A
set with an eye to linguistic intuitions and the linguistic literature
[2.5, careful
Logistic
5.0, error
1.2, analysis
regression
0.5, on is
2.0, 0.7], whilethe =training
bcommonly
0.1. (We’ll setdiscuss
or devset
applied of
in theto ansection
all
next early how
sorts version
of NLPoftasks,
a system
and any property
the weights are learned.) The weight
period often provides insights into features. w , for example indicates how important a
sambiguation feature of the
theFor input positive
number can belexicon 1
a feature. Consider the task of period disambiguation: deciding
someoftasks wordshelpful
it is especially (great, to
nice, enjoyable,
build complex etc.) is to that
features are combi-
if
a positive a period
sentiment
nations of wmore
is the
decision, end
while w2oftellsa sentence
us or
the importance part of
of negative a word,
lexicon
primitive features. We saw such a feature for period disambiguation
by classifying each period
words. Note that 1 = 2.5 is positive, while w2 = 5.0, meaning that negative words
into one
above,
are negatively where ofa two
associated period classes
with a on the EOS
positive word (end-of-sentence)
St.
sentiment was lessand
decision, likely and
to be
are about thenot-EOS.
twice end We might use features
as of the sentence
if theasprevious
important word was capitalized. For logistic regression and naive Bayes these
positive words.
feature Given
combination features
these 6 features and orthefeature interactions
input review haveP(to |x)
x, P(+|x) and be can
designed
be com-by hand.
interactions
puted using Eq. 5.5:
78
p(+|x) = P(y = 1|x) = s (w · x + b)
= s ([2.5, 5.0, 1.2, 0.5, 2.0, 0.7] · [3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
= 0.70 (5.7)
p( |x) = P(y = 0|x) = 1 s (w · x + b)
Linear Models in Practice

I Binary classification is sometimes useful


I Spam filtering, spell checking, . . .
I But most NLP problems involve more than two classes
I Text categorization: news, business, culture, sports, . . .
I Word sense disambiguation: one class per sense
I And many involve structured prediction
I Part-of-speech tagging: sequence-to-sequence
I Dependency parsing: sequence-to-tree

5.2 Multiclass
Generalized Linear Classifiers 3(38)

Multiclass Classification
Multiclass Classification

I Can we do multiclass classification with binary classifiers?

I Yes, but we need more than one classifier


I One-Versus-All (OVA): one classifier for every class yj
I All-Versus-All (AVA): one classifier for every pair yj , yk

One-Versus-Rest/All (OVR/A)
Generalized Linear Classifiers 12(38)
OVR/A

79
One-Versus-All
I Given multiclass training data:

T = {(x(i) , y (i) )}N


i=1 y (i) ∈ {y1 , . . . , yn }

I Create training set for each class yj :


(
1 if y (i) = yj
Tj = {(x(i) , z (i) )}N
i=1 z (i) =
−1 otherwise

I Train one classifier (weight vector) wyj for each class yj


I Decision rule:
f (x) = argmax wy · x
y

Generalized Linear Classifiers 13(38)


All-Versus-All (AVA)/One-Versus-One (OVO)
All-Versus-All AVA/OVO

I Given multiclass training data:

T = {(x(i) , y (i) )}N


i=1 y (i) ∈ {y1 , . . . , yn }

I Create training set for each pair yj , yk :


(
N 1 if y (i) = yj
Tjk = {(x(i) , z (i) )}i=1
j,k
z (i) =
−1 if y (i) = yk

I Train one classifier (weight vector) wjk for each pair yj , yk


I Score for yj combines all classifiers involving yj

OVA or AVA?
OVA vs. AVA Linear Classifiers
Generalized 14(38)

I Both methods come with guarantees


I Both methods can work well in practice
I OVA is more efficient both at training and classification time
I Given n classes:
I OVA only needs to train and run n classifiers
I AVA requires n(n−1)
2 classifiers

Generalized Linear Classifiers 80 15(38)


5.2.1 Features
Generalized Linear Models
Generalized Linear Models
Multiclass
Feature
Functions
I In binary classification, we use feature vectors over inputs:

f(x) : X → Rm

I For multiple classes, we need to represent input-output pairs:

f(x, y) : X × Y → Rm

I This can be generalized to structured outputs (more later)

True multinomial classification typically produces better calibrated class probabilities.

Generalized Linear Classifiers 16(38)


Examples
Feature Functions: A Single Feature for Multiclass Learning

I x is a document and y is a label



 1 if x contains the word “interest”
fj (x, y) = and y =“financial”

0 otherwise

fj (x, y) = % of words in x with punctuation and y =“scientific”

I x is a word and y is a part-of-speech tag



1 if x = “bank” and y = Verb
fj (x, y) =
0 otherwise

Examples
Feature Generalized
Functions as Numeric Vectors
Linear Classifiers 17(38)

I x is a name, y is a label classifying the name


 
 1 if x contains “George”  1 if x contains “George”
f0 (x, y) = and y = “Person” f4 (x, y) = and y = “Object”
 0 otherwise  0 otherwise
 
 1 if x contains “Washington”  1 if x contains “Washington”
f1 (x, y) = and y = “Person” f5 (x, y) = and y = “Object”
 
0 otherwise 0 otherwise
 
 1 if x contains “Bridge”  1 if x contains “Bridge”
f2 (x, y) = and y = “Person” f6 (x, y) = and y = “Object”
 0 otherwise  0 otherwise
 
 1 if x contains “General”  1 if x contains “General”
f3 (x, y) = and y = “Person” f7 (x, y) = and y = “Object”
 
0 otherwise 0 otherwise

I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]


I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]
I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]

Generalized Linear Classifiers 18(38)

81
Block Feature Vectors
Feature Functions as Numeric Vectors: Blocks
Feature
Function
Blocks
I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]
I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]
I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]

I One equal-size block of the feature vector for each label


I Input features duplicated in each block
I Non-zero values allowed only in one block

We can rearrange the long vectors into a matrix of the non-zero section of each class. See [?]
Multiclass
Chapter 5 Linear Classification
Generalized Linear Classifiers 19(38)
Decision Function of Multiclass Linear Classification
Decision
I Let w ∈ Rm be a weight vector Function
I If we assume that w is known, then we define our classifier as

ŷ = argmax w · f(x, y)
y
m
X
= argmax wj × fj (x, y)
y
j=0

Multiclass Linear Classification


Multiclass Linear Classifier
Defines regions of space:
Generalized Linear Classifiers 20(38)

I i.e., + are all points (x, y) where + = argmaxy w · f(x, y)

Generalized Linear Classifiers 23(38)

Bias Terms as Bias Features

82
Bias Terms
I Often linear classifiers presented as
m
X
ŷ = argmax wj × fj (x, y) + by
y
j=0

I Where b is a bias or offset term


I But this can be folded into f

x=General George Washington, y=Person → f(x, y) = [1 1 0 1 1 0 0 0 0 0]


x=General George Washington, y=Object → f(x, y) = [0 0 0 0 0 1 1 0 1 1]
 
1 y =“Person” 1 y =“Object”
f4 (x, y) = f9 (x, y) =
0 otherwise 0 otherwise

I w4 and w9 are now the bias terms for the labels


Generalized Linear Classifiers 21(38)

5.2.2 Supervised
Perceptron Learning
General Supervised Learning

(i)
I Input: Training examples T = {(x(i) , yt )}N
i=1
I Feature representation f : X × Y → Rm
I Output: A vector w that optimizes some important function
of the training set:
I minimize error (Perceptron, SVMs, Boosting)
I maximize likelihood of data (Logistic Regression, Naive Bayes)
I NB: Same as binary case except for feature representation

Perceptron Learning Algorithm


Perceptron Learning Algorithm: Binary and Multiclass

Generalized Linear Classifiers 24(38)

1: w←0
2: for a fixed number of iterations do
3: for all (x, y) ∈ T do
4: ŷ = argmaxy w · f(x, y)
5: if ŷ 6= y
6: w = w + f(x, y) − f(x, ŷ)
7: end if
8: end for
9: end for

There is an error in the binary case. Can you spot and fix it? Use the algorithm formulation
from the next slide!
Generalized Linear Classifiers 25(38)

83
Binary Perceptron Algorithm [?]

5.2.3 Logistic Regression


Linear Classifiers
Multinomial Logistic Regression/ Maximum Entropy
Logistic Regression / Maximum Entropy Log-Linear
Classifier

Define a conditional probability:

e w·f(x,y) X
e w·f(x,y )

P(y|x) = , where Zx =
Zx
y′ ∈Y

Note: still a linear classifier

e w·f(x,y)
arg max P(y|x) = arg max
y y Zx
= arg max e w·f(x,y)
y
= arg max w · f(x, y)
y

Logistic Regression/ Maximum Entropy


Generalized Linear Classifiers in NLP 34(94)

84
Linear Classifiers

Logistic Regression / Maximum Entropy

10 C HAPTER 5 • e w·f(x,y)
L OGISTIC R EGRESSION
P(y|x) =
Zx
If you work out the matrix arithmetic, you can see that the estimated score of
the first output class ŷ1 (before we take the softmax) will correctly turn out to be
w1 · x + b1 .
◮ Q: How do we learn weights w Fig. 5.3 shows an intuition of the role of the weight vector versus weight matrix
in the computation of the output class probabilities for binary versus multinomial
◮ A: Set weights to maximize log-likelihood of training data:
logistic regression.

X Binary Logistic Regression


w = arg max log P(yt |xt )
w t

◮ In a nut shell we set the weights w so that we assign as much


10 C HAPTER 5 • L OGISTIC R EGRESSION
probability to the correct label y for each x in the training set
If you work out the matrix arithmetic, you can see that the estimated score of
the first output class ŷ1 (before we take the softmax) will correctly turn out to be
w1 · x + b1 .
Fig. 5.3 shows an intuition of the role of the weight vector versus weight matrix
5.2.4 Comparison
in the computation of the output class probabilities for binary versus multinomial
logistic regression.
Generalized Linear Classifiers in NLP 35(94)
Comparison binary weights vs multiclass weights from [?]
Binary Logistic Regression
p(+) = 1- p(-) Multinomial Logistic Regression
p(+) p(-) p(neut)
Output y y^
sigmoid [scalar] Output y y^1 ^y y^3 These f red weights
softmax 2
[K⨉1] are a row of W
Weight vector w corresponding
[1⨉f] Weight W to weight vector w3,
matrix [K⨉f] (= weights for class 3)
Input feature x x1 x2 x3 … xf
vector [f ⨉1] Input feature x x1 x2 x3 … xf
wordcount positive lexicon count of vector [f⨉1]
wordcount positive lexicon count of
=3 words = 1 “no” = 0
=3 words = 1 “no” = 0

Input words dessert was great


Input words dessert was great
Multinomial Logistic Regression
Figure 5.3 Binary versus multinomial logistic regression. Binary logistic regression uses a
single weight vector w, and has a scalar output ŷ. In multinomial logistic regression we have
5.3 Losses K separate weight vectors corresponding to the K classes, all packed into a single weight

Linear Models in Theory matrix W, and a vector output ŷ.

Important: We are back to binary decisions (+1, -1) for this section
5.3.3 Features in Multinomial Logistic Regression
I The linear model Features
for inbinary classification:
multinomial logistic regression act like features in binary logistic regres-
sion, with the difference mentioned above that we’ll need separate weight vectors
and biases for each of the K classes. Recall our binary exclamation point feature x
(
Figure 5.3 Binary versus multinomial logistic regression. Binary logistic regression uses a
5

1 if w · x + b > 0
single weight vector w, and has a scalar output ŷ. In multinomial logistic regression we have
K separate weight vectors corresponding to the K classes, all packed into a single weight
matrix W, and a vector output ŷ. ŷ =
1 if w · x + b < 0
5.3.3 Features in Multinomial Logistic Regression

I
Features in multinomial logistic regression act like features in binary logistic regres-
Learning as optimization (loss + regularization):
sion, with the difference mentioned above that we’ll need separate weight vectors
Minimize Error
and biases for each of the K classes. Recall our binary exclamation point feature x5

ŵ, b̂ = argmin L(w, b; T ) + R(w, b)


w,b

I Gradient descent for convex optimization:

w w ⌘rf (w, b; T )
85

Generalized Linear Classifiers 2(38


4

Minimize Error 3

2 loss

−3 −2 −1 1 2 3
margin

(
1 if y ŷ ≤ 0
L(w, b; T ) =
0 otherwise

The perceptron (implicitly) minimizes 0-1 loss

Not a smooth loss!


Minimize Error
Generalized Linear Classifiers 5(38)
Minimize Error

Train Test

The perceptron (or 0-1 loss) does not care about margin

What could we have done better when drawing the decision boundary?

Minimize Error
Generalized and Maximize Margin
Linear Classifiers 6(38)
Loss, Hinge

86
Maximize Margin
4 loss

−3 −2 −1 1 2 3
margin

L(w, b; T ) = max(0, 1 − y ŷ )

Hinge loss goes to 0 with a margin of (at least) 1


Typical of (min-error) max-margin methods: SVM, MIRA, . . .
Supervised
Generalized Linear Classifiers 7(38)
Linear SVM: Minimize Error and Maximize Margin
Linear SVM Linear SVM

There are many possible decision hyperplanes


Goal: search for the best one (with the best generalization)
Idea: the best decision hyperplane maximizes the margin between
the two classes
x2 x2

margin

x1 x1

Heike Adel Machine Learning 08.03.2019 62 / 68

Maximize Likelihood
Loss, Log

87
Maximize Likelihood
4 loss

−3 −2 −1 1 2 3
margin

1
L(w, b; T ) = log(1 + exp(−y ŷ ))
log 2
Log loss improves beyond a margin of 1
Minimizing
Min Error log loss meansLikelihood
6= Max maximizing likelihood
Generalized Linear Classifiers 8(38)
Example I: Binary Loss Differences

I Consider a training set T with 100 instances


I 99 negative instances: hh2, 1i, −1i
I 1 positive instance: hh2, 3i, 1i
I Consider the weight vector w = h−1, 1i
I h−1, 1i · h2, 1i = −1
I h−1, 1i · h2, 3i = 1
I Loss functions:
I0/1 loss = 0
IHinge loss = 0
I Log loss = 0.452 × 100 = 45.2
Min Error 6= Max Likelihood
Generalized Linear Classifiers 9(38)
Example II: Binary Loss Differences

I Consider a training set T with 100 instances


I 99 negative instances: hh2, 1i, −1i
I 1 positive instance: hh2, 3i, 1i
I Consider the weight vector w = h−2, 1i
I h−2, 1i · h2, 1i = −3
I h−2, 1i · h2, 3i = −1
I Loss functions:
I 0/1 loss = 1
I Hinge loss = 2
I Log loss = 0.07 × 99 + 1.895 × 1 = 8.82

Generalized Linear Classifiers 10(38)


88
More Convex Loss Functions for Linear Modeling
Loss
Functions

Source: Sklearn Loss Functions▲

sklearn Solvers for Logistic Regression

Properties of sklearn solvers▲


Properties of sklearn solvers

Summary

• Multiclass classification can be done by combining binary classification

• True multinomial multiclass classification uses features function that encode the features
per class

• Different loss functions implement different learning strategies

5.4 Further Study


• Perceptron in sklearn▲

89
• Logistic Regression in sklearn▲

Recommended reading

• Perceptron▲ on Wikipedia

• Confused about the connection between Perceptron and SGD with Perceptron Loss?1
Note from sklearn: Perceptron() is equivalent to SGDClassifier(loss="perceptron",
eta0=1, learning_rate="constant", penalty=None).

• Recommended blog on log-loss ▲

1
https://stats.stackexchange.com/questions/137834/clarification-about-perceptron-rule-vs-gradient-descent-
vs-stochastic-gradient

90
n
The dot product of two vectors is the sumX of their elementwise products:
a · b = a> b = (ai · bi )
Xn
i=1
a · b = a> b = (ai · bi )
Example:
i=1
   
Example: 1 1
 3
  2 
  ·   = 1 · 1 + 3 · 2 + 2 · 1 + 0.5 · 4 = 11
1
 2  1 1
 3  2
 0.5 ·  4 = 1 · 1 + 3 · 2 + 2 · 1 + 0.5 · 4 = 11
 2  1
0.5 4
2.3 Scalars
Chapter
If a scalar
2.3
6 with a vector or a matrix, it is multiplied to every entry
is multiplied
Scalars
of the vector/matrix:
If a scalar is multiplied witha vector 
or a 
matrix, it ismultiplied to every entry
of the vector/matrix: 1 4 7 3 12 21
Automatic Differentiation 3· =
 2 0 1  6 0 3 
1 4 7 3 12 21
3· =
2 0 1 6 0 3
2.4 Multiplication
Cell i, jMultiplication
2.4
Learning ofObjectives
the product of two matrices A and B is the dot product of row i of
matrix A and column j of matrix B:
Cell i, j of the product of two matrices A
 graph and B is the dot product of row i of
• Understand the  computation
  
matrix A and column j of matrix B:7 8
1 2 3   58 64
• Grasp how gradients ∗  9 10 =computed on computation graphs
 4 5 can 6 be efficiently
7 12 8  139 154
1 2 3 11 58 64
• Grasp reverse-mode ∗  9 10 =
4 5 auto-differentiation
6 139 154m×n
As a result, we can only multiply 11 two 12
matrices A ∈ R and B ∈ Rk×l if
n ==
• Knowk. Thethe
dimension of the product
typical ingredients andmatrix is m
steps for × l.
computation-graph-based optimization
As a result, we can only multiply two matrices A ∈ Rm×n and B ∈ Rk×l if
n == k. The dimension of the product matrix is m × l.
3 Derivatives
6.1 Gradients
3The derivative
Derivatives
Derivatives ▲ : Symbolic
tells us theDifferentiation
slope of a function
[?] at any point. Examples:
The•derivative
The slopetells
of aus
constant value
the slope of a(e.g., 7) isat
function always 0
any point. Examples:
line w · xvalue
• The slope of a constant is w. (e.g.,
Example: 0 ⇒ f’(x) = 3
f(x) = 3x
7) is always
• The slope of a line w · x is w. Example: f(x) = 3x ⇒ f’(x) = 3
3.1 Calculating Derivatives
There are
3.1 rules for calculating
Calculating derivatives, such as:
Derivatives
• Multiplication
There by constant:
are rules for calculating g(x) = c ·such
derivatives, ⇒ g 0 (x) = c · f 0 (x)
f (x)as:
Power rule: f (x)
• Multiplication by = xn ⇒ n ·g(x)
constant: xn−1= c · f (x) ⇒ g 0 (x) = c · f 0 (x)

• Power rule: f (x) = xn ⇒ n · xn−1


2
Derivatives▲ : Symbolic Differentiation [?]
2

91
Function name Function Derivative
Exponential ex ex
1
Logarithm ln(x) x
Sine sin(x) cos(x)
Cosine cos(x) − sin(x)
• Sum rule: h(x) = f (x) + g(x) ⇒ h0 (x) = f 0 (x) + g 0 (x)
• Product rule: h(x) = f (x) · g(x) ⇒ f (x) · g 0 (x) + f 0 (x) · g(x)
Note:
• Chain rule: h(x) = f (g(x)) ⇒ h0 (x) = f 0 (g(x)) · g 0 (x)
d df
• To denote differentiation,
f (x)
we
0
can feither
0 use ’ as in f’(x) or
(x)·g(x)−g 0 (x)·f (x) dx as in dx
• Quotient rule: h(x) = g(x) ⇒ h (x) = g 2 (x)

Example:
Some special functions and their derivatives:
2
• Calculate
Function name the derivative
Function of h(x) = ex
Derivative
Exponential ex ex
⇒Logarithm
We can apply the chain rule:x1
ln(x)
h(x) = f (g(x)) sin(x)
Sine with f (y) = ey and y = g(x) = x2 .
cos(x)
Cosine 0
(g(x)) · g−
Thus, h0 (x) = fcos(x) 0 sin(x) x2
(x) = e · 2x

3.1.1
Chain Afor
Rule
Note:
Note on the
Function Chain Rule
Composition
0 Chain Rule
The •chain rule h(x)
To denote = f (g(x))
differentiation, we⇒
canh either f 0 (g(x))
(x) =use · g 0 (x)
’ as in f’(x) or can
d alsodfbe written as
dh dh dg dx as in dx
dx = dg · dx .
2
Example:from before: h(x) = ex with g = x2
Example
2
• Calculate the derivative of h(x) = exg
dh dh dg de dx2 g x2
⇒ =
⇒ We can apply the chain
= = e · 2x = e · 2x
dx rule:
dg x dg dx
h(x) = f (g(x)) with f (y) = ey and y = g(x) = x2 .
2
Thus, h0 (x) = f 0 (g(x)) · g 0 (x) = ex · 2x
3.2 Partial Derivatives
If3.1.1 A Notehas
our Numeric
6.1.1 function on the
moreChain
than Rule
one variable and these variables are independent
2 0
ofThe
each other,
chain
Numeric
i.e., f=(x,
rule h(x)
Differentiation:
y) =⇒ xy)
z = f (x,
f (g(x)) h+ y 3 , we
0 can calculate
= (x=∗fx)(g(x))
(x) g 0 (x)
∗ y + ·(y
the partial derivatives to
+ 2)can also be written as
dg
each
dh of them, treating the other one as a constant:
dh
dx = dg · dx . Differentia-
2
Example from before: h(x) = ex with g = x2 tion, Numeric
∂f ∂f
f (x, y) = x2 + y 3 ⇒ =g 2x2 + 0 = 2x ; 2 = 0 + 3y 2 = 3y 2
dh dh dg ∂xde dx ∂y
⇒ = = = eg · 2x = ex · 2x
dx dg x dg dx

3.2 Partial Derivatives


If our function has more than one variable and these variables are independent
of each other, i.e., f (x, y) = x2 + y 3 , we can3calculate the partial derivatives to
each of them, treating the other one as a constant:
∂f ∂f
f (x, y) = x2 + y 3 ⇒ = 2x + 0 = 2x ; = 0 + 3y 2 = 3y 2
∂x ∂y
Numeric Differentiation: z = f (x, y) = (x ∗ x) ∗ y + (y + 2)

92
How many times do we call function f for all partial derivatives? How many times would we
call a function with 1000 parameters (which neural networks easily have)?

6.1.2 Symbolic
Symbolic Differentiation

• Build the diff expression from the leaves to the top of the original expression

• Sympy does symbolic differentiation▲

• Result can be lambdafied into (efficient) python/numpy operations

Partial Derivation with Sympy z = f (x, y) = (x ∗ x) ∗ y + (y + 2)


>>> from sympy import *
>>> x, y, z = symbols('x y z')
>>> f = (x*x)*y + (y+2)
>>> diff(f,x)·
2·xy
>>> diff(f,x,y)·
2x

93
>>> diff(f,y)
2
x + 1

# Lambdify
>>> df_dx = diff(f,x)
>>> df_dx_fun = lambdify(x,df_dx)
>>> df_dx_func(x=3)·
6y

Problems of Symbolic Differentiation

• Symbolic differentiation can produce huge graphs for nested functions (exponential growth)

• Restricted to “Closed-Form Expressions”▲ : Does not support arbitrary algorithmic com-


putation steps

What is the derivation of the following code?


def my_func(a, b): z=0
for i in range(100):
z=a*np.cos(z+i)+z*np.sin(b-i)
return z

Ari Seff’s video: https://youtu.be/wG_nF1awSSY?t=259

6.1.3 Reverse-Mode Autodiff


Chain Rule at Work

• Given two functions u(x) and v(x):

• sequentially applied (function composition): z = v(u(x))

• written with intermediate results: z = v(s) and s = u(x).

• How do we get the partial derivative of the output z with respect to the input x? Mean-
ing: What is the rate of change of z if x changes?

• Chain rule:
∂z ∂z ∂s
= ·
∂x ∂s ∂x
∂s ∂z
• In forward mode (starts from input), we first compute then
∂x ∂s
∂z ∂s
• In reverse mode (starts from output), we first compute , then .
∂s ∂x
• x = 3, z(x) = sin(x2 )

• How does forward and reverse mode work out for this example?

94
Reverse Mode: Example and Principle

∂z ∂z ∂s
= ·
∂x ∂s ∂x

x = 3, z(s) = sin(s), s = u(x) = x2


∂z ∂ sin(s)
• First = = cos(s) = cos(32 ) ≈ −0.91
∂s ∂s
∂s ∂x2
• Second = = 2x = 6.
∂x ∂x
∂z ∂z ∂s
• = · ≈ −0.91 · 6 = −5.46
∂x ∂s ∂x

Full Chain . . .
If z is the output of a sequence of functions with intermediate outputs s1 , s2 , ..., sn , the chain
∂z ∂s1 ∂s2 ∂s3 ∂sn−1 ∂sn ∂z
rule applies as follows: = · · · ··· · · ·
∂x ∂x ∂s1 ∂s2 ∂sn−2 ∂sn−1 ∂sn

Forward Mode: Example and Principle

∂z ∂z ∂s
= ·
∂x ∂s ∂x

x = 3, s = u(x) = x2 , z(s) = sin(s)


∂s ∂x2
• First = = 2x = 6.
∂x ∂x
∂z ∂ sin(s)
• Second = = cos(s) = cos(32 ) ≈ −0.91
∂s ∂s
∂z ∂z ∂s
• = · ≈ −0.91 · 6 = −5.46
∂x ∂s ∂x

Computation Graph: z = f (x, y) = (x ∗ x) ∗ y + (y + 2)

• Directed Acyclic Computation Graph (DAG)▲ of function term

• Arrows show direction of backward pass

• Topological sorting▲ of ni for forward pass

• Idea in ML: loss term is top node

• Leave nodes are constants or variables

95
Source: [?, 512]

Remember. . . and Notice Different Coloring Schema

96
What is constant? What are the variables w.r.t. differentiation of the top node?

Forward Pass: 42 = f (3, 4) = (3 ∗ 3) ∗ 4 + (4 + 2)

• Input variables are instantiated to numbers!

• Top node n7 is output

Source: [?, 512]

Backward Pass: What is the rate of change w.r.t x?

∂f

∂x

97
• Start at top node

• n7 is output

• Follow down all possible paths to x

• Apply chain rule!

Source: [?, 512]

Backward Pass: What is the rate of change w.r.t x?


If n7 changes by 1, f trivially changes by 1 (f = n7 and ∂f /∂f = 1 ).

Source: [?, 512]

Backward Pass: What is the rate of change w.r.t x?


How much does f change if n5 varies?
Chain rule
∂n5 = ∂n7 ·
∂f ∂f ∂n7
∂n5

Sum rule
∂(a+b)
∂a = ∂a
∂a + ∂b
∂a

98
∂n7
∂n5 = ∂(n5 +n6 )
∂n5 = ∂n5
∂n5 + ∂n6
∂n5 =1+0=1

Source: [?, 512] Note, no influence of x on n6

Backward Pass: What is the rate of change w.r.t x?


How much does f change if n4 varies?
Chain rule
∂n4 = ∂n5 ·
∂f ∂f ∂n5
∂n4

Product rule
∂u = u ∂u + v ∂u
∂(uv) ∂v ∂u

∂n5
∂n4 = ∂(n4 ·n2 )
∂n4 = n4 ∂n
∂n2
4
+ n2 ∂n
∂n4 = 0 + 4 = n2 = 4
4

Source: [?, 512]

99
Backward Pass: What is the rate of change w.r.t x?
∂x = ∂n4 · ∂n1 ∂n1 = = n1 + n1 = 3 + 3
∂f ∂f ∂n4 ∂n4 ∂(n1 ·n1 )
∂n1
There are two paths from n7 to n1 .
How do the combine?
They add up!

Source: [?, 512]

Rate of Change for x in Action

∂f
z = f (x, y) = (x ∗ x) ∗ y + (y + 2) for x = 3, y = 4 = 24
∂x
When we change x by +e, how does it change z?
Applying the rate of change w.r.t. x

change e f (x + e, y) change of z z’ by derivation difference z and z’


0.001 42.024004 0.024004 0.024 4E-06
0.01 42.2404 0.2404 0.24 0.0004
0.1 44.44 2.44 2.4 0.04
1 70 28 24 4
2 106 64 48 16
3 150 108 72 36

What does it tell us? How does this relate to learning rates?

Backward Pass: What is the rate of change w.r.t y?


Which rule?

100
Source: [?, 512]

Backward Pass: What is the rate of change w.r.t y?


Try to apply the correct rules by yourself!

Source: [?, 512]

Backward Pass: What is the rate of change w.r.t y?


Efficiency: Compute derivative per node only once for both x and y! (n5 )

101
Source: [?, 512]

Backward Pass: What is the rate of change w.r.t y?


Again two paths from n7 to the y input.
No need to compute all paths!

Source: [?, 512]

Do you understand now? If not . . . watch this video:


https://youtu.be/R_m4kanPy6Q?t=268:

102
Do you understand now? If not . . . watch this video:
https://www.youtube.com/watch?v=wG_nF1awSSY:

Do you understand now?

103
Do you understand now?

104
6.2 Implementations
6.2.1 autograd
Autograd▲

• Autograd can automatically differentiate native Python and Numpy code by overloading

• An early practical implementation of “tape-based” auto-differentiation for Python.

• Tracks all (supported) operations of “normal function definition”.

• Works within “normal” algorithmic control structures

• Automatically derives the derivation of the function w.r.t the relevant parameters

• Gradient-based optimization gets trivial from a user perspective

See Logistic Regression example for XOR on Colab▲ .

Dynamic vs. Static Computational Graphs


Computation
Graph, Static

[?] https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html
Tensorflow’s eager evaluation makes it dynamic as well. . . pytorch’s compile functionality▲ goes the other way
round (and speeds up processing)

6.2.2 pytorch
Reverse Autograd Diff in pytorch

105
Try on Colab▲ Forward computation happens automatically here

Autograd in PyTorch with Matrices [?, Chapter 1]

Try on Colab▲
Inner Details of PyTorch Autograd
https://www.youtube.com/watch?v=MswxJw-8PvE

Jacobian Matrix
Jacobian
Matrix

106
Mathematically, if you have a vector valued function ⃗y = f (⃗x), then the gradient of ⃗y with
respect to ⃗x is a Jacobian matrix:
1 ∂y1
 ∂y 
···
 ∂x. 1 ∂xn
..
J = .. 
 .. . .


∂ym ∂ym
∂x1 ··· ∂xn

Pytorch’s Autograd▲ : Efficient Engine for Computing Vector-Jacobian Product

What is g(⃗y ) typically in neural network ML?

6.2.3 tensorflow
Reverse Autograd Diff in Tensorflow

Try on Colab▲

Typical Ingredients of Computation Graph ML

• Graph nodes: expressions that support forward and backward computations

• Model parameters: Numerical data structure representing Θ

• Input: Placeholders in static computation graphs; actual data structures in dynamic com-
putation graphs

107
• Trainer: Optimizers (SGD, AdaDelta, etc.) that use the results of backward pass to update
the parameters

6.3 Further Study


• Mandatory reading: Appendix D on Automatic Differentiation [?]; watch this great video
(15’)▲ for more details on the topic

• Interesting survey for the development of autodifferentiation: [?]

• Make sure you understand the relevant parts of the mathematical background [?] (in
OLAT) and visit the links if a refresher is needed

• Famous blog post on computation graphs and backpropagation http://colah.github.io/


posts/2015-08-Backprop/

• Wikipedia Page on Automatic Differentiation▲ and a nice blog post on different auto-
matic differentiation▲

• Introduction to autodiff in Tensorflow▲

• Video on inner details of PyTorch Autograd▲

• “Instructional implementation of autodiff” https://github.com/ageron/handson-ml2/blob/


master/extra_autodiff.ipynb

108
Chapter 7

Learning with PyTorch

Learning Objectives

• Know the tensors in pytorch

• Understand the subclassing modeling approach pytorch

• Know the typical OOP programming-style with pytorch for data, networks and opti-
mization

• Know the typical setup, vectorization, mini-batch generator, learning with pytorch

7.1 Tensors
Linear Algebra: Scalars, Vectors, Matrices, Tensors
Tensor

Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/

What are tensors? Tensors in Pytorch Tutorial▲

Tensors in Pytorch
Squeezing

• Capitalization matters: torch.tensor and torch.Tensor are different

• Components with dimension 1 can be squeezed to scalars

• Scalars can be “unsqueezed” (boxed) into vectors of size 1. Try on Colab▲ !

• Work through d2ai introductory pytorch notebook▲ , or the more extended examples in
Chapter 1 of [?]

109
Tensor Operations with Broadcasting Support1
Tensor arguments can be automatically expanded to be of equal sizes (without making copies Broadcasting
of the data).
Two tensors are “broadcastable” if the following holds:
• Each tensor has at least one dimension.
• When iterating over the dimension sizes, starting at the trailing dimension, the dimen-
sion sizes must either be equal, one of them is 1, or one of them does not exist.

Calculating the resulting size of two “broadcastable” tensors x, y


• If the number of dimensions of x and y are not equal, prepend 1 to the dimensions of the
tensor with fewer dimensions to make them equal length.
• For each dimension, the resulting size is the max of the sizes of x and y along that dimen-
sion.

Broadcasting (as done in NumPy▲ )


Broadcasting

Playlists

istory
Chapter 3. Foundational Components of
opics Neural Networks
This chapter sets the stage for later chapters by introducing the basic ideas involved in building neural
earning Paths
networks, such as activation functions, loss functions, optimizers, and the supervised training setup.
We
ffers & Deals begin by looking at the perceptron, a one­unit neural network, to tie together the various concepts.
The perceptron itself is a building block in more complex neural networks. This is a common pattern
ighlights that will repeat itself throughout the book—every architecture or network we discuss can be used
either standalone or compositionally within other complex networks. This compositionality will
ettings become clear as we discuss computational graphs and the rest of this book.
Source: https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/
Support
The Perceptron: The Simplest Neural Network
Sign Out The simplest neural network unit is a perceptron. The perceptron was historically and very loosely
7.2 ML modeled after the biological neuron. As with a biological neuron, there is input and output, and
“signals” flow from the inputs to the outputs, as illustrated in igure 3­1.

Perceptron

1
https://pytorch.org/docs/stable/notes/broadcasting.html
Figure 3­1. The computational graph for a perceptron with an input (x) and an output (y). The weights (w) and

110
bias (b) constitute the parameters of the model.

H
OF
S
LT
Note: Graphical representations of neural networks are computation graph visualizations. . .

Perceptron in PyTorch: Subclassing nn.Module


import torch
import torch.nn as nn

class Perceptron(nn.Module):

def __init__(self, input_dim):


super(Perceptron, self).__init__()
self.fc1 = nn.Linear(input_dim, 1)

def forward(self, x_in):


return torch.sigmoid(self.fc1(x_in)).squeeze()

Super

• What is super(...) for? See Python 2 compatibility▲ and this blog on the inner working
of super()▲ .

• What is squeeze() doing?

• Important to know: Input argument of forward is of shape (batch, input_dim). It


returns a tensor of shape (batch, 1)

Optimizers, Parameters and Loss Functions


import torch.optim as optim
perceptron = Perceptron(input_dim=input_dim)
optimizer = optim.Adam(params=perceptron.parameters(), lr=0.01)
bce_loss = nn.BCELoss()

• nn.Module.parameters() exposes all trainable parameters.

• The optimizer object uses them as instance variables.

• The loss object will know how to compute the loss . . .

Basic Mini-Batch Learning Loop


batch_size = 1000
n_epochs = 12
for epoch in range(n_epochs):
for batch in batch_generator(batch_size):
optimizer.zero_grad()
y_pred = perceptron(batch["x_data"])
loss = bce_loss(y_pred, batch["y_target"])
loss.backward()
optimizer.step()

• What is going on here?

111
Σ f p1
p2
w1, 1
b
2 Neuron Model and Network Architectures
Σ p3 n
f a

1 w1, R b
pR
a = f (wp + b)
1
Layer of Neurons
Get Your Dimensions Right! Where do batches of data go? a = f (Wp + b)
le-Input Neuron
Simple Neuron Input Layer of S Neurons Matrix Multi-
plication in
Inputs Multiple-Input Neuron Input Multiple-Input Neuron NNs
p a
Rx1 W Sx1
p1 p n a
p2
w1, 1 SxR
Rx1 W
Sx1 f 1x1
n
p3
Σ n f a 1
Sx1
b 1xR
1x1 f
w1, R b R S
pR 1 b
1x1
1 R a = f(Wp + b) 1
a = f (Wp + b)
a = f (Wp + b)
Three Layers of Neurons
Input Multiple-Input Neuron
Deep Network
Input First Layer Second Layer Third Layer
p a 2-16
Rx1 W 1 x1
p n a 1 a2 a3
Rx1
W
1
1 x1R
n 11x 1 S1 x 1 f
W2
n2 S2 x 1
W3
n3 S3 x 1

1
S xR
b S1 x 1
f 1 S2 x S1
S2 x 1
f2 S3 x S2
S3 x 1
f3
R 1 b11x 1 1 1 b2 1 b3
S1 x 1 S2 x 1 S3 x 1
R S1 S2 S3
a = f (Wp + b)
a1 = f 1 (W1p + b1) a2 = f 2 (W2a1 + b2) a3 = f 3 (W3a2 + b3)
a3 = f 3 (W3 f 2 (W2f 1 (W1p + b1) + b2) + b3)

2-16 Source: [?]


Delay
PyTorch: Network Modules and Parameters Delay
Network modules can be layers, networks (=connected layers), connected groups of networks!
• torch.nn.Module▲ : Basic class for u(t)
complex network
a(t) implementation with a lot of behind-
D
the-scene magic (initialization, backprop for gradient computation): Needed from user
perspective: forward method (computation graph) and optional parameters!

• torch.nn.parameter.Parameter▲ : Parameters
a(0) are complex objects, containing values (nu-
merical data), gradients, and additional information.

• Lazy Initialization▲ : Typically data-driven


a(t) = u(t - initialization
1) of network (dimensions/val-
ues), no input dimensions specified in code!
2-18
7.2.1 Data Loader
Typical Text Handling
Subclassing

• Vocabulary: Mapping of tokens to vocabulary index numbers and vice versa

• Dealing with unseen/unknown words: Unking rare tokens in training data allows model
to learn their "behaviour"

112
• Vectorization of data

• Your dataset representation: Subclassing torch.utils.data.Dataset: Need to have:


indexing (data[i]) and len()

• See data tutorial▲

• DataLoader▲ : Using torch.utils.data.DataLoader gives you shuffled mini-batching


"for free"!

7.3 Further Study


• More gentle introductory to Pytorch modeling: Chapter 3 [?] (in OLAT)

• More introductory/tutorial material on pytorch.org▲

113
Chapter 8

Feed-forward Neural Networks

Learning Objectives

• Understand the need for non-linearity and hidden layers

• Grasp the Universal Approximation Hypothesis

• Understand Multilayer Perceptrons (MLP) feedforward networks

• Know several activation functions and their properties

• Know some of the learning problems that can arise

8.1 Motivation
8.1.1 XOR
Closed Solution to Linear Regression
Closed
Common loss function for linear regression Solution

1 Xm
MSE(X, θ) = (θ T x(i) − y (i) )2
m i=1

Loss, MSE
Analytical closed solution: “Normal Equation”

θ̂ = (XT X)−1 XT y

Pro Simple direct computation

Contra Computational complexity grows polynomial O(n3 ) with number n of features

Contra Training data has to fit in core memory

114
Closed Solution to Ridge Regression
Common loss function for linear regression

1X n
J(θ) = MSE(θ) + α θi 2
2 i=1

Analytical closed solution to ridge regression


CHAPTER 6. DEEP FEEDFORWARD NETWORKS
θ̂ = (XT X + αA)−1 XT y

• A is n × n identity matrix

• Top left cell for bias is set to 0.

• Where does the regularization enter the closed formula?

XOR as Linear Model


XOR Problem

Original x space Learned h space

1 1

h2
x2

0 0

0 1 0 1 2
x1 h1

Figure 6.1:
How does linear regression with Solving the XOR
MSE loss fail? problem by learning a representation. The bold numbers
1 P m printed on the plot indicate the value that the learned function must output at each point.
MSE(X, θ) = (θ T x(i) − y (i) )2
m i=1 (Left)A linear model applied directly to the original input cannot implement the XOR
function. When x1 = 0, the model’s output must increase as x2 increases. When x1 = 1,
Closed solution: θ̂ =the X)−1 XToutput
(XTmodel’s y must decrease as x2 increases. A linear model must apply a fixed
See [?, 172] coefficient w2 to x2 . The linear model therefore cannot use the value of x1 to change
import numpy as np the coefficient on x2 and cannot solve this problem. (Right)In the transformed space
represented
# constant bias feature firstby the features extracted by a neural network, a linear model can now solve
the problem. In our example solution, the two points that must have output 1 have been
X = np.array([[1,0,0],[1,0,1],[1,1,0],[1,1,1]])
y = np.array([[0,1,1,0]]).T
collapsed into a single point in feature space. In other words, the nonlinear features have
# Normal equation
mapped both x = [1, 0]> and x = [0, 1]> to a single point in feature space, h = [1, 0]> .
theta_best = (np.linalg.inv(X.T @ X)) @ X.T @ y
theta_best The linear model can now describe the function as increasing in h1 and decreasing in h2 .
>>> array([[ 0.5], In [ this
0. ],example,
[ 0. ]]) the motivation for learning the feature space is only to make the model
capacity greater so that it can fit the training set. In more realistic applications, learned
representations can also help the model to generalize.

115

173
8.1.2 Depth
Can Deep Linear Models Help?
Going deep means composition of functions: f (g(h(x)))
Composition of linear transformations
If two function f : Rm → Rk and g : Rn → Rm are linear transformations, then their composi-
tion (f ◦ g) : Rn → Rk is a linear transformation.

Linear transformation and matrix-vector multiplication


If f is a function from Rn → Rk which is linear, then there exists a matrix A such that f (x) =
A·x

Composition and linearity of matrix multiplication


If a two matrices A and B compute linear transformation, then A · B computes their composi-
38 3. FROM LINEAR MODELS TO MULTI-LAYER PERCEPTRONS
tion and is also a linear transformation.
It is clear that no straight line can separate the two classes.
8.1.3 Nonlinearity
3.2 NONLINEAR INPUT TRANSFORMATIONS
Nonlinear Input Transformations
However, if we transform the points by feeding each of them through the nonlinear function
.x1 ; x2 / D Œx1  x2 ; x1 C x2 , the XOR problem becomes linearly separable.

0 1 2
Source: [?]
Featuree
(transformation) engineering
function  mapped needed!
the data into Or can we
a representation thatlearn the transformations
is suitable end-to-
for linear classification.
end? Yeswe
Having can!disposal, we can now easily train a linear classifier to solve the XOR problem.
at our

Adding a Bit of Nonlinearity yO D f .x/ D .x/W C b:

In general,
• Linear one can successfully
transformation train
of input x: a linear
z(x) = xWclassifier

+ b′ over a dataset which is not linearly
separable by defining a function that will map the data to a representation in which it is linearly
• Elementwise
separable, and thennonlinear activation:
train a linear classifierg(x) = max(0,
on the x) representation. In the XOR example
resulting
the transformed data has the same dimensions as the original one, but often in order to make the
• Non-linear transformation of linear transformation: y = g(xW + b)
data linearly separable one needs to map it to a space with a much higher dimension.
is solution
• Expression has one glaring(except
is differentiable problem,
for however: we need
non-smooth change at 0▲ ) anddefine
to manually SGD the function
is applicable
for optimizing
 , a process which is the parameters
dependent on the particular dataset, and requires a lot of human intuition.
• However, nonlinearity introduces non-convex loss functions. A problem? Not as long as
3.3 KERNEL
it finds METHODS
good solutions
Kernelized Support Vectors Machines (SVMs) [Boser and et al., 1992], and Kernel Methods in
general [Shawe-Taylor and Cristianini, 2004], approach this problem by defining a set of generic
mappings, each of them mapping the data into very high dimensional—and sometimes even
infinite—spaces, and then performing linear classification in the transformed space. Working
in very high dimensional spaces significantly increase 116 the probability of finding a suitable linear
separator.
One example mapping is the polynomial mapping, .x/ D .x/d . For d D 2, we get
.x1 ; x2 / D .x1 x1 ; x1 x2 ; x2 x1 ; x2 x2 /. is gives us all combinations of the two variables, allow-
ing to solve the XOR problem using a linear classifier, with a polynomial increase in the number
y y

h1 h2 h

The Rectified Linear Activation


W Function
x1 x2 x

g (z ) = max{0, z}
Figure 6.2: An example of a feedforward network, drawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left) In this style, we draw every unit as a node in the
graph. This style is very explicit and unambiguous but for networks larger than this
example it can consume too much space. (Right) In this style, we draw a node in the
graph for each entire vector representing a layer’s activations. This style is much more
compact. Sometimes we annotate the edges in this graph with the name of the parameters
that describe the relationship between two layers. Here, we indicate that a matrix W
describes the0mapping from x to h, and a vector w describes the mapping from h to y.
We typically omit the intercept parameters associated with each layer when labeling this
kind of drawing. 0
z
affine transformation from an input vector to an output scalar. Now, we describe
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
Figure 6.3: The
8.1.4 Depth andthat rectified linear activation function.
Nonlinearity This activation function is the default
is applied element-wise, with h i = g(x W :,i + ci ). In modern neural networks,
activation function recommended for use with most feedforward
the default recommendation is to use the rectified linear unit or ReLU neural networks. Applying
(Jarrett
Adding a Bit of Depth
et al. , to
this function to the output of a linear transformation yields a nonlinear transformation.
2009XOR
; Nair and Hinton , 2010 ; Glorot et al. , 2011a ) defined by the activation
function g(z ) = max{0, z } depicted in Fig. 6.3.
However, the function remains very close to linear, in the sense that is a piecewise linear
An “engineered“ exact solution
We can now specify our complete network as
function with two linear pieces. Because rectified linear units are nearly linear, they
preserve many of the properties
f (x; Wthat ) = w linear
, c, w, bmake max{0, W 
models easy
x + c} + b. to optimize
(6.3) with gradient-

based methods. We They also preserve many of the properties that make linear models
can now
generalize well. A common principle throughout   computer science is that we can build
1 1
W = , (6.4)
complicated systems from minimal components. 1 1 Much as a Turing machine’s memory
needs only to be able to store 0 or 1 states, we0 can 
build a universal function approximator
c= , (6.5)
from rectified linear functions. −1
 
1
w= , b=0 (6.6)
−2
173
CHAPTER 6. DEEP FEEDFORWARD NETWORKS
Can a solution be learned in practical terms▲ ?

y y

h1 h2 h
174
W

x1 x2 x

Figure
Make sure6.2:
youAn example ofhow
understand a feedforward network,
this operations workdrawn in two different styles. Specifically,
this is the feedforward network we use to solve the XOR example. It has a single hidden
layer containing two units. (Left) In this style, we draw every unit as a node in the
Gradient
graph.Descent Pitfalls
This style for explicit
is very Non-Convex Cost Functions
and unambiguous but for networks larger than this
Gradient
example it can consume too much space. (Right) In this style, we draw a node in the
Descent,
graph for each entire vector representing a layer’s activations. This style is much more Pitfalls
compact. Sometimes we annotate the edges in this graph with the name of the parameters
that describe the relationship between two layers. Here, we indicate that a matrix W
describes the mapping from x to h, and a vector w describes the mapping from h to y.
We typically omit the intercept parameters associated with each layer when labeling this
kind of drawing.

affine transformation from an input vector to an output scalar. Now, we describe


117
an affine transformation from a vector x to a vector h, so an entire vector of bias
parameters is needed. The activation function g is typically chosen to be a function
that is applied element-wise, with h i = g(x W :,i + ci ). In modern neural networks,
the default recommendation is to use the rectified linear unit or ReLU (Jarrett
et al., 2009; Nair and Hinton, 2010; Glorot et al., 2011a) defined by the activation
[?, 113]

Unfortunately, non-linear models do not have convex loss functions

From a Single Neuron . . .


Computations of a Neuron

• Scalar inputs
41
• Weights for inputs CHAPTER 4
• (Nonlinear) activation function

Feed-forward Neural Networks


4.1 A BRAIN-INSPIRED METAPHOR
As the name suggests, neural networks were inspired by the brain’s computation mechanism,
which consists of computation units called neurons. While the connections between artificial
neural networks and the brain are in fact rather slim, we repeat the metaphor here for complete-
ness. In the metaphor, a neuron is a computational unit that has scalar inputs and outputs. Each
input has an associated weight. e neuron multiplies each input by its weight, and then sums¹
them, applies a nonlinear function to the result, and passes it to its output. Figure 4.1 shows such
a neuron.

Output y1

Neuron ∫

Input x1 x2 x3 x4

Figure 4.1: A single neuron with four inputs.

e neurons are connected to each other, forming a network: the output of a neuron may
feed into the inputs of one or more neurons. Such networks were shown to be very capable com-
putational devices. If the weights are set correctly,
118a neural network with enough neurons and a
nonlinear activation function can approximate a very wide range of mathematical functions (we
will be more precise about this later).
A typical feed-forward neural network may be drawn as in Figure 4.2. Each circle is a
neuron, with incoming arrows being the neuron’s inputs and outgoing arrows being the neuron’s
outputs. Each arrow carries a weight, reflecting its importance (not shown). Neurons are arranged
An MLP is often used for classification, with each output corresponding to a different
binary class (e.g., spam/ham, urgent/not-urgent, and so on). When the classes are
exclusive (e.g., classes 0 through 9 for digit image classification), the output layer is
typically modified by replacing the individual activation functions by a shared soft‐
max function (see Figure 10-9). The softmax function was introduced in Chapter 3.
The output of each neuron corresponds to the estimated probability of the corre‐
sponding class. Note that the signal flows only in one direction (from the inputs to
. . . to MLP/FFNN for Classification
the outputs), so this architecture is an example of a feedforward neural network
(FNN). MLP

Figure 10-9. A modern MLP (including ReLU and softmax) for classification
Multilayer Architecture: Ingredients for Deep Learning

From Biological to Artificial Neurons | 263

Source: [?, 180]

Number of hidden layers?


Deep NNs
More than 1 hidden layer!

Fully Connected Architecture

• Sequence of layers

• Each node of layer n feeds into each node of layer n + 1


~
• Bias nodes have no input

119
How Many Hidden Layers are Needed?
Universal Approximation Theorem

• Theoretically, one hidden layer with a squashing non-linear activation function would be
enough to approximate any real function as close as possible to any non-zero error (see
[?, 197ff])

• Does not guarantee the learnability of the function from training data!

• One hidden layer would sometimes need to be exponentially large (e.g. an order of 2n
for binary vectors {0, 1}n )

• Deep Learning favors multiple layers of hidden units

• DL believes that our objective functions is better expressed by a composition of several


simpler functions (realized by the hidden units)

CHAPTER •6.Practical
DEEPresults indicate that “going
FEEDFORWARD deep” helps to generalize better
NETWORKS
• Practical results indicate that “going broad” without “going deep” is not always helpful

(2014) showed thatEffect


Transformation functions representable
of Hidden Units with a deep rectifier net can require
an exponential number of hidden units with a shallow (one hidden layer) network. Transforma-
tion,
More precisely, they showed that piecewise linear networks (which can be obtained Non-Linear

from rectifier nonlinearities or maxout units) can represent functions with a number
of regions that is exponential in the depth of the network. Figure 6.5 illustrates how
a network with absolute value rectification creates mirror images of the function
computed on top of some hidden unit, with respect to the input of that hidden
unit. Each hidden unit specifies where to fold the input space in order to create
mirror responses (on both sides of the absolute value nonlinearity). By composing
Source: [?, 437]
these folding operations,
What’s going on here? we obtain an exponentially large number of piecewise
linear regions which can capture all kinds of regular (e.g., repeating) patterns.
Repeated Transformations

Each hidden layer can “linearize” further on the transformation of the preceding layer.
Figure 6.5: An intuitive, geometric explanation of the exponential advantage of deeper
rectifier networks formally by Montufar et al. (2014). (Left)An absolute value rectification
unit has the same output for every pair of mirror points in its input. The mirror axis
120
of symmetry is given by the hyperplane defined by the weights and bias of the unit. A
function computed on top of that unit (the green decision surface) will be a mirror image
of a simpler pattern across that axis of symmetry. (Center)The function can be obtained
by folding the space around the axis of symmetry. (Right)Another repeating pattern can
CHAPTER 6. DEEP FEEDFORWARD NETWORKS

Going Deep on the MNIST Character Recognition Task

96.5
96.0
Test accuracy (percent) 95.5
95.0
94.5
94.0
93.56. DEEP FEEDFORWARD NETWORKS
CHAPTER
93.0
92.5
92.0
3 4 5 6 7 8 9 10 11

Figure 6.6: Empirical results showing that deeper networks generalize better when used
Going Broad and Deepmulti-digit
to transcribe on the MNIST
numbersTask
from photographs of addresses. Data from Goodfellow
et al. (2014d). The test set accuracy consistently increases with increasing depth. See
figure 6.7 for a control experiment
Effect of demonstrating that other increases to the model size
Number of Parameters
do not
97 yield the same effect.

96 3, convolutional
Test accuracy (%)

Another
95 key consideration of architecture design3,isfully connected
exactly how to connect a
pair of layers to each other. In the default neural network 11, layer
convolutional
described by a linear
transformation
94 via a matrix W , every input unit is connected to every output
unit.93Many specialized networks in the chapters ahead have fewer connections, so
that each unit in the input layer is connected to only a small subset of units in
the 92
output layer. These strategies for reducing the number of connections reduce
the 91
number of parameters and the amount of computation required to evaluate
the network,
0.0 but are
0.2 often highly
0.4 problem-dependent.
0.6 0For
.8 example, 1.0convolutional
8
networks, described in chapter parameters patterns of sparse×10
9, useofspecialized
Number connections
that are very effective for computer vision problems. In this chapter, it is difficult
towithout
give
Figure
Going broad 6.7:much
Deepermore
going specific
models
deep tend toadvice
performconcerning
introduces better. Thisthe
overfitting. architecture
is not of athegeneric
merely because model isneural
network.
larger. Subsequent
This experiment chapters
from develop
Goodfellow et al. the particular
(2014d architectural
) shows that strategies
increasing the number that
ofhave
parameters in layers of convolutional networks without
been found to work well for different application domains. increasing their depth is not
Fat vs. Deep
nearly as effective at increasing test set performance. The legend indicates the depth of
network used to make each curve and whether the curve represents variation in the size of
the convolutional or the fully connected layers. We observe that shallow models in this
context overfit at around 20 million parameters while deep ones can benefit from having
over 60 million. This suggests that using a deep model expresses a useful preference over
the space of functions the model can learn. Specifically, it expresses a belief that the
function should consist of many simpler functions composed together. This could result
either in learning a representation that is composed in turn of simpler representations (e.g.,
corners defined in terms of edges) or in learning 202a program with sequentially dependent
steps (e.g., first locate a set of objects, then segment them from each other, then recognize
them).

202

121
Architectures

Fat + Short vs. Thin + Tall?

...

... ...

...
...

shallow ...

Architectures
deep
Fat + Short vs. Thin + Tall:
[?] ASR
Heike Adel Neural Networks II 29.03.2019 45 / 61

Fat vs. Deep: Automatic Speech Recognition (ASR)

Seide et al. 2011: Word error rate:


L×N WER %
1 × 2k 24.2
L×N WER %
2 × 2k 20.4
1 × 3772 22.5
3 × 2k 18.4
1 × 4634 22.6
4 × 2k 17.8
1 × 16k 22.1
5 × 2k 17.2
7 × 2k 17.1Architectures

Capacity [?]

Capacity of Networks
Capacity
Heike Adel Neural Networks II 29.03.2019 48 / 61

The number of parameters indicates the capacity of the model


The more parameters, the more powerful
⇒ The more complex tasks can be solved
But: You need to have enough training data to train a powerful
model!
Otherwise, you overfit on the training data set
⇒ The model memorizes the training data set but cannot generalize

122
Heike Adel Neural Networks II 29.03.2019 49 / 61
Recap: Capacity - Overfitting - Underfitting

Error

Underfitting Overfitting
Test error

Training error

Capacity
[?]

Use early stopping and regularization to fight overfitting!

8.2 FFNN
Heike Adel Neural Networks II 29.03.2019 52 / 61
Neural Network
8.2.1 Notation
Many Hidden
Deep Fully-Connected Layers with
Feedforward Many
Network Neurons
(FFNN)

input layer 1 layer 2 ... layer N

vector ... vector


x ... y
... ... ...

y = f (W N ...f (W 2 (f (W 1x + b 1 ) + b 2 )... + b N )
Neural Network
f: non-linear activation function
Notation [?]
Heike Adel Neural Networks I 22.03.2019 15 / 46

Layer Superscript and Subscript Notation

layer l-1 layer l


l-1
a1 a1l
l-1 Output of neuron i
a2 a2l of layer l: ail
a3l-1 ... ail is a scalar
... al is a vector
l
a
aNl-1 M

123

Heike Adel Neural Networks I 22.03.2019 16 / 46


[?]
Neural Network

Ascending from input to output


Notation
Weight Matrix Notation for Weights

Weights: wijl : from neuron j


layer l-1 layer l of layer l − 1 to neuron i of
a l-1 wl12 layer l
1 a1l
a2l-1 Weights between all neurons
a2l of layer l − 1 and layer l:
a3l-1 ... matrix:
... wlM2  l l l

w11 w12 ... w1N
aMl  wl w22l ... w2N l 
aNl-1 Wl = 
 ...
21 
... ... ... 
l
wM1 l
wM2 ... wMNl

[?] N is input dimension,


Neural Network M is output dimension.
What is the meaning of a row and a column? Each column is the output vector of neuron in the preceding layer.
Notation
Heike Adel
Each row is the input to a neuron in the following layer
Neural Networks I 22.03.2019 17 / 46

Notation for the Bias Weights

layer l-1 layer l Biases: bias for neuron i of


a l-1 layer l: bil
1 a1l
a2l-1 b1l Bias for all neurons of layer
a2l l: vector:
a3l-1  l
... b1
...  b2l 
bl =  ... 

aNl-1 bMl aMl
l
1 bM

[?]

Notation for the Input of a Single Neuron

Heike Adel Neural Networks I 22.03.2019 18 / 46

124
Neural Network

Notation

zil : input of activation


function for neuron i
layer l-1 layer l at layer l
l-1
a zil =
1 a1l
l l al−1 + w l al−1 + ... + b l
a2l-1 z i
wi1 1 i2 2 i
a2l
a3l-1 N
X
...
... = wijl ajl−1 + bil
j=1
l-1 aMl
a N
1 z l = W l al−1 + b l
al = f (z l ) with f being an
activation function
[?]

Heike Adel Neural Networks I 22.03.2019 19 / 46


8.3 Training
Training Feedforward Networks: Overview
FFNN
Training
• Initial model: Start from randomly initialized parameters Θ

• Forward Pass: For each input x, compute the output ŷ

• Loss computation: Quantify the scalar loss (that is cost including regularization if present)
of ŷ wrt. the gold solution y.

• Backward Pass: Determine contribution of each parameter to the loss by applying the
chain rule (back-propagation).

• Training Step: Adapt the parameters according to the computed gradients and update
regime (learning rate and gradient descent)

Initialization

• Clever random initialization of parameters quality and speed of learning! Why do we


avoid initialization with a constant value?▲ Breaking the symmetry!

• Different initializations (via random seeds) lead to different final parameters (models).
Why? Non-convexity of objective function! Saddle points, local minima. Different mod-
els are good for ensembling!

• Best initialization strategy can depend on activation function

125
Table 11-1. The initialization strategy for the ReLU activation function (and its var‐
iants, including the ELU activation described shortly) is sometimes called He initiali‐
zation (after the last name of its author).

Table 11-1. Initialization parameters for each type of activation function


Activation function Uniform distribution [–r, r] Normal distribution x < 0
Logistic 6 2
r= σ=
ninputs + noutputs ninputs + noutputs
(
Hyperbolic tangent 6 2
r = 4 .x/ D σ = 40 x < 0
.0; x/ D
ninputs + noutputs ninputs + noutputs
x :
ReLU (and its variants) 6 2
r= 2 σ= 2
ninputs + noutputs ninputs + noutputs

Source: [?, ]
By default, the fully_connected() function (introduced in Chapter 10) uses Xavier
initialization (with a uniform distribution). You can change this to He initialization
8.3.1 Activations
by using the variance_scaling_initializer() function like this:
Some Nonlinear Activation Functions
sigmoid(x) tanh(x) hardtanh(x) ReLU(x)
1.0 he_init = tf.contrib.layers.variance_scaling_initializer()
1.0 1.0 1.0
0.5 hidden1 = fully_connected(X,
0.5 n_hidden1,
0.5 weights_initializer=he_init,
0.5 scope="h1")
0.0 0.0 0.0 0.0
-0.5 -0.5 -0.5 -0.5
-1.0 -1.0 -1.0 -1.0
-6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
Some !f
activation functions are “more !f
linear” than others! Can !f
actually be better... !f
!x !x !x !x
What
1.0 are the graphs of their
1.0derivatives? 1.0 1.0
See ▲
0.5 d2l section on activation
0.5 functions 0.5 0.5
3 This simplified strategy was actually already proposed much earlier—for example, in the 1998 book Neural
0.0 0.0 0.0 0.0
Networks: Tricks of the Trade by Genevieve Orr and Klaus-Robert Müller (Springer).
Activation
-0.5 Functions in Layers
-0.5 -0.5 -0.5
4 Such as “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” K.
-1.0
For-6hidden layers -1.0 -1.0 -1.0
-4 -2He 0et al.
2 (2015).
4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6 -6 -4 -2 0 2 4 6
• sigmoid, tanh, ReLU, ELU, Swish▲ , etc.

278 | activation
• Different Chapter 11:functions
Training Deep Neural
have Nets problems in learning (saturation, “dead” at
different
zero) Good introduction ▲

For output layers

• Sigmoid

• Softmax (mostly in log space)


O y/
L.y; yO
y
• Depending onL.
the ▲
O classification
y; y/ task and numerical stability requirements. See d2l sectiony
O
y
ELU Activation Function: A Nice Combination

126
from the training set until the average of the objective function stops variations of the inp
variants in their experiments: training time was reduced and the neural network per‐
decreasing. It is called
formed better on stochastic
the test set.because each small
It is represented set of11-3,
in Figure examples illumination
and Equation 11-2 of an ob
gives a noisy estimate
shows of the average gradient over all examples. This while being very sens
its definition.
simple procedure usually finds a good set of weights surprisingly the difference betw
quickly whenEquation
compared 11-2. ELU
withactivation
far more function
elaborate optimization tech- dog called a Samoye
18
niques . After training,
ELUα z =
the zperformance
α exp − 1 if z < 0 of the system is measured different poses and
on a different set of examplesz called i f z a≥ test
0 set. This serves to test the from each other, wh
generalization ability of the machine — its ability to produce sensible same position and on
answers on new inputs that it has never seen during training. other. A linear classi

Figure 11-3. ELU activation function


Source: [?]

New trending activation functions appear regularly. . .


6 “Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs),” D. Clevert, T. Unterthiner,
8.3.2 ForwardS. Hochreiter (2015). Hidden
(2 sigmoid)
Forward Pass in Simple Math
280 | Chapter 11: Training Deep Neural Nets

c d Com
answ
yl = f (zl )
Output units l
zl = wkl yk
wkl k H2

E E
= wkl
yk = f (zk ) yk zl
Hidden units H2 k I out
zk = w jk y j
wjk E E yk
j H1 =
zk yk zk

Hidden units H1 j y j = f (zj )

zj = wij xi
wij
i Input

Input units i

Source: [?]
Figure 1 | Multilayer neural networks and backpropagation. a, A multi- which one can backpro
layer neural
• What network
is not shown (shown
here? by the connected dots) can distort the input the total input z to eac
space to make the classes of data (examples of which are on the red and the units in the layer b
blue lines)
Forward linearly separable.
Computation in Matrix Note
Vectorhow a regular grid (shown on the left)
Notation z to get the output of th
in input space is also transformed (shown in the middle panel) by hidden The non-linear functio
units. This is an illustrative example with only two input units, two hidden linear unit (ReLU) f(z)
units and one output unit, but the networks used for object recognition well as the more conve
or natural language processing contain tens or hundreds of thousands of f(z) = (exp(z) − exp(−z
units. Reproduced with permission from C. Olah (http://colah.github.io/). f(z) = 1/(1 + exp(−z)). d
b, The chain rule of derivatives tells us how
127two small effects (that of a small pass. At each hidden la
change of x on y, and that of y on z) are composed. A small change Δx in the output of each unit
x gets transformed first into a small change Δy in y by getting multiplied with respect to the tota
by ∂y/∂x (that is, the definition of partial derivative). Similarly, the change convert the error deriv
Δy creates a change Δz in z. Substituting one equation into the other derivative with respec
gives the chain rule of derivatives — how Δx gets turned into Δz through At the output layer, the
Neural Network

Forward Pass ofMATHEMATICAL


4.2. IN an MLP NOTATION 43
ATHEMATICAL NOTATION
on, we will abandon the brain metaphor and describe
input layernetworks
1 exclusively
layer 2 ... in layer L
matrix operations.
ural network is called a perceptron. It is simply a linear model:
NNPerceptron .x/ D xW C b (4.1)
vector ... vector
din din dx
... y
x2R ; W 2R out
2 Rdout ;
; b... ... ...
weight matrix and b is a bias term.³ In order to go beyond linear functions, we
linear hidden layer (the network in Figure 4.2 has two such layers), resulting in
Perceptron with one hidden-layer (MLP1). Axfeed-forward a1 neural network
a2 with aL
r has the form:

1 1 1
NNMLP1 .x/ D ag.xW1
= f (W /Wb 12)C b2a2 = f (W 2 a1 + b 2 )
C bx + a(4.2)
L
= f (W L aL−1 + b L )
2 Rdin ; W 1 2 Rdin d1 ; b1 2 Rd1 ; W 2 2 Rd1 d2 ; b2 2 R[?]
d2
:
and b1 are a matrix and a bias termHeike forAdel Neural Networks I
the first linear transformation of the input, 22.03.2019 21 / 46

applied 42
function that isForward 4. FEED-FORWARD
element-wise
Computation (also calledNEURAL
in Goldberg’s NETWORKS
a nonlinearity
Notation or an activation
W and b are the matrix and bias term for a second linear transform. has no outgoing arrows, and is the output of the
2 2 the input to the network. e top-most layer
it down, xW 1 C b1• isBe network.
aware
a linear of e other layers
Goldberg’s
transformation areinput
considered
ofnon-standard
the x from “hidden.”
notation erow
with
din dimensions sigmoid shapeinstead
vectors inside the
of neurons in the
the usual
column x
ns. g is then applied to each ofvectors.
middle layers
the d1 represent
dimensions, a nonlinear
and the function
matrix(i.e.,
W 2the logistic function 1=.1 C e /) that is applied
together
to the neuron’s value before passing it to the output. In the figure, each neuron is connected to all
b2 are then used to• transform
Can ofyou the result
draw a into the d2graph?
computation dimensional output vector.
the neurons in the next layer—this is called a fully connected layer or an affine layer.
ctivation function g has a crucial role in the network’s ability to represent complex
out the nonlinearity in g , the neural network can only Outputrepresent
layer linear transfor-
y1 y2 y3
input.⁴ Taking the view in Chapter 3, the first layer transforms the data into a
tion, while the second layer applies a linear classifier to that representation.
dd additional linear-transformations and nonlinearities, resulting in an MLP with
rs (the network in Figure 4.2 is of this form): ∫ ∫ ∫ ∫ ∫
Hidden layer
2 1
NNMLP2 .x/ D .g .g .xW C b /W C b //W 3 :
1 1 2 2
(4.3)
arer to write deeper networks like this using intermediary variables:
NNMLP2 .x/ Dy Hidden layer ∫ ∫ ∫ ∫ ∫ ∫

h1 Dg 1 .xW 1 C b1 /
(4.4)
h2 Dg 2 .h1 W 2 C b2 /
x1 x2 x3 x4
y Dh2 W 3 : Input layer

ure 4.2 does not include bias terms. A bias term can be added to a layer by adding to it an additional neuron
any incoming connections, whose Figure 4.2: Feed-forward neural network with two hidden layers.
Computation Graph 1.
value is always
er that a sequence of linear transformations is still a linear transformation.
While the brain metaphor is sexy and intriguing, it is also distracting and cumbersome
to manipulate mathematically. We therefore switch back to using more concise mathematical
notation. As will soon become apparent, a feed-forward network as the one in Figure 4.2 is simply
a stack of linear models separated by nonlinear functions.
e values of each row of neurons in the network can be thought of as a vector. In Figure 4.2
the input layer is a 4-dimensional vector (x ), and the layer above it is a 6-dimensional vector (h1 ).
e fully connected layer can be thought of as a linear transformation from 4 dimensions to 6
dimensions. A fully connected layer implements a vector-matrix multiplication, h D xW where
the weight of the connection from the i128th neuron in the input row to the j th neuron in the output
row is W Œi;j  .² e values of h are then transformed by a nonlinear function g that is applied to
each value before being passed on as input to the next layer. e whole computation from input
to output can be written as: .g.xW 1 //W 2 where W 1 are the weights of the first layer and W 2
are the weights of the second one. Taking this view, the single neuron in Figure 4.1 is equivalent
to a logistic (log-linear) binary classifier .xw/ without a bias term .
two hidden-layers (the network in Figure 4.2 is of this form):
NNMLP2 .x/ D .g 2 .g 1 .xW 1 C b1 /W 2 C b2 //W 3 : (4.3)
It is perhaps clearer to write deeper networks like this using intermediary variables:
NNMLP2 .x/ Dy
h1 Dg 1 .xW 1 C b1 /
(4.4)
2 2 1 2 2
h Dg .h W C b /
y Dh2 W 3 :
³e network in Figure 4.2 does not include bias terms. A bias term can be added to a layer by adding to it an additional neuron
that does not have any incoming connections, whose value is always 1.
⁴To see why, consider that a sequence of linear transformations is still a linear transformation.

A Computation Graph for d2l’s regularized MLP▲

• What is not optimal in my opinion? Extra nodes for variable introduction

• Each leaf node has to be input/parameter data. Each internal node an operation!

• Better: decorate nodes with symbolic variable names.

Forward Computation: Arbitrary Depth

129
Neural Network
mall set of examples illumination of an object, or variations in the pitch or accent of speech,
r all examples. This whilePass
Forward beingofvery
ansensitive
MLP to particular minute variations (for example,
eights surprisingly the difference between a white wolf and a breed of wolf-like white
optimization tech- dog called a Samoyed). At the pixel level, images of two Samoyeds in
system is measured different poses and in different environments may be very different
input layer 1 layer 2 layer L
his serves to test the from each other, whereas two images of...a Samoyed and a wolf in the
to produce sensible same position and on similar backgrounds may be very similar to each
ng training. other. A linear classifier, or any other ‘shallow’ classifier operating on
vector ... vector
x b ... y
... ... ...

z z
Δz = y ΔyL
x a1 a2 a
z
y Δy = xy Δx
y
y = f (W L ...f (W 2 (f (W 1x + b 1 ) y+ b 2 )... + zb L )y
Δz = y x Δ x
x
[?] z z y
Heike Adel Neural Networks I
x x = y x 22.03.2019 22 / 46
Hidden 8.3.3 Backward
(2 sigmoid)
Backward Pass: A Closer Look

d Compare outputs with correct


answer to get error derivatives
yl = f (zl ) E
= yl tl
yl
zl = wkl yk l
E E yl
k H2 =
wkl zl yl zl

E E
= wkl
yk = f (zk ) yk zl
I out
zk = w jk y j k
E E yk wjk
j H1 =
zk yk zk E E
= w jk
y j = f (zj ) yj zk
j k H2
E E yj
zj = wij xi wij =
zj y j zj
i Input

Source: [?]
gation. a, A multi- which one can backpropagate gradients. At each layer, we first compute
an distort the input the layer,
At each hidden total input
computez tothe
each unit,
error which is
derivative a weighted
wrt. sumwhich
to the output of theisoutputs of sum
a weighted
re on the red and
of the error the units in the layer below. Then a non-linear function f(.) is applied to
derivatives wrt. to the total inputs to the units in the layer above.
(shown on the left) z to get the output of the unit. For simplicity, we have omitted bias terms.
• What stands t for?
le panel) by hidden The non-linear functions used in neural networks include the rectified
ut units, two hidden linear E
• What stands unit (ReLU) f(z) = max(0,z), commonly used in recent years, as
for?
bject recognition well as the more conventional sigmoids, such as the hyberbolic tangent,
ds of thousands of f(z) = (exp(z) − exp(−z))/(exp(z) + exp(−z)) and logistic function logistic,
p://colah.github.io/). f(z) = 1/(1 + exp(−z)). d, The equations
130 used for computing the backward
effects (that of a small pass. At each hidden layer we compute the error derivative with respect to
mall change Δx in the output of each unit, which is a weighted sum of the error derivatives
getting multiplied with respect to the total inputs to the units in the layer above. We then
imilarly, the change convert the error derivative with respect to the output into the error
Backpropagation as Reverse-Mode Autodiff
“The backpropagation algorithm is a fancy name for methodically computing the derivatives
of a complex expression using the chain-rule, while caching intermediary results. More gener-
ally, the backpropagation algorithm is a special case of the reverse-mode automatic differenti-
ation algorithm.” [?]

• Needs a bit of self study to comprehend.

• See slides on “Autodiff” for principles of reverse-mode automatic differentiation

• See Chapter 6 of [?] for a formal exposition of backpropagation algorithm.

• See 3Blue1Brown video series▲ [?] for a good well-animated formal explanation of back-
propagation in FFNNs

From Grayscale Image to a Vector

131
How many rows and columns in the original grayscale matrix?

132
Weighting the Input

• The input vector a is weighted elementwise.

• Think of it as a activation of each pixel.

• Blue = Positive Weight, Red = Negative Weight

FFNN for MNIST Handwritten Digit Classification

133
Predicting the digit of a grayscale image with a FFNN.
How man Hidden Layers? How many nodes in the hidden layer? How many connection
weights?

Gradient Landscape

Here: A loss that punishes every deviation from the truth

134
Backprop Motivation: What are gradients telling us? The relative influence of certain
weights with respect to the cost function!
The gradient of the cost function C specifies for any model parameter its change of rate with re-

spect to the C.
Nudging the connection weight with the yellow gradient has 32 times more influence on the
cost function than changing the pink arrow

Backprop Motivation: Larger Deviations from Ground Truth Should Change More into the
Desired Direction

135
What are the arrows indicating?
Predicted values that are closer to the expected value should change less!

Hebbian Learning: “neurons that fire together wire together”

Increasing the bias to output 2 helps.


Increasing the weights of positive (=active) activations helps.

136
Backprop: Minimal Example

A minimal network structure

137
The influence factors.

Backprop: Propagating back recursively over more layers with more than one neuron

138
Study the written text https://www.3blue1brown.com/lessons/backpropagation-calculus

8.3.4 Problems
Learning Problems for Deep Networks
What to do?

• Overfitting: Reduce capacity, L2 regularization or dropout

• Underfitting: Enlarge capacity, reuse existing models or train layerwise

• Vanishing gradients: non-saturating activation functions, special memory cells,

• Exploding gradients▲ : non-saturating activation functions, gradient clipping, batch nor-


malization

• Forgetting the original input signal in deeper layers: Highway networks [?] and Residual
Networks▲

• Deep FFNNs quickly have too much capacity for NLP problem generalization. CNNs
and RNNs can cure this

8.4 Tooling
8.4.1 PyTorch
FFNN for Markov Language Modeling
https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#an-example-n-gram-language-mode
Study the code together with your neighbor! Ask questions in the forum for things that you
don’t understand!
Can you draw the computation graph of single forward step?

139
Introduction to Keras

8.4.2 MLPs
Keras in Keras
Keras Sequential Layer API

Example: model with 2 hidden layers and softmax output layer

from keras . models import Sequential


from keras . layers import Dense
# define model
model = Sequential ()
# first hidden layer
model . add ( Dense ( units =64 , activation = ’ relu ’ , input_dim =100))
# second hidden layer
model . add ( Dense ( units =64 , activation = ’ relu ’ ))
# softmax output layer
model . add ( Dense ( units =10 , activation = ’ softmax ’ ))

[?]

See https://github.com/ageron/handson-ml2/blob/master/10_neural_nets_with_keras.ipynb for


keras-based code.
Heike Adel CNNs 05.04.2019 46 / 67

8.4.3 DyNet
DyNet: An Early Dynamic NLP-oriented Framework
If you struggle with what goes behind the scene in pytorch, DyNet is more transparent!
Installation and Use

• DyNet runs well on CPUs (and GPUs); based on Eigen library (as Tensorflow)

• Simple installation on CPU: pip install dynet

Documentation and Introductions

• Official tutorials: http://dynet.readthedocs.io/en/latest/tutorial.html#python-tutorial

• EMNLP tutorial slides (DyNet Version 1 API): https://github.com/clab/dynet_tutorial_examples

• Y. Goldbergs Introduction to DyNet and NNs: https://www.youtube.com/watch?v=8eYZz6kiuUA

Training with Computation Graph without Mini-Batches


Computation
Graph

140
58 5. NEURAL NETWORK TRAINING

Algorithm 5.5 Neural network training with computation graph abstraction (using minibatches
of size 1).

1: Define network parameters. 5.1. THE COMPUTATION GRAPH ABSTRACTION 55


5.1.32: forSOFTWARE
iteration = 1 to T do 5.1. THE COMPUTATION GRAPH ABSTRACTION 55
3:
Several SOFTWARE Training example
software packages implement the do
for x i ; y i in dataset computation-graph model, including eano,¹
5.1.3
4: loss_node build_computation_graph(xi , yi , parameters)
[Bergstrasoftware
Several et al., 2010], TensorFlow²
packages implement [Abaditheet computation-graph
al., 2015], Chainer,³ model,
and DyNet⁴ [Neubig
including et al.,
eano,¹
5: loss_node.forward()
2017].
[Bergstra
6:
Alletthese
al., packages
2010],
gradients
support all [Abadi
TensorFlow² the essential
loss_node().backward() et al., components (nodeand
2015], Chainer,³ types) for defining
DyNet⁴ [Neubiga wide
et al.,
range
2017]. of neural
7: All these
network
parameters architectures, covering the
update_parameters(parameters,
packages support structures
all the essential components described
gradients) in this book and more.
(node types) for defining a wide
Graphofcreation
range neural is made almost
network transparent
architectures, by use
covering theofstructures
operator overloading.
described in e
this framework de-
book and more.
8: return parameters.
fines a type for representing graph nodes (commonly called expressions), methods for constructing
Graph creation is made almost transparent by use of operator overloading. e framework de-
nodes for inputs and parameters, and a set of functions and mathematical operations that take
fines a type for representing graph nodes (commonly called expressions), methods for constructing
expressions as input and result in more complex expressions. For example, the python code for
nodes for inputs
Chapters 14–18. and
Forparameters, and
networks with a set
fixed of functions
structures, such asand mathematical
an MLPs, it may beoperations that take
more efficient
creating the computation graph from Figure 5.1c using the DyNet framework is:
expressions
to create
DyNet as
oneinput and result
for in
base computation
Framework: Setup moreand
agraph
Neural complex
POS expressions.
varyTagger
only the inputs For example,outputs
and expected the python code for
between
examples.
creating thedynet
import computation
as dy graph from Figure 5.1c using the DyNet framework is:
# model i n i t i a l i z a t i o n .
model
import=dynet dy . Model as dy ()
5.1.5
mW1
# model NETWORK
= model i a l i z a t COMPOSITION
i n i t. add_parameters ion . ((20 ,150) )
As long as the network’s output is2 0a) vector (1  k matrix), it is trivial to compose networks by
mb1
model = =modeldy . . add_parameters
Model ( ) (
mW2 = model . add_parameters
mW1 add_parameters((((1270 ,,2105)0)) )
making the output
mb2 = model . add_parameters
of one network the input of another, creating arbitrary networks. e compu-
mb1 add_parameters((1270))
tation
lookup
mW2 =graph= model
model abstractions makes this
. add_lookup_parameters
. add_parameters ( ( 1 7ability
, 2 0 ) ) explicit:
( ( 1 0 0 , a50)node
) in the computation graph can itself
bet r aa icomputation
mb2 n e model
= r = dy. add_parameters
. SimpleSGDTrainer
graph with a designated ( 1 7()model )
output node. One can then design arbitrarily deep and
lookup =
complex model . add_lookup_parameters
networks, and be able to easily evaluate ( ( 1and
0 0 , train
50) )them thanks to automatic forward and
tdreaf i nget_index (x) :
e r = dy . SimpleSGDTrainer ( model )
gradient p a s computation.
s # Logic omitted is makes . it easy to define and train elaborate recurrent and recursive
networks,
Maps words as
d e f get_index ( x ) : discussed
to numeric in Chapters
IDs . 14–16 and 18, as well as networks for structured outputs and
p a s s # Logic
multi-objective training, omitted
as we .discuss in Chapters 19 and 20.
#Maps Thewords f o l l o wto i n gnumeric
b u i l d s IDs
and. e x e c u t e s the computation graph ,
# and updates
Creating the Lossmodel parameters .
5.2
# The
# Only fPRACTICALITIES
oone
l l o wdata
i n g point
b u i l d s i sand shown e x e,c uitne sp rthe
a c t icomputation
c e the f o l l o w ing ,
graph
# should run
# and updates model parameters .i n a data - f e e d i n g loop .
Once
# Only theone gradient
data computation
point i s shown is taken , i ncare p rof,
a c tthe
i c e network
the f o lisl otrained
w i n g using SGD or another
# Building
# should run optimization the computation
i n a data - falgorithm. graph
e e d i n g loop :
gradient-based e. function being optimized is not convex, and for a
dy . renew_cg ( ) # c r e a t e a new graph .
long
# Wrap timethe training
model of parameters
neural networks was considered
as graph - nodes . a “black art” which can only be done by
# Building the computation graph :
selected
dy . renew_cg ( ) # c r e a t e a new graph . the optimization process, and care has to be taken
W1 = dy few.
. Indeed,
parameter many
(mW1) parameters affect
#b1tune
to = dy
Wrap . parameter
these
the model (mb1)
parameters. While thisas
parameters book graphis not intended
- nodes . as a comprehensive guide to successfully
W2 =
W1 = dyneural dy . parameter
. parameter (mW2)
(mW1)
training
b2 = = dy dy .. parameter networks, we do list here a few of the prominent issues. For further discussion on
b1 parameter (mb2) (mb1)
optimization
# Generate
W2 = dy . parameter
techniques
the embeddings
(mW2)
and algorithms l a y e r . for neural networks, refer to Bengio et al. [2016, Chapter
vthe
8].
b2 For = dy =
some dy . lookup
theoretical
. parameter [ get_index
(mb2)discussion (and ” theanalysis,
”) ] refer to Glorot and Bengio [2010]. For various
vblack
# Generate = dy . lookup [ get_index
therecommendations,
embeddings l a y esee ( ” black
r . Bottou”) ]
practical
vdog =
tips and [2012], LeCun et al. [1998a].
vthe = dy dy .. lookup
lookup [[ get_index
get_index((”dog” ” the ”)) ]]
vblack = dy . lookup [ get_index ( ” black ” ) ]
# Connect the l e a f nodes i n t o a complete graph .
vdog = dy . lookup [ get_index ( ”dog” ) ]
x = dy . concatenate ( [ vthe , vblack , vdog ] )
output = dy . softmax (W2*( dy . tanh (W1*x+b1 ) )+b2 )
# Connect the l e a f nodes i n t o a complete graph .
l o s s = -dy . l o g ( dy . pick ( output , 5) )
x = dy . concatenate ( [ vthe , vblack , vdog ] )
output = dy . softmax (W2*( dy . tanh (W1*x+b1 ) )+b2 )
l o s s = -dy . l o g ( dy . pick ( output , 5) )
¹http://deeplearning.net/software/theano/
²https://www.tensorflow.org/
³http://chainer.org
¹http://deeplearning.net/software/theano/
⁴https://github.com/clab/dynet
²https://www.tensorflow.org/
141
³http://chainer.org
⁴https://github.com/clab/dynet
+
to the case where the computation graph
*
mponent is an independent function that
e other connected components).
56 5.Optimizing
NEURAL NETWORK TRAINING
the Parameters
a b los s_valu e =2 1×1
l o s s . forward
neg
()
l o s s . backward ( ) # the g r a d i e n t i s computed
#
1 × 1and s t o r e d i n the corresponding
#logparameters .
strict ourselves to the case where the computation graph
t r a i n e r . update
1×1( ) # update the parameters according to the g r a d i e n t s .
(c) pick
h connected component is an independent function that
DyNet Code is typically very close to mathematical formulation! You don’t need to think about
Mosttheofmini-batch
the code involves various initializations: the first block defines model parameters that are
dimension. 1 × 17

endentlybeofshared
the between
other
softmax connected components). 5
different computation graphs (recall that each graph corresponds to a specific
A More
training Real-World
example).
1 × 17 e Classification Computation
second block turns the model Graph
parameters into the graph-node (Expression)
1×1
types. e third block retrieves the Expressions for the embeddings
ADD of the input words. Finally,
N neg 1 ! 1
the fourth block is where the graph is created. Note how transparent the graph creation is—
1 × 17 1×1
there is an almost
MUL
a one-to-one correspondence between creating the graph and describing
@N it
log d.i /
mathematically. e last block shows a forward and backward pass. e equivalent code @i in the
1 × 1 d.i / i
TensorFlow 1package
× 20 20 is:⁵
× 17 1 × 17
tanh 2 2
W b (c) d.1/; : :pick
: ; d.N /
import t e n s o r f l o w as t f
1 × 17
1 × 20 5
x . g et _ va ri a b l e ( ”W1” , [ 2 0softmax
W1 = t f ADD , 150])
b1 = t f . g e t _va ri a b l e ( ”b1” , [ 2 0 ] )
W2d.N
= t/f . g et @N
1 _ va ri a b l e ( ”W2” , [ 1 7 , 2 0 ] ) F D1
b2 = t f .1 g× 20
e t _va ri a b l e ( ”b2” , [ 1 7 ]1)× 17 @N
MUL
lookup = t f . g e t_ v a r i a b l e ( ”W” , ADD
P
[100 , 50])
@fj @N X @N @j
d.i / j 2!.i / d.j / " F D
d e f get_index
1 × 150 ( 150
x ) ×: 20 1 × 20 @i @i @j @i
j 2!.i/
p a s sconcat
# Logic W 1 omitted
b1 1 × 17
L 1 × 50
p1lookup
1 × 50
= t f . p llookup
1 × 50
a c e h o l d elookup
r ( t f . int32 , [ ] )
MUL
p2 = t f . p l a c e h o l d e r ( t f . int32 , [ ] )
20 × 17
p3 = t f . p l a c e h o l d e r ( t f . int32 , [1 ]×)20
1 × 17 20 × 17 1 × 17
× 50
t a 2r g e t = t f . p l a c e h@f o ljd e r|V|( ×t 50
f . int32 , [ ] ) !1 2
W 2
b “the” “black” “dog” E tanh 2
W fj .! .jb // i 2 ! !1 .j /
@i
v_w1 = t f . nn . embedding_lookup f
[?, 52]
( jlookup , p1 )
Computation graph with expected output (5) andv.a 1 /;computation.
loss : : : ; v.am / E is the : : : ; am D
a1 ;embedding
!1
v_w2 = t f . nn . embedding_lookup ( lookup , p2 )
!matrix.
.j / 1 × 20
th concrete input.
v_w3(c)= Graph with concrete ( lookup
t f . nn . embedding_lookup , p3 )
ADD
x = t f . concat ( [ v_w1,Selection
v_w2, @fi
Derivation of Element v.i/ v_w3(pick)] , 0)
output = t f . nn . softmax (
“The pick node @x
!1 implements an indexing 1 × 20 operation, receiving a vector and an index (in this Pick Function
cal expression, it can
t f x
. 2 ! .i
bereturning
einsum /
representedcorresponding
( ” i j , j -> i ”as a
, W2, t f . tanh (
L case, 5) and t f . einsum ( the MUL
” i j , j -> i ” , W1, xentry in )the
) + b1 ) +vector.”
b2 ) [?, 53]
he computation graph for an MLP with
l o s s = - t f . l o g ( output [ t a r g e t ] )
@fi ( l o s s )
0
t r a i n e oval
. In our notation, r = tnodesf . t r a i nrepresent
. GradientDescentOptimizer ( 0 . 1 ) . minimize
1 × 150
150 × 20 1 × 20 150 × 20 1 × 20 @x
W 1 1 C concat
#bGraph d e f i n i t i o n done , compile i t and Wf e1e d c o n cbr1 e t e data .
# Only one data - point i s pick.x;
shown , 5/i n p r a c t i c e we w i l l use
× 50 1 × 50# a data - f e e d i n g loop . 1 × 50 1 × 50 1 × 50
i
okup lookupwith t f . S e s s i o n ( ) as s e lookup
ss : lookup lookup
s e s s . run ( t f . g l o b a l _ v a r i a b l e s _ i n i t i a l i z e r ( ) )
pi ck.x; 5/= {
feed_dict g x g Œ5" D 1 g Œi¤5" D 0
|V| × 50 p1 : get_index ( ” the.0; ” )x/, |V| ×150 x > 0 0
ack” “dog” E p2 : get_index ( ” black “the”” ) , “black” “dog” E
p3 : get_index ( ”dog” ) , [?, 52]

What is code
⁵TensorFlow max(0,x)?
provided by Tim Rocktäschel. anks Tim!
t. (b) Graph with concrete input. (c) Graph with concrete
142
e.

y a mathematical expression, it can be represented as a


Conclusion

• FFNNs (sometimes called MLP) with non-linear activation functions and at least 1 hid-
den layer can approximate any computable function

• Deep FFNNs typically learn better than fat FFNNs

• Simple ReLU-style (Swish) non-linear activation perform well in deep FFNNs

• Backpropagation applies reverse-mode auto differentiation through FFNNs for optimiza-


tion of the parameters

8.5 Further Study


• Mandatory reading: Chapter 5 “Multilayer Perceptron”▲

• Chapters 4,5,6 in [?]; especially 5.1 introduces Computation Graphs and the Backpropa-
gation Algorithm

• 3Blue1Brown video on Backpropagation in NNs https://youtu.be/tIeHLnjs5U8

• More readings: Chapter 10 and 11 in [?]

• Standard literature: Chapter 6 on Feedforward Networks in [?]

143
Chapter 9

Static Word Embeddings

Learning Objectives

• Understand the basic ideas of distributionalism and its various instantiations

• Understand the reason behind and advantages of dense representations

• Grasp the basic ideas behind embeddings in word2vec and gloVe

• Understand different representation learning techniques (self-learning approaches) for


static word embeddings

9.1 Distributionalism

But what is meaning?


What is Meaning?

What is bardiwac?

[?]

Short Version of Long History of Distributionalism


Distribution-
alism
• “Die Bedeutung eines Wortes ist sein Gebrauch in der Sprache.” (Ludwig Wittgenstein)

• “Distributional Structure” [?]

• “You shall know a word by the company it keeps!” (J. R. Firth (1957))

144
• “words which are similar in meaning occur in similar contexts (Rubenstein & Goode-
nough, 1965)”

• “words with similar meanings will occur with similar neighbors if enough text material
is available” (Schütze & Pedersen, 1995)

• “words that occur in the same contexts tend to have similar meanings” (Pantel, 2005)

See [?] for details.

What is Meaning?
But what is meaning? Meaning

What is bardiwac?

He handed her a glass of bardiwac.


Beef dishes are made to complement the bardiwac.
Nigel staggered to his feet, face flushed from too much
bardiwac. Bardiwac is a ...
Malbec, one of the lesser-known bardiwac grapes,
responds well to Australia’s sunshine.
I dined off bread and cheese and this excellent bardiwac.
The drinks were delicious: blood-red bardiwac as well as
light, sweet Rhenish.
[?]

Distributional semantics
What is Meaning?

A bottle of _________ is on the table. (1)


Everybody likes _________. (2)
What other words fit into these contexts?
Don’t have _________ before you drive. (3)
We make _________ out of corn. (4)
(1) (2) (3) (4) …
bardiwac 1 1 1 1
loud 0 0 0 0
motor oil 1 0 0 1
tortillas 0 1 0 1
wine 1 1 1 0
choices 0 1 0 0

[?]

Window-Based Cooccurrence Matrix

145
Window based cooccurence matrix
• Example corpus:
• I like deep learning.
• I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160

Source: [?, 9]

Generating Distributional Thesauri (unsupervised)▲


Created from huge linguistically analyzed web corpora [?]

“an
automatically produced thesaurus which identifies words that occur in similar contexts as the
target word”▲
Types of contexts for adjectives are for instance adjective and its modified noun (stupid error)
or coordinated adjectives (he is stupid and mean).

9.1.1 Variations
Many Ways of Distributional Modeling
Distribution-
alism

146
7 (Semi-)supervision 17

8 Tools 19

Great power, a great many design choices:

tokenization
annotation
tagging
parsing
feature selection
..
. cluster texts by date/author/discourse context/. . .
# .
Dimensionality Vector
Matrix type Weighting reduction comparison
word ⇥ document probabilities LSA Euclidean
word ⇥ word length normalization PLSA Cosine
word ⇥ search proximity ⇥ TF-IDF ⇥ LDA ⇥ Dice
adj. ⇥ modified noun PMI PCA Jaccard
word ⇥ dependency rel. Positive PMI IS KL
verb ⇥ arguments PPMI with discounting DCA KL with skew
.. .. .. ..
. . . .

Source: [?]
(Nearly the full cross-product to explore; only a handful of the combinations are ruled out mathe-
See [?] Chapter 6 for definitions of PPMI etc.
matically, and the literature contains relatively little guidance.)
Idea: co-occurrence counts
Distributionalism: Reducing Co-occurrence Counts
Corpus sentences Co-occurrence counts vector

small vector

Dimensionality
reduction

[?]

LSA▲ : Classical Dimension Reduction Technique (SVD)

147
Latent semantic analysis (LSA)
𝑋 - document-term co-occurrence matrix LSA term vectors
Hope: term having common
𝑋 ≈ 𝑋෠ = 𝑈 Σ 𝑉 𝑇 meaning are mapped to the
same direction

d d w
≈ × ×
LSA document vectors
Hope: documents
w discussing similar topics U Σ 𝑉𝑇
have similar representations
[?]
X = d × w contains the weights (relative counts in the simplest case) of all considered words w for all documents
d.
The dimension (rank k) of the diagonal matrix Σ sets the compression factor.
~
Bad news: Computational costs grow quadratic for d × w. New words/documents are hared to integrate.

9.2 Embeddings
Atomic Word Representation: One-Hot-Encoding

The standard word representation


The%vast%majority%of%ruleJbased%and%staGsGcal%NLP%work%regards%
words%as%atomic%symbols:%hotel, conference, walk
In%vector%space%terms,%this%is%a%vector%with%one%1%and%a%lot%of%zeroes%

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
Dimensionality:%20K%(speech)%–%50K%(PTB)%–%500K%(big%vocab)%–%13M%(Google%1T)%

We%call%this%a%“oneJhot”%representaGon.%Its%problem:%
motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND
hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0%
35% Source: [?, 35]

Continuous Vector Representation Learning


“Don’t count, predict!” [?]

148
Neural word embeddings
as a distributed representation
Similar%idea% %
%
Combine%vector%space%
semanGcs%with%the%predicGon%of% $ 0.286%
probabilisGc%models%(Bengio%et% 0.792%
$ −0.177%
al.%2003,%Collobert%&%Weston%
−0.107%
2008,%Turian%et%al.%2010)% linguis<cs$$=% 0.109%
In%all%of%these%approaches,% −0.542%
including%deep%learning%models,% 0.349%
a%word%is%represented%as%a% 0.271%
dense%vector%
[?, 38]
38%

Crucial question: How can we effectively learn good representation?

Words as Dense Vectors of 64 Real Numbers


Which line represents which word? (“war”, “peace”, “tomato”)

Word Embeddings
Continuous, numeric, dense word representations learned from raw text.
Similar vectors mean similar words (cosine vector distance).
Embeddings are a perfect input for numeric ML methods.

How to evaluate embeddings


How To Evaluate Embeddings?

Intrinsic: evaluation on a specific/intermediate subtask


word analogies: “a is to b as c is to ___?”
word similarity: correlation of the rankings

Extrinsic: evaluation on a real task
take some task (MT, NER, coreference resolution, …) or several tasks
train with different pretrained word embeddings
if the task quality is better -> win!
[?]

• Odd-One-Out: Identify the word that does not belong into a set of words. . .

• Human judgements are needed (as always in semantic tasks)

149
Linguistic Regularities in Word Vector Space
Semantic Connections in Vector Space Substructures▲
Implicitly, linear semantic AND syntactic connections are learned and modeled in the vector
space! Astonishing!

Visualization of Regularities in Word Vector Space


[1em]

The word vector space implicitly encodes many regularities


among words

11 / 31

Analogies Explained: Toward [?]

Understanding Word Embed


Analogy Computation in Vector Space
27 / 34

https://carl-allen.github.io/nlp/2019/07/01/explaining-analogies-explained.html
Allen & Hospedales, ICML 2019, Best Paper Honourable Men
Analogy Computation in VectorOfficial
Space blog: https://carl-allen.github.io/nlp/2019/07/01/exp

150
trained using W2V, one could take the vector of the word king, subtract the word man, add
the word woman and get that the closest vector to the result (when excluding the words king, man,
and woman) belongs to the word queen. at is, in vector space wking wman C wwoman  wqueen .
Similar results are obtained for various other semantic relations, for example wFrance wParis C
wLondon  wEngland , and the same holds for many other cities and countries.
is has given rise to the analogy solving task in which different word embeddings are eval-
uated on their ability to answer analogy questions of the form man:woman ! king:? by solving:
analogy.m W w ! k W‹/ D argmax cos.v; k m C w/: (11.4)
v2V nfm;w;kg

Levy and Goldberg [2014] observe that for normalized vectors, solving the maximization
in Equation (11.4) is equivalent to solving Equation (11.5), that is, searching for a word that is
similar to king, similar to man, and dissimilar to woman:
analogy.m W w ! k W‹/ D argmax cos.v; k/ cos.v; m/ C cos.v; w/: (11.5)
v2V nfm;w;kg

Levy and Goldberg refer to this method as 3CA. e move from arithmetics between
Source: [?]
words in vector space to arithmetics between word similarities helps to explain to some extent the
ability of the word embeddings to “solve” analogies, as well as suggest which kinds of analogies
The
can Cosine Similarity
be recovered by this method. It also highlights a possible deficiency of the 3CA analogy
recovery method: because of the additive nature of the objective, one term in the summation may Cosine
Dot Product Geometrically Similarity
· ⃗b = ∥⃗a∥∥the
⃗adominate ⃗b∥cos θ = ∥⃗b∥∥⃗
expression, effectively ignoring the∥⃗
a∥cos θ Explanation: others.
a∥cos θAs suggested
projects ⃗b. and Goldberg, this
by Levy
⃗a onto
can be alleviated by changing to a multiplicative objective (3CM):
cos.v; k/ cos.v; w/
analogy.m W w ! k W‹/ D argmax : (11.6)
v2V nfm;w;kg cos.v; m/ C 

Cosine Similarity (uses dot product)


cos θ = ⃗a·b⃗

∥⃗a∥∥b∥
Pn
i=1 ai × bi
= pPn
i=1 (ai ) × i=1 (bi )
pPn
2 2

http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

Quality of Embeddings: Analogies in 2D-Space


Question: “man” is to “woman” as “king” is to ?

151
Images adapted from this blog▲ The red vector is the re-
sult of resolving MAN + ? = WOMAN, that is ? = WOMAN - MAN. What is it in numbers?

Analogies in 2D-Space

152
Analogies in 2D-Space

Analogies in 2D-Space

153
Analogies in 2D-Space

Analogies in 2D-Space

154
In-Class Task: Analogies
Go to https://cutt.ly/ml4nlp1-hs22-we

Dimension Reduction for Visualization▲ : PCA, t-SNE, UMAP...1

• Important: The operations do not take place in 2D vector space!

• Different methods of dimension reduction for visualization of the vectors in 2D or 3D.

• PCA (Principal Component Analysis): Linear algebra method which maximizes the vari-
ance of the data. Stable visualization!

• t-SNE (t-Distributed Stochastic Neighbor Embedding): Non-linear gradient method that


optimizes the similarity of vectors in high and low dimensional space. Non-deterministic
visualization

• UMAP: yet another non-linear method.... and others more...

Examples for PCA and t-SNE Visualizations▲


Left column: Wikipedia, Right column: English Gigaword

1
https://towardsdatascience.com/visualizing-word-embedding-with-pca-and-t-sne-961a692509f5

155
Linguistic Regularities - Results
Which one is PCA? Which one is t-SNE?

Training and Performance (Semantic/Syntactic Tests)


Model Vector Training Training Accuracy
Dimensionality Words Time [%]
Collobert NNLM 50 660M 2 months 11
Turian NNLM 200 37M few weeks 2
Mnih NNLM 100 37M 7 days 9
Mikolov RNNLM 640 320M weeks 25
Huang NNLM 50 990M weeks 13
Our NNLM 100 6B 2.5 days 51
Skip-gram (hier.s.) 1000 6B hours 66
CBOW (negative) 300 1.5B minutes 72

[?]
The word2vec performance revolution in 2013 . . .
14 / 31

Basic Ideas behind Embeddings


Embeddings

• Simple idea: similar words have similar vectors!

• Vector space models define distance [?]

• Dimension reduction (typical vector length for words d: 64-300)

Main task in creating continuous representations


Learn a numeric vector Wi ∈ Rd (=embedding), adequately representing the ith word of your
vocabular from distributional evidence!

Learning methods: Interestingly from Deep to Shallow Learning


Originally complex network architectures with slow training. Meanwhile, simpler and better
models: word2vec, GloVe, fasttext.
And more recently: Back to deep models for contextualized word embeddings (BERT)

156
Early Approach Using Classical n-Gram Language Modeling
Feedforward Neural Net Language Model Feedforward
Neural Net
LM

U, V and W are weight matrices


whose values are to be learned
by the network.

When training is complete, U will


be used to “translate“ any word
into the respective vector in the
continuous space

● Four-gram neural net language model architecture (Bengio 2001)


● The training is done using stochastic gradient descent and backpropagation
● The word vectors are in matrix U

5 / 34
Source: http://www.micc.unifi.it/downloads/readingroup/TextRepresentationNeuralNetwork.pdf See simplified
PyTorch implementation▲

9.2.1 word2vec
Continuous Bag-Of-Word Language Modeling
Neural Network learns to estimate the probability a word using some context represented
as a vector: word2vec [?]

www.youtube.com/watch?v=aZarigloqXc

More theoretical insights on the exact distributional model in [?]

Idea: Continuous Bag-of-Words (CBOW)


CBOW

157
ontinuous Bag-of-words Architecture

Input projection output

w(t-2)

SUM
w(t-1)

w(t)

w(t+1)

w(t+2)

Source: [?]
Predicts the current
• Givenword given predict
the context, the context
the current word!
• Efficient computation with shallow neural nets (1 hidden9 /layer)
31
• Sum dense input word representations (and divide them for proper averaging)
• Shallow NNs can work with larger training sets than deep NNs
• There is no data like more data. . .

wevi▲ : Visualization of a CBOW Training Step


Context words: drink, juice
Training center word: apple

Source: wevi: word embedding visual inspector▲

Input/Output has dimensionality of one-hot vector: drink=[0,1,0,0,0,0,0,0] Embeddings are


learned edge weights to “hidden” layer: drink=[0.1,-0.3,-0.4]

Idea: Skip-Gram
Skip-Gram

158
p-gram Architecture

Input projection output

w(t-2)

w(t-1)

w(t)

w(t+1)

w(t+2)

Source: [?]
Predicts the surrounding words given the current word
• Given the current word, predict the most probable context by predicting each context word separately!
8 / 31
• Negative sampling (i.e., also learn the improbability of selected improbable words) improves quality and
efficiency
Word2Vec
word2vec Architecture

a large corpus of text


Every word in a fixed vocabulary is
represented by a vector
Go through each position t in the
text, which has a center word c and
context (“outside”) words o
Use the similarity of the word
vectors for c and o to calculate the
probability of o given c (or vice
versa)
Keep adjusting the word vectors
to maximize this probability
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
[?]

Word2Vec
word2vec: SKIP-GRAM Windows and Probabilities
Examples windows and and process for computing 𝑃(𝑤𝑡+𝑗 |𝑤𝑗 )

[?] http://web.stanford.edu/class/cs224n/syllabus.html

159
Word2Vec: objective function
word2vec SKIP-GRAM Objective Function
For each position 𝑡 = 1, ... , 𝑇, predict context words within a window of fixed
size m, given center word 𝑤t.

Likelihood =

𝜃 𝑖𝑠 𝑎𝑙𝑙 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
𝑡𝑜 𝑏𝑒 𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑑

[?]

Word2Vec: objective function


word2vec: Loss Function
The objective function (or loss, or cost function) 𝐽(𝜃) is the (average) negative
log likelihood

Minimizing objective function Maximizing predictive accuracy


[?]

Word2Vec: objective function


word2vec: Probabilities

We want to minimize objective function

Question: How to calculate 𝑷 𝒘𝒕+𝒋 𝒘𝒋 , 𝜽)?

Answer: We will use two vectors per word w vw is a center word


uw is a context word
Then for a center word c and a context word o:

exp(𝑢𝑜𝑇 𝑣𝑐 )
𝑃 𝑜𝑐 = 𝑇 𝑣 )
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐

Word2Vec:
[?]
prediction function
word2vec: Conditional Word Probabilities

Dot product measures similarity of o and c


exp(𝑢𝑜𝑇
𝑣𝑐 ) Larger dot product = larger probability
𝑃 𝑜𝑐 = 𝑇 𝑣 )
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐

After taking exponent, normalize


over entire vocabulary

[?]
Who does the workload? Softmax
Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf

160
This is softmax!
word2vec:Softmax
Softmax function ℝ𝑛 → ℝ𝑛 : Softmax

exp(𝑥𝑖 )
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒙)𝑖 = = 𝑝𝑖
σ𝑛𝑗=1 exp(𝑥𝑗 )

maps arbitrary values 𝑥𝑖 to a probability distribution 𝑝𝑖


”max” because amplifies probability of largest 𝑥𝑖
“soft” because still assigns some probability to smaller 𝑥𝑖

[?] often used in Deep Learning!


word2vec

Two vectors for each word


word2vec: Center and Context Representation
When it is a When it is a
center word context word

I saw a cat . 39

39 1592 10 2548 5
Token index in
the vocabulary V V
[?]

Two vectors for each word


word2vec: Center and Context Representation
When it is a When it is a
center word context word

banking

into into
problems problems
turning turning
crises crises
V V
[?]

Two vectors for each word


word2vec: Center and Context Representation
When it is a When it is a
center word context word

banking

into into
problems problems
turning turning
crises crises
V V

161
[?]

Where is 𝜽?
word2vec: What are the parameters?

𝜃 - d-dimensional vectors for V words

every word has two vectors!

we optimize these parameters

[?]

Where is 𝜽?
word2vec: Where are the parameters?

[?]

The Bigger Picture

162
https://ronxin.github.io/wevi/

Word2Vec: Additional efficiency in training


word2vec: Training Efficiency Problem and Some Solutions
exp(𝑢𝑜𝑇 𝑣𝑐 ) Huge sum! Time for calculating
𝑃 𝑜𝑐 = 𝑇 𝑣 ) gradients is proportional to |V|
σ𝑤∈𝑉 exp(𝑢𝑤 𝑐

Possible solutions:
Hierarchical softmax
𝑇 ෍ 𝑇 𝑣 )
exp(𝑢𝑤
Negative sampling ෍ exp(𝑢𝑤 𝑣𝑐 ) 𝑐
𝑤∈𝑉 𝑤∈{𝒐}∪𝑺_𝒌

Sum over a small subset: negative sample, |Sk|=k


Mikolov et al, 2013, https://arxiv.org/pdf/1310.4546.pdf
[?] Negative sampling: Temporarily reduce the vocabulary size by randomly selecting words that do not appear in
the context.

Hierarchical Softmax

163
http://building-babylon.net/2017/08/01/hierarchical-softmax/

P (time|C) = Pn0 (right|C)Pn1 (left|C)Pn2 (right|C),


Word2Vec:
The probability of child node is always smaller than its mother node.
(Near) equivalence to matrix factorization
word2vec: Connection to LSA-Style Matrix Factorization

𝑁 𝑤,𝑐 × |𝑉|
𝑃𝑀𝐼(𝑤, 𝑐) = log
𝑁 𝑤 𝑁(𝑐)

𝑃𝑀𝐼 = 𝑋 ≈ 𝑋෠ = 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
w w
≈ × ×

c
c
[?] Levy et al, TACL 2015 http://www.aclweb.org/anthology/Q15-1016
PMI: Point-wise Mutual Information (Blog Post with code▲ ) When is PMI positive? PMI is positive if the two
words tend to co-occur, 0 if they occur together as often as one would expect by chance, and less than 0 if they are
in complementary distribution. [?]

word2vec: Connection to Matrix Factorization

164
Word2Vec:
(Near) equivalence to matrix factorization
𝑁 𝑤,𝑐 × |𝑉|
𝑃𝑀𝐼(𝑤, 𝑐) = log
𝑁 𝑤 𝑁(𝑐)
Context vectors
𝑃𝑀𝐼 = 𝑋 ≈ 𝑋෠ = 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇

w w
c
≈ × ×

Word vectors
c 𝑉𝑑 Σ𝑑 𝑈𝑑𝑇
[?]
Levy et al, TACL 2015 http://www.aclweb.org/anthology/Q15-1016
Famous paper reinterpreting the results of word2vec algorithm in a declarative fashion [?]

9.2.2 GloVe
GloVe2 : Global Vectors for Word Representations
Window based cooccurence matrix
• GloVe combines count-based and prediction methods
• Example corpus:
• Training is performed on• aggregated global word-word co-occurrence statistics from a
I like deep learning.
corpus. • I like NLP.
• I enjoy flying.
counts I like enjoy deep learning NLP flying .
I 0 2 1 0 0 0 0 0
like 2 0 0 1 0 1 0 0
enjoy 1 0 0 0 0 0 1 0
deep 0 1 0 0 1 0 0 0
learning 0 0 0 1 0 0 0 1
NLP 0 1 0 0 0 0 0 1
flying 0 0 1 0 0 0 0 1
9 . 0 0 0 Richard
0 Socher
1 1 1 3/31/160

GloVe vs word2vec

• GloVe learns from the global cooccurrence matrix directly

• word2vec uses sliding windows running over all training samples

GloVe Cooccurrence Measure Pij


X = cooccurrence count matrix ; Xij = number of times word j occurs in the context of word
i; Xi = k Xik number of times any word appears in the context of word i ; Pij = P (j|i) =
P

Xij /Xi probability that word j appears in the context of word i.


2
[?] http://www-nlp.stanford.edu/projects/glove/

165
[?]

GloVe: Properties
“The training objective of GloVe is to learn word vectors such that their dot product equals the
logarithm of the words’ probability of co-occurrence.”

Nice Properties

• Trains quickly

• Good for small and big corpora


GloVe: combine count-based and direct
• Good for word vectors with few or many dimensions

prediction methods
GloVe: Technicalities
GloVe

𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉

probability that word j


appears in the context
of word i
[?]

Pennington et al., EMNLP 2014, https://www.aclweb.org/anthology/D14-1162

166
GloVe: combine count-based and direct
prediction methods
GloVe: Punishing Rare Events and Reduce the Effect of Frequent Events

𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉

discard rare noisy


co-occurrences
GloVe: combine count-based and direct
[?]

prediction methods
And prevent frequent co-occurences to be overweighted

GloVe: Connection to LSA


Pennington et al., EMNLP 2014, https://www.aclweb.org/anthology/D14-1162

𝑋𝑓𝑖𝑛𝑎𝑙 = 𝑈 + 𝑉

the idea is close to


factorizing the log of the co-
occurrence matrix (closely
related to LSA)
[?]
The embedding size is analogue to the diagonal matrix dimension in SVD.
Pennington et al., EMNLP 2014, https://www.aclweb.org/anthology/D14-1162

9.2.3 Properties
Limits/Properties of Embeddings

• Type of similarity varies and is not trivial to control. Which is more similar? dog, cat,
tiger.

• Antonyms: often appear as semantically close.

• Black-sheep effect: Trivial standard features are verbalized less often than salient ones:
black sheep vs. white sheep.

• Corpus bias: Stereotypes of text corpora are reflected in semantic space. Not always
desirable.

• Context independence: static embeddings have context independence. What happens to


ambiguous words?

167
Word Embeddings and Ambiguity

• Question: can word embeddings distinguish different meanings?

• Simple answer: No.

• More precise answer: a word embedding of an ambiguous word is a mixture of meanings


according to their occurrence in the corpus over which the vectors are learned.

• Next question: can we divide apart the different meanings in the vectors?

Subtracting Word Meanings3


What is the word with the smallest distance in the vector space, if we subtract from the mean-
ing of English “bank” the meaning of “banking” subtract?

Different Vectors for Different Word Sense

Sense
Sense 1
Sense 2
Sense 3

Sense 1
Sense 2

Nearest words to representations for Apple [?]


3
http://vectors.nlpl.eu/explore/embeddings/en/calculator/

168
Variants

• of the classical word2vec method (=Skip-Gram).

• can compute a fixed number of distinct vectors (=MSSG).

• or compute a variable number of vectors adapted to the ambiguity (=NP-MSSG).

But how can I directly calculate the meaning of a word in context as a vector? Contextualized
embeddings . . .

Influence of Context on Embeddings


How big should the context be chosen? What type of neighborhood should be chosen? prox-
imity at the word level? proximity in syntactic distance?

• The choice of type and size of context influences the semantic space generated.

• A large window size makes embeddings “more thematic”.

• A small window size makes embeddings “more syntactic” and more sensitive to local
co-occurrences.

• A syntactic context makes the words more functionally/semantically similar and homo-
geneous with respect to parts of speech.

Dependency Syntactic Contexts


C is thus
prep
amod nsubj
dobj pobj

However,
Australian scientist discovers star with telescope
odel; con- prep with

the num- amod nsubj


dobj

lly larger Australian scientist discovers star telescope


generalize
WORD CONTEXTS
ords con-
australian scientist/amod 1
scientist australian/amod, discovers/nsubj 1
pendency- discovers scientist/nsubj, star/dobj, telescope/prep with
texts cap- star discovers/dobj 1
telescope discovers/prep with 1
word con-
nce “Aus-
Figure 1: Dependency-based context extraction example.
cope”. Top: preposition relations are collapsed into single arcs,
making telescope a direct modifier of discovers. Bottom: the
is is the contexts extracted for each word in the sentence.
other neu- [?]

k around where lbl is the


• dependency type
parse of the (prepositions).
is simplified dependency relation be-
uced: the tween the head and the modifier (e.g. nsubj, dobj,
• context are directly dependent words (→ WORD/DEP).
• and the head word (WORD/DEP−1 ).
w. For prep with, amod) and lbl is used to mark the
1

rd w are inverse-relation. Relations that include a preposi-


e contexts tion are “collapsed” prior to context extraction, 169 by
r, with.2 directly connecting the head and the object of the
may miss preposition, and subsuming the preposition itself
ot a con- into the dependency label. An example of the de-
pendency context extraction is given in Figure 1.
hallows sunnydale collinwood
hogwarts half-blood garderobe calarts
malfoy blandings greendale
snape collinwood millfield
nondeterministic non-deterministic pauling
Example: Influence of Context Width and Type
non-deterministic
Most similar words of “target finite-state
word” with different contexts trained hotelling
on Wikipedia.
turingcolumn corresponds
Which computability nondeterministic
to dependency context (DEP)? Which one heting
corresponds to a CBOW
model with window size of +/- 2 (CBOW2)
deterministic or +/-
buchi 5 (CBOW5) context words?
lessing
A
finite-state B
primality C
hamming
gainesville fla texas
fla alabama louisiana
florida jacksonville gainesville georgia
tampa tallahassee california
lauderdale texas carolina
aspect-oriented aspect-oriented event-driven
smalltalk event-driven domain-specific
object-oriented event-driven objective-c rule-based
prolog dataflow data-driven
domain-specific 4gl human-centered
[?] A=? B=? C=? singing singing singing
dance dance rapping
dancing
9.2.4 Tooling dances dances breakdancing
Word embedding dancers matrix breakdancing miming
Word Embedding Matrix
tap-dancing clowning busking
• IniGalize%all%word%vectors%randomly%to%form%a%word%embedding%
matrix%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%|V|%

%
%
[%%%%%%%%%%%%]%
%L%%=%%%%%%%%% % %%%%%%%%%%%%%%… %%%%%%%%%%n%%

% %%%%%%%%%%the%%%cat%%%%%%mat%%…%
• These%are%the%word%features%we%want%to%learn%
• Also%called%a%lookJup%table%
• Conceptually%you%get%a%word’s%vector%by%le^%mulGplying%a%
oneJhot%vector%e%by%L:%%%%%x%=%Le$
46%

Embeddings in PyTorch▲

170
Using/loading pretrained embeddings
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1 via index lookup
embedding(torch.LongTensor([1]))
>>> tensor([[ 4.0000, 5.1000, 6.3000]])

Frozen Embedding Layers in Keras4


Using pretrained embeddings as a layer in Keras

• GloVe embedding can be used as a “normal” layer

• “Frozen” means no weight changes during backpropagation

• Example code for document classification with convolution on frozen embeddings

from tensorflow.keras.layers import Embedding

embedding_layer = Embedding(
num_tokens,
embedding_dim,
embeddings_initializer=\
keras.initializers.Constant(embedding_matrix),
trainable=False,
)

9.2.5 fastText

Facebook’s fastText▲ : Embeddings for Classification


4
https://github.com/keras-team/keras-io/blob/master/examples/nlp/pretrained_word_embeddings.py

171
“Bag of Tricks for Efficient Text Classification” [?]

• Learns/uses task-specific low-dimensional continuous representations for words (“Word


Embeddings”)

• Allows word N-grams as features, e.g. bigrams help in sentiment analysis!

• Emits k-best labels including their probabilities for a text

• Deals efficiently with large label sets (hierarchic softmax)

• Applies Hashing Trick for high-dimensional feature spaces [?]

Spacy’s tok2vec▲ uses a more memory-friendly approach.

Idea Behind fastText: CBOW with Labels Instead of Words

output This means that the probability of


lower than the one of its parent. E
with a depth first search and tracki
hidden probability among the leaves allo
any branch associated with a sma
practice, we observe a reduction o
x1 x2 ... xN −1 xN to O(h log2 (k)) at test time. This
ther extended to compute the T -t
Figure 1: Model architecture of fastText for a sentence with cost of O(log(T )), using a binary h
N ngram features x1 , . . . , xN . The features are embedded and
averaged to form the hidden variable. 2.2 N-gram features
Bag of words is invariant to word
tion is an hidden variable which can be potentially explicitly this order into account i
be reused. This architecture is similar to the cbow tionally very expensive. Instead,
model of Mikolov et al. (2013), where the middle n-grams as additional features to c
word is replaced by a label. We use the softmax tial information about the local w
function f to compute the probability distribution is very efficient in practice while a
over the predefined classes. For a set of N doc- rable results to methods that expl
uments, this leads to minimizing
172 the negative log- der (Wang and Manning, 2012).
likelihood over the classes: We maintain a fast and m
N mapping of the n-grams by us
1 !
− yn log(f (BAxn )), trick (Weinberger et al., 2009) wit
N n=1 ing function as in Mikolov et al.
where x is the normalized bag of features of the n- bins if we only used bigrams, and 1
averaged to form the hidden variable. 2.2 N-gram features
Bag of words is invariant to word
tion is an hidden variable which can be potentially explicitly this order into account is
be reused. This architecture is similar to the cbow tionally very expensive. Instead, w
model of Mikolov et al. (2013), where the middle n-grams as additional features to ca
word is replaced by a label. We use the softmax tial information about the local w
function f to compute the probability distribution is very efficient in practice while ac
over the predefined classes. For a set of N doc- rable results to methods that expli
uments, this leads to minimizing the negative log- der (Wang and Manning, 2012).
likelihood over the classes: We maintain a fast and me
N mapping of the n-grams by usin
1 !
− yn log(f (BAxn )), trick (Weinberger et al., 2009) with
N n=1 ing function as in Mikolov et al. (
where xn is the normalized bag of features of the n- bins if we only used bigrams, and 1
th document, yn the label, A and B the weight matri-
ces. This model is trained asynchronously on mul-
3 Experiments
tiple CPUs using stochastic gradient descent and a We evaluate fastText on two
linearly
• Lookup matrix A decaying
that embeds learning
features rate. n-grams)
(word First, we compare it to existing text
• Matrix B that connects the hidden layer with the output layer problem of sentiment analysis. Th
2.1 Hierarchical softmax
its capacity to scale to large outpu
When athe
• Softmax f computing number ofdistribution
probability classes is large, computing
for the the prediction dataset. Note that our mo
output labels
linear classifier is computationally expensive. More plemented with the Vowpal Wabbit
• SGD optimization of matrices
precisely, A and B
the computational complexity is O(kh) observe in practice, that our tailored
where k is the
• For supervised classification, number
matrix of classes
A produces and h theembeddings
task-specific di- is at least 2-5× faster.
mension of the text representation. In order to im-
• Details in [?] prove our running time, we use a hierarchical soft- 3.1 Sentiment analysis
max (Goodman, 2001) based on the Huffman cod- Datasets and baselines. We
ing tree (Mikolov
FastText-Style Embedding-based CBOW et al., 2013). During
Classifier training, the same 8 datasets and evalu
in Pytorch
Linear classifier on top of task-specific dense representations 5
computational complexity drops to O(h log 2 (k)). of Zhang et al. (2015). We repo
The hierarchical softmax is also advantageous at and TFIDF baselines from Zha
test time when searching for the most likely class. as well as the character leve
Each node is associated with a probability that is the model (char-CNN) of Zhang an
probability of the path from the root to that node. If the character based convolution
the node is at depth l + 1 with parents n1 , . . . , nl , its work (char-CRNN) of (Xiao and
probability is the very deep convolutional netw
l
" of Conneau et al. (2016). We
P (nl+1 ) = P (ni ). 2
i=1
Using the options --nn, --ngrams a

EmbeddingBag▲ uses efficient mean aggregation per default. Offsets are for 1D representation.
5
https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html

173
Model AG Sogou DBP Yelp P. Yelp F. Yah. A. Amz. F. Amz. P.
BoW (Zhang et al., 2015) 88.8 92.9 96.6 92.2 58.0 68.9 54.6 90.4
ngrams (Zhang et al., 2015) 92.0 97.1 98.6 95.6 56.3 68.5 54.3 92.0
ngrams TFIDF (Zhang et al., 2015) 92.4 97.2 98.7 95.4 54.8 68.5 52.4 91.5
char-CNN (Zhang and LeCun, 2015) 87.2 95.1 98.3 94.7 62.0 71.2 59.5 94.5
char-CRNN (Xiao and Cho, 2016) 91.4 95.2 98.6 94.5 61.8 71.7 59.2 94.1
VDCNN (Conneau et al., 2016) 91.3 96.8 98.7 95.7 64.7 73.4 63.0 95.7
fastText, h = 10 91.5 93.9 98.1 93.8 60.4 72.0 55.8 91.2
fastText, h = 10, bigram 92.5 96.8 98.6 95.7 63.9 72.3 60.2 94.6
Table 1: Test accuracy [%] on sentiment datasets. FastText has been run with the same parameters for all the datasets. It has
10 hidden units and we evaluate it with and without bigrams. For char-CNN, we show the best reported numbers without data
augmentation.

Zhang and LeCun (2015) Conneau et al. (2016) fastText


small char-CNN big char-CNN depth=9 depth=17 depth=29 h = 10, bigram
AG 1h 3h 24m 37m 51m 1s
Sogou - - 25m 41m 56m 7s
DBpedia 2h 5h 27m 44m 1h 2s
Yelp P. - - 28m 43m 1h09 3s
Yelp F. - - 29m 45m 1h12 4s
Yah. A. 8h 1d 1h 1h33 2h 5s
Amz. F. 2d 5d 2h45 4h20 7h 9s
Amz. P. 2d 5d 2h45 4h25 7h 10s
Table 2: Training time for a single epoch on sentiment analysis datasets compared to char-CNN and VDCNN.

to Tang et al. (2015) following their evaluation Model Yelp’13 Yelp’14 Yelp’15 IMDB
protocol. We report their main baselines as SVM+TF 59.8 61.8 62.4 40.5
well as their two approaches based on recurrent CNN 59.7 61.0 61.5 37.5
networks (Conv-GRNN and LSTM-GRNN). 174 Conv-GRNN 63.7 65.5 66.0 42.5
LSTM-GRNN 65.1 67.1 67.6 45.3
Results. We present the results in Figure 1. We
use 10 hidden units and run fastText for 5 fastText 64.2 66.2 66.6 45.2
epochs with a learning rate selected on a valida- Table 3: Comparision with Tang et al. (2015). The hyper-
tion set from {0.05, 0.1, 0.25, 0.5}. On this task, parameters are chosen on the validation set. We report the test
adding bigram information improves the perfor- accuracy.
Tutorial Using Pre-trained Embedding Initialization with EmbeddingBag for Classifica-
tion
See Microsoft Learn Tutorial▲

Summary

• Symbolic representations of words lead to high-dimensional and sparse vectors that are
not suitable for expressing similarities.

• Continuous, low-dimensional dense representations result in more suitable semantic


spaces for many NLP tasks.

• Word Embeddings in combination with neural approaches have revolutionized NLP re-
search and text analysis in recent years.

• word2vec representation learning was the first success story of transfer learning!

Further Study

• Mandatory reading: Chapters 15.1 to 15.7 from dl2ai▲

• Mandatory reading: Chapter 6.2 onwards from [?] if you have never heard of word em-
beddings

• Chapter 10 of [?] and blog post to embeddings and analogies 6

• Demo for typical computations in embeddings spaces http://vectors.nlpl.eu/explore/embeddings/


en/calculator/

• GloVe reimplementation in Python http://www.foldl.me/2014/glove-python/

6
https://carl-allen.github.io/nlp/2019/07/01/explaining-analogies-explained.html

175
Chapter 10

Assignment 4: Two Exclusive Options:


A XOR B

10.1 Options
10.1.1 A
Assignment 4: Option A: Paper Dissection
Identify an interesting and high-quality (short) NLP paper

• Interesting: landmark paper from lecture/reading or ACL▲

• Interesting: paper from nlpprogress.com or on HF Leaderboards▲

• If paper is long and covers many Machine Learning approaches, focus on the best or
clearest setup

Understand the paper

• Read the paper “quickly and efficiently”

• Go along the IMRaD schema (next slide)

• If you don’t understand some concepts, search introductory resources (WP pages, quora,
book chapters, chatgpt, blogs, videos) that help.

• But do not waste too much time into researching things that are totally unclear. Try to
formulate/pinpoint what you don’t understand and what is unclear.

IMRaD: Introduction, Methods, Results and Discussion1


Efficient reading order may not be linear order

• Abstract

• Conclusion

• Look at examples/figures/tables
1
https://francescolelli.info/thesis/read-scientific-papers-quickly-and-effectively/

176
• Introduction

• Methods

• Results

• Discussion

Writing Your Paper Dissection: Max. 2 Pages


Follow these questions in order!

1. What is it about? What problem does it try to solve? Why is it interesting?

2. Which ML methods are used? What is the main innovation of the paper?

3. What are the takeaways?

4. What are possible problems of the approach? Think critically!

Some rules

• What does one need to know for understanding the paper? List the resources that were
helpful for you.

• You can also copy/paste the most important figure/table

• You can add a mind map if you like

• Do not just use ChatGPT output!

10.1.2 B
Option B: Short Student Talk

• 8 minutes + 2 minutes questions

• In 3 slots in class, 3 slots in tutorial in November/December sessions

• Or: create a short screencast (e.g. with Screencastify▲ ) for “future” students (no perfec-
tionism asked for!); e.g. a walkthrough to a code example

Topics

• A (short) paper on a technical or social/ethical aspect of ML in NLP

• A technical topic: GPU/TPUs; hierarchical Softmax; feature hashing; different optimiz-


ers (Adam); walkthrough of the code of a paper (Papers with Code▲ )

177
10.2 Organization
Organization and Deadlines
For this exercise, you can team up in pairs or work alone. Yes, no teams of 3 students allowed.
Communicate your topics and suggestions via Feedback-Forum in OLAT

• For talks: Reply ASAP in forum thread “Student Talks” in OLAT and email me at the
same time.

• Paper dissections: Friday 19.1.2024 23:59: Hand-in your PDF in OLAT

• Screencasts: Friday 19.1.2024 23:59: Hand-in Link to screencast in OLAT

178
Chapter 11

Convolutional Neural Networks (CNN)

Learning Goals

• Understand the key concepts behind Convolutional Neural Networks (CNNs/ConvNets)

• Understand the different functions of layers in CNNs: Convolution and Pooling

• Know how to implement CNNs with high-level interfaces in pytorch, tensorflow and
keras

• Know about the classical MNIST task and its solution with CNNs in tensorflow

11.1 Motivation
11.1.1 Local Features
Sequence Classification Tasks
Sequence
Classical NLP Sequence Labeling Task(s) Labeling
x y
Evidence Class
Word Lemma POS Tag NER Tag
Anton Anton NE B-PER
Schürmanns Schürmann NE I-PER
Reise Reise NN O
über über APPR O
das d ART O
Sustenjoch Sustenjoch NE B-GEO
im im APPRART O
Jahre Jahr NN O
1881 @card@ CARD O
How would a simple neural approach look like?
Sliding window approach: Local features for local predictions! Chapter 8.2.1 in [?]

Sliding Windows for Local Prediction Problems


Sliding
Window

179
NATURAL L ANGUAGE P ROCESSING (A LMOST ) FROM S CRATCH

Input Window word of interest

Text cat sat on the mat


1
Feature 1 w11 w21 . . . wN
..
.
K
Feature K w1K w2K . . . wN

Lookup Table
LTW 1
.. d
.
LTW K
concat
Linear

M1 × ·
n1
hu

HardTanh

Linear

M2 × ·
n2
hu = #tags

Source: [?]
Figure 1: Window approach network.
Simple FFNN Architecture

complex features (e.g., extracted from a parse tree) which can impact the computational cost which
• Input: words and externally available features. Which ones might be useful?
might be important for large-scale applications or applications requiring real-time response.
Instead, we •advocate
Vectorization: Embeddings
a radically of words
different and features
approach: as input we will try to pre-process our
features as little •asSimplest
possible and then use a multilayer neural network (NN) architecture, trained in
architecture
an end-to-end fashion. The architecture takes the input sentence and learns several layers of feature
• Role of first linear layer? Linear modeling
extraction that process the inputs. The features computed by the deep layers of the network are
• Role of HardTanh? Adding non-linearity
automatically trained by backpropagation to be relevant to the task. We describe in this section a
general multilayer architecture
• Role suitable
of last linear for all ourreduction
layer? Dimension NLP tasks, whichofisoutput
to number generalizable
tags to other NLP
tasks as well.
Our architecture is summarized
Representation of Features:in Sparse
Figurevs1Dense
and Figure 2. The first layer extracts features for
each word. The second layer extracts features from a window of words or from the whole sentence,
treating it as a sequence with local and global structure (i.e., it is not treated like a bag of words).
The following layers are standard NN layers.

3.1 Notations 180

We consider a neural network fθ (·), with parameters θ. Any feed-forward neural network with L
layers, can be seen as a composition of functions fθl (·), corresponding to each layer l:
(a) pw=the pt=DET w=dog&pw=the
pt=NOUN
w=dog w=dog&pt=DET w=chair&pt=DET

x = (0, ...., 0, 1, 0, ...., 0, 1, 0 ..... 0, 1, 0, .... , 0 , 1, 0 ,0, 1, 0, .... , 0, 0, 0 , .... , 0)

(b)
x = (0.26, 0.25, -0.39, -0.07, 0.13, -0.17) (-0.43, -0.37, -0.12, 0.13, -0.11, 0.34) (-0.04, 0.50, 0.04, 0.44)

NOUN (0.16, 0.03, -0.17, -0.13)


chair (-0.37, -0.23, 0.33, 0.38, -0.02, -0.37)
VERB (0.41, 0.08, 0.44, 0.02)
on (-0.21, -0.11, -0.10, 0.07, 0.37, 0.15)

dog (0.26, 0.25, -0.39, -0.07, 0.13, -0.17)
… …
… DET (-0.04, 0.50, 0.04, 0.44)
the (-0.43, -0.37, -0.12, 0.13, -0.11, 0.34) ADJ (-0.01, -0.35, -0.27, 0.20)
… PREP (-0.26, 0.28, -0.34, -0.02)
… …
mouth (-0.32, 0.43, -0.14, 0.50, -0.13, -0.42) …
… ADV (0.02, -0.17, 0.46, -0.08)
… …
gone (0.06, -0.21, -0.38, -0.28, -0.16, -0.44)
… POS Embeddings

Word Embeddings
[?, 91] F1: Current word = dog, F2: Preceding word=the, F3: preceding PoS=DET (a) one-hot encoding of features
and their combination
(b) dense embeddings of features (networks learns how to combine the evidence)

Spotting Relevant Information in Large Data


Which words are relevant for predicting sentence sentiments? Negative, neutral, positive
“Still, this flick is fun and host to some truly excellent sequences.”

What is the name for spotting relevant information in data?


Feature extraction

Why can a simple CBOW representation fed into an FFNN work to some degree?
As an approximation, lexical clues are informative regardless of their position!
Why only as a rough approximation?

Global and Local Ordering


Why is local ordering relevant?

• Negation: “it was not good, it was actually quite bad”

• Constructions/Multiword Expressions: “avoids the obvious”

Semantic compositionality of words matters. . . N-grams are more informative

What does the following sentence pair illustrate?

181
“Montias pumps a lot of energy into his nuanced narative, and surrounds himself with a cast
of quirky—but not stereotyped—street characters” vs “Montias surrounds himself with a cast
of quirky—but not stereotyped—street characters, thereby pumping a lot of energy into his
nuanced narative.”

Why is global ordering not so relevant?


Variability of expression
Analogy to images?

A Naive Approach: Embedding of N-Grams


What are the problems of embedding word n-grams?

• Exponential growth of types of n-grams

• Huge embedding matrices if done naively

• Sparsity of n-grams even with large text corpora

• Opaqueness of n-gram component words, independence of n-grams: if “very good” was


seen in training, but “quite good” not, the model is not able to deduce anything from the
shared component words.

• How can we deal with this problems?

BTW: “A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors”
[?]
Interesting paper that tackles (among other tasks) the problem of deriving good embeddings
for unseen or rare n-grams. nonce2vec problem [?].

Summary: Problems for FFNNs

• N-grams for capturing local information exponentially grow the input size

• Text windows and their features are matrices

• Flattening/Concatenation of feature matrices creates large vectors (high capacity net-


works)

• FFNNs are not prepared for global ordering invariance (shifted inputs with similar func-
tion)

11.2 CNN

182
CNN Architecture: Convolution-and-Pooling
Design goals of CNNs

• identify indicative local evidence in a large structure (convolution)

• combine it into fixed sized tensor representation (pooling) that

• optimally supports the prediction task at hand.

Representation learning function

• Automatic learning of feature extraction (vs. manual feature engineering)

• Extract meaningful substructures and integrate them for global tasks

Inspiration for NLP from neural computer vision


“Natural Language Processing (Almost) from Scratch” [?] popularized CNNs by showing their
effectiveness on several tasks!

11.2.1 Convolution
Intuition: Visual Filters Based on Matrix Convolution
Convolution kernels are like special “glasses” to see different aspects of rich information sources

Blurring

Edge Detection

183
Download from finelybook www.finelybook.com
Now you know all the building blocks to create a convolutional neural network. Let’s
see how to assemble them.
Classical LeNet Architecture for Object Recognition

CNN Architectures
Idea: From pixels to edges to shapes to classification
Pooling

Typical CNN architectures stack a few convolutional layers (each one generally fol‐
lowed by a ReLU layer), then a pooling layer, then another few convolutional layers
(+ReLU), then another pooling layer, and so on. The image gets smaller and smaller
as it progresses through the network, but it also typically gets deeper and deeper (i.e.,
with more feature maps) thanks to the convolutional layers (see Figure 13-9). At the
top of the stack, a regular feedforward neural network is added, composed ▲ of a few
fully connected layers (+ReLUs), and the final layer outputs the prediction (e.g., a
softmax
Sandwichlayer that outputs
architecture with estimated class
convolution andprobabilities).
pooling layers

Figure
CNNs13-9. Typical
dominate CNN vision
computer architecture
since 2012. Live browser-based demo▲

Applications ofA common


CNN: Frommistake
Images is
to to use convolution kernels that are too large.
Text
You can often get the same effect as a 9 × 9 kernel by stacking two 3
× 3 kernels on top of each other, for a lot less compute.

Over the years, variants of this fundamental architecture have been developed, lead‐
ing to amazing advances in the field. A good measure of this progress is the error rate
in competitions such as the ILSVRC ImageNet 184 challenge. In this competition the
top-5 error rate for image classification fell from over 26% to barely over 3% in just
five years. The top-five error rate is the number of test images for which the system’s
top 5 predictions did not include the correct answer. The images are large (256 pixels
high) and there are 1,000 classes, some of which are really subtle (try distinguishing
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
Which problems are solved here?
Well-known object recognition model: https://www.v7labs.com/blog/yolo-object-detection

Applications of CNN: From Images to Text

https://cs.stanford.edu/people/karpathy/neuraltalk2/demo.html
What problem should have been solved here?

Applications of CNN and Transformers: Image Generation

185
CNNs for NLP

Automatic Speech Recognition


https://openai.com/blog/dall-e/
Another AI image creation site▲ to play with (incl. NFTs▲ ): Drawing by texting...

CNNs for Automatic Speech Recognition (ASR)

signal acoustic pron. language


hello
input processing MFCC model dict model
speech features

CNN can be used in speech recognition, e.g., for acoustic modeling


Input: time-sequence of features (MFCC) for speech signal
Task of acoustic modeling: Determine which phone has been uttered
CNNs for NLP
) Input: features representing speech signal; output: phone
CNN[?] for Acoustic Modeling
The phones▲ are then matched against the phonetic representation of pronunciation dictionary resulting in textual
word candidates. A language model then finds the most probable sequence of words.
Heike Adel CNNs 05.04.2019 32 / 67

CNNs for Automatic Speech Recognition


CNN for acoustic modeling has shown to improve the ASR
performance
Sainath et al. (IBM), 2013: WER results (in %)
Model news 50h news 400h Switchboard 300h
HMM 18.1 13.8 25.2
DNN 15.8 13.3 23.5
CNN 15.0 12.0 21.9
[?] Now it belongs to the state-of-the-art techniques
• WER: Word Error Rate: Percentage of wrong words
• HMM (Hidden Markov Models): Traditional ML discrete sequence modeling
• DNN (Dense Neural Networks): Alternative term for fully connected feed-forward Neural Network (FFNN)

Heike Adel CNNs 05.04.2019 33 / 67

186
CNNs for ASR: Wav2vec 1.0▲ (2019)

• Evaluated on 81 hours of read WSJ articles

• Trained on 1000 hours (LibriSpeech▲ dataset) (DeepSpeech 12,000 hours)

• Directly vectorizes 30ms wave form data by CNNs using a self-supervision approach

CNNs for ASR: Wav2Vec Local and Context Encoder

187
• Input encoder: 512-dimensional output vector

• Context encoder: takes input encoder output from 210ms and outputs 512-dimensional
context representation

• Self-Supervision goal: Learn to recognize the next 12 input encoder vectors (each out of
10 randomly selected vectors) from the context representation input only

CNNs for ASR: Wav2Vec Architecture

188
CNNs for Text-To-Speech: WaveNet▲ [?]

• Idea: A generative dilated CNN directly produces the sound wave signal (therefore it
can also learn to generate music).

• Evaluation measure: Human ratings on a 1-5 scale

• Parametric system is RNN-based and concatenative is HMM-based Google TTS from


2015

Narrow Matrix Convolution: Simple Graphical Example


Convolution,
Narrow
• Sliding window (aka. convolutional kernel, filter, feature detector) with the same weights
applied allover

189
• Basic convolution is just a weighted sum

• Sum of element-wise multiplication

• Where are the weights?

• Stride = sliding step size : Which stride size?

2D Convolution with Small or Large Filters

Convolutional Layer (Cross-Correlation Operation)


Layer,
Convolutionl

190
CNNs

Convolutional Layer

A convolutional layer learns filters (each is an m ⇥ n matrix of


weights)
These filter matrices are shifted through the input matrix
Shift is called “stride”, e.g., = 1
At each position, convolution is applied
n/2 m/2
X X
y [k, l] = f [i, j] · x[k i, l j]
j= n/2 i= m/2

Intuition:
Learn local features which are important for classification
Position independent extraction of features
[?]

~
Filter matrix uses kernel coordinate system with
Heike Adel 0CNNs
CNNs
at central position (equidistant from borders).
05.04.2019 13 / 67 Stride
How would a normal matrix notation for the filter look like?
Convolutional Filter
Convolutional Filters Without Bias

5 3 -1 2 4
-2 -1 0 1 0 1 2 1 5 -2 3
0 2 1 1 1 * 0 0 0 = -6 -6 2
0 -1 4 -1 -2 -1 -2 -1 1 -5 -12
1 0 3 4 5
5*1+3*2+(-1)*1+(-2)*0+(-1)*0+0*0+0*(-1)+2*(-2)+1*(-1)=5

1
X 1
X
y [k, l] = f [i, j] · x[k i, l j]
j= 1 i= 1

with x: input image, f : convolutional filter, y : result


[?]
Heike Adel CNNs 05.04.2019 12 / 67

Stride Size: Summarizing Information

191
11.2.2 1D
One 1D Convolution Over Text: Sliding Window

• Sequence of n words w1:n = (w1 , . . . , wn )


160 13. NGRAM DETECTORS: CONVOLUTIONAL NEURAL NETWORKS
• Lookup
can be embedding(i.e.,
further specialized function wi = E(w
“a sequence ) Signature
of iwords of E?
that do :N→R
notEcontain notemb
” or “a sequence of
d

words
• that are adverb-like”).⁷
Sliding window of sizeFigure
k: How13.3 shows
many in aa sequence
two-layer of
hierarchical
length n? convolution with k D 2.

• Vector concatenation a s of size k:t


c e of ith window
r vi ew o od
se r vi
c sn er y go
u al se wa tv y
act al vic
e
sn
o ver
the act
u xi s= wai+k−1 )
e r ⊕(wi , . . . , w no
t

• Dimensionality of xi ? xi i∈ e R emb s
k∗d
a l e r vc wa od
u s t y
act al ice no ver go
e
• Weight
th filter u foracdot-productsezi = xi · u
t u r v
wa
s
no
t
ver
y

• (Optional) Non-linear function g on weighted sum: pi = g(zi )

• Dimensionality of p for k=3 for example below?

• What is the stride? demb

the actual service was not very good

Figure
Many13.3: Two-layer hierarchical
1D Convolutions Over Text convolution with k =2.

• A single filter u ∈ Rk∗demb does not have enough modeling capacity!


Strides, Dilation and Pooling So far, the convolution operation is applied to each k -word win-
dow in the sequence, i.e., windows starting at indices 1; 2; 3; : : :. is is said to have a stride of
size 1. Larger strides are also possible, i.e., with a stride of size 2 the convolution operation will
be applied to windows starting at indices 1; 3; 5;192: : :. More generally, we define CONVk;s as:
p1Wm DCONVk;s .w1Wn /
U ;b
(13.8)
pi Dg.˚.w1C.i 1/sW.sCk/i /  U C b/;
get n k C 1 vectors p1Wn kC1 . is is called a narrow convolution. An alte
sentence with k 1 padding-words to each side, resulting in n C k C 1 vecto
called a wide convolution [Kalchbrenner et al., 2014]. We use m to denote the
vectors.
• Let’s take l filters u1 , . . . , ul and also give them a bias b:
An Alternative Formulation of Convolutions In our description of convolu
pi = g(xi · U + b)
of n items w1Wn each item is associated with a d -dimensional vector, and th
• What nated
do we haveintohere? l linear
a large 1 models
 d  nwith a non-linearity
sentence applied
vector. e. .convolution
. network with
• What and output values
is the`dimensionality of U,isbthen
and pibased
? on a k  d  ` matrix. is matrix is applie
• What 1 is   n sentence
thedsignature of g? Seematrix that
Universal correspond
functions of NumPyto k -word
▲ for windows.
vectorizing wrappers Each such m
of functions that applyEach
in ` values. element-wise!
of these k values can be thought of as the result of a do
k  d  1 vector (a row in the matrix) and a sentence segment.
Applying Non-linearity to Convolution Result
Another (equivalent) formulation that is often used in the literature i
vectors are stacked on top of each other, resulting in an n  d sentence mat
operation is then performed by sliding ` different k  d matrices (called “
over the sentence matrix, and performing a matrix convolution between each
responding sentence-matrix segment. e matrix convolution operation bet
defined as performing element-wise multiplication of the two matrices, and
Each of the ` sentence-kernel convolution operations produces a single v
values. It is easy to convince oneself that the two approaches are indeed equ
that each kernel corresponds to a row in the k  d  ` matrix, and the convo

corresponds to a dot-product with a matrix row.
https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/

Figure 13.1 show narrow and wide convolutions in the two notations.
Example: 1D Convolution
e
ic

as
rv

od
al

w
se

y
tu

go
er
no
e
ac

al

ic

tv

*PAD*
ry
tu

rv

as
e

no
th

ve
ac

se

the
actual
service
was
not
very
the actual service was not very good good
*PAD*
(a) and how many filters?
Which dimensions (b)

1D vs 2D: Text as an Embedding Image...


Figure 13.1: e inputs and outputs of a narrow and a wide convolution in the
and the vector-stacking notations. (a) A narrow convolution with a window
dimensional output (` D 3), in the vector-concatenation notation. (b) A wide co
dow of size k D 2, a 3-dimensional output (` D 3), in the vector-stacking notat

193
In NLP, CNN has become popular for sentence modeling
Kalchbrenner et al., 2014
Kim, 2014
etc...
RS: CONVOLUTIONAL NEURAL NETWORKS
If the input is a sentence, where can we get the “image” for the CNN?
ons How many)vectors do word
p ieach
Represent we have? For a(e.g.,
as a vector sentence of length
with word n
embeddings)
ere are n k C ) 1 positions
A sentence in
canwhich
then beto start theassequence,
represented a matrix and we
C1 . is is called a narrow convolution. An alternative is to pad the
-words to each side, resulting in n C k C 1 vectors p1WnCkC1 . is is
I
chbrenner et al., 2014]. We use m to denote
like the number of resulting

sentence
this
movie
very
much
!
of Convolutions In our description of sentence convolutions
representation over a sequence

associated
~ with a d -dimensional vector, and the vector are concate-
The sequence of words is still in just 1 direction! In “real images” neighborhood of pixels makes
Heike Adel CNNs 05.04.2019 34 / 67
sentence vector.
sense e convolution
in all directions, network
but not in texts! with a window of size k
ased on a k  d  ` matrix. is matrix is applied to segments of the
Horizontal vs Vertical Stacking of Windows
at correspondHorizontal
to k -word windows. Each such multiplication results
Stacking Stacking,
Horizontal
values can be •thought of as the result of a dot product between a
Input of n words with d dimension (channels): R1×n∗d
matrix) and a •sentence segment.
Convolution matrix U with windows of size k and l filters: Rk∗d×l
ormulation that is often used in1×k∗d the literature is one in which the n
• Window segment R is multiplied by Rk∗d×l , resulting in R1×l
each other, resulting in an n  d▲ sentence matrix. e convolution
• See NumPy’s hstack
by sliding ` different k  d matrices (called “kernels” or “filters”)
a matrix
d performingVertical Stackingconvolution between each kernel and the cor-
segment. e matrix convolution operation between two matrices is
• Input of n words with d dimensions (channels): Rn×d
nt-wise multiplication of the two matrices, and summing the results.
l different convolution kernels/filters size Rk×d
nel convolution• operations produces a single value, for a total of `
oneself that the• two
Theseapproaches
kernels slide over
are matrix
indeed rows while performing
equivalent, matrix convolution.
by observing
to a row in the• k ` matrix,is and
 d convolution
Matrix theofconvolution
the sum the elements ofwith a kernelproduct of two same-
the Hadamard
shape matrices.
t with a matrix row.
• See NumPy’s vstack▲
ow and wide convolutions in the two notations.
1D Convolution in the Vertical Stacking
od
y
t

go
er
no

tv

*PAD*
ry
as

*PAD* the
no

ve
w

the
actual the actual
service actual service
was service was
was not
not
not very
very very good
not very good good good *PAD*
*PAD*
(b)
194

outputs of a narrow and a wide convolution in the vector-concatenation


ions. (a) A narrow convolution with a window of size k D 2 and 3-
n the vector-concatenation notation. (b) A wide convolution with a win-
Including padding
What do the colors in the right matrix encode?
In keras/pyTorch frameworks, the embedding dimensions are treated as channel dimensions!

Narrow and Wide Convolutions: Padding


Is there something between narrow and wide?

Resulting dimensions in 1D: m = n − k + 1

Resulting dimensions in 1D: m = n + k − 1


~
( Typo in [?])

Animations of different convolution types▲ described in [?]

Padding

195
Advantages of Wide Convolution [?]
Convolution,
Wide
• All weights in a filter reach every element in the sequence, . . .

• specifically the ones at the edges.

• Especially, if the window size is large (8 to 10).

• But hierarchical convolution with smaller windows is preferable and covers the same
span. CNNs

• Robustly produces wellformed convoluted vectors, even if input is smaller than window.
Zero Padding
• Padding with which values? Typically a zero vector for CNNs.

Zero Padding
Problem: Filters are not well defined for data near the borders
) Zero padding
0 1
0 1 0 0 0 0 0 0
0 1 0 0 B 0C
B0 0 1 0 0 C
@2 3 3 3A ) B0 2 3 3 3 0C
B C
4 0 1 0 @0 4 0 1 0 0A
0 0 0 0 0 0
[?]
Note: Zero padding size is another hyperparameter

Heike Adel CNNs 196 05.04.2019 23 / 67


11.2.3 Pooling
Max-Pooling: Non-linear Down-sampling

• 2 x 2 kernel with stride 2

• Just keep the maximum value within one kernel window

• Idea: keep the relevant, that is, informative part with respect to the task

• Backpropagation tunes representations for profiling the relevant information

• Lossy reduction (but also generalization and avoiding overfitting)


CNNs

N in Action

Convolution and Pooling in Action

4 1 6 9

max pooling
layer

2 4 3 5 5 6

-1 1 0 9 0 2

convolutional filters / kernels:


layer
0 1 0 0 1 -1 -1 1

2 3 3 3 0 1 2 0

4 0 1 0
Heike Adel [?] CNNs 05.04.2019 17 / 67
Read from bottom to top

Exercise: Convolution and Pooling in Action

197
CNNs

Exercise

? ?
? ?
max pooling Compute the output
layer
of the convolutional
? ? ? ?
and max pooling
? ? ? ? layer

Filter:
convolutional -1 1
layer
0 1 0 -2 1
2 -2
2 3 3 2 3

4 0 CNNs
1 0 2

[?]
N in Action Heike Adel CNNs 05.04.2019 25 / 67

Simple CNN Architecture in Action

fully-
connected
layer

non-linearity
4 1 6 9

max pooling
layer

2 4 3 5 5 6

-1 1 0 9 0 2

convolutional filters / kernels:


layer
0 1 0 0 1 -1 -1 1

2 3 3 3 0 1 2 0

4 0 1 0
Heike Adel [?] CNNs 05.04.2019 19 / 67
Note: no non-linearity after convolution

Pooling Windows for Images and Texts

198
CNNs

Pooling Windows

Typical in vision: max pooling in window

5 2 -1 -2
11 5 -2 3 11 3
-4 -6 -6 2 1 2
0 1 -5 -12
Typical in NLP: max pooling over time
) Only store the maximum value for the whole sentence

5 2 -1 -2 5
11 5 -2 3 11
-4CNNs
-6 for NLP
-6 2 2
0 1 -5 -12 1
CNN for NLP
[?] Heike Adel CNNs 05.04.2019 15 / 67

1D Convolution and Pooling for Feature Extraction

n = number of words
d = Dimension/channel of words
k = size of kernel

n k

[?] with some renaming

Convolution
Heike Adel
and Max-Pooling onCNNs
Text 05.04.2019 35 / 67
Average
c[j] = max pi[j] for all filter j Pooling
1<i≤m

199
particular sort of predictors, and max operation will pick on the most important predictor of each
type.
Figure 13.2 provides an illustration of the convolution and pooling process with a max-
pooling operation.

6×3
W max
the quick brown fox jumped over the lazy dog

the quick brown MUL+tanh


quick brown fox MUL+tanh
brown fox jumped MUL+tanh
fox jumped over MUL+tanh
jumped over the MUL+tanh
over the lazy MUL+tanh
the lazy dog MUL+tanh

convolution pooling

How would Average Pooling enter the picture?


Figure 13.2: 1D convolution+pooling over the sentence “the quick brown fox jumped over the lazy
C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA
dog.” is
Classic is a Classification
CNN narrow convolution (no padding is added to the sentence) with a window size of 3.
Architecture
C OLLOBERT, W ESTON , B OTTOU , K ARLEN , K AVUKCUOGLU AND K UKSA
Each wordLanguage
“Natural is translated to a 2-dim
Processing embedding
(Almost) From vector (not shown).
Scratch” (2011) e embedding vectors are then
concatenated, resulting in 6-dim window representations. Each of the seven windows is transfered
Input Sentence
through a 6  3 filter (linear transformation followed by element-wise tanh), resulting in seven 3-
dimensional filtered representations. en, aThe
TextSentence
Input cat sat on the mat
max-pooling operation
1
is applied, taking the max over
Feature 1 w11 cat
w21 .sat wN
. . on the mat
Padding

Padding
Text The
each dimension, resulting in a... final 3-dimensional
1
pooled
1
vector. 1
Feature 1 w1 w2 . . . wN
Padding

.. K
Padding
Feature
. K w1K w2K . . . wN
Feature K 1 2 N wK wK . . . wK
Average Pooling e second most common
Lookup Table pooling type being average-pooling —taking the
average value of each index
LTinstead
Lookup
W of
Table the1max:
LT...W 1 d

m
X
LTW .. K
.
1 d

cD pi : (13.5)
LTW K m
iD1
Convolution
Convolution M1 × ·
M1 × ·

n1
hu

n1
hu

Max Over Time


Max Over Time
max(·)
n1
max(·) hu

n1
Linear hu

Linear
M2 × ·
n2
M2 × · hu

HardTanh n2
hu

HardTanh

Linear
Linear
M3 × ·
n3
M3 × · hu = #tags

n3
hu = #tags

Figure 2: Sentence approach network.


Figure 2: Sentence approach network.
200
In the following, we will describe each layer we use in our networks shown in Figure 1 and Figure 2.
We adopt
In the few notations.
following, Given aeach
we will describe matrix A we use
layer denote [A]networks
in our i, j the coefficient
shown inat row i1and
Figure and column
Figure 2.j
We adopt few notations. Given admatrix A we denote [A] the coefficient at row i and column j
in the matrix. We also denote ⟨A⟩i the vector obtained by
win
i, j concatenating the dwin column vectors
around th
the i column dwin
A∈ d ×d
R obtained
in the matrix. We alsovector
denoteof⟨A⟩
matrix
i the 1
vector 2 : by concatenating the dwin column vectors
around the ith column vector of matrix A ∈ Rd1 ×d2 :
! "T # $
d
Source: [?]

Break-through paper of applying CNNs to basic NLP tasks (POS tagging, chunking, NER,
SRL)

1 by 1 Convolution [?] [?]


“[?] propose the need for an MLP convolutional layer and the need for cross-channel pooling
to promote learning across channels.

See Blog post with more info on 1x1 convolution ▲

Combining Different Window Sizes for Text Classification

wait
for
the
video
and
do
n't
rent
it

n x k representation of Convolutional layer with Max-over-time Fully connected layer


sentence with static and multiple filter widths and pooling with dropout and
non-static channels feature maps softmax output

Figure 1: Model architecture with two channels for an example sentence.


~
[?] Such figures are notoriously imprecise! Note that channel in this figure is used in 2D convolution sense, not
necessary)1D! isTherepresented
static channel as
contains static embeddings that are that is keptduring
not modified statictask-specific
throughout training
training and one that
(non-static

is fine-tuned via backpropagation (section 3.2).2


channel’s embeddings are modified).

x1:n = x1 x2 . . . xn , (1) In the multichannel architecture, illustrated in fig-


Sentiment Classification in Keras: A CNN Approach
▲ ure 1, each filter is applied to both channels and
where What is the doesconcatenation
the following operator.
code with In gen- implement? Review and discuss the code to-
Conv1D
the results are added to calculate ci in equation
eral, let gether!
xi:i+j refer to the concatenation of words
(2). The model is otherwise equivalent to the sin-
https://github.com/joosephook/keras/blob/master/examples/imdb_cnn.py
xi , xi+1 , . . . , xi+j . A convolution operation in-
gle channel architecture.
volves aKeras
filterCNN w 2Sentiment
Rhk , which is applied to a
Max-Pooling
window of h words to produce a new feature. For 2.1 Regularization
example, a•feature i is generated from a window
Simplec3-gram CNN with 250 filters
of words xi:i+h 1 by For regularization we employ dropout on the
201 penultimate layer with a constraint on l2 -norms of
ci = f (w · xi:i+h 1 + b). (2) the weight vectors (Hinton et al., 2012). Dropout
prevents co-adaptation of hidden units by ran-
Here b 2 R is a bias term and f is a non-linear domly dropping out—i.e., setting to zero—a pro-
function such as the hyperbolic tangent. This filter portion p of the hidden units during foward-
• Max-Pooling Layer

Be aware that each convolution filter has a bias parameter per default!

Sentiment Classification in Keras: The FastText Approach


https://github.com/joosephook/keras/blob/master/examples/imdb_fasttext.py fastText

Keras FastText Sentiment


Average
Pooling
• n-grams of words

• 1D Average Pooling on Embeddings

• No convolution done 13.1. BASIC CONVOLUTION + POOLING 157


One view of average-pooling is that of taking a continuous bag-of-words (CBOW) of the k-gram
representations resulting from the convolutions rather than from the sentence words.
K-Max Pooling
K-max Pooling Another variation, introduced by Kalchbrenner et al. [2014] is k-max pooling
operation, in which the top k values in each dimension are retained instead of only the best one,
while preserving the order in which they appeared in the text.⁵ For example, consider the following
matrix:
2 3
1 2 3
6 9 6 57
6 7
6 2 3 17 :
6 7
4 7 8 15
3 4 1
 
A 1-max pooling over the column vectors will result in 9 8 5 , while a 2-max pool-
 
9 6 3
ing will result in the following matrix: whose rows will then be concatenated to
7 8 5
 
9 6 3 7 8 5.
Helpfule k-max
for? pooling
Repeated operation
feature makes it possible to pool the k most active indicators that
activations
may be a number of positions apart; it preserves the order of the features, but is insensitive to
their specific
CNNs positions.
for Relation It can also discern
Classification: [?] more finely the number of times the feature is highly
activated [Kalchbrenner et al., 2014].

Dynamic Pooling Rather than performing a single pooling operation over the entire sequence,
we may want to retain some positional information based on our domain understanding of the
prediction problem at hand. To this end, we can split the vectors pi into r distinct groups, apply
the pooling separately on each group, and then concatenate the r resulting `-dimensional vectors
c1 ; : : : ; cr . e division of the pi s into groups is performed based on domain knowledge. For
example, we may conjecture that words appearing early in the sentence are more indicative than
words appearing late. We can then split the sequence into r equally sized regions, applying a
separate max-pooling to each region. For example, Johnson and Zhang [2015] found that when
classifying documents into topics, it is useful to have 20 average-pooling regions, clearly separating
the initial sentences (where the topic is usually introduced) from later ones, while for a sentiment
classification task a single max-pooling operation over the entire sentence was optimal (suggesting
that one or two very strong signals are enough to determine the sentiment, regardless of the
position in the sentence).
202task we may be given two words and asked to
Similarly, in a relation extraction kind of
determine the relation between them. We could argue that the words before the first word, the
words after the second word, and the words between them provide three different kinds of infor-

⁵In this chapter, we use k to denote the window-size of the convolution. e k in k -max pooling is a different, and unrelated,
value. We use the letter k for consistency with the literature.
Example: CNN for Relation Classifi
P(r|c)
softmax
sentence representation s
fully-connected MLP
u
Pleft Pmiddle Pright v flag
flatten flatten flatten 4

pooling result
k-max pooling k-max pooling k-max pooling
2

m 1

conv conv conv

W1 W2 … Wc-1 Wc <> Wc+1 Wc+2… W2c-1W2c <>W2c+1W2c+2…W3c-1W3c

In
case indicator
wordvector,
left context middle context right context

In 1614, Pocahontas married John Rolfe after being baptized ...

contextC
Heike Adel CNNs

203
P

Pright v flag
flatten 4
top 1
top 3
3 top 5

pooling result
pooling
2

m 1

conv

W2c+1W2c+2…W3c-1W3c
,

in c ill

am ,
e rs

's

n e its

a ry
In

d iv ity
n

ad
u re

ra i e >
r>

su w es
is io

lu d
w
lle

lr o
u ti

id i
a rt
fu t
case indicator

< fi
wordvector,

bs

<n
qu

[?] The per:spouse relation


org:parents slot relation
• Look at filters with highest contribution to correct classification
• Which N-grams are pooled by from these filters?

right context
• Height of a bar is frequency that the 3-gram around corresponding word was selected by k-max pooling:
“newest” stands for “its newest subsidiary”

Dynamic Pooling and Variations


Dynamic
Pooling

r being baptized ... block-size pooling is not always optimal for NLP
• Fixed

• Consider document position (initial topic sentences)

contextCNN [Adel, Roth & Schütze,


• Consider relative positions to words of interest (relation extraction)

• Consider the sequence of words by traversing a parse tree, not by reading order

Dynamic CNNs by [?]


CNNs 05.04.2019

204
Fully connected
layer

tence:
3 K-Max pooling
(k=3)

ws 5 (2)
Folding

varying sentence
Wide
the maximum of convolution
(m=2)
c yielding a vector
3
:) Dynamic
7
(3)
k-max pooling
5 (k= f(s) =5)

:)
levant feature, i.e. Wide
convolution
for each of the d (m=3)

The fixed-sized
ut to a fully con-
Projected
sentence

many desirable matrix


(s=7)

rder of the words The cat sat on the red mat


epend on external
as dependency or on Movie
Results
Figure 3: A DCNN for the seven word input sen-
Review
gives largely uni-
used or global and long astence. the Word embeddings
Classifier have
Fine-grained (%)size Binary = 4. The
d (%)
oming from each
kewise, the edges of a subgraph network has NB two convolutional 41.0 layers
81.8 with two
ithreflect
ph the exception
these varying ranges.
feature maps B I NBeach. The widths 41.9 of the83.1filters at the
nconsidered
either be localised
fewer to one or
sentence or spread more widely two layers SVMare respectively40.7 3 and 2. The 79.4 (dynamic)
narrow convolu-
e. This structure is internal k-max
to
R EC NTN
pooling layers have45.7values k of 85.4
5 and 3.
ome limiting
defined as- propa-
by the forward M AX -TDNN 37.4 77.1

detectors
through is lim-
the network. NB OW 42.4 80.5
ntence models, m the or with
NBoW is a dynamic pooling layers
48.5
given by dynamic k-
86.8
. Increasing DCNN
d the RNN has a linear chain max pooling. In the network the width of a feature
layers of the nar-
ubgraphs induced in the Max- map atTable 1: Accuracy of layer
an intermediate sentiment prediction
varies in the on
depending
feature detectors
a single fixed-range feature ob- movie reviews dataset. The first four results are
the length
11.2.4 Hierarchical of the input sentence; the resulting ar-
xacerbates
x pooling. The the ne-
recursive neural reported from Socher et al. (2013b). The baselines
Hierarchicalchitecture
he structure of an Convolution
external parse NB andis the
B I NBDynamic Convolutional
are Naive Bayes Neural
classifiers with,
nce and increases
variable range are computed Network.
at Figureunigram
respectively, 3 represents
features and a DCNN.
unigram and Webi- pro-
sentence
tree required
combining one or more of gram features. SVM is 205
a support vector machine
ceed to describe the network in detail.
ason higher-order
tree. Unlike in a DCNN, where with unigram and bigram features. R EC NTN is a
hierarchy
cannot be easilyorders, in recursive neural network with a tensor-based fea-
of feature
der features like those of 3.1 sin- Wide Convolution
ture function, which relies on external structural
max pooling op-
directly combined with higher features given by a parse tree and performs best
13.3 HIERARCHICAL CONVOLUTIONS
e 1D convolution approach described so far can be thought of as an ngram detector. A convo-
lution layer with a window of size k is learning to identify indicative k -grams in the input.
e approach can be extended into a hierarchy of convolutional layers, in which a sequence
of convolution layers are applied one after the other. Let CONVk‚ .w1Wn / be the result of applying
a convolution with window size k and parameters ‚ to each k -size window in the sequence w1Wn :

p1Wm DCONVkU ;b .w1Wn /

pi Dg.˚.wi Wi Ck 1 /  U C b/
( (13.6)
n k C 1 narrow convolution
mD
n C k C 1 wide convolution:

We can now have a succession of r convolutional layers that feed into each other as follows:

1
p1Wm DCONVk1 1 1 .w1Wn /
1 U ;b
2
p1Wm DCONVk2 2 2 .p1Wm
1
/
2 U ;b 1
(13.7)

r
p1Wm DCONVkr r r .p1Wm
r 1
/:
r U ;b r 1
160 13. NGRAM DETECTORS: CONVOLUTIONAL NEURAL NETWORKS
r
e resulting vectors p1Wm capture increasingly larger effective windows (“receptive-fields”) of the
can be further specialized (i.e.,
r “a sequence of words that do not contain not ” or “a sequence of
sentence. For r layers with a window of size k , each vector pir will be sensitive to a window
words that are adverb-like”).⁷
Simple Figure 13.3of shows a two-layer hierarchical convolution with k D 2.
of r.k Hierarchical
1/ C 1 words.⁶CNN With
Moreover, Stride the vector1 (word
pir can vector)
be sensitive to gappy-ngrams of k C r 1
words, potentially capturing e patterns such as as “not good ” or “obvious predictable plot ”
e r vic c ew n ot y o d
where a l a short sequence
stands for s
er v
i of words,waas
s well more vspecialized
t
e r patterns
go where the gaps
act
u
ua
ls
vic
e
sn
o er y
tv
the act ser wa no
⁶To see why, consider that the first convolution layer transforms each sequence of k neighboring word-vectors into vectors
representing k -grams. en, the second convolution layer will combine each k consecutive k -gram-vectors into vectors that
capture a window of k C .k 1/ words, e and so on, until the r th convolution will capture k C .r 1/.k 1/ D r.k
1/ C 1 words. t u a l vic wa
s
c l ser e ot er y oo
d
h e a
c t ua e r vi c
a sn o tv e r yg
t a s w n v

the actual service was not very good

Figure 13.3: Two-layer hierarchical convolution with k =2.


11.2.5 Stride
Strides, Dilation
CNNs with and>Pooling
Stride 1 So far, the convolution operation is applied to each k -word win-
dow in the sequence, i.e., windows starting at indices 1; 2; 3; : : :. is is said to have a stride of Stride
size 1. Larger strides are also possible, i.e., with a stride of size 2 the convolution operation will
be applied to windows starting at indices 1; 3; 5; : : :. More generally, we define CONVk;s as:
p1Wm DCONVk;s .w1Wn /
U ;b
(13.8)
pi Dg.˚.w1C.i 1/sW.sCk/i /  U C b/;
206
where s is the stride size. e result will be a shorter output sequence from the convolutional
layer.
In a dilated convolution architecture [Strubell et al., 2017, Yu and Koltun, 2016] the hi-
erarchy of convolution layers each has a stride size of k 1 (i.e., CONVk;k 1 ). is allows an
exponential growth in the effective window size as a function of the number of layers. Figure 13.4
the actual service was not very good

Figure 13.3: Two-layer hierarchical convolution with k =2.

Strides, Dilation and Pooling So far, the convolution operation is applied to each k -word win-
dow in the sequence, i.e., windows starting at indices 1; 2; 3; : : :. is is said to have a stride of
size 1. Larger strides are also possible, i.e., with a stride of size 2 the convolution operation will
be applied to windows starting at indices 1; 3; 5; : : :. More generally, we define CONVk;s as:
p1Wm DCONVk;s .w1Wn /
U ;b
(13.8)
pi Dg.˚.w1C.i 1/sW.sCk/i /  U C b/;
whereissthe
What is the stride
stride size. e result will be a shorter output sequence from the convolutional
parameter?
layer.
a dilated convolution architecture [Strubell et al., 2017, Yu and Koltun, 2016] the hi-
In Stride
Effect of
erarchy of convolution layers each has a stride size of k 1 (i.e., CONVk;k 1 ). is allows an
exponential growth in the effective window size as13.3. a function of the number
HIERARCHICAL of layers.161
CONVOLUTIONS Figure 13.4
(a)
shows convolution layers with different stride lengths. Figure 13.5 shows a dilated convolution
architecture.
An alternative to the dilation approach is to keep the stride-size fixed at 1, but shorten the
0
sequence length between each layer by applying k = 3,local
s = 1 pooling, i.e, consecutive k -gram of vectors
⁷To see why, consider a sequence of two convolution layer each with a window of size 2 over the sequence funny and appealing.
e first convolution layer will encode funny and and and appealing as vectors, and may choose to retain the equivalent of
“funny ” and
(b) “ appealing ” in the resulting vectors. e second convolution layer can then combine these into “funny
appealing,” “funny ” or “ appealing.”

k = 3, s = 2

(c)

k = 3, s = 3

Figure 13.4: Strides. (a–c) Convolution layer with k =3 and stride sizes 1, 2, 3.
Dilation: Holes in Convolution

can be converted into a single vector using max pooling or averaged pooling. Even if we pool
just every two neighboring vectors, each convolutional-and-pooling layer in the hierarchy will
halve the length of the sequence. Similar to the dilation approach, we again gain an exponential
decrease in sequence length as a function of the number of layers.

Parameter Tying and Skip-connections Another variation that can be applied to the hierarchical
convolution architecture is performing parameter-tying, using the same set of parameters U; b
in all the parameter layers. is results in more parameter sharing, as well as allowing to use an
unbounded number of convolution layers (as all the convolution layers share the same parameters,
the number of convolution layers need not be set in advance), which in turn allows to reduce
arbitrary length sequences into a single vector by using a sequence of narrow convolutions, each
resulting in a shorter sequence of vectors.
When using deep architectures, skip-connections are sometimes useful: these work by feed-
ing into the i th layer not only the vectors resulting from the i 1th layer, but also vectors from

207
Figure 4­12. A convolution with kernel_size=2 applied to an input matrix with the hyperparameter dilation=2.
The increase in dilation from its default value means the elements of the kernel matrix are spread further apart as
they multiply the input matrix. Increasing dilation further would accentuate this spread.

[?]
Implementing CNNs in PyTorch
162 13. NGRAM
Dilated DETECTORS:
Convolution NetworksCONVOLUTIONAL NEURAL NETWORKS
In this section, we work through an end­to­end example that will utilize the concepts introduced in the
previous section. Generally, the goal of neural network design is to find a configuration of
hyperparameters that will accomplish a task. We again consider the now­familiar surname
classification task introduced in Example: Surname Classification with an MLP”, but we will use
CNNs instead of an MLP. We still need to apply a final Linear layer that will learn to create a
prediction vector from a feature vector created by a series of convolution layers. This implies that the
goal is to determine a configuration of convolution layers that results in the desired feature vector. All
CNN applications are like this: there is an initial set of convolutional layers that extract a feature map
that
Not becomes
Figure 13.5: input in some
really illustrating
ree-layer
theupstream
point processing. In classification, the upstream processing is almost
dilated hierarchical convolution with k =3.
always the application of a Linear (or fc) layer.
previous layers which are combined to the vectors of the i 1th layer using either concatenation,
The implementation
averaging, walk through in this section iterates over the design decisions to construct a
or summation.
2
feature vector. We begin by constructing an artificial data tensor mirroring the actual data in shape.
Further Reading e use of hierarchical and dilated convolution and pooling architectures is
The size of the data tensor is going to be three­dimensional—this is the size of the minibatch of
very common in the computer-vision community, where various deep architectures—comprising
vectorized text data.
of arrangements ofIfmany
you use a one­hot vector
convolutions for eachlayers
and pooling character
withindifferent
a sequence of characters,
strides—have a pro-
been
sequence of one­hot
posed, resulting vectors
in very is a matrix,
strong and a minibatch
image classification andofobject
one­hot matrices isresults
recognition a three­dimensional
[He et al., 2016,
tensor. Using the terminology of convolutions, the size of each one­hot vector (usually the size of thefor
Krizhevsky et al., 2012, Simonyan and Zisserman, 2015]. e use of such deep architectures
NLP is still more preliminary. Zhang et al. [2015] provide initial experiments with text classi-
vocabulary) is the number of “input channels” and the length of the character sequence is the “width.”
fication with hierarchical convolutions over characters, and Conneau et al. [2016] provide fur-
ther results, this time with very deep convolutional networks. e work of Strubell et al. [2017]
As illustrated in xample 4­14, the first step to constructing a feature vector is applying an instance of
provides a good overview of hierarchical and dilated architectures for a sequence labeling task.
PyTorch’s
Kalchbrenner et al.class
Conv1d to theuse
[2016] three­dimensional data tensor.
dilated convolutions By checking
as encoders in anthe size of the output,archi-
encoder-decoder you
can getillustration
tecture
Better a(Section
sense of how
17.2)
frommuch
for machine
WaveNet ▲ translation.
the tensor has been reduced. We referofyou
e hierarchy to igure 4­9with
convolutions for alocal
visual
pooling
approach isofused
explanation whybytheXiao andtensor
output Cho [2016], who apply it to a sequence of character in a document-
is shrinking.
classification task, and then feed the resulting vectors into a recurrent neural network. We return
to this example in Section 16.2.2, after discussing
Example 4­14. Artificial data and using a Conv1d class
208 recurrent-neural-networks.
• Exponential growth of receptive field as a function of the deepness of the CNN.

• [?] and [?] show its benefits on several NLP tasks

Strides and Padding


Padding
Download from finelybook www.finelybook.com

Figure 13-7. Padding options—input width: 13, filter width: 6, stride: 5


Source: [?]
Unfortunately, convolutional layers have quite a few hyperparameters: you must
choose the number of filters, their height and width, the strides, and the padding
SAME strategy: padding
type. As always, you as
cannecessary to keep to
use cross-validation dimensions;
find the rightoutput neurons
hyperparameter m = ceil(input/stride)
values,
but this is very time-consuming. We will discuss common CNN architectures later, to
give you some idea of what hyperparameter values work best in practice.
11.2.6 Channels
Memory in
Input Channels Requirements
Text Processing
Another problem with CNNs is that the convolutional layers require a huge amount
Channels
of RAM, especially during training, because the reverse pass of backpropagation
requires all
• Channels inthe intermediate
image valuesare
processing computed duringintensities
RGB color the forward (for
pass. RGB images. . . )
For example,
• Channels consider
in text a convolutional
processing can belayer with 5layers
different × 5 filters,
of aoutputting 200 feature
text: words, POS tags, lemmas
maps of size 150 × 100, with stride 1 and SAME padding. If the input is a 150 × 100
RGB imagetype
• Depending (threeofchannels), then the
convolution number
(1D of parameters
vs 2D) channelsiscan (5 ×be
5 ×dimensions
3 + 1) × 200 of embeddings
= 15,200 (the +1
or something else! 7 corresponds to the bias terms), which is fairly small compared to a
fully connected layer. However, each of the 200 feature maps contains 150 × 100 neu‐
rons, and each of
• Multi-Channel these(or
input neurons needs
output) ▲ to compute a weighted sum of its 5 × 5 × 3 =
75 inputs: that’s a total of 225 million float multiplications. Not as bad as a fully con‐

7 A fully connected layer with 150 × 100 neurons, each connected to all 150 × 100 × 3 inputs, would have 1502
× 1002 × 3 = 675 million parameters!

362 | Chapter 13: Convolutional Neural Networks

209
Download from finelybook www.finelybook.com
Moreover, input images are also composed of multiple sublayers: one per color chan‐
nel. There are typically three: red, green, and blue (RGB). Grayscale images have just
one channel, but some images may have much more—for example, satellite images
that capture extra light frequencies (such as infrared).

Figure 13-6. Convolution layers with multipleSource: [?] and images with three
feature maps,
channels
Channels
Specifically, a neuron located in row i, column j of the feature map k in a given convo‐
lutional layer l is connected to the outputs of the neurons in the previous layer l – 1,
located in rows i × sw to i × sw + fw – 1 and columns j × sh to j × sh + fh – 1, across all
feature maps (in layer l – 1). Note that all neurons located in the same row i and col‐
umn j but in different feature maps are connected to the outputs of the exact same
neurons in the previous layer.
Equation 13-1 summarizes the preceding explanations in one big mathematical equa‐
tion: it shows how to compute the output of a given neuron in a convolutional layer.

Convolutional Layer | 359

[?]
The convoluted channels are added. Another option to combine channels: 1x1 convolution.

Channels vs Filters

210
Figure 4­7. A convolution operation is shown with two input matrices (two input channels). The corresponding
kernel also has two layers; it multiplies each layer separately and then sums the results. Configuration:
input_channels=2, output_channels=1, kernel_size=2, stride=1, padding=0, and dilation=1.

Figure 4­8. A convolution operation with one input matrix (one input channel) and two convolutional kernels (two
output channels). The kernels apply individually to the input matrix and are stacked in the output tensor.
Configuration: input_channels=1, output_channels=2, kernel_size=2, stride=1, padding=0, and dilation=1.

It [?]
is a bit ugly due to all the different indices, but all it does is calculate the weighted
It’s difficult to immediately know how many output channels are appropriate for the problem at hand.
sum of all the inputs, plus the bias term.
To simplify this difficulty, let’s say that the bounds are 1 and 1,024—we can have a convolutional
Full Formula for 2D Convolution with Bias
layer with a 13-1.
Equation singleComputing
channel, up the
to aoutput
maximumof aof
neuron
1,024 in a convolutional
channels. Now that layer
we have bounds, the next
thing to considerf is how f w many
f input channels there are. A common design pattern is not to shrink the
h n i = u . sh + f h − 1
number
k + ∑
zi, j, k =ofbchannels by∑more∑than
x a factor
u = 1 v = 1 k = 1 i , j ,k
. w of two from
u, v, k , k
one convolutional
with layer to the next. This is not
j = v . sw + f w − 1
a hard­and­fast rule, but it should give you some sense of what an appropriate number of
out_channels would look like.
• zi, j, k is the output of the neuron located in row i, column j in feature map k of the
SIZE layer (layer l).
convolutional
KERNEL
• As explained earlier, sh and sw are the vertical and horizontal strides, fh and fw are
The width
the of the
height andkernel
widthmatrix
of theis receptive
called the kernel size (kernel_size
field, and fn is the numberinofPyTorch). In igure 4­6
feature maps
in the previous layer (layer l – 1).
• xi , j , k is the output of the neuron located in layer l – 1, row i , column j , feature
map k (or channel k if the previous layer is the input layer).
• bk is the bias term for feature map k (in layer l). You can think of it as a knob that
tweaks the overall brightness of the feature map k.
• wu, v, k ,k is the connection weight between any neuron in feature map k of the layer
l and its input located at row u, column v (relative to the neuron’s receptive field),
and feature map k .
[?]

TensorFlow Implementation
11.3 Training
In TensorFlow, each input image is typically represented as a 3D tensor of shape
[height, width, channels]. A mini-batch is represented as a 4D tensor of shape
Training of CNN Layers
[mini-batch size, height, width, channels]. The weights of a convolutional
layer are represented as a 4D tensor of shape [fh, fw, fn, fn ]. The bias terms of a convo‐
lutional layer are simply represented as a 1D tensor of shape [fn].
211 loads two sample images, using
Let’s look at a simple example. The following code
Scikit-Learn’s load_sample_images() (which loads two color images, one of a Chi‐
nese temple, and the other of a flower). Then it creates two 7 × 7 filters (one with a
vertical white line in the middle, and the other with a horizontal white line), and
applies them to both images using a convolutional layer built using TensorFlow’s
Training can be done with gradient descent and backpropagation on the computation graph –
just as with FFNNs.
Additionally needed:

• Gradient for the convolutional layer

• Gradient for the max pooling layer

Make sure you understand Goldberg’s Chapter 5

CNNs

[?, 54]
Recap: Backpropagation
Mathematical Recap: Forward and Backward Pass

@C @zil @C
@wijl
= @wijl @zil

layer l-1 layer l l


a
1 1 1
a1l-1 al (
2 2 2 ajl 1
l >1
a2l-1 l
... ... xj l =1 i
a il
j a l-1 i
...
j wlij ... Forward pass Backward pass
zl = W l al 1 + b l L= 0 (z L )rC (y )
al = (z l ) l = 0 (z l )(W l+1 )> l+1
[?]

Heike Adel CNNs 05.04.2019 27 / 67


From FFNN to CNNs

212
https://grzegorzgwardys.wordpress.com/2016/04/22/8/

Gradients of Convolutional Layers


Convolutional Neural Networks backpropagation: from intuition to derivation▲

https://grzegorzgwardys.wordpress.com/2016/04/22/8/

Gradients of CNN

213
CNNs

Gradient Computation for CNN

δ1l+1 δ2l+1

δlmax=δ1l+1 0 0 Gradient of convolutional layer:


sum over all gradients of the same shared
0 0 δlmax=δ2l+1 parameter
Gradient of max pooling layer:
depends on indices of largest value
Need to store the indices of the largest
0 1 0 0 value
2 3 3 3

4 0 1 0

[?]
Heike Adel CNNs 05.04.2019 29 / 67

11.3.1 Deep
Issues with Deep CNNs
Deep CNN

• Training problems (convergence) due to vanishing gradients in deep CNNs

• Skip connections can bypass certain layers: Bypassed information from lower levels is
integrated again on high levels.

• More parameter saving: Share parameters between layers (similar idea as in recurrent
neural networks)

• Apply dropout: deactivate a random selection of neurons in a training step for regular-
ization

• Normalization of activation in hidden layers (natural extension of normalization of input


features): Layer Normalization, Batch Normalization, Group Normalization, etc.

11.3.2 Normalization
Variants of Normalization Illustrated on Image Data

214
s into
we
and GN,
and GN, perform
malization,perform thepresent
the
and then following
following computation:
GNcomputation:
in this formulation. A fam-
s not
ily of feature normalization11methods, including BN, LN, IN,
s we (1)
(1)
and GN, perform thex̂x̂following
= (x
ii = (xiicomputation:
µµii).
).
ii

Here xx isis the


the feature
featurex̂computed
computed 1 by aaµlayer,
layer, andand ii isis an
an index.
index.
tirely
irely Here i = (xiby i ). (1)
G [9],
[9], In the
In the case
case of of 2D2D images,
images, ii i= = (i(iNN,,iiCC,, iiH is aa 4D
W )) is
H,, iiW 4D vec-
vec-
sign,
sign, torindexing
tor
Here indexing
x is thethe the features
features
feature in (N,
in
computed (N,by C,aH,
C, H, W
layer, order,
W)) order,
and where
i iswhere N is
N
an index. is
tirely
kind
kind thethe
the
In batch
batch axis,
caseaxis,of 2D C isis the
C images,the channel
channel axis,
i = (iNaxis, and
, iC ,and and
iWand
H
iH , H ) is WaW4D arevec-
are the
the
[9],
roup-
oup- spatial
spatial
tor heightthe
height
indexing and
and width in
width
features axes.
axes.
(N, C, H, W ) order, where N is
sign,
ation.
tion. and axis,
and
theµµbatch in (1)
in (1)
C is arethe
are thechannel
the mean and
mean and standard
axis,standard
and H and deviation
deviation (std)
W are(std)the
kind
Vec-
Vec- computed
computed
spatial heightby:and width axes.
by:
roup-
group
roup
ation. µ and in (1) are the mean ss and standard deviation (std)
X
11 by:X 11 X X
spect
spect
Vec- computed
µµii == xxkk,, ii == (xkk µµii))22 +
(x + ✏,✏, (2)(2)
m
m k2Si m
s m k2S
group k2S k2Sii
1 X 1 X
i
neu-
spect
neu- µi = xk , i = (xk µi )2 + ✏, (2)
mple, with as a small constant.
with ✏ asma small constant. Siim
✏ S is the set
is the set of of pixels
pixels in in which
which
mple, k2Si k2Si
, neu- themean
the mean and and stdstd are
are computed,
computed, and and m m is is the
the size
size of of this
this set.
set.
itit isis
ng to to Many✏types
Many
with types
as a smallof feature
of feature
constant.normalization
normalization methods
Si is the methods
set mainly
mainly
of pixels differ
differ
in which
ng
mple,
atural inhow
in
the how
mean theand
the setstd
set is defined
SSii is
are defined
computed, (Figure
(Figureand2), 2), discussed
m discussed
is the size as as follows.
of follows.
this set.
tural
it is
sng pair ManyIn Batch
In Batch
types of Norm
Normfeature [26],
[26], the set
the set SSii is
normalization ismethods
defined as:
defined as:
mainly differ
pair to
orma-
atural
rma- in how the set Si is defined (Figure 2), discussed as follows.
In Batch Norm S[26], Sii = = {k{k ||set
the kkCCS= = iiC },
}, (3)
(3)
then
pair i is Cdefined as:
then
rmal-
rma-
mal- where iiCC (and
where (and kkCC))S denotes
denotes the sub-index
the sub-index of of ii (and
(and k) along
k) along
i = {k | kC = iC }, (3)
then the C
the axis. This
C axis. This means
means thatthat the the pixels
pixels sharing
sharing thethe same
same
rr be-
rmal-be- channel
where
channel index kare
(and are
iCindex normalized
C )normalized together,
denotes the together, i.e.,
sub-indexi.e., for
of ifor
(andeach chan-
k) along
each chan-
orien-
rien- nel,C
the
nel, BN
BN computes
axis. This means
computes and that
µµ and along
along the (N,
the pixels
the (N,sharing
H, W
H, axes.
W))the same
axes. In
In
many
r be-
many Layer Norm
channel
Layer Norm [3],
index[3], the
are the set is:
normalized
set is: together, i.e., for each chan-
apes,
rien-
apes, nel, BN computes µ and along the (N, H, W ) axes. In
erde-
many Layer Norm [3], the SSii set
= {k
is: || kkNN =
= {k },
= iiNN}, (4)
(4)
erde-
model
apes,
odel meaning that
meaning that LN
LN computes
computes
Si = {k | µ and
kµNand along the
= iN along
}, the (C,
(C, H,
H, WW
(4)))
onses
erde-
nses axes for
axes for each
each sample.
sample. In Instance Norm
In Instance Norm [61],
[61], the
the set
set is:
is:
(cov-
model
cov- meaning that LN computes µ and along the (C, H, W )
l fre-
onsesfre- (5)
SSii =
axes for each sample. {kIn
= {k || kkInstance
N=
N
= iiNN,,Norm
kkCC = }. the set is: (5)
= ii[61],
C}.
C
nly
(cov-
ly in in
215
visual
l fre-
isual meaning that
meaning that IN
SIN
i =
computes
{k | kN µµ
computes and
=and along
iN , kCalong the
= iCthe (H, W
}. (H, axes
W)) axes
(5)
lynew
new in foreach
for eachsample
sampleand and each
each channel.
channel. The
The relations
relations among
among BN,
BN,
orks.
isual
orks. LN, and
and IN
meaning
LN, IN are
thatare incomputes
INin Figure 2.
Figure 2.µ and along the (H, W ) axes
Batch Norm Layer Norm Instance Norm Group Norm

H, W

H, W

H, W

H, W
C N C N C N C N

Figure 2. Normalization methods. Each subplot shows a feature map tensor, with N as the batch axis, C as the channel axis, and (H, W )
Group
as the spatial axes. The Normalization
pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels.
[?]
number. ShuffleNet [65] proposes a channel shuffle oper- 3.1. Formulation
Yuxinthe
ation that permutes Wuaxes of groupedKaiming He These
features.
We first describe a general formulation of feature nor-
methodsSize
Batch all involve
and dividingNormalization
Batch the channel dimension into
Facebook AI Research (FAIR) malization, and then present GN in this formulation. A fam-
groups.are
Why Despite
small thebatchsizes
relation to these
badmethods, GN does
for Batch not
Normalization?
{yuxinwu,kaiminghe}@fb.com ily of feature normalization methods, including BN, LN, IN,
require group convolutions. GN is a generic layer, as we
and GN, perform the following computation:
evaluate in standard ResNets [20].
Abstract 36
Batch Norm 1
Batch Normalization (BN) is a milestone technique in the x̂i = (xi µi ). (1)
3. Group Normalization
34 Group Norm
i
evelopment of deep learning, enabling various networks 32
o train. However, normalizing Thealong the batch
channels dimension
of visual representations Here x is the feature computed by a layer, and i is an index.
30 are not entirely
error (%)

ntroduces problems — BN’s error increases rapidly when In the case of 2D images, i = (iN , iC , iH , iW ) is a 4D vec-
independent. Classical features of SIFT [39], HOG [9],
he batch size becomes smaller, caused by inaccurate batch 28
tor indexing the features in (N, C, H, W ) order, where N is
and GIST [41] are
atistics estimation. This limits BN’s usage for training group-wise representations by design,
wherefeatures
each group of channels the batch axis, C is the channel axis, and H and W are the
vision is constructed by some kind
26
arger models and transferring to computer
of histogram. These
asks including detection, segmentation, and video, which features are often processed
24 by group- spatial height and width axes.
wise normalization
equire small batches constrained over each histogram or22each orientation.
by memory consumption. µ and in (1) are the mean and standard deviation (std)
32 16 8 computed 4 by: 2
n this paper, we presentHigher-level
Group Normalization
features such (GN) as as VLAD [29] and Fisher Vec-
batch size (images per worker)
simple alternative to BN.torsGN (FV)divides
[44]thearechannels into
also group-wise features where a group s
roups and computes within Figure 1. ImageNet classification error vs. batch 1 X sizes. This is 1 X
caneach group theofmean
be thought as theandsub-vector
vari- computed with respect
a ResNet-50 model trained in the ImageNet µ i = x
training set using
k , 8i = (xk µi )2 + ✏, (2)
nce for normalization. GN’s computation is independent
to a cluster. workers (GPUs), evaluated in the validation set.m k2Si m
k2Si
f batch sizes, and its accuracy is stable in a wide range
f batch sizes. On ResNet-50Analogously, it is not
trained in ImageNet, GNnecessary
has toDespite
think itsofgreat
deepsuccess,
neu- [?] BN exhibits drawbacks that are
with ✏ as a small constant. Si is the set of pixels in which
0.6% lower error than its ral BN
network features
counterpart when as using
unstructured
a vectors.
also caused For by itsexample,
distinct behavior of normalizing along
fortypical
conv1batch(the sizes,
first convolutional the batch
layer) of adimension.
network, itInisparticular, the mean and std for
it is required are BNcomputed, and m is the size of this set.
atch size of 2; when using GN is com-
arably good with BN and Group
reasonable Normalization
outperforms to expect definesto a
work fixed
with
a filter and its horizontal flipping to
other normaliza- a number
sufficiently of
largeMany
batch types
channels size of
perfeature
(e.g., 32 per normalization
instance (e.g. methods
32) thatmainly
are differ
nor-
on variants. Moreover,malized.
GN can be naturally transferred worker
exhibit similar distributions of filter responses on natural
2
[26, 59, 20]). A small in how
batch the
leads set
to S is defined
inaccurate
i (Figure 2), discussed as follows.
om pre-training to fine-tuning.
images. GNIfcan outperform
conv its BN- estimation of the batch statistics, and Batch Norm
Inreducing BN’s batch[26], the set Si is defined as:
At application 1 happens
time, to approximately
the normalization learn this pair
ased counterparts for object detection and segmentation in size increases the parameters from training
model error dramatically (Figure 1). data As are used...
of filters, or if the horizontal flipping (or other transforma- [59, 20,▲57, 24, 63] are trained
a result, many recent models (3)
See d2ai chapter
OCO,1 and for video classification in Kinetics,on Batch Normalization
showing for details Si = {k | kC = iC },
tions) is made into the architectureswith by design
non-trivial[11,batch
8], then
sizes that are memory-consuming.
hat GN can effectively replace the powerful BN in a variety
f tasks. GN can be easily theimplemented
corresponding by achannels
few lines of of theseThefilters
heavycan be normal-
reliance where iCto (and
on BN’s effectiveness kC ) denotes
train models in the sub-index of i (and k) along
ode in modern libraries. ized together. turn prohibits people from exploring the C axis. This means
higher-capacity mod- that the pixels sharing the same
11.4 MNIST
The higher-level layers are moreelsabstract that would andbetheir
limitedbe-by memory.channel index are normalized together, i.e., for each chan-
haviors are not as intuitive. However, The
in restrictiontoonorien-
addition batch sizes isnel,
more demanding
BN computes in com-
µ and along the (N, H, W ) axes. In
. Introduction puter8]),vision tasks
tations (SIFT [39], HOG [9],1 or [11, there areincluding
many detection Layer [12,Norm
47, 18], segmen-
[3], the set is:
MNIST
Batch Normalization (Batch
in BN)Tensorflow tation [38, 18], video recognition [60, 6], and other high-
factorsNormthat or [26] has
could lead been
to grouping, e.g., frequency, shapes,
Solving
stablished as a very effective the
component Hello
in deep World
learning, level systems
problem built on them.
(MNIST) withForCNNs!example, the Fast/erSand (4)
illumination, textures. Their coefficients can be interde- i = {k | kN = iN },
rgely helping push the frontier in computer vision [59, 20] Mask R-CNN frameworks [12, 47, 18] use a batch size of
pendent. In fact, a well-accepted1 or computational
2 images model
because of higher resolution, where
Solving
nd beyond [54]. BN normalizes thethe handwritten
features by the mean character recognition meaning
problem that
withLN BN dense is
computes µ and
and along the (C, H,
convolutional NNsW)
in neuroscience is to normalize across “frozen” theby cell responses to a linear layer [20]; in video
transforming
nd variance computed within a (mini-)batch. This has been axes for each sample. In Instance Norm [61], the set is:
hown by many practices to •
[21,ease
52, optimization
55, 5], “with various receptive-field
and enablefamous classificationcenters
with 3D(cov- convolutions [60, 6], the presence of
ering the Martin Görner’s presentation
ery deep networks to converge. Thevisual field)
stochastic and with various
uncertainty spatiotemporal
spatial-temporal featuresfre-introduces a trade-off betweenSi = {k the| k = i , k = i }.
N N C C (5)
quency tunings” (p183,
f the batch statistics also acts as a regularizer that can ben- [21]); this temporal
can happenlengthnot and batch
only in size. The usage of BN often re-
fit generalization. BN hasthe •primary
been aPresentation manyslides
visualofcortex,
foundation but alsoquires
state- these systems
“throughout to compromise
the visual between
meaning theIN
that model de-
computes µ and along the (H, W ) axes
f-the-art computer visionsystem”
algorithms. [5]. Motivated by these works, sign andwe batch sizes. new
propose for each sample and each channel. The relations among BN,
• Links
generic to videos
group-wise normalization for deep neural networks. LN, and IN are in Figure 2.
2 In the context of this paper, we use “batch size” to refer to the number
1 https://github.com/facebookresearch/Detectron/ of samples per worker (e.g., GPU). BN’s statistics are computed for each
lob/master/projects/GN. worker, but not broadcast across workers, as is standard in many libraries.
• Source code as a workbook tutorial
3
1
• (Re-)Introduces many concepts of ML in tensorflow setting (including dropout)

1
https://github.com/GoogleCloudPlatform/tensorflow-without-a-phd

216
Modern Approach: Pooling/Down-Sampling via Stride

CNN in Raw Tensorflow I

CNN in Raw Tensorflow II

CNN Model in Keras Tensorflow

217
Source▲

11.5 Conclusions
Summary

• Convolution allows to detect information using small shared parameters present in con-
volutional kernels

• Local patterns detectable by a convolutional kernel can be detected everywhere in a


larger space

• Feature extraction can be done hierarchically

• Shared parameters of convolutional kernels keep parameters manageable

• CNNs easily allow for efficient parallelized execution on GPUs and/or multi-GPUs

11.6 Further Study


• Mandatory: Chapter 7 and 8▲ of [?] Relevant concepts mentioned here!

• Recommended lecture: Chapter 13 of [?] and Chapter 4 of [?]

218
Chapter 12

Recurrent Neural Networks

Learning Goals

• Know why Recurrent Neural Networks can deal with sequences with sequence-intern
dependencies: very well with input dependencies and to a good degree with output
dependencies

• Understand different modeling approaches for sequence data

• Know about the theoretical abilities of RNNs and the practical problems in training (van-
ishing and exploding gradients)

• Know most important facts about unfolding recurrent relations and backpropagation
through time

12.1 Motivation
Recap: Convolutional Networks (CNNs)

• Excellent for grid-structured input (1D, 2D, 3D, or more!)

• Sensitive to local ordering

• Applicable to large inputs in a “fully-connected” style

• Only feasible because of shared parameters of convolution weight matrices!

Fixed-Size Input to Fixed-Size Output


Motivation
for RNNs

219
Limitations of Non-Recurrent Neural Networks

• fixed-sized input vector

• fixed-sized output vector (not necessarily the same size as input)

• fixed-sized number of computing steps

• ideal for classification or regression of independent and identically distributed items


(i.i.d. random variables▲ )

• Dealing with variable lengths input or output needs special treatment (shortening,padding,
masking)!

• For instance, applying iteratively sliding windows.

Recap: Typical Prediction Problems


Types of
Categorical Prediction Problems Prediction