0% found this document useful (0 votes)

21 views34 pages

NLP: Linear & Log-Linear Models

Uploaded by

1162407364

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views34 pages

NLP: Linear & Log-Linear Models

Uploaded by

1162407364

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

Natural Language

Processing
Lecture 13:
Machine Learning: Linear and Log-Linear Models

12/4/2019

COMS W4705
Yassine Benajiba
Intro
Intro
Machine Learning and NLP
• We have encountered many different situations where we
had to make a prediction:

• Text classification, language modeling, POS tagging,

constituency/dependency parsing,

• These are all classification problems of some form.

• Today: Some machine learning background. Linear/log-

linear models. Basic neural networks.
Generative Algorithms
• Assume the observed data is being “generated” by a
“hidden” class label.

• Build a different model for each class.

• To predict a new example, check it under each of the

models and see which one matches best.

• Model and . Then use bases rule

Discriminative Algorithms

• Model conditional distribution of the label given the data

• Learns decision boundaries that separate instances of the

different classes.

• To predict a new example, check on which side of the

decision boundary it falls.
Machine Learning Definition
• “Creating systems that improve from experience.”

• “A computer program is said to learn from experience E

with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E.”
(Tom Mitchel, Machine Learning 1997)
Inductive Learning (a.k.a. Science)

• Goal: given a set of input/output pairs (training data), find

the function f(x) that maps inputs to outputs.
Problem: We did not see all possible inputs!

• Learn an approximate function h(x) from the training data

and hope that this function generalize well to unseen inputs.

• Ockham’s razor: Choose the simplest hypothesis that is

consistent with the training data.
Classification and Regression
• Recall: In supervised learning, training data consisting of training
examples
(x1, y1), …, (xn, yn), where xj is an input example (a d-dimensional vector of
attribute values) and yj is the label.

• Two types of supervised learning problems:

• In classification: yj is a finite, discrete set.

Typically yj ∈ {-1, +1}. i.e. predict a label from a set of labels.
Learn a classifier function:

• In regression: xj ∈ℝd, yi ∈ ℝ. i.e. predict a numeric value.

Learn a regressor function:
Linear Classification and
Regression
h(x) x1

x2
decision boundary
Regression Classification
Linear Classification
Training ML models
Training Data ML algorithm function h(x)=y

• How can we be confident about the learned function?

• Can compute empirical error/risk on the training set:

• Typical loss functions:

• Least square loss (L2):

• Classification error:
Training ML models
Training Data ML algorithm function h(x)=y

• Empirical error/risk:

• Training aims to minimize .

• We hope that this also minimizes , the test error.

Overfitting

• Problem: Minimizing empirical risk can lead to overfitting.

• This happens when a model works well on the training

data, but it does not generalize to testing data.

• Data sets can be noisy. Overfitting can model the noise

in the data.
Preventing Overfitting
• Solutions: Simpler models.

• Reduce the number of features (feature selection).

• Model selection.

• Regularization.

• Cross validation.

• However: Adding wrong assumptions (bias) to the training

algorithm can lead to underfitting!
Goodness of Fit
Linear Model
bias 1 Xi0
w0
Threshold Function
Xi1 w1
Σ output
…

Xin

activation function
Linear Models
• We have chosen a function class (linear separators).

• Specified by parameter w.

• Need to estimate w on the basis of the training set.

• What loss should we use? One option: minimize

classification error:
Perceptron Learning
• Problem: Threshold function is not differentiable, so we
cannot find a closed-form solution or apply gradient descent.

• Instead use iterative perceptron learning algorithm:

• Start with arbitrary hyperplane.

• Adjust it using the training data.

• Update rule:

• Perceptron Convergence Theorem states that any linear

function can be learned using this algorithm in a finite number
of iterations.
Perceptron Learning
Algorithm
Input: Training examples (x1, y1),…,(xn, yn)
Output: A perceptron defined by (w0, w1,…,wd)

Initialize wj←0, for j=0…d

while not converged: "convergence" means that the weights don't

change for one entire iteration through the
training data.
shuffle training examples.
for each training example (xi, yi):

if output - target != 0: #(output and prediction do not match)

for each weight wj:

Perceptron
• Simple learning algorithm. Guaranteed to converge after a
finite number of steps.

• But only if the data is linearly separable.

x2
perceptron cannot learn this
Feature Functions
• In NLP we often need to make multi-class decisions.
Linear models provide only binary decisions.

• Use a feature function where x is an input object

and y is a possible output.

• The values of are d-dimensional vectors.

Log-Linear Model
(a.k.a. "Maximum Entropy Models")

• Define conditional probability P(y|x)

• exp(z) = ez is positive for any z.

• But how should we estimate w?

Log-Likelihood
• Define the log-likelihood of some model w on the training
data (x1, y1), …, (xn, yn) as

• We want to compute the maximum likelihood

• Unfortunately, there is no general analytical solution. Can

use gradient-based optimization.
Simple Gradient Ascent
Initialize w ←any setting in the parameter (weight) space
for a set number of iterations T:
for each wi in w:

update each wi to w’i

• Follow the gradients (partial derivatives) to find a parameter setting

that maximizes LL(w)

• α > 0 is the learning rate or step size.

Partial Derivative of the Log
Likelihood
Regularization
• Problem: Parameter estimation can overfit the training
data.

• Can include a regularization term. For example L2

regularizer:
• λ > 0 controls the strength of the regularization.

• Since we are maximizing ,

there is now a trade-off between fit and model 'complexity'.
POS Tagging with
Log-Linear Models
• Previously we used a generative model (HMM) for POS
tagging.

• Now we want to use a discriminative model for

• Next tag is conditioned on previous tag sequence and all

observed words.
Maximum Entropy Markov
Models (MEMM)
• Make an independence assumption (similar to HMM):

• Probability only depends on the previous tag.

MEMMs

• Model each term using a log-linear model

• φ is a feature function defined over:

• the observed words w1,...,wm
• the position of the current word
• the previous tag ti-1
• the suggested tag for the current word ti
• t' is a variable ranging over all possible tags.
MEMMs

• Training: same as any log-linear model.

• Decoding: Need to find

• Can use Viterbi algorithm!

Feature Function
(Ratnaparkhi, 1996)

• is a feature vector of length d.

• (wi,ti), (wi-1,ti), (wi-2,ti), (wi+1,ti), (wi+2,ti)

• (ti-1,ti)

• (wi contains numbers, ti),

(wi contains uppercase characters, ti)
(wi contains a hyphen, ti)

• (prefix1 of wi,ti), (prefix2 of wi,ti), (prefix3 of wi,ti), (prefix4 of wi,ti)

(suffix1 of wi,ti), (suffix2 of wi,ti), (suffix3 of wi,ti), (suffix4 of wi,ti)
Feature Example
The stories about well-heeled communities and developers ...
DT NNS IN ??

• (well-helled,JJ), (about,JJ), (stories,JJ), (communities, JJ), (and,JJ)

• (IN,JJ)

• (wi contains a hyphen, JJ)

• (w,JJ), (we,JJ), (wel,JJ), (well, JJ)

(d,JJ), (ed,JJ), (led,JJ), (eled, JJ)

Lec1 PerceptronPocket Recap
100% (1)
Lec1 PerceptronPocket Recap
61 pages
GML Slides 2024 04 29
No ratings yet
GML Slides 2024 04 29
206 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Linear Models in Regression & Classification
No ratings yet
Linear Models in Regression & Classification
30 pages
383 Fall11 Lec19
No ratings yet
383 Fall11 Lec19
30 pages
Lec10 Intro ML
No ratings yet
Lec10 Intro ML
93 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
03 Introtoml Ueh
No ratings yet
03 Introtoml Ueh
43 pages
CS60010: Deep Learning: Spring 2021
No ratings yet
CS60010: Deep Learning: Spring 2021
32 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
56 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
100% (2)
Unit 2 - Machine Learning - WWW - Rgpvnotes.in
21 pages
07 Intro To ML
No ratings yet
07 Intro To ML
38 pages
Neural Networks
No ratings yet
Neural Networks
38 pages
Week3 LearningI
No ratings yet
Week3 LearningI
48 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Week3 Perceptron Mlprwerwerwer
No ratings yet
Week3 Perceptron Mlprwerwerwer
8 pages
Lecture1 2015
No ratings yet
Lecture1 2015
52 pages
lec21-ML II
No ratings yet
lec21-ML II
66 pages
Lecturenotes Cse176
No ratings yet
Lecturenotes Cse176
80 pages
Lecturenotes PDF
No ratings yet
Lecturenotes PDF
80 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
AI ch6
No ratings yet
AI ch6
42 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
03 ML Essentials
No ratings yet
03 ML Essentials
52 pages
Lec 4
No ratings yet
Lec 4
33 pages
Log-Linear Models, Memms, and CRFS: 1 Notation
No ratings yet
Log-Linear Models, Memms, and CRFS: 1 Notation
11 pages
WSDM 1 31 15
No ratings yet
WSDM 1 31 15
108 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
19 ML Intro
No ratings yet
19 ML Intro
33 pages
Lecture 2 - Supervised Learning
No ratings yet
Lecture 2 - Supervised Learning
6 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
ANN Unit-2
No ratings yet
ANN Unit-2
48 pages
NN Theory
No ratings yet
NN Theory
138 pages
BITS F464 ML Lecture Notes
No ratings yet
BITS F464 ML Lecture Notes
86 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
15 pages
ML 01
No ratings yet
ML 01
24 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
Machine Learning Course Intro
No ratings yet
Machine Learning Course Intro
29 pages
DSA5102X Lecture1
No ratings yet
DSA5102X Lecture1
51 pages
Maths For ML
No ratings yet
Maths For ML
156 pages
Matematics and Machine Learning
No ratings yet
Matematics and Machine Learning
156 pages
Machine Learning Week 4
No ratings yet
Machine Learning Week 4
24 pages
14 Supervised Machine Learning
No ratings yet
14 Supervised Machine Learning
94 pages
Lecture 2.2 Example Data Preparation Feature Engineering
No ratings yet
Lecture 2.2 Example Data Preparation Feature Engineering
25 pages
Lecture 1 2022
No ratings yet
Lecture 1 2022
55 pages
Short Course On Deep Learning: Welcome!!
No ratings yet
Short Course On Deep Learning: Welcome!!
57 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Advanced ML Slides Intro
No ratings yet
Advanced ML Slides Intro
14 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
46 pages
Lec 21
No ratings yet
Lec 21
34 pages
Intro to Binary Classification
No ratings yet
Intro to Binary Classification
10 pages
ChatGPT - Machine Learning Overview
No ratings yet
ChatGPT - Machine Learning Overview
34 pages
ML Intro Theory
No ratings yet
ML Intro Theory
10 pages
Gradient-Based Learning & Neural Networks
No ratings yet
Gradient-Based Learning & Neural Networks
72 pages
LoTs and HoTs Question For Unit 3 and Unit 4 - 1
No ratings yet
LoTs and HoTs Question For Unit 3 and Unit 4 - 1
16 pages
Synthesis Essay: Oliver Sacks and Malcolm Gladwell
0% (1)
Synthesis Essay: Oliver Sacks and Malcolm Gladwell
3 pages
Books Doubtnut Question Bank
No ratings yet
Books Doubtnut Question Bank
110 pages
Learn D3.Js Simple Way - Nuno Correia
No ratings yet
Learn D3.Js Simple Way - Nuno Correia
129 pages
Mindless Reading
No ratings yet
Mindless Reading
3 pages
Lesson Plan Analysis Guide
No ratings yet
Lesson Plan Analysis Guide
1 page
Covenant Theology
No ratings yet
Covenant Theology
5 pages
Literature Discussion Guide
No ratings yet
Literature Discussion Guide
7 pages
Skripsi PDF
No ratings yet
Skripsi PDF
65 pages
PALS Case Scenario Testing Checklist
100% (2)
PALS Case Scenario Testing Checklist
12 pages
Hannah Setter's 2019 Teaching Experience Report
No ratings yet
Hannah Setter's 2019 Teaching Experience Report
6 pages
Berserk v30 (2009)
No ratings yet
Berserk v30 (2009)
203 pages
GS 6 - 2nd Semester - Final Term - Test 3
No ratings yet
GS 6 - 2nd Semester - Final Term - Test 3
3 pages
Algorithm Design: MST & Huffman Coding
No ratings yet
Algorithm Design: MST & Huffman Coding
3 pages
Using The Dewey Decimal System: Worksheets
No ratings yet
Using The Dewey Decimal System: Worksheets
9 pages
创意写作教程书
100% (1)
创意写作教程书
7 pages
Eapp Q2 Las 2 0 3
No ratings yet
Eapp Q2 Las 2 0 3
2 pages
First-Conditional-Activity 6 7 8
No ratings yet
First-Conditional-Activity 6 7 8
2 pages
ST 2.2 Template
No ratings yet
ST 2.2 Template
2 pages
School Supplies Activities
100% (1)
School Supplies Activities
6 pages
Grade 5 Olympiad: Answer The Questions
No ratings yet
Grade 5 Olympiad: Answer The Questions
4 pages
DepEd Emerging-LAS Week1 (Edited)
No ratings yet
DepEd Emerging-LAS Week1 (Edited)
17 pages
Collective Nouns, Adverbs, and Prepositions Guide
No ratings yet
Collective Nouns, Adverbs, and Prepositions Guide
9 pages
Android Programming Sample Questions
No ratings yet
Android Programming Sample Questions
3 pages
ReactJS Tutorial - Javatpoint
No ratings yet
ReactJS Tutorial - Javatpoint
7 pages
Order of the Holy Mass for Christmas
No ratings yet
Order of the Holy Mass for Christmas
4 pages
Bhs Inggris1
No ratings yet
Bhs Inggris1
94 pages
Understanding Psycholinguistics and Language
No ratings yet
Understanding Psycholinguistics and Language
28 pages
Splay Tree Insertion and Deletion Guide
No ratings yet
Splay Tree Insertion and Deletion Guide
18 pages
101 Omwally Ri Urnbull : Bread Give, Email: 1 Ecause Docto
No ratings yet
101 Omwally Ri Urnbull : Bread Give, Email: 1 Ecause Docto
148 pages
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
No ratings yet
CNN Based Approach For Speech Emotion Recognition Using MFCC Croma and STFT Hand-Crafted Features
5 pages

NLP: Linear & Log-Linear Models

Uploaded by

NLP: Linear & Log-Linear Models

Uploaded by

Natural Language

• Text classification, language modeling, POS tagging,

• These are all classification problems of some form.

• Today: Some machine learning background. Linear/log-

• Build a different model for each class.

• To predict a new example, check it under each of the

• Model and . Then use bases rule

• Model conditional distribution of the label given the data

• Learns decision boundaries that separate instances of the

• To predict a new example, check on which side of the

• “A computer program is said to learn from experience E

• Goal: given a set of input/output pairs (training data), find

• Learn an approximate function h(x) from the training data

• Ockham’s razor: Choose the simplest hypothesis that is

• Two types of supervised learning problems:

• In classification: yj is a finite, discrete set.

• In regression: xj ∈ℝd, yi ∈ ℝ. i.e. predict a numeric value.

• How can we be confident about the learned function?

• Can compute empirical error/risk on the training set:

• Typical loss functions:

• Least square loss (L2):

• Training aims to minimize .

• We hope that this also minimizes , the test error.

• Problem: Minimizing empirical risk can lead to overfitting.

• This happens when a model works well on the training

• Data sets can be noisy. Overfitting can model the noise

• Reduce the number of features (feature selection).

• However: Adding wrong assumptions (bias) to the training

• Need to estimate w on the basis of the training set.

• What loss should we use? One option: minimize

• Instead use iterative perceptron learning algorithm:

• Start with arbitrary hyperplane.

• Adjust it using the training data.

• Perceptron Convergence Theorem states that any linear

Initialize wj←0, for j=0…d

while not converged: "convergence" means that the weights don't

if output - target != 0: #(output and prediction do not match)

for each weight wj:

• But only if the data is linearly separable.

• Use a feature function where x is an input object

• The values of are d-dimensional vectors.

• Define conditional probability P(y|x)

• exp(z) = ez is positive for any z.

• But how should we estimate w?

• We want to compute the maximum likelihood

• Unfortunately, there is no general analytical solution. Can

update each wi to w’i

• Follow the gradients (partial derivatives) to find a parameter setting

• α > 0 is the learning rate or step size.

• Can include a regularization term. For example L2

• Since we are maximizing ,

• Now we want to use a discriminative model for

• Next tag is conditioned on previous tag sequence and all

• Probability only depends on the previous tag.

• Model each term using a log-linear model

• φ is a feature function defined over:

• Training: same as any log-linear model.

• Decoding: Need to find

• Can use Viterbi algorithm!

• is a feature vector of length d.

• (wi,ti), (wi-1,ti), (wi-2,ti), (wi+1,ti), (wi+2,ti)

• (wi contains numbers, ti),

• (prefix1 of wi,ti), (prefix2 of wi,ti), (prefix3 of wi,ti), (prefix4 of wi,ti)

• (well-helled,JJ), (about,JJ), (stories,JJ), (communities, JJ), (and,JJ)

• (wi contains a hyphen, JJ)

• (w,JJ), (we,JJ), (wel,JJ), (well, JJ)

You might also like