Natural Language
Processing
Lecture 13:
Machine Learning: Linear and Log-Linear Models
12/4/2019
COMS W4705
Yassine Benajiba
Intro
Intro
Machine Learning and NLP
• We have encountered many different situations where we
had to make a prediction:
• Text classification, language modeling, POS tagging,
constituency/dependency parsing,
• These are all classification problems of some form.
• Today: Some machine learning background. Linear/log-
linear models. Basic neural networks.
Generative Algorithms
• Assume the observed data is being “generated” by a
“hidden” class label.
• Build a different model for each class.
• To predict a new example, check it under each of the
models and see which one matches best.
• Model and . Then use bases rule
Discriminative Algorithms
• Model conditional distribution of the label given the data
• Learns decision boundaries that separate instances of the
different classes.
• To predict a new example, check on which side of the
decision boundary it falls.
Machine Learning Definition
• “Creating systems that improve from experience.”
• “A computer program is said to learn from experience E
with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured
by P, improves with experience E.”
(Tom Mitchel, Machine Learning 1997)
Inductive Learning (a.k.a. Science)
• Goal: given a set of input/output pairs (training data), find
the function f(x) that maps inputs to outputs.
Problem: We did not see all possible inputs!
• Learn an approximate function h(x) from the training data
and hope that this function generalize well to unseen inputs.
• Ockham’s razor: Choose the simplest hypothesis that is
consistent with the training data.
Classification and Regression
• Recall: In supervised learning, training data consisting of training
examples
(x1, y1), …, (xn, yn), where xj is an input example (a d-dimensional vector of
attribute values) and yj is the label.
• Two types of supervised learning problems:
• In classification: yj is a finite, discrete set.
Typically yj ∈ {-1, +1}. i.e. predict a label from a set of labels.
Learn a classifier function:
• In regression: xj ∈ℝd, yi ∈ ℝ. i.e. predict a numeric value.
Learn a regressor function:
Linear Classification and
Regression
h(x) x1
x2
decision boundary
Regression Classification
Linear Classification
Training ML models
Training Data ML algorithm function h(x)=y
• How can we be confident about the learned function?
• Can compute empirical error/risk on the training set:
• Typical loss functions:
• Least square loss (L2):
• Classification error:
Training ML models
Training Data ML algorithm function h(x)=y
• Empirical error/risk:
• Training aims to minimize .
• We hope that this also minimizes , the test error.
Overfitting
• Problem: Minimizing empirical risk can lead to overfitting.
• This happens when a model works well on the training
data, but it does not generalize to testing data.
• Data sets can be noisy. Overfitting can model the noise
in the data.
Preventing Overfitting
• Solutions: Simpler models.
• Reduce the number of features (feature selection).
• Model selection.
• Regularization.
• Cross validation.
• However: Adding wrong assumptions (bias) to the training
algorithm can lead to underfitting!
Goodness of Fit
Linear Model
bias 1 Xi0
w0
Threshold Function
Xi1 w1
Σ output
…
wn
Xin
activation function
Linear Models
• We have chosen a function class (linear separators).
• Specified by parameter w.
• Need to estimate w on the basis of the training set.
• What loss should we use? One option: minimize
classification error:
Perceptron Learning
• Problem: Threshold function is not differentiable, so we
cannot find a closed-form solution or apply gradient descent.
• Instead use iterative perceptron learning algorithm:
• Start with arbitrary hyperplane.
• Adjust it using the training data.
• Update rule:
• Perceptron Convergence Theorem states that any linear
function can be learned using this algorithm in a finite number
of iterations.
Perceptron Learning
Algorithm
Input: Training examples (x1, y1),…,(xn, yn)
Output: A perceptron defined by (w0, w1,…,wd)
Initialize wj←0, for j=0…d
while not converged: "convergence" means that the weights don't
change for one entire iteration through the
training data.
shuffle training examples.
for each training example (xi, yi):
if output - target != 0: #(output and prediction do not match)
for each weight wj:
Perceptron
• Simple learning algorithm. Guaranteed to converge after a
finite number of steps.
• But only if the data is linearly separable.
x1
x2
perceptron cannot learn this
Feature Functions
• In NLP we often need to make multi-class decisions.
Linear models provide only binary decisions.
• Use a feature function where x is an input object
and y is a possible output.
• The values of are d-dimensional vectors.
Log-Linear Model
(a.k.a. "Maximum Entropy Models")
• Define conditional probability P(y|x)
• exp(z) = ez is positive for any z.
• But how should we estimate w?
Log-Likelihood
• Define the log-likelihood of some model w on the training
data (x1, y1), …, (xn, yn) as
• We want to compute the maximum likelihood
• Unfortunately, there is no general analytical solution. Can
use gradient-based optimization.
Simple Gradient Ascent
Initialize w ←any setting in the parameter (weight) space
for a set number of iterations T:
for each wi in w:
update each wi to w’i
• Follow the gradients (partial derivatives) to find a parameter setting
that maximizes LL(w)
• α > 0 is the learning rate or step size.
Partial Derivative of the Log
Likelihood
Regularization
• Problem: Parameter estimation can overfit the training
data.
• Can include a regularization term. For example L2
regularizer:
• λ > 0 controls the strength of the regularization.
• Since we are maximizing ,
there is now a trade-off between fit and model 'complexity'.
POS Tagging with
Log-Linear Models
• Previously we used a generative model (HMM) for POS
tagging.
• Now we want to use a discriminative model for
• Next tag is conditioned on previous tag sequence and all
observed words.
Maximum Entropy Markov
Models (MEMM)
• Make an independence assumption (similar to HMM):
• Probability only depends on the previous tag.
MEMMs
• Model each term using a log-linear model
• φ is a feature function defined over:
• the observed words w1,...,wm
• the position of the current word
• the previous tag ti-1
• the suggested tag for the current word ti
• t' is a variable ranging over all possible tags.
MEMMs
• Training: same as any log-linear model.
• Decoding: Need to find
• Can use Viterbi algorithm!
Feature Function
(Ratnaparkhi, 1996)
• is a feature vector of length d.
• (wi,ti), (wi-1,ti), (wi-2,ti), (wi+1,ti), (wi+2,ti)
• (ti-1,ti)
• (wi contains numbers, ti),
(wi contains uppercase characters, ti)
(wi contains a hyphen, ti)
• (prefix1 of wi,ti), (prefix2 of wi,ti), (prefix3 of wi,ti), (prefix4 of wi,ti)
(suffix1 of wi,ti), (suffix2 of wi,ti), (suffix3 of wi,ti), (suffix4 of wi,ti)
Feature Example
The stories about well-heeled communities and developers ...
DT NNS IN ??
• (well-helled,JJ), (about,JJ), (stories,JJ), (communities, JJ), (and,JJ)
• (IN,JJ)
• (wi contains a hyphen, JJ)
• (w,JJ), (we,JJ), (wel,JJ), (well, JJ)
(d,JJ), (ed,JJ), (led,JJ), (eled, JJ)