Machine Learning Basics
Rachele Sprugnoli
[email protected]
Artificial Intelligence
2
Artificial Intelligence
● Artificial Intelligence attempts to build intelligent entities
○ Involves many disciplines, e.g. mathematics, computer science,
electronics, logic, philosophy, ethics, linguistics…
● What is intelligence?
○ Intelligence has been defined in many ways: the capacity for
abstraction, logic, understanding, self-awareness, learning,
emotional knowledge, reasoning, planning, creativity, critical
thinking, and problem-solving. It can be further described as the
ability to perceive or infer information; and to retain it as
knowledge to be applied to adaptive behaviors within an
environment or context. (Wikipedia)
3
Artificial Intelligence
● For AI we have 2 dimensions of intelligence:
○ thinking vs. acting
● AI means (4 main views of AI in the literature):
○ Thinking Humanly "The exciting new effort to make computers think ….
machines with minds" (Haugeland, 1985)
○ Thinking Rationally: "The study of mental faculties through the use of
computational models." (Charniak and McDermott, 1985)
○ Acting Humanly: "The study of how to make computers do things at
which, at the moment, people are better." (Rich and Knight, 1991) →
Turing test
○ Acting Rationally: "Al ...is concerned with intelligent behavior in
artifacts." (Nilsson, 1998)
4
Artificial Intelligence
TODAY
NEXT LECTURE
5
Evolution of NLP
RULE-BASED MACHINE-LEARNING
SYSTEMS SYSTEMS
Source: https://blog.dataiku.com/nlp-metamorphosis
6
Rule-based approach
RULES
● The system performs a linguistic task using rules
defined and formalized "by hand" by a linguist 7
○ Pros: based on linguistic evidence, accurate for
limited domains
○ Cons: difficult to extend or adapt to new domains,
slow development, language-dependent, not able
to deal with non-literal meaning
Rule-based approach
Example: Part-of-Speech tagging (i.e. assign a grammatical category)
1) assignment to each word of all possible PoS using a dictionary
NOUN
NOUN ADJ
VERB DET ADV
«pity the dead»
2) application of rules to remove ambiguous labels
- «choose NOUN if preceded by DET»
NOUN
NOUN ADJ
VERB DET
ADV
«pity the dead»
8
Rule-based approach
● Example: Named entity recognition (i.e. identify and classify proper
names)
Mikheev, A., Moens, M., & Grover, C. (1999). Named entity recognition without gazetteers. In Ninth Conference of the
European Chapter of the Association for Computational Linguistics (pp. 1-8). https://aclanthology.org/E99-1001.pdf
9
ML: definition
● Learning is any process by which a system improves performance
from experience. - Herbert Simon
● A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E. - Tom Mitchell
10
ML: definition
● 3 components:
1. task to be addressed by the system (e.g. assign a grammatical
category to each token of a sentence)
2. training experience to train the learning system (e.g. labelled
tokens)
3. performance measure to evaluate the learned system (e.g.
number of misclassified tokens)
● Example:
T: assign a grammatical category to each token of a sentence
E: a dataset with labelled tokens
P: percentage of tokens correctly classified
11
ML: main types
1. UNSUPERVISED: do not need annotated data for training the model
a. SELF-SUPERVISED: labels are generated automatically by using
patterns in the unlabeled data
2. SUPERVISED: use annotated data for training the model
3. SEMI-SUPERVISED: combine information from both annotated and
non-annotated data for training the model
4. REINFORCEMENT LEARNING: try different actions and getting rewards
for good choices or penalties for bad ones
12
Supervised Approach
MATTER CYCLE
From Pustejovsky and Stubbs (2012) “Natural Language Annotation for Machine Learning”. O'Reilly Media. 13
Semi-supervised Approach
(3B) UNLABELED
DATA
MATTER CYCLE
From Pustejovsky and Stubbs (2012) “Natural Language Annotation for Machine Learning”. O'Reilly Media. 14
MATTER cycle
● The MATTER cycle:
1. Model: theoretical description of a linguistic phenomenon
2. Annotate: data annotation following a model-based annotation
scheme defined in step 1
3. Train: training of an ML algorithm on the annotated corpus
4. Test: test the trained system on a new sample of data
5. Evaluate: system performance evaluation
6. Revise: revision of the model, annotation, algorithm
15
How to use annotated data
● Annotated data is divided into 3 disjoint subsets:
○ training set (typically 80% of the total)
○ development or validation set (10%) → to reduce OVERFITTING
○ evaluation or test set (10%)
● After the model has been trained
on the training set, it is tested
and tuned on the dev set
● Once satisfactory results have
been achieved on the dev set,
the model is evaluated on the
test set
Source: https://www.v7labs.com/blog/train-validation-test-set 16
Cross-validation
● Dividing the available data into 3 subsets reduces the number of data
available for training
● SOLUTION → cross-validation:
○ Split training+dev into k partitions (called folds, usually 5 or 10)
○ Choose k–1 folds as training set: the remaining fold is used for
evaluation
○ Repeat with the rest of the folds: each time use the remaining fold for
evaluation → in the end, you should have assessed the model on
every fold
○ To get the final score average the results obtained on each test fold
17
Cross-validation
● 5-fold cross-validation
+ Standard
deviation
18
Cross-validation
● 10-fold cross-validation
19
Features
● Features: important properties of the text
that serve as input to machine learning
algorithms
- Lexical: words
- Syntactic: PoS, relations
- Semantic: embeddings
- Statistical: word frequency, TF-IDF
- Structural: capitalization, punctuation
Source:
https://www.3rdisearch.com/blogs/content-clas
sification-the-backbone-of-text-mining 20
Feature extraction
● Feature extraction: identifying and extracting relevant features from
raw data
● Feature engineering: process of transforming texts into structured
features that machine learning models can understand and use for
predictions
● Model for Named Entity Recognition:
○ Word itself
○ Capital letters
○ Part of speech
○ Context Words
N.B. Deep learning models automatically learn features from raw text! 21
Supervised approach: example
● CLASSIFICATION: given a set of predefined classes, determine
which class a certain linguistic element belongs to, e.g. grammatical
category (PoS tagging)
Input (training): Classification of unseen data (test):
A
B A B A B A A A
22
Supervised approach: example
● CLASSIFICATION: given a set of predefined classes, determine
which class a certain linguistic element belongs to, e.g. grammatical
category (PoS tagging)
●
Input (training): Classification of unseen data (test):
A
B A B A B A A A
→ Will the algorithm be able to
B
classify an element as belonging to
class c? 23
Supervised approach: example
● REGRESSION: predict a numerical value (continuous score, like
1-10) from text data instead of predicting a label
CLASSIFICATION REGRESSION
Sentiment analysis: Predict a label Sentiment analysis: Predicting a
(POSITIVE, NEGATIVE, NEUTRAL) sentiment score (e.g., from 1 to 10)
based on a product review based on a product review.
Readability: Predicting how easy or Readability: Predicting how easy or
hard a text is to read by assigning a hard a text is to read by assigning a
label (EASY, INTERMEDIATE, score from 0 to 100, where higher
ADVANCED) means easier
24
Unsupervised approach: example
● CLUSTERING: grouping of the input based on some relationship of
similarity between the data (no labels are present in the training data)
Input: Color-based
clustering:
25
Unsupervised approach: example
● CLUSTERING: grouping of the input based on some relationship of
similarity between the data
Input: Form-based
clustering:
26
Unsupervised approach: example
● CLUSTERING: grouping of the input based on some relationship of
similarity between the data
○ K-MEANS: technique that divides data into K clusters by
minimizing the distance between points and their cluster center
(centroid)
1) Choose number of clusters (K)
2) Randomly select K distinct points as centroids
3) Assign each data point (word, sentence, or document) is
assigned to the nearest cluster
4) Recalculate the centroids
5) Repeat steps 3 and 4 until clusters are stable
27
Unsupervised approach: example
● CLUSTERING: grouping of the input based on some relationship of
similarity between the data
○ K-MEANS
Source:
https://www.blopig.com/blog/2020/07/k-means-
clustering-made-simple/
28
Unsupervised approach: example
● CLUSTERING: grouping of the input based on some relationship of
similarity between the data
○ K-MEANS: how to find the
optimal number of clusters?
→ ELBOW METHOD
sum of the squared distances
between each point and its
cluster centroid
Source:
https://github.com/RihabFekii/clustering-methods
29
Unsupervised approach: example
● SELF-SUPERVISION
○ GUESS-THE-NEXT-WORD: predict a word removed from the text
(masked word) → fill-in-the-gap task in which a model uses the
words surrounding a masked token to try to predict what it should
be
○ Demo: https://tweetnlp.org/demo/ (word prediction)
30
Evaluation metrics for classification
● Confusion matrix: table used to display the comparison between the
predictions of the system (on the horizontal axis) and the expected
classifications annotated by experts (on the vertical axis)
- True Positive = TP
correct predictions
- True Negative = TN
- False Positive = FP
wrong predictions
- False Negative = FN 31
Evaluation metrics for classification
● The evaluation of the prediction of the system (output) is based on
manually annotated data: gold standard
● The simplest metric: ACCURACY
TP + TN
ACCURACY =
N
Example:
- 150 tokens annotated in the test set
- 120 tokens correctly predicted
- accuracy = 120/150 = 0.8 (80%)
- it can be calculated on a general level or by class/tag
32
Evaluation metrics for classification
● PRECISION (P): it measures the ratio between the elements correctly
predicted by the system and the total of predicted elements
○ How many predicted items were actually correct?
- # correct predictions / # predictions given
TP
PRECISION =
TP + FP 33
Evaluation metrics for classification
● RECALL (R): it measures the ratio between the elements correctly
predicted by the system and the total of the correct elements
○ How many correct items were predicted?
- # correct predictions / # possible correct elements
TP
RECALL =
TP + FN 34
Evaluation metrics for classification
● Sometimes there is a gap between precision and recall: as precision
increases, recall often drops (and vice versa)
● F-MEASURE: harmonic mean between precision and recall
- 2*precision*recall / (precision + recall)
● Alternative metric: parameterized average, which allows to choose to
give more importance to P or R: when beta = 1 we speak of F1
β = 1: P and R they have the same weight
β > 1: R is more important
β < 1: P is more important
β = 0: Only P is taken into consideration
35
Evaluation metrics for classification
● Sometimes there is a gap between precision and recall: as precision
increases, recall often drops (and vice versa)
● F-MEASURE: harmonic mean between precision and recall
- 2*precision*recall / (precision + recall)
● Alternative metric: parameterized average, which allows to choose to
give more importance to P or R: when beta = 1 we speak of F1
β = 1: P and R they have the same weight
β > 1: R is more important
β < 1: P is more important
β = 0: Only P is taken into consideration
36
Evaluation metrics for classification: example
ACTUAL (gold standard)
Positive Negative
PREDICTED Positive 70 (TP) 15 (FP)
(test set) Negative 30 (FN) 45 (TN)
37
Evaluation metrics for classification: example
ACTUAL (gold standard)
Positive Negative
PREDICTED Positive 70 (TP) 15 (FP)
(test set) Negative 30 (FN) 45 (TN)
- Precision: 70 / (70+15) = 70 / 85 = 0.82
38
Evaluation metrics for classification: example
ACTUAL (gold standard)
Positive Negative
PREDICTED Positive 70 (TP) 15 (FP)
(test set) Negative 30 (FN) 45 (TN)
- Precision: 70 / (70+15) = 70 / 85 = 0.82
- Recall: 70 / (70+30) = 70 / 100 = 0.70
39
Evaluation metrics for classification: example
ACTUAL (gold standard)
Positive Negative
PREDICTED Positive 70 (TP) 15 (FP)
(test set) Negative 30 (FN) 45 (TN)
- Precision: 70 / (70+15) = 70 / 85 = 0.82
- Recall: 70 / (70+30) = 70 / 100 = 0.70
- F-measure: 2*0.82*0.7 / (0.82+0.70) = 0.75
40
Evaluation metrics for classification: example
ACTUAL (gold standard)
Positive Negative
PREDICTED Positive 70 (TP) 15 (FP)
(test set) Negative 30 (FN) 45 (TN)
- Precision: 70 / (70+15) = 70 / 85 = 0.82
- Recall: 70 / (70+30) = 70 / 100 = 0.70
- F-measure: 2*0.82*0.7 / (0.82+0.70) = 0.75
- Accuracy: (70+45)/(70+15+30+45) = 0.72
41
Try it yourself!
● Clustering:
https://colab.research.google.com/drive/1V0oeXnBlr2frOzLGZDZ
G7iCBS-eOLy4Z?usp=sharing
42
Try it yourself!
● Evaluation metrics for classification:
https://colab.research.google.com/drive/1OQ21sLMNQvaBEldqt
mb5bwgWz83mvv9Z?usp=sharing
● A file named “valutazione.tsv” is needed:
○ two tab-separated columns
○ first column: header P with predicted labels
○ second column: header GS with gold standard labels
43
Try it yourself!
● Output:
- Precision: how many of the positive predictions made are correct
- Recall: how many of the positive cases the classifier correctly predicted,
over all the positive cases in the data
- F1-score: harmonic mean of precision and recall having the same weight
- Support: number of actual occurrences of the class in the specified
dataset
44
Try it yourself!
● Output:
- Accuracy: number of correct predictions over all predictions
- Macro average: mean of the scores calculated per class
- Weighted average: average the mean per label considering each class’s
support → give more weight to classes with more examples, useful for
unbalanced dataset
45
Questions?