0% found this document useful (0 votes)

16 views46 pages

7 MachineLearningBasics

The document provides an overview of Artificial Intelligence (AI) and Machine Learning (ML), discussing definitions, types, and methodologies. It covers rule-based and machine learning approaches, including supervised, unsupervised, and semi-supervised learning, along with evaluation metrics like accuracy, precision, recall, and F-measure. Additionally, it outlines the MATTER cycle for ML processes and feature extraction techniques.

Uploaded by

payam.latifi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views46 pages

7 MachineLearningBasics

Uploaded by

payam.latifi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Machine Learning Basics

Rachele Sprugnoli
[email protected]
Artificial Intelligence

2
Artificial Intelligence
● Artificial Intelligence attempts to build intelligent entities
○ Involves many disciplines, e.g. mathematics, computer science,
electronics, logic, philosophy, ethics, linguistics…

● What is intelligence?
○ Intelligence has been defined in many ways: the capacity for
abstraction, logic, understanding, self-awareness, learning,
emotional knowledge, reasoning, planning, creativity, critical
thinking, and problem-solving. It can be further described as the
ability to perceive or infer information; and to retain it as
knowledge to be applied to adaptive behaviors within an
environment or context. (Wikipedia)
3
Artificial Intelligence
● For AI we have 2 dimensions of intelligence:
○ thinking vs. acting
● AI means (4 main views of AI in the literature):
○ Thinking Humanly "The exciting new effort to make computers think ….
machines with minds" (Haugeland, 1985)
○ Thinking Rationally: "The study of mental faculties through the use of
computational models." (Charniak and McDermott, 1985)
○ Acting Humanly: "The study of how to make computers do things at
which, at the moment, people are better." (Rich and Knight, 1991) →
Turing test
○ Acting Rationally: "Al ...is concerned with intelligent behavior in
artifacts." (Nilsson, 1998)
4
Artificial Intelligence

TODAY

NEXT LECTURE

5
Evolution of NLP

RULE-BASED MACHINE-LEARNING
SYSTEMS SYSTEMS

Source: https://blog.dataiku.com/nlp-metamorphosis
6
Rule-based approach

RULES

● The system performs a linguistic task using rules

defined and formalized "by hand" by a linguist 7

○ Pros: based on linguistic evidence, accurate for

limited domains
○ Cons: difficult to extend or adapt to new domains,
slow development, language-dependent, not able
to deal with non-literal meaning
Rule-based approach
Example: Part-of-Speech tagging (i.e. assign a grammatical category)
1) assignment to each word of all possible PoS using a dictionary
NOUN
NOUN ADJ
VERB DET ADV
«pity the dead»

2) application of rules to remove ambiguous labels

- «choose NOUN if preceded by DET»
NOUN
NOUN ADJ
VERB DET
ADV
«pity the dead»
8
Rule-based approach
● Example: Named entity recognition (i.e. identify and classify proper
names)

Mikheev, A., Moens, M., & Grover, C. (1999). Named entity recognition without gazetteers. In Ninth Conference of the
European Chapter of the Association for Computational Linguistics (pp. 1-8). https://aclanthology.org/E99-1001.pdf
9
ML: definition
● Learning is any process by which a system improves performance
from experience. - Herbert Simon
● A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with
experience E. - Tom Mitchell

10
ML: definition
● 3 components:
1. task to be addressed by the system (e.g. assign a grammatical
category to each token of a sentence)
2. training experience to train the learning system (e.g. labelled
tokens)
3. performance measure to evaluate the learned system (e.g.
number of misclassified tokens)
● Example:
T: assign a grammatical category to each token of a sentence
E: a dataset with labelled tokens
P: percentage of tokens correctly classified
11
ML: main types
1. UNSUPERVISED: do not need annotated data for training the model
a. SELF-SUPERVISED: labels are generated automatically by using
patterns in the unlabeled data
2. SUPERVISED: use annotated data for training the model
3. SEMI-SUPERVISED: combine information from both annotated and
non-annotated data for training the model
4. REINFORCEMENT LEARNING: try different actions and getting rewards
for good choices or penalties for bad ones

12
Supervised Approach

MATTER CYCLE
From Pustejovsky and Stubbs (2012) “Natural Language Annotation for Machine Learning”. O'Reilly Media. 13
Semi-supervised Approach

(3B) UNLABELED
DATA

MATTER CYCLE
From Pustejovsky and Stubbs (2012) “Natural Language Annotation for Machine Learning”. O'Reilly Media. 14
MATTER cycle
● The MATTER cycle:
1. Model: theoretical description of a linguistic phenomenon
2. Annotate: data annotation following a model-based annotation
scheme defined in step 1
3. Train: training of an ML algorithm on the annotated corpus
4. Test: test the trained system on a new sample of data
5. Evaluate: system performance evaluation
6. Revise: revision of the model, annotation, algorithm

15
How to use annotated data
● Annotated data is divided into 3 disjoint subsets:
○ training set (typically 80% of the total)
○ development or validation set (10%) → to reduce OVERFITTING
○ evaluation or test set (10%)

● After the model has been trained

on the training set, it is tested
and tuned on the dev set
● Once satisfactory results have
been achieved on the dev set,
the model is evaluated on the
test set
Source: https://www.v7labs.com/blog/train-validation-test-set 16
Cross-validation
● Dividing the available data into 3 subsets reduces the number of data
available for training
● SOLUTION → cross-validation:
○ Split training+dev into k partitions (called folds, usually 5 or 10)
○ Choose k–1 folds as training set: the remaining fold is used for
evaluation
○ Repeat with the rest of the folds: each time use the remaining fold for
evaluation → in the end, you should have assessed the model on
every fold
○ To get the final score average the results obtained on each test fold

17
Cross-validation
● 5-fold cross-validation

+ Standard
deviation
18
Cross-validation
● 10-fold cross-validation

19
Features

● Features: important properties of the text

that serve as input to machine learning
algorithms
- Lexical: words
- Syntactic: PoS, relations
- Semantic: embeddings
- Statistical: word frequency, TF-IDF
- Structural: capitalization, punctuation
Source:
https://www.3rdisearch.com/blogs/content-clas
sification-the-backbone-of-text-mining 20
Feature extraction
● Feature extraction: identifying and extracting relevant features from
raw data
● Feature engineering: process of transforming texts into structured
features that machine learning models can understand and use for
predictions
● Model for Named Entity Recognition:
○ Word itself
○ Capital letters
○ Part of speech
○ Context Words

N.B. Deep learning models automatically learn features from raw text! 21
Supervised approach: example

● CLASSIFICATION: given a set of predefined classes, determine

which class a certain linguistic element belongs to, e.g. grammatical
category (PoS tagging)

Input (training): Classification of unseen data (test):

A
B A B A B A A A

22
Supervised approach: example

● CLASSIFICATION: given a set of predefined classes, determine

which class a certain linguistic element belongs to, e.g. grammatical
category (PoS tagging)
●
Input (training): Classification of unseen data (test):

A
B A B A B A A A
→ Will the algorithm be able to
B
classify an element as belonging to
class c? 23
Supervised approach: example

● REGRESSION: predict a numerical value (continuous score, like

1-10) from text data instead of predicting a label

CLASSIFICATION REGRESSION

Sentiment analysis: Predict a label Sentiment analysis: Predicting a

(POSITIVE, NEGATIVE, NEUTRAL) sentiment score (e.g., from 1 to 10)
based on a product review based on a product review.

Readability: Predicting how easy or Readability: Predicting how easy or

hard a text is to read by assigning a hard a text is to read by assigning a
label (EASY, INTERMEDIATE, score from 0 to 100, where higher
ADVANCED) means easier

24
Unsupervised approach: example

● CLUSTERING: grouping of the input based on some relationship of

similarity between the data (no labels are present in the training data)

Input: Color-based
clustering:

25
Unsupervised approach: example

● CLUSTERING: grouping of the input based on some relationship of

similarity between the data

Input: Form-based
clustering:

26
Unsupervised approach: example
● CLUSTERING: grouping of the input based on some relationship of
similarity between the data
○ K-MEANS: technique that divides data into K clusters by
minimizing the distance between points and their cluster center
(centroid)
1) Choose number of clusters (K)
2) Randomly select K distinct points as centroids
3) Assign each data point (word, sentence, or document) is
assigned to the nearest cluster
4) Recalculate the centroids
5) Repeat steps 3 and 4 until clusters are stable
27
Unsupervised approach: example
● CLUSTERING: grouping of the input based on some relationship of
similarity between the data
○ K-MEANS

Source:
https://www.blopig.com/blog/2020/07/k-means-
clustering-made-simple/
28
Unsupervised approach: example
● CLUSTERING: grouping of the input based on some relationship of
similarity between the data
○ K-MEANS: how to find the
optimal number of clusters?
→ ELBOW METHOD
sum of the squared distances
between each point and its
cluster centroid

Source:
https://github.com/RihabFekii/clustering-methods

29
Unsupervised approach: example
● SELF-SUPERVISION
○ GUESS-THE-NEXT-WORD: predict a word removed from the text
(masked word) → fill-in-the-gap task in which a model uses the
words surrounding a masked token to try to predict what it should
be
○ Demo: https://tweetnlp.org/demo/ (word prediction)

30
Evaluation metrics for classification
● Confusion matrix: table used to display the comparison between the
predictions of the system (on the horizontal axis) and the expected
classifications annotated by experts (on the vertical axis)

- True Positive = TP
correct predictions
- True Negative = TN
- False Positive = FP
wrong predictions
- False Negative = FN 31
Evaluation metrics for classification
● The evaluation of the prediction of the system (output) is based on
manually annotated data: gold standard
● The simplest metric: ACCURACY
TP + TN
ACCURACY =
N
Example:
- 150 tokens annotated in the test set
- 120 tokens correctly predicted
- accuracy = 120/150 = 0.8 (80%)
- it can be calculated on a general level or by class/tag
32
Evaluation metrics for classification
● PRECISION (P): it measures the ratio between the elements correctly
predicted by the system and the total of predicted elements
○ How many predicted items were actually correct?
- # correct predictions / # predictions given

TP
PRECISION =
TP + FP 33
Evaluation metrics for classification
● RECALL (R): it measures the ratio between the elements correctly
predicted by the system and the total of the correct elements
○ How many correct items were predicted?
- # correct predictions / # possible correct elements

TP
RECALL =
TP + FN 34
Evaluation metrics for classification
● Sometimes there is a gap between precision and recall: as precision
increases, recall often drops (and vice versa)
● F-MEASURE: harmonic mean between precision and recall
- 2*precision*recall / (precision + recall)
● Alternative metric: parameterized average, which allows to choose to
give more importance to P or R: when beta = 1 we speak of F1
β = 1: P and R they have the same weight
β > 1: R is more important
β < 1: P is more important
β = 0: Only P is taken into consideration

35
Evaluation metrics for classification
● Sometimes there is a gap between precision and recall: as precision
increases, recall often drops (and vice versa)
● F-MEASURE: harmonic mean between precision and recall
- 2*precision*recall / (precision + recall)
● Alternative metric: parameterized average, which allows to choose to
give more importance to P or R: when beta = 1 we speak of F1
β = 1: P and R they have the same weight
β > 1: R is more important
β < 1: P is more important
β = 0: Only P is taken into consideration

36
Evaluation metrics for classification: example