05/02/2021
Content
Introduction to Natural Language Processing Course Information
(NLP) Some achievements of NLP
Overview of NLP
Dr. Tran Hong-Viet Linguistic levels of description
Why is NLP difficult?
UET-VNU
Conclusion
Course information
Course information
Course web page: https://courses.uet.vnu.edu.vn/ choose NLP course.
Course: Natural Language Processing (NLP) Up to date information
Lecture notes
Instructor: Dr. Tran Hong Viet Information Faculty. Relevant dates, links, etc.
Email: [email protected] Prerequisites: Programming principles, discrete mathematics for
Tel: 0975486888 computing, software design and software engineering concepts, AI. Good
knowledge of C++, Java, Python.
Python required for programming assignments.
Grading: 30% for (midterm + homeworks/assignments ) +10% for
attendence + 60% for final
3 4
05/02/2021
Policy & Practical issues Reference
Slides
Encourage discussion but assignments must be your individual work Text books:
1) Speech and Language Processing,
Codes copied from books or other libraries but be explicitly acknowledged Daniel Jurasky & James H. Martin, second
edition, printed by Prentice Hall, 2009
Sharing or copying codes is strictly prohibited. (https://web.stanford.edu/~jurafsky/slp3/ )
2) Natural Language Processing ,
Eisenstein, 2018
(https://github.com/jacobeisenstein/gt-nlp-
class/blob/master/notes/eisenstein-nlp-
notes.pdf )
3) Foundation of Statistical Natural
Language Processing, Christopher D.
Manning & Hinrich Schutze, 2001
5 6
NLP in Industry Communication With Machines
7 8
05/02/2021
Google Translate & Vietgle Translate
Virtual Assistant
Conversational agents contain:
Speech recognition
Language analysis
Dialogue processing
Information retrieval
Text to speech
Google now, Alexa, Siri, Cortana,
VAV…
9 10
Machine Translation vs. Human Watson system –IBM 2011 (Question-Answering )
• IBM built a computer that won Jeopardy in 2011
• Question answering technology built on 200 million text pages, encyclopedias,
dictionaries, thesauri, taxonomies, ontologies, and other databases
11 12
05/02/2021
Google’s Knowledge Graph Key Applicatons in 2019
Goal: move beyond Computatonal linguistcs (i.e., modeling the human capacity for language
keyword search document
retrieval to directly
computatonally)
easier for mobile device Informaton extracton, especially “open” IE
users
Queston answering, chatbot (e.g., Watson, Google now)
Google’s Knowledge
Graph (Knowledge Machine translaton
Graph (“things not strings”):
Summarizaton
built on top of FreeBase
entries are synthesised Opinion and sentment analysis
from Wikipedia, news
stories, etc. Social media analysis
Manually updating Fake News Recogniton
13 14
NLP Careers: So hot! What is NLP?
Natural language processing (NLP) is a subfield of artificial
Industry
intelligence and computational linguistics. It studies the problems of
Government
automated generation and understanding of natural human languages.
Academia
Natural-language-generation systems convert information from
computer databases into normal-sounding human language. Natural-
language-understanding systems convert samples of human
language into more formal representations that are easier for
computer programs to manipulate.
15 16
05/02/2021
Natural language processing and
What is Natural Language Processing? computational linguistics
Computers using natural language as input and/or output Natural language processing (NLP) develops methods for solving practical
problems involving language:
Automatic speech recognition
Machine Translation
language Computer language Sentiment Analysis
Information extraction from documents
Understanding Computational linguistics (CL) focused on using technology to
support/implement linguistics:
(NLU) how do we understand language?
how do we produce language?
how do we learn language?
Generation
(NLG)
17 18
Phonetics and phonology
Phonetics (ngữ âm) studies the sounds of a language
Level Of Linguistic Phonology (âm vị học) studies the distributional properties of these
Knowledge sounds
19 20
05/02/2021
Morphology Syntax
Morphology studies the structure of words Syntax studies the ways words combine to form phrases and
Morphological derivation exhibits hierarchical sentences
structure
Example: re+vital+ize+ation
The suffix usually determines the syntactic category of the derived Syntactic parsing helps identify who did what to whom, a key
word step in understanding a sentence
21 22
Semantics and pragmatics The lexicon
Semantics studies the meaning of words, phrases and sentences
A language has a lexicon, which lists for
Ex: I have a dinner in/for an hour
each morpheme
Pragmatics (Ngữ dụng) studies how we use language to do things in the world
how it is pronounced (phonology),
Ex: Con vịt chạy đến Mary và liếm chân cô. its distributional properties (morphology
and syntax),
what it means (semantics), and
its discourse properties (pragmatics)
The lexicon interacts with all levels of
linguistic representation
23 24
05/02/2021
What’s driving NLP and CL research? Factors Changing NLP Landscape
Tools for managing the "information explosion“
Increases in computing power
extracting information from and managing large text document
collections The rise of the web, then the social web
NLP is often free tools integrated with main products to sell more Advances in machine learning
ads;
Ex: speech recognition, machine translation, document clustering (news), etc.
Advances in understanding of language in social context
Mobile and portable computing
keyword search / document retrieval don’t work well on very small
devices
we want to be able to talk to our computers (speech recognition) and
have them say something intelligent back (NL generation)
25 26
Natural Language Processing Why is NLP difficult?
Applications Core Technologies (NLP sub- Ambiguity
Machine Translation problems) Sparsity
Information Retrieval Language modeling Abstractly, most NLP applications can be viewed as prediction
Question Answering Part-of-speech tagging problems
Dialogue Systems Syntactic parsing Should be able to solve them with Machine Learning
Information Extraction Named-entity recognition The label set is often the set of all possible sentences
Summarization Word sense disambiguation infinite (or at least astronomically large)
Sentiment Analysis Semantic role labeling Training data for supervised learning is often not available
… … Unsupervised/semi-supervised techniques for training from available
data
Algorithmic challenges
NLP lies at the intersection of computational linguistics and machine learning. vocabulary can be large (e.g., 50K words)
data sets are often large (GB or TB)
27 28
05/02/2021
Ambiguity ??? Ambiguity
“At last, a computer that understands you like your mother”
It understands you as well as your mother understands you
It understands (that) you like your mother
“At last, a computer that understands you like your mother” It understands you as well as it understands your mother
“Ông già đi nhanh quá”
29 30
Ambiguity at Many Levels Ambiguity at Many Levels
At the acoustic level (speech recognition):
“… a computer that understands you like your mother” At the syntactic level:
“… a computer that understands you lie cured mother”
Different structures lead to different interpretations
31 32
05/02/2021
More Syntactic Ambiguity Ambiguity at Many Levels
At the semantic (meaning) level:
Two definitions of “bank”
an organization where people and businesses can invest or borrow money, change it to foreign money,
etc., or a building where these services are offered
sloping raised land, especially along the sides of a river
This is an instance of word sense ambiguity
33 34
More Word Sense Ambiguity Dealing with Ambiguity
At the semantic (meaning) level: How can we model ambiguity?
They put money in the bank Non-probabilistic methods (CKY parsers for syntax) return all possible analyses
Probabilistic models (HMMs for POS tagging, PCFGs for syntax) and algorithms
I saw her duck with a telescope (Viterbi, probabilistic CKY) return the best possible analyses, i.e., the most
probable one.
But the “best” analysis is only good if our probabilities are accurate. Where do they
come from?
35 36
05/02/2021
Corpora Statistical NLP
A corpus is a collection of text
Often annotated in some way Like most other parts of AI, NLP is dominated by statistical methods
Sometimes just lots of text Typically more robust than rule-based methods
Examples Relevant statistics/probabilities are learned from data
Penn Treebank: 1M words of parsed WSJ Normally requires lots of data about any particular phenomenon
Canadian Hansards: 10M+ words of French/English sentences
Yelp reviews
VLSP Corpus (Vietnamese)
37 38
Sparsity Sparsity
Order words by frequency. What is the frequency of nth ranked word?
39 40
05/02/2021
Sparsity Fields with Connections to NLP
Regardless of how large our
corpus is, there will be a lot of
infrequent words
This means we need to find clever
ways to estimate probabilities for
things we have rarely or never
seen
41 42
Today’s Applications What is this course?
Conversational agents Linguistic Issues
Information extraction and question answering What are the range of language phenomena?
Machine translation What are the knowledge sources that let us disambiguate?
What representations are appropriate?
Summarization
How do you know what to model and what not to model?
Opinion and sentiment analysis
Statistical Modeling Methods (almost Machine Learning)
Social media analysis
Increasingly complex model structures
Visual understanding
Learning and parameter estimation
Essay evaluation Efficient inference: dynamic programming, search
Mining legal, medical, or scholarly literature Deep neural networks for NLP: LSTM, CNN, Seq2seq, Transformer
…
43 44
05/02/2021
Outline of Topics Goals of this Course
Words and Sequences Learn about the problems and possibilities of natural language analysis:
Text classifications What are the major issues?
What are the major solutions?
Probabilistic language models
At the end you should:
Vector semantics and word embeddings
Agree that language is difficult, interesting and important
Sequence labeling: POS tagging, NER Be able to assess language problems
HMM Know which solutions to apply when, and how
Feel some ownership over the algorithms
Parsers Be able to use software to tackle some NLP language tasks
Semantics Know language resources
Be able to read papers in the field
Applications
Machine translation, Question Answering, Dialog Systems
45 46
Journal and Conference in NLP Conclusion
http://anthology.aclweb.org/ Computational linguistics and natural language processing:
were originally inspired by linguistics,
but now they are almost applications of machine learning and statistics
We solve these problems using standard methods from machine
learning:
Define a probabilistic model over the relevant variables
Factor the model into small components that we can learn
Ex: HMMs, SVM, CRFs and PCFGs
End2end: Deep Learning
47 48
05/02/2021
References
Slides of NLP course from CMU, Toronto University
Some Tutorials of NLP
49