NLP and the Web – WS 2024/2025
Lecture 1
Introduction
Dr. Thomas Arnold
Hovhannes Tamoyan
Kexin Wang
Ubiquitous Knowledge Processing Lab
Technische Universität Darmstadt
Introduction: Teaching Staff
Dr. Thomas Arnold Hovhannes Tamoyan Kexin Wang
Lectures Practice Class Practice Class
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 2
Outline
UKP Lab: profile and projects
Administrative course issues
NLP 4 Web Introduction
NLP Basics / Linguistic Analysis
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 3
Who Are We?
▪ 1 Professor, ~5 Postdocs, ~35 Doctoral Researchers
▪ We mainly work in natural language processing (NLP)
▪ Research areas (growing every day!)
Deep Learning for NLP Knowledge Graphs
Argument Mining Interactive AI and NLP
Content Analytics for the Social Writing Assistance and Language
Good Learning
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 4
Teaching Concept – UKP (Lectures)
Winter Term Summer Term
Information
Introductory
Management
Application NLP and the Web Ethics in NLP
Oriented
Advanced Deep Learning for
NLP
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 5
Teaching Concept – UKP (Seminars & Projects)
Data Analysis Software Project
Software Project
for Natural Language
(irregular schedule)
Winter 2023/24: Various Projects
Winter 2024/25: Various Projects
Regular Seminar Text Analytics / Large Language Models
Winter 2023/24: Generative AI
Summer 2024: LLMs for Mental Health
Winter 2024/25: Understanding LLMs
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 6
Complementary Lectures and Seminars
▪ Machine Learning
▪ Einführung in die künstliche Intelligenz (Kersting)
▪ Data Mining und maschinelles Lernen (Kersting)
▪ Deep Learning (Kersting)
▪ Computer Vision
▪ Computer Vision 1 and 2 (Roth)
▪ Natural Language Processing
▪ Deep Learning for NLP
▪ Ethics in NLP
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 7
Teaching Concept – UKP (PhD)
▪ Get involved early (HiWi, [Link]. thesis, [Link]. thesis)
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 8
More information
• Website:
[Link]
• GitHub:
[Link]/UKPLab
• Social Media:
@UKPLab
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 9
Outline
UKP Lab: profile and projects
Administrative course issues
NLP 4 Web Introduction
NLP Basics / Linguistic Analysis
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 10
Course Goals
▪ Learn the basic principles underlying NLP systems
▪ Two big NLP topics:
▪ Information Retrieval (IR)
▪ Large Language Model (LLM) Applications
▪ Gain insight into open research problems in natural language
processing
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 11
Why Care?
Information Overload
Business Intelligence
Need for Robust, Intelligent Systems
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 12
Textbook
Constantly updated:
▪ Speech and Language Processing. An Introduction to Natural Language
Processing, Computational Linguistics, and Speech Recognition. Daniel
Jurafsky and James H. Martin. 3nd edition, 2023 (draft).
▪ [Link]
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 13
General Information
▪ All lectures and practice classes will be in person
Lectures: Tuesdays 13:30 – 15:10, S306 / 051
Practice Class: Thursdays 16:15 – 17:55, S103 / 221
▪ All slides, handouts, readings etc. can be found on the
Moodle e-Learning platform
▪ We also use Moodle as a central point for announcements and questions
▪ Please use the Moodle forum!
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 14
General Information – Practice Class
▪ In the practice classes, you will work on programming exercises
▪ Programming language is Python
▪ First practice session will include a brief introduction to Python
▪ This will give you some practical experience in NLP
▪ Practice class topics are relevant for the exam! (including Python)
▪ In addition, there are homework assignments for an exam bonus:
▪ Assignments will be bi-weekly – 6 exercises in total
▪ Each assignment is worth a maximum of 20 points
▪ If you get >= 75% of the points (>= 90 points), you get a bonus
▪ You can improve your grade by 0.3/0.4 IFF you pass the exam without bonus
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 15
General Information – Practice Class
▪ First class: October 24th (no practice class this week)
▪ Details will be announced in moodle
▪ If you need additional help regarding the practice class, use the Moodle forum
The assignments will require a significant amount of time, so start earlier
than the day before submission.
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 16
Final exam
Tuesday, 25.02.2025, 15:00
More info be announced in Moodle
▪ Allowed: Non-programmable calculator, no other material
▪ Content: lecture, readings, practice class
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 17
Syllabus (tentative)
Nr. Lecture
01 Introduction / NLP basics
02 Foundations of Text Classification
03 IR – Introduction, Evaluation
04 IR – Word Representation, Data Collection
05 IR – Re-Ranking Methods
06 IR – Language Domain Shifts, Dense / Sparse Retrieval
07 LLM – Language Modeling Foundations
08 LLM – Neural LLM, Tokenization
09 LLM – Transformers, Self-Attention
10 LLM – Adaption, LoRa, Prompting
11 LLM – Alignment, Instruction Tuning
12 LLM – Long Contexts, RAG
13 LLM – Scaling, Computation Cost
14 Review & Preparation for the Exam
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 18
Warm up
Now it is your turn:
Which degree programme are you studying?
▪ Computer Science?
▪ Bachelor?
▪ Master?
▪ Other disciplines?
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 19
Warm up
Now it is your turn:
Which other UKP courses did you already attend?
▪ FoLT
▪ Ethics in Natural Language Processing
▪ Deep Learning for NLP
▪ Data Analysis Software Project
▪ Text Analytics / LLM Seminar
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 20
Outline
UKP Lab: profile and projects
Administrative course issues
NLP 4 Web Introduction
NLP Basics / Linguistic Analysis
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 21
NLP in the Web – Search Engines
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 22
NLP in the Web – Spelling Correction
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 23
Question Answering
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 24
NLP in the Web – Machine Translation
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 25
NLP in the Web – Speech Recognition
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 26
NLP in the Web – Plagiarism Detection
[Link]
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 27
NLP in the Web – Summarization
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 28
NLP in the Web – Diachronic Analysis
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 29
NLP in the Web – Text Generators
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 30
Natural Language Processing and the Web
▪ The web is an application area for NLP, e.g.:
▪ Information retrieval:
• Search engines
• Question answering
• News aggregation
• Recommender Systems
• Chatbots…
▪ Web is a resource to improve the quality of NLP, e.g.:
▪ Web as a corpus
▪ Analyzing web-based knowledge repositories
• Wikipedia
• Wiktionary
▪ Recognizing synonyms, paraphrases and the like
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 31
Challenges for NLP
• How to remove noise, e.g. duplicates?
• How to assess the quality of content?
• How to integrate the content of heterogeneous and scattered nature?
• How to deal with errors, e.g. spelling or grammar errors?
• How to „clean“ the data?
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 32
Data Cleansing is Necessary
▪ User-generated content contains errors, smileys, abbreviations, etc.
Hi
Micheal,
have u seen my
posting,last week u said that u
will look in to my problem thsi [Link] i ask u
now?
Data import Data cleansing
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 33
Outline
UKP Lab: profile and projects
Administrative course issues
NLP 4 Web Introduction
NLP Basics / Linguistic Analysis
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 34
Analysis Levels in Language Understanding
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics and Discourse
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 35
Phonetics and Phonology
(c) David Groome, 2006
night
Homophones /naɪt/
knight
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 36
Analysis Levels in Language Understanding
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics and Discourse
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 37
Segmentation
(c) David Groome, 2006
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 38
Tokenization
▪ Segmenting an input stream into an ordered sequence of units is called
tokenization.
▪ A token can correspond to an inflected word form or sub-word units,
and may be subject to a subsequent morphological analysis.
▪ Tokens include punctuation!
▪ A system which splits texts into tokens is called a tokenizer
A very simple example:
▪ Input text:
John likes Mary and Mary likes John.
▪ Tokens:
{"John", "likes", "Mary", "and", "Mary", "likes", "John", "."}
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 39
Tokenization
English Example
▪ Mr. Sherwood said, reaction to Sea Containers‘ proposal has been „very
positive.“ In New York Stock Exchange composite trading yesterday, Sea
Containers closed at $62.625, up 62.5 cents.
Where could be problems for a tokenizer?
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 40
Tokenization
English Example
▪ Mr. Sherwood said, reaction to Sea Containers‘ proposal has been „very
positive.“ In New York Stock Exchange composite trading yesterday, Sea
Containers closed at $62.625, up 62.5 cents.
▪ Split at whitespace characters?
cents. said, positive.” $62.625,
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 41
Tokenization Ambiguities
Period
▪ In most of the cases: Final sentence punctuation symbol
▪ Part of an abbreviation, e.g. F.D.P.
▪ Numbers, ordinal numbers, e.g.: 21., numbers with fractions, e.g. 1.543
▪ References to resources locators, e.g.: [Link]
▪ To complicate things, if a sentence ends with an abbreviation which
ends with a period, only one period is written. “I go to Apple, Inc.”
▪…
Whitespace character
▪ Part of numbers, e.g. “1 543”
▪ No segmentation character in multi-word expressions
▪ “New York”
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 42
Ambiguities
Comma
▪ Part of numbers, e.g. 1,543
Single quote
▪ Within tokens to mark contractions and elisions, e.g. English: don’t,
won’t, you’ve, James’ new hat; German: Ich hab’s!
▪ Part of a token in French, e.g. aujourd´hui
▪ But in most cases: Enclosing quoted groups of words
Dash
▪ A delimiter, if it connects strings of digits, e.g. "see pages 100-101”
▪ In French: Signal a close connection between two tokens, e.g. verb and
personal pronoun: donne-le
▪ In most cases, however, it is part of the token, e.g. multi-word
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 43
Tokenization in Other Languages
Chinese
爱国人
▪ No spaces
▪ Two possible segmentations, both of them are syntactically and
semantically correct
▪ Disambiguation can only be done with contextual information
爱国 / 人
country-loving person
爱 / 国人
love country-person
Bird et al., NLP with Python, p.113
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 44
German Compounds
German
STAUBECKEN
▪ No spaces within noun compounds
▪ Two possible segmentations, both of them are syntactically and
semantically correct
▪ Disambiguation can only be done with contextual information
STAU BECKEN
water reservoir
STAUB ECKEN
dusty corners
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 45
Analysis Levels in Language Understanding
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics and Discourse
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 46
Morphology
• Morphology is the branch of linguistics that studies word forms and word
formation
• Words are composed of morphemes
• Morphemes are the smallest meaning-bearing units
(c) David Groome, 2006
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 47
Morphology
Words can be further decomposed into smaller units:
“pneumonoultramicroscopicsilicovolcanoconiosis”
lung disease caused by the inhalation of very fine
silica dust found in volcanoes
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 48
Bases and Affixes
• Remember: Morphemes are the smallest meaning-bearing units
• Examples:
▪ cats → cat (noun) + s (plural)
▪ unknowingly → un + know + ing + ly
▪ bedenken → be + denk + en
▪ Both cat and cats can be uttered in isolation but s cannot:
-s is a bound morpheme
▪ Minimal free morphemes = stems
▪ cat is a free morpheme
▪ Stems carry the main meaning of the word
▪ Affixes are bound morphemes
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 49
Types of Affixes
Suffixes: appear after the base
▪ cat + s, nice + ly
Prefixes: appear before the base
▪ un + true
Infixes: appear inside the base
▪ fan + bloody + tastic
Circumfixes: appear on both sides of the base
▪ ge + sag + t
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 50
Morphological Normalization
▪ Morphological normalization consists in identifying a single
canonical representative for morphologically related word-
forms
Methods
▪Stemming
▪Lemmatization
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 51
Stemming
Stemming is an algorithmic approach to strip off the endings of words
sitting → sitt
anarchism, anarchy, anarchistic → anarchi
Objective: group words belonging to the same morphological family by
transforming them into the same stemmed representation
▪ stemming does not distinguish between inflection and derivation
▪ the stems obtained do not necessarily correspond to a real word form
Well-known stemming algorithms for English have been developed by
Lovins and Porter
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 52
Algorithmic Stemming Method
Stemming is rule-based. Example rules from Porter:
*ATIONAL -> *ATE (relational -> relate)
*[> 0 vowels] + ING -> * (monitoring -> monitor)
*SSES -> *SS (grasses -> grass)
Rule-based stemming methods are hard to create, often yield arbitrary
distinctions, but can be executed very quickly at runtime.
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 53
Porter's Stemmer
Original Word Stemmed Word
vision vision
visible visibl
visibility visibl
visionary visionari
visioner vision
visual visual
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 54
Stemming Errors
Under-stemming: remove too little
▪ adhere → adher
▪ adhesion → adhes
Over-stemming: remove too much
▪ appendicitis → append
▪ append → append
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 55
Problem with Stemming: Syntactic Ambiguity
Homographs: words which have the same spelling but different meanings
I saw the saw
Past form Singular form
of the verb
SEE
≠ of the noun
SAW
Such cases cannot be properly dealt with by stemming only,
the word's grammatical category also has to be identified
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 56
Lemmatization
▪ “undo” the inflectional changes of a base form
▪ Usually needs lexical resources and part-of-speech tagging
▪cats (NOUN) → cat
▪left (VERB) → leave
▪left (ADJ) → left
▪Has to deal with Irregularities
▪ sing, sang, sung → sing
▪ indices → index
▪ Bäume → Baum
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 57
Stemming vs. Lemmatization
Original Stemmed Lemmatized
visibilities visibl visibility
adhere adher adhere
adhesion adhes adhesion
appendicitis append appendicitis
oxen oxen ox
indices indic index
swum swum swim
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 58
Analysis Levels in Language Understanding
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics and Discourse
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 59
Syntax
▪ Syntax refers to the way words are arranged together
▪ "Syntax is the study of the regularities and constraints of
word order and phrase structure"
(Manning & Schütze, 2003, p. 93)
▪ There is an infinite number of ways in which words can be
arranged together to form sentences
▪ Yet, we can understand sentences we have never heard or
read before
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 60
POS Tagging
▪ The process of assigning a part of speech or lexical class marker to
each word in a corpus
▪ The input to a tagging algorithm is a sequence of words and a tagset, and
the output is a sequence of tags, a single best tag for each word
Determiner Noun Verb Pronoun Adjective
(c) David Groome, 2006
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 61
Parts of Speech
▪ In English we traditionally have 8 parts of speech
▪N Noun chair, bandwidth, pacing
▪V Verb study, debate, munch
▪ ADJ Adjective purple, tall, ridiculous
▪ ADV Adverb unfortunately, slowly
▪P Preposition of, by, to
▪ PRO Pronoun I, me, mine
▪ DET Determiner the, a, that, those
▪ INTJ Interjection oh!, m-hm, huh?
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 62
Penn Treebank Tagset
1. CC Coord. conjunc. 25. TO to
2. CD Cardinal number 26. UH Interjection
3. DT Determiner 27. VB V, base form
4. EX Existential there 28. VBD V, past tense
5. FW Foreign word 29. VBG V, gerund/pres. part.
6. IN Prep./subord. conj. 30. VBN V, past part. Language Tagset Size
7. JJ Adject. 31. VBP V, non-3rd ps. sing. pres.
8. JJR Adject., comp. 32. VBZ V, 3rd ps. sing. pres. English 139
9. JJS Adject., superl. 33. WDT wh-det.
10. LS List item marker 34. WP wh-pronoun Czech 970
11. MD Modal 35. WP$ Poss. wh-pronoun
12. NN Noun, sing. or mass 36. WRB wh-adverb
Estonian 476
13. NNS Noun, plural 37. # Pound sign Hungarian 401
14. NNP Proper noun, sing. 38. $ Dollar sign
15. NNPS Proper noun, plural 39. . Sent.-final punct. Romanian 486
16. PDT Predeterminer 40. , Comma
17. POS Possessive ending 41. : Colon, semi-colon Slovene 1033
18. PRP Personal pronoun 42. ( L. bracket char.
19. PP$ Poss. pronoun 43. ) R. bracket char.
(Hajič, 2000)
20. RB Adverb 44.“ Straight dbl. quote
21. RBR Adverb, comp. 45. ‘ L. open sngl. quote
22. RBS Adverb, superl. 46. “ L. open dbl. quote
23. RP Particle 47. ’ R. close sngl. quote
24. SYM Symbol 48. ” R. close dbl. quote
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 63
An Example
WORD LEMMA TAG
the the +DET
host host +NOUN
kissed kiss +VPAST
the the +DET
friend friend +NOUN
on on +PREP
the the +DET
cheek cheek +NOUN
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 64
Ambiguities
▪ POS Tagging is a disambiguation task
▪ Words are ambiguous—have more than one possible part-of-speech
▪ The word “book”:
▪ book that flight: verb
▪ hand me that book: noun
▪ The word “that”:
▪ Does that flight serve dinner? : determiner
▪ I thought that your flight was earlier: complementizer
▪ POS Tagging: resolves ambiguities, choosing the proper tag for the context
▪ Baseline: Most Frequent Class (accuracy 92.34% [Jurafsky & Martin])
▪ Outdated: Rule-based tagging, probabilistic tagging
▪ State of the art: Neural approaches, accuracy ~ 98%
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 65
Parsing
▪ The process of determining the grammatical structure with respect to a
given grammar.
(c) David Groome, 2006
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 66
Alternative representations
▪ Bracketed notation:
[S [NP [Det the] [N dog] ] [VP [V ate] [NP [Det a] [N cookie] ] ] ]
▪ Parenthesized notation:
(S Parse Tree:
(NP
(Det the)
(N dog) )
(VP
(V ate)
(NP
(Det a)
(N cookie))))
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 67
Syntactic Ambiguity
▪If you love money problems show up
▪ If you love, money problems show up.
▪ If you love money, problems show up.
▪ If you love money problems, show up.
▪“I made her duck.”
▪“We're eating grandpa!” vs. "We're eating, grandpa!"
▪“Weil er drei Monate verfallene Medikamente nahm, ...”
▪Different interpretations are mainly caused by syntactic
ambiguity.
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 68
Syntactic Ambiguities:
Two Possible Parsing Possibilities
“I saw the man with a telescope.”
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 69
Syntactic Ambiguities:
Two Possible Parsing Possibilities
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 70
Analysis Levels in Language Understanding
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics and Discourse
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 71
Definition
Semantics:
▪ Study of the meaning of words, phrases, sentences, or documents
Lexical Semantics
▪ Study of the meaning of lexical units, i.e. words.
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 72
Lexical Ambiguity
He hit the ball with the bat.
Chuck Norris can hit a bat with a ball.
▪ Different interpretations are caused by lexical ambiguity.
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 73
Analysis Levels in Language Understanding
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics and Discourse
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 74
Pragmatics
What is the purpose of an utterance?
“I never said she stole my money" I simply didn't ever say it.
▪ “I never said she stole my money” Someone else said it, but I didn't.
▪ “I never said she stole my money” I might have implied it in some way,
but I never explicitly said it.
▪ “I never said she stole my money” I said someone took it; I didn't say it
was she.
▪ “I never said she stole my money” I just said she probably borrowed it.
▪ “I never said she stole my money” I said she stole someone else's
money.
▪ “I never said she stole my money” I said she stole something of mine,
but not my money.
Example from Wikipedia
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 75
Pragmatics
What is the purpose of an utterance?
Utterance: “Is it cold in here or is it just me?
Intended meaning: “Please close the window!”
Utterance: “Oh, great! Another meeting.”
Intended meaning: The speaker likely means the opposite of what they are
literally saying—meetings might be something they dislike, despite the
positive tone.
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 76
Summary – Linguistic Analysis Levels
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 77
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics and Discourse
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 78
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
Phonetics and Phonology
Segmentation
Morphology
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 79
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]
Segmentation
Morphology
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 80
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]
["Elementary", ",", "my", "dear", "Watson"]
Morphology
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 81
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]
["Elementary", ",", "my", "dear", "Watson"]
Base: Element, Suffix: -ary
Syntax
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 82
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]
["Elementary", ",", "my", "dear", "Watson"]
Base: Element, Suffix: -ary
ADJ, PRP$ ADJ NNP
Semantics
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 83
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]
["Elementary", ",", "my", "dear", "Watson"]
Base: Element, Suffix: -ary
ADJ, PRP$ ADJ NNP
Watson: Dr. John H. Watson (not IBM)
Pragmatics
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 84
Summary – Linguistic Analysis Levels
Elementary, my dear Watson
[ɛlɪˈmɛntəri, maɪ dɪə ˈwɒtsən]
["Elementary", ",", "my", "dear", "Watson"]
Base: Element, Suffix: -ary
ADJ, PRP$ ADJ NNP
Watson: Dr. John H. Watson (not IBM)
"You are so stupid…"
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 85
Take-Home-Messages
▪ Natural language processing is an interesting topic ☺
▪ There are a lot of challenges
▪ Typical preprocessing steps:
▪ Tokenization for splitting texts into tokens
▪ Stemming / Lemmatization to normalize tokens
▪ PoS-Tagging and parsing analyze syntactic features
▪ PoS-tags roughly represent word classes
▪ Phrases group words to function as a single unit
▪ Ambiguity in language makes analysis a hard problem
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 86
Next Lecture
Text Classification
WS24/25 | Computer Science Department | UKP - Dr. Thomas Arnold 87