0% found this document useful (0 votes)
21 views37 pages

NLP Unit 1,2 Notes

Uploaded by

Harsh sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views37 pages

NLP Unit 1,2 Notes

Uploaded by

Harsh sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

1

Note: Please refer following books for detailed study:


 Speech and Language Processing, Daniel Jurafsky and James H., 3rd Edition, Martin Prentice Hall, 2023.

 Steven Bird, Ewan Klein and Edward Loper, Natural Language Processing with Python, O’Reilly Media, First
edition, 2009.

Summary of Notes
NLP overview

Natural Language Processing is the study and engineering of computational systems that can analyze,
understand, generate, and interact using human language. It bridges unstructured linguistic signals—text
and speech—and structured machine representations by combining computational linguistics with
statistical and neural machine learning. A useful mental model is a pipeline: data collection and cleaning;
text normalization and tokenization; linguistic analysis (morphology, part-of-speech tagging, chunking,
parsing); feature learning or representation learning (from TF-IDF to embeddings like word2vec, GloVe,
contextual encoders); task modeling (classification, sequence labeling, sequence-to-sequence, retrieval-
augmented generation); and evaluation with appropriate metrics. This pipeline clarifies dependencies:
better normalization reduces vocabulary sparsity; better tagging and parsing reduce ambiguity propagated
to downstream tasks. Diagram to draw: a left-to-right pipeline with boxes “Raw text → Normalization →
Tokenization → Linguistic analysis → Representations → Model → Output,” with arrows; beneath each
box, write two concrete examples (e.g., Normalization: lowercasing, Unicode NFKC; Tokenization:
whitespace, WordPiece; Linguistic analysis: POS, dependency; Representations: TF-IDF, BERT
embeddings; Model: CRF, Transformer; Output: labels, summary, translation). Applications span
information extraction (NER, relations), search and ranking (query understanding, intent), text
classification (spam, sentiment, topic), conversational agents (dialogue state tracking, response
generation), machine translation, summarization, and question answering. Key historical phases: rule-
based systems (grammars, lexicons), statistical NLP (n-grams, HMMs/CRFs, PCFGs), and neural NLP
culminating in attention-based Transformers and large-scale pretraining. The field’s central challenges are
ambiguity, data quality, domain shift, and encoding of world knowledge; modern systems increasingly
use transfer learning, instruction tuning, and retrieval to address them.

Applications of NLP

Applications can be organized by input-output mapping and required linguistic depth. For classification,
sentiment analysis maps sentences or documents to polarity labels; spam filters map messages to
“spam/ham”; topic classifiers map documents to taxonomies. For extraction, NER identifies entities
(PER, ORG, LOC), relation extraction links entity pairs with semantic relations (works_for,
headquartered_in), and event extraction structures triggers and arguments. For generation, summarization
compresses content while preserving key facts; translation maps sequences across languages; dialogue
systems produce contextually appropriate responses. In search and recommendations, query rewriting,
entity linking, and semantic retrieval improve relevance; in enterprise, document intelligence
(OCR+NLP), contract analytics, and support automation are major use cases. Evaluation varies by task:
accuracy/F1 for classification and extraction; BLEU, ROUGE, or BERTScore for generation but
increasingly human preference or task success metrics; latency and cost matter in production. Typical
system patterns include retrieval-augmented generation (RAG) that uses dense retrievers to fetch
knowledge then generate grounded responses, and tool-use agents that combine language models with
deterministic tools for reliable actions. Deployment considerations include multilingual coverage, fairness
and bias audits, privacy for sensitive data, and robustness to adversarial prompts or noisy inputs.
2

Ambiguity in NLP (hard problems)

Ambiguity drives the core difficulty in language understanding. Lexical ambiguity arises when a word
has multiple senses (bank=financial institution vs. river edge); resolving it requires context and
sometimes world knowledge, a task known as word sense disambiguation. Syntactic ambiguity occurs
when the same sequence yields multiple parse trees (prepositional phrase attachment: “I saw the man with
a telescope”—did the seeing happen with a telescope, or did the man have it?); probabilistic or neural
parsers score structures to prefer the most plausible attachment. Semantic ambiguity occurs at the
sentence-meaning level (“Visiting relatives can be boring” could mean the act of visiting them is boring,
or relatives who visit are boring); models need selectional preferences and event structure. Pragmatic
ambiguity depends on context, intention, and social norms (“Can you pass the salt?” is a request, not a
query about ability); resolving it involves speech act recognition. Referential/anaphoric ambiguity arises
when pronouns or definite descriptions have multiple candidates (“Alice told Jane that she would win”—
who is she?); coreference resolution and entity tracking aim to resolve this. Diagram to draw: a central
sentence bubble with five arrows to boxes labeled Lexical, Syntactic, Semantic, Pragmatic, Referential; in
each box, write a one-line example and the typical resolver (sense disambiguation, parser, semantic role
labeling, intent detection, coreference). Mathematically, ambiguity resolution often uses Bayesian or
discriminative scoring over alternatives. For syntactic ambiguity, if

P(T∣x)∝P(x∣T)P(T). For tagging, Viterbi selects

in HMMs, illustrating how context reduces ambiguity. Despite advances, residual errors concentrate on
rare senses, long-distance dependencies, idioms, sarcasm, and under-specified references; incorporating
external knowledge and discourse modeling remains active research.

Algorithms and models

Algorithmic families align with tasks and data regimes. Rule-based methods (finite-state transducers for
tokenization and morphology; handcrafted grammars for parsing) offer interpretability and precision in
constrained domains. Statistical learning introduced n-gram language models estimating, and sequence
models like HMMs and CRFs for tagging and segmentation; decoding typically uses dynamic
programming. Parsing used PCFGs with chart algorithms maximizing P(T,x)=∏rule_probs. Neural
models began with RNNs and LSTMs for sequences, CNNs for character/subword features, attention for
long-range dependencies, and the Transformer architecture relying solely on self-attention.

Knowledge bottlenecks in NLP

The knowledge bottleneck is the gap between what a model encodes and what robust language
understanding requires. Text alone seldom contains the full commonsense and world knowledge needed
to resolve implicature, presupposition, ellipsis, and pragmatic cues; labeled data is costly and uneven
across domains and languages; distribution shift causes failures when test data differs from training. This
bottleneck appears in classic errors: pronoun resolution requiring world knowledge (“The trophy doesn’t
fit in the suitcase because it is too small”—it=suitcase), temporal reasoning, spatial relations, or
procedural knowledge. Strategies to ease the bottleneck include pretraining on massive corpora to learn
linguistic and factual regularities; retrieval augmentation to ground generation in up-to-date sources;
integration of structured knowledge graphs (e.g., Wikidata) and differentiable memory; weak supervision
and data programming to expand coverage; and instruction tuning with preference optimization to better
3

follow task intent. Diagram to draw: a funnel where “Language Input” enters a “Model” box through a
narrow neck; side arrows inject “World Knowledge,” “Domain Data,” and “Context,” illustrating that
without these, the neck restricts throughput; outputs annotate typical error types (wrong referent, wrong
sense, hallucination).

Introduction to NLTK

The Natural Language Toolkit is a pedagogical and prototyping library in Python that bundles tokenizers,
stemmers, lemmatizers, POS taggers, chunkers, parsers, classifiers, corpus readers, and evaluation
metrics. It includes access to corpora like Gutenberg, Brown, and movie_reviews, and utilities like
concordance views. Typical workflow: install and import, run nltk.download() to fetch resources;
tokenize text (word_tokenize, sent_tokenize or PunktSentenceTokenizer); clean or normalize (stopwords
corpus, regexp); stem (PorterStemmer, SnowballStemmer) or lemmatize (WordNetLemmatizer, requiring
POS for best results); tag POS (nltk.pos_tag using a pretrained tagger); chunk noun phrases with
RegexpParser; parse with a probabilistic parser if models are available; evaluate with accuracy/F1 and
confusion matrices from nltk.metrics. A simple “mini-pipeline” could: take input text, sentence-tokenize,
word-tokenize each sentence, remove stopwords, lemmatize with POS derived from tags, tag POS, and
chunk NP patterns like “<DT>?<JJ>*<NN.+>+”.

limitations: speed and industrial-scale performance compared to spaCy or Transformers; yet, it’s ideal for
demonstrations, assignments, and quick baselines.

Word-level analysis: text normalization, edit distance, parsing, syntax

Word-level analysis reduces variability and establishes structure. Normalization decisions are task-
dependent: lowercasing affects proper nouns; Unicode normalization like NFKC harmonizes visually
similar characters; punctuation handling differs between IR and sentiment tasks; number handling may
map digits to placeholders; contractions expansion improves downstream parsing; stopword removal
reduces noise for bag-of-words but can harm generation. Stemming aggressively chops suffixes (connect
→ connect, connections → connect) but may over-stem (universe vs university), while lemmatization
uses vocabulary and POS to return lemmas (better → good if tagged as JJR). Tokenization ranges from
whitespace to rule-based to subword (BPE/WordPiece), with trade-offs between OOV handling and
morphological fidelity. Edit distance formalizes similarity: Levenshtein distance counts insertions,
deletions, substitutions with unit costs; Damerau-Levenshtein adds transposition. Worked example:
“kitten”→“sitting” has 3 edits; draw a grid with rows “k i t t e n” and columns “s i t t i n g,” fill values
and circle the minimal path. Syntax and parsing sit above word-level but critically depend on tokenization
and normalization; constituency parsing builds hierarchical phrase-structure trees labeled NP/VP/PP,
whereas dependency parsing produces head-dependent arcs with labels like nsubj, obj, amod. CKY
parsing requires a CNF grammar and fills a triangular chart bottom-up; Earley parsing handles arbitrary
CFGs with dotted rules and three operations (predict, scan, complete). Neural parsers use encoders to
produce contextual token vectors and score arcs or transitions; decoding finds maximum spanning trees or
highest-probability transition sequences. Hand-drawable diagram: two alternative parses for “the boy saw
the man with a telescope,” showing PP attaching to NP vs VP, and a dependency graph with different
heads for “with.”

Spelling: error detection and correction


4

Spelling correction decomposes into detection (is the token erroneous?) and correction (which candidate
is intended?). Nonword errors are easy to detect with a lexicon; real-word errors require contextual
modeling (“peace” vs “piece”). Candidate generation uses edit neighborhoods within distance 1–2,
keyboard adjacency graphs, phonetic hashing (Soundex, Metaphone), and morphological variants.

Words, word classes, and POS tagging

Words belong to open classes (nouns, verbs, adjectives, adverbs) and closed classes (prepositions,
determiners, conjunctions, pronouns, particles), with tagsets like Penn Treebank (NN, NNS, NNP, VB,
VBD, VBG, JJ, RB, IN, DT, PRP). POS tagging assigns the most likely tag sequence given the token
sequence, supporting parsing, lemmatization (verb vs noun lemmas), and downstream tasks like NER.

Detailed Notes

NaturalLanguageProcessing(NLP)Unit-I

1. Natural LanguageProcessing–Introduction

 Humans communicate through some form of language either by text or speech.


 To make interactions between computers and humans, computers need to
understand natural languages used by humans.
 Natural language processing is all about making computers learn,understand,
analyze, manipulate and interpret natural(human) languages.
 NLP stands for Natural Language Processing, which is a part of Computer Science,
Human languages or Linguistics, and Artificial Intelligence.
 Processing of Natural Language is required when you want an intelligent system
like robot to perform as per your instructions, when you want to hear decision
from a dialogue based clinical expert system, etc.
 The ability of machines to interpret human language is now at the core of many
applications that we use every day - chatbots, Email classification and spam filters,
search engines, grammar checkers, voice assistants, and social language
translators.
 TheinputandoutputofanNLPsystemcanbeSpeechorWrittenText.

2. ApplicationsofNLPorUsecasesofNLP

1. Sentimentanalysis
 Sentiment analysis, also referred to as opinion mining, is an approach to natural
language processing (NLP) that identifies the emotional tone behind a body of text.
 This is a popular way for organizations to determine and categorize opinions about
a product, service or idea.
 Sentiment analysis systems help organizations gather insights into real-time
customer sentiment, customer experience and brand reputation.
5

 Generally, these toolsuse text analyticsto analyze online sourcessuchasemails, blog


posts,online reviews, newsarticles, surveyresponses, case studies,
webchats,tweets, forums and comments.
 Sentiment analysis uses machine learning models to perform text analysis of
human language. The metrics used are designed to detect whether the overall
sentiment of a piece of text is positive, negative or neutral.
2. MachineTranslation
 Machinetranslation,sometimes referredtoby the abbreviation MT,is a sub-field of
computational linguistics that investigates the use of software to translate text or
speech from one language to another.
 On a basic level, MT performs mechanical substitution of words in one language for
words in another, but that alone rarely produces a good translationbecause
recognition of whole phrases and their closest counterparts in the target language
is needed.
 Not all words in one language have equivalent words in another language, and
many words have more than one meaning.
6

 Solvingthisproblemwithcorpusstatisticaland neuraltechniques isarapidlygrowing field


that is leading to better translations, handling differences in linguistic typology,
translation of idioms, and the isolation of anomalies.
 Corpus: A collection of written texts, especially the entire works of a particular
author.

3. TextExtraction
 There are a number of natural language processingtechniques that
canbe used to extract information from text or unstructured data.
 These techniques canbe usedtoextractinformationsuchas entitynames,
locations, quantities, and more.
 With the help of natural language processing, computers can make
sense of the vast amount of unstructured text data that is generated
every day, and humans can reap the benefits of having this information
readily available.
 Industries such as healthcare, finance, and e-commerce are already
using natural language processing techniques to extract information and
improve business processes.
 As the machine learning technology continues to develop, we will only
see more and more information extraction use cases covered.

4. Text Classification

 Unstructured text is everywhere, such as emails, chat conversations, websites, and


social media. Nevertheless, it’s hard to extract value from this data unless it’s
organized in a certain way.
 Text classification also known as text tagging or text categorization is the process of
categorizingtextintoorganizedgroups.Byusing NaturalLanguageProcessing (NLP), text
classifiers can automaticallyanalyze text and then assign a set of pre-defined tags or
categories based on its content.
 Text classification is becoming an increasingly important part of businesses as it
allows to easily get insights from data and automate business processes.

5. SpeechRecognition
 Speech recognition is an interdisciplinary subfield ofcomputer
science and computational linguistics that develops methodologies and technologies
that enable the recognition and translation ofspoken language into text
bycomputers.
 It is also known as automaticspeechrecognition (ASR),computerspeech recognition
or speech to text (STT).
 Itincorporatesknowledgeandresearchinthe computer science, linguistics and
computer engineering fields. The reverse process is speech synthesis.
7

Speechrecognitionusecases
 A wide number of industries are utilizing different applications of speech technology
today, helping businesses and consumers save time and even lives. Some examples
include:
 Automotive: Speech recognizers improves driver safety by enabling voice-activated
navigation systems and search capabilities in car radios.
 Technology: Virtual agents are increasingly becoming integrated within our daily
lives, particularly on our mobile devices. We use voice commands to access them
throughour smartphones, such as through Google Assistant or Apple’s Siri, for tasks,
such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s
Cortana, to play music. They’ll only continue to integrate into the everyday products
that we use, fueling the “Internet of Things” movement.
 Healthcare: Doctors and nurses leverage dictation applications to capture and log
patient diagnoses and treatment notes.
 Sales: Speech recognition technologyhas a couple ofapplications in sales. It can help
a call center transcribe thousands of phone calls between customers and agents to
identify common call patterns and issues. AI chatbots can also talk to people via a
webpage, answering common queries and solving basic requests without needing to
wait for a contact center agent to be available. In both instances speech recognition
systems help reduce time to resolution for consumer issues.
6. Chatbot
 Chatbots are computer programs that conduct automatic conversations with people.
They are mainly used in customer service for information acquisition. As the name
implies, these are bots designed with the purpose of chatting and are also simply
referred to as “bots.”

 You’llcomeacrosschatbotsonbusinesswebsitesormessengersthat givepre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.

7. EmailFilter
 One of the most fundamental and essential applications of NLP online is email
filtering. It began with spam filters, which identified specific words or phrases that
indicate a spam message. But, like early NLP adaptations, filtering hasbeen improved.
 Gmail's email categorization is one of the more common, newer implementations of
NLP. Based on the contents of emails, the algorithm determines whether they
belong in one of three categories (main, social, or promotional).
 This maintains your inbox manageable for all Gmail users, with critical, relevant
emails you want to see and reply to fast.
8. SearchAutocorrectandAutocomplete
 When you type 2-3 letters into Google to search for anything, it displays a list of
probable search keywords. Alternatively, if you search for anything with mistakes, it
correctsthem for you while still returning relevant results. Isn't it incredible?
8

 Everyone uses Google search autocorrect autocomplete on a regular basis but


seldom gives it any thought. It's a fantastic illustration of how natural language
processing is touching millions of people across the world, including you and me.
 Both, search autocomplete and autocorrect make it much easier to locate accurate
results.
3. ComponentsofNLP
 There are two components of NLP, Natural Language Understanding (NLU)and
Natural Language Generation (NLG).
 Natural Language Understanding (NLU) which involves transforminghumanlanguage
into a machine-readable format.It helps the machine tounderstand and analyze
human language byextracting the text from large data such as keywords, emotions,
relations, and semantics.
 Natural Language Generation (NLG) acts as a translator that
convertsthecomputerized data into natural language representation.
 ItmainlyinvolvesTextplanning,Sentenceplanning,andText realization.
 TheNLUisharderthanNLG.

4. StepsinNLP
Therearegeneralfivesteps:

 1.LexicalAnalysis
 2.SyntacticAnalysis(Parsing)
 3.SemanticAnalysis
 4.DiscourseIntegration
 5. PragmaticAnalysis

LexicalAnalysis:
 ThefirstphaseofNLPistheLexicalAnalysis.
 This phase scans the source code as a stream of characters and converts it into
meaningful lexemes.
 Itdividesthewholetextintoparagraphs,sentences,and words.
 Lexeme: A lexeme is a basic unit of meaning. In linguistics, the abstract unit of
morphological analysis that corresponds to a set of forms taken by a single word is
called lexeme.
 The way in which a lexeme is used in a sentence is determined by its grammatical
category.
9

 Lexeme canbeindividualwordor multiword.


 For example, the word talk is an example of an individual word lexeme,
which mayhave manygrammatical variants like talks, talked and talking.
 Multiword lexeme can be madeup of morethanoneorthographic word. For
example,speakup,pullthrough,etc.aretheexamplesofmultiwordlexemes.

SyntaxAnalysis(Parsing)
 SyntacticAnalysisisusedtocheckgrammar,wordarrangements,and
showstherelationship among the words.
 Thesentencesuchas“Theschoolgoestoboy”isrejectedbyEnglish syntactic
analyzer.

SemanticAnalysis
 Semanticanalysisisconcernedwiththemeaningrepresentation.
 Itmainlyfocusesontheliteralmeaningofwords,phrases,andsentences.
 Thesemanticanalyzerdisregardssentencesuchas“hotice-cream”.
 Another Example is “Manhattan calls out to Dave” passes a syntactic analysis because
it’sa grammatically correct sentence. However, it fails a semantic analysis.
BecauseManhattanis a place(andcan’t literallycallout topeople), thesentence’s meaning
doesn’t make sense.

DiscourseIntegration
 Discourse Integration depends upon the sentences thatprecedes itand also
invokesthe meaning of the sentences that follow it.

 For instance, if one sentence reads, “Manhattan speaks to all its people,” and the
following sentence reads, “It callsout to Dave,” discourse integrationchecksthe first
sentence for context to understand that “It” inthe lattersentence refersto
Manhattan.

PragmaticAnalysis
 Duringthis, whatwassaidisre-interpretedonwhatitactuallymeant.
 Itinvolvesderivingthoseaspectsoflanguagewhichrequirerealworldknowledge.
 For instance, a pragmatic analysis can uncover the intended meaning of “Manhattan
speaks to all its people.” Methods like neural networks assess the context to
understandthat thesentence isn’t literal, and most peoplewon’t interpret it assuch. A
pragmatic analysis deduces that this sentence is a metaphor for how people
emotionally connect with place.

5. FindingthestructureofWords
Words and Their Components
 Words are defined in most languages as the smallest linguistic units that
can form acomplete utterance by themselves.
10

 Theminimalpartsofwordsthatdeliveraspectsofmeaningtothemarecalled
morphemes.

Tokens:
Suppose, for a moment, that words in English are delimited only by
whitespace and punctuation (the marks, such as full stop, comma,
and brackets)

 Example:Will you read the newspaper? Will you read it? I won’t
readit. If we confront our assumption with insights from syntax,
we notice twowords here: words newspaper and won’t.

Being a compound word, newspaper has an interesting derivational


structure.
In writing, newspaperand the associated concept is distinguished from
the isolated news and paper.
For reasons of generality, linguists prefer to analyze won’t as two
syntacticwords, ortokens, eachofwhichhas its independentroleandcan be
reverted to its normalized form.

 Thestructureofwon’tcouldbeparsedaswillfollowedbynot.
 InEnglish, this kind oftokenizationand normalization mayapplyto just a
limited set ofcases, but inother languages, these phenomena have to be
treated different way.

Lexemes
 By the term word, we often denote not just the one linguistic form in the
givencontext but also the concept behind the form and the set of alternative forms
that can express it.
 Such sets are called lexemes or lexical items, and they constitute the lexicon of a
language.
 Lexemescanbe divided bytheir behaviour into the lexicalcategoriesofverbs, nouns,
adjectives, conjunctions or other parts of speech.
 The citation form of a lexeme, by which it is commonly identified, is also called its
lemma.
 Whenweconvert awordinto itsother forms, such asturning the singular mouse into
the plural mice ormouses, we say we inflect the lexeme.
 When we transform a lexeme into another one that is morphologically related,
regardlessofits lexicalcategory, wesaywederive thelexeme: for instance, thenouns
receiver and reception are derived from the verb receive.
 Example:Didyouseehim?Ididn’tseehim.Ididn’tseeanyone Example presents the
problem of tokenization of didn’t and the investigation of the internal structure of
anyone.
 The difficulty with the definition of what counts as a word need not posea problem
for the syntactic description if we understand no one as two closely connected
tokens treated as one fixed element.
11

Morphemes
Thesecomponentsareusuallycalledsegmentsormorphs.

Morphology
Morphologyisthedomainoflinguisticsthatanalysestheinternalstructureofwords.
 Morphologicalanalysis–exploringthe structureofwords
 Wordsarebuiltupofminimalmeaningfulelementscalledmorphemes:
played = play-ed
cats=cat-s
unfriendly=un-friend-ly
Twotypesofmorphemes:
i Stems: play,cat, friend
ii Affixes: -ed, -s, un-, -ly
Twomaintypesofaffixes:
i Prefixesprecedethestem:un
ii Suffixesfollowthestem:-ed, -s,un-,-ly
Stemming=findthestembystrippingoffaffixes play
= play
replayed = re-play-ed
computerized=comput-er-ize-d

Problemsinmorphologicalprocessing
Inflectionalmorphology:inflectedformsareconstructedfrombaseforms and
inflectional
Affixes.
Inflectionrelatesdifferentformsofthesameword
LemmaSingularPlural
Cat cat Cats
Mouse mouse mice
Derivationalmorphology:wordsareconstructedfromroots(orstems) and
derivational
affixes:
inter+national = international
international+ize = internationalize
internationalize+ation=internationalization

 The simplest morphological process concatenates morphs one by one, as in disagree-


ment-s, where agree is a free lexical morpheme and the other elements are bound
grammatical morphemes contributing some partial meaning to the whole word.
 In a more complex scheme, morphs can interact with each other, and their forms
may become subject to additional phonological and orthographic changes denoted
as morphophonemic.
 Thealternativeformsofamorphemearetermed allomorphs.
 The ending -s, indicating plural in “cats,” “dogs,” the -es in “dishes,” and the -en of
12

“oxen” are all allomorphs of the plural morpheme.


Typology
 Morphologicaltypologydivideslanguages into groupsbycharacterizingtheprevalent
morphological phenomena in those languages.
 Itcan consider various criteria, and during the history of linguistics, different
classifications have been proposed.
 Let usoutlinethetypologythat isbasedonquantitativerelationsbetweenwords,their
morphemes, and their features:
13

 Isolating, or analytic, languages include no or relatively few words that would


comprise more than one morpheme (typical members are Chinese, Vietnamese, and
Thai; analytic tendencies are also found in English).
 Synthetic languages can combine more morphemes in one word and are further
divided into agglutinative and fusional languages.
 Agglutinative languages have morphemes associated with only a single function at a
time (as in Korean, Japanese, Finnish, and Tamil, etc.)
 Fusional languages are defined by their feature-per-morpheme ratio higher than one
(as in Arabic, Czech, Latin, Sanskrit, German, etc.).
 In accordance with the notions about word formation processes mentioned earlier,
we can also find out using concatenative and nonlinear:
 Concatenativelanguageslinkingmorphsand morphemesoneafteranother.
 Nonlinear languages allowing structural components to merge nonsequentially to
apply tonal morphemes or change the consonantal or vocalic templates of words.

MorphologicalTypology
 Morphologicaltypologyisawayofclassifying the languagesoftheworld that groups
languages according to their common morphological structures.
 Thefieldorganizeslanguagesonthebasisofhowthoselanguagesformwordsby combining
morphemes.
 The morphologicaltypologyclassifies languages intotwo broadclasses like synthetic
languages and analytical languages.
 The synthetic class is then further sub classified as either agglutinative languages or
fusional languages.
 Analytic languagescontainverylittleinflection, insteadrelyingonfeatureslikeword
order and auxiliary words to convey meaning.
 Syntheticlanguages,onesthatarenotanalytic,aredividedintotwocategories:
agglutinative and fusional languages.
 Agglutinativelanguagesrelyprimarilyondiscreteparticles(prefixes, suffixes,and
infixes) for inflection, ex: inter+national = international, international+ize =
internationalize.
 While fusionallanguages "fuse" inflectionalcategories together,oftenallowing one
wordendingtocontainseveralcategories, suchthattheoriginalroot canbedifficult to
extract (anybody, newspaper).

6. NaturalLanguageProcessingWithPython'sNLTK Package

• NLTK,orNaturalLanguageToolkit,isaPythonpackagethat youcanusefor NLP.


• Alot ofthe datathat youcould beanalyzing is unstructured dataand contains human-
readable text.

• Beforeyoucananalyzethatdataprogrammatically, youfirstneedtopreprocessit.
• Nowwearegoingto see kinds oftext preprocessingtasks youcando withNLTKso that
you’ll be ready to apply them in future projects.
14

1. Tokenizing
 Bytokenizing,youcanconvenientlysplituptextbywordorbysentence.
 Thiswillallowyoutoworkwithsmallerpiecesoftextthatarestillrelatively coherent
and meaningful even outside of the context of the rest of the text.
 It’syour first stepinturning unstructureddataintostructureddata,which iseasier to
analyze.
 Whenyou’reanalyzingtext,you’llbetokenizingbywordandtokenizingby sentence.

Tokenizingbyword
• Words are like the atoms of natural language. They’re the smallest unit of meaning
that still makes sense on its own.
• Tokenizing your text by word allows you to identify words that come up particularly
often.
• For example, if you were analyzing a group of job ads, then you might find that the
word “Python” comes up often.
• That couldsuggest highdemand forPythonknowledge,but you’dneedto lookdeeper to
know more.
Tokenizingbysentence
• When you tokenize by sentence, you can analyze how those words relate to one
another and see more context.
• Are there a lot of negative words around the word “Python” because the hiring
manager doesn’t like Python?
• Arethere more terms from the domain of herpetology than the domain of software
development,suggestingthatyoumaybedealingwithanentirelydifferentkind of python
than you were expecting?

Python Program for Tokenizing by Sentence


fromnltk.tokenizeimportsent_tokenize,word_tokenize
example_string = """

Muad'Dib learned rapidly because his first trainingwasinhowto


learn. And the firstlesson of all was the basic trust that he could
learn.It's shocking to find how many people do notbelieve
they can learn,and how many more believe learning to be
difficult."""
sent_tokenize(example_string)
Output
["\nMuad'Diblearnedrapidlybecausehisfirsttrainingwasinhowtolearn.",
15

'Andthefirstlessonofallwasthebasictrustthathecouldlearn.’,
"It'sshockingtofindhow manypeopledo notbelievetheycanlearn,\nand how many
more believe learning to be difficult."]
Note:
import nltk
nltk.download('punkt')

Python Program for Tokenizing by Word


fromnltk.tokenizeimportsent_tokenize,word_tokenize
example_string = """

Muad'Dib learned rapidly because his first trainingwasinhowtolearn.And


the first lesson of all was the basic trust that he could learn.It's
shocking to find howmany people do not believe they can learn,and
how many more believe learning tobe difficult."""
word_tokenize(example_string)
Output:
["Muad'Dib",'learned','rapidly','because','his','first','training','was','in','how','to',
'learn','.','And','the','first','lesson','of','all','was','the','basic','trust','that','he',
'could','learn','.','It',"'s", 'shocking','to','find','how','many','people','do','not',
'believe','they','can','learn',',','and','how','many','more','believe','learning','to','be',
'difficult', '.']

2. FilteringStopWords
 Stop words are words that you want to ignore, so you filter themout of your text
when you’re processing it. Very common words like 'in', 'is', and 'an' are
oftenused as stop words since they don’t add a lot of meaning to a text in and of
themselves.
 Note:nltk.download("stopwords")

Pythonprogramtoeliminatestopwords
fromnltk.corpusimportstopwords
from nltk.tokenize import word_tokenize
worf_quote="Sir,Iprotest.Iamnotamerryman!"
words_in_quote = word_tokenize(worf_quote)
print(words_in_quote)
stop_words=set(stopwords.words("english"))
filtered_list = []
16

forwordinwords_in_quote:
ifword.casefold()notinstop_words:
filtered_list.append(word)
print(filtered_list)
Output:
• ['Sir',',','I','protest','.','I','am','not','a','merry','man','!’]
• ['Sir',',','protest','.','merry','man','!’]
• ‘I’ispronounanditiscontextword
• Contentwordsgive you informationaboutthetopicscovered inthetextorthe
sentiment that the author has about those topics.
• Contextwordsgiveyou informationaboutwritingstyle.Youcanobservepatternsin how
authors use context words in order to quantify their writing style.
• Once you’ve quantified their writing style, you can analyze a text written by an
unknownauthortoseehowcloselyit followsaparticularwritingstyleso youcantry to
identify who the author is.

3. Stemming
 Stemming isatext processingtaskinwhichyoureducewordstotheirroot,which is the
core part of a word.
 Forexample, thewords“helping”and“helper”sharetheroot “help.”
 Stemmingallows youto zeroinonthebasic meaningofawordratherthanallthe
details of how it’s being used.
 NLTKhasmorethanonestemmer,butwe’llbeusingthePorterstemmer.
PythonprogramforStemming
from nltk.stem import PorterStemmer
fromnltk.tokenizeimportword_tokenize
stemmer = PorterStemmer()
string_for_stemming="ThecrewoftheUSSDiscoverydiscovered many discoveries.
Discovering is what explorers do."
words=word_tokenize(string_for_stemming)
print(words)
stemmed_words=[stemmer.stem(word)forwordinwords]
print(stemmed_words)
Output
17

• ['The','crew','of','the','USS','Discovery','discovered','many','discoveries','.',
'Discovering', 'is', 'what', 'explorers', 'do', '.’]
• ['the','crew','of','the','uss','discoveri','discov','mani','discoveri','.','discov','is',
'what', 'explor', 'do', '.’]

Originalword Stemmedversion

'Discovery' 'discoveri'

'discovered' 'discov'

'discoveries' 'discoveri'

'Discovering' 'discov'

4. TaggingPartsofSpeech
Part of speech is a grammatical term that deals with the roles words play when
you use them together in sentences. Tagging parts of speech, or POS tagging, is
the task of labeling the words in your text according to their part of speech.

Partofspeech Role Examples

Noun Isaperson,place,or thing mountain,bagel,


Poland

Pronoun Replacesanoun you,she,we

Adjective Givesinformationaboutwhatanounis like efficient,windy,


colorful

Verb Isanactionorastateofbeing learn, is, go

Adverb Givesinformationaboutaverb,an efficiently,always,


adjective, or another adverb very

Preposition Givesinformationabouthowanounor from,about,at


pronoun is connected to another word

Conjunction Connectstwootherwordsorphrases so,because,and


18

Interjection Is an exclamation yay,ow,wow

• Some sources also include the categoryarticles (like “a” or “the”) in the list of parts
ofspeech,butothersourcesconsiderthemtobeadjectives.NLTKusesthe word
determiner to refer to articles.

PythonprogramforTaggingPartsofSpeech
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
sagan_quote = """
Ifyouwishto makeanapple pie fromscratch, you
must first invent the universe."""
words_in_sagan_quote=word_tokenize(sagan_quote)
nltk.pos_tag(words_in_sagan_quote)
Output:
• [('If','IN'),('you','PRP'), ('wish','VBP'),('to','TO'),('make','VB'),('an','DT'),('apple',
'NN'),('pie','NN'),('from','IN'),('scratch','NN'),(',',','),('you','PRP'),('must','MD'),
('first','VB'),('invent','VB'),('the','DT'),('universe','NN'),('.','.')]
POSTaginformation
• nltkusesThePennTreebank'sPOStags

nltk.download('tagsets')nltk.help.upenn_tagset(

5. Lemmatizing
• Like stemming, lemmatizing reduces words to their core meaning, but it willgive
youacompleteEnglishwordthatmakessenseonitsowninsteadofjust afragmentof a
word like 'discoveri'.
• Alemma isawordthatrepresentsawholegroupofwords, andthat groupofwordsis called
a lexeme.
• For example, if you were to look up the word “blending” in a dictionary, then you’d
need to look at the entry for “blend,” but you would find “blending” listed in that
entry.
• Inthisexample, “blend” isthe lemma, and “blending” ispart ofthe lexeme. So when
you lemmatize a word, you are reducing it to its lemma.
19

5. PythonProgramforLemmatization
import nltk
nltk.download('punkt')
nltk.download('wordnet')
fromnltk.stemimportWordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
string_for_lemmatizing="ThefriendsofDeSotolovescarves."
words = word_tokenize(string_for_lemmatizing)
lemmatized_words=[lemmatizer.lemmatize(word)forwordinwords]
print(lemmatized_words)
Output:
 lemmatizer.lemmatize("worst"
) o/p:'worst’
 lemmatizer.lemmatize("worst",
pos="a") o/p: 'bad'

6. Chunking
 Chunking allows you to identify phrases.
 Aphraseisawordor groupofwordsthatworksasasingleunit toperforma
grammatical function. Noun phrases are built around a noun.
 Herearesome examples:
 “Aplanet”
 “Atiltingplanet”
 “Aswiftlytiltingplanet”
 Chunking makes use of POS tags to group words and apply chunk tags to those
groups.Chunksdon’t overlap,sooneinstanceofawordcanbe inonlyonechunk at a
time.
 Aftergettinga list oftuplesofallthewordsinthe quote,alongwiththeirPOS tag. In
order to chunk, you first need to define a chunk grammar.
 Note:Achunkgrammarisacombinationofrulesonhowsentencesshouldbe
chunked. It often uses regular expressions, or regexes.
 Createachunkgrammar withoneregularexpressionrule:
 grammar="NP: {<DT>?<JJ>*<NN>}“
 Createachunkparserwiththisgrammar:
20

Pythonprogramforchuncking
import nltk
nltk.download('puckt')
fromnltk.tokenize importword_tokenize
quote="It'sadangerousbusiness,Frodo,goingoutyourdoor." words_quote =
word_tokenize(quote)
print(words_quote)
nltk.download("averaged_perceptron_tagger")
tags = nltk.pos_tag(words_quote)
print(tags)
#Regular expression for Noun Phrase
grammar="NP:{<DT>?<JJ>*<NN>}"
#Createachunkparserwiththisgrammar:
chunk_parser=nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)
print(tree)

Output:
• ['It',"'s",'a','dangerous','business',',','Frodo',',','going','out','your','door','.']
• [('It','PRP'),("'s",'VBZ'),('a','DT'),('dangerous', 'JJ'),('business','NN'),(',',','),
('Frodo','NNP'),(',',','),('going','VBG'),('out','RP'),('your','PRP$'),('door','NN'), ('.', '.')]
• (S
• It/PRP
• 's/VBZ
• (NPa/DTdangerous/JJ business/NN)
• ,/, Frodo/NNP
• ,/, going/VBG
• out/RP
• your/PRP$
• (NPdoor/NN)
• ./.)
21

Tree Representation

7. Chinking
• Chinkingisusedtogetherwithchunking,but whilechunkingisusedto includea
pattern, chinking is used to exclude a pattern.
Pythonprogramtoperformchinking
import nltk
nltk.download('puckt')
fromnltk.tokenize importword_tokenize
quote="It'sadangerousbusiness,Frodo,goingoutyourdoor."
words_quote = word_tokenize(quote)
print(words_quote)
nltk.download("averaged_perceptron_tagger")
tags = nltk.pos_tag(words_quote)
print(tags)
#Regularexpression
grammar = """
Chunk:{<.*>+}
}<JJ>{""“
chunk_parser=nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)
print(tree)
Output:
• ['It',"'s",'a','dangerous','business',',','Frodo',',','going','out','your','door','.']
• [('It','PRP'),("'s",'VBZ'),('a','DT'),('dangerous', 'JJ'),('business','NN'),(',',','),
('Frodo','NNP'),(',',','),('going','VBG'),('out','RP'),('your','PRP$'),('door','NN'), ('.', '.')]
22

• (S
• (ChunkIt/PRP's/VBZa/DT)
• dangerous/JJ
• (Chunkbusiness/NN ,/,Frodo/NNP,/,going/VBGout/RPyour/PRP$ door/NN./.))

Tree Representation

8. UsingNamedEntityRecognition(NER)
SomeExamplesofNamedEntityRecognition (NER)

Python ProgramtoNameEntity Recognition

import nltk
nltk.download('punkt')
fromnltk.tokenize importword_tokenize
quote="It'sadangerousbusiness,Frodo,goingoutyourdoor." words_quote =
word_tokenize(quote)
print(words_quote)
nltk.download("averaged_perceptron_tagger")
tags = nltk.pos_tag(words_quote)
nltk.download("maxent_ne_chunker")
nltk.download("words")
tree=nltk.ne_chunk(tags)
print(tree)

Output
['It',"'s",'a','dangerous','business',',','Frodo',',','going','out','your','door','.']
1

(S
It/PRP
's/VBZ
a/DT
dangerous/JJ
business/NN
,/,
(PERSONFrodo/NNP)
,/,
going/VBG
out/RP
your/PRP$
door/NN
./.)

Note: Ifweusethiscodeit simplyspecifiesthat itisaNamedEntitywithout giving the


specification.

• tree=nltk.ne_chunk(tags,binary=True)
• print(tree)
Output

Natural LanguageProcessing
Unit-II

TheparsinginNLPistheprocessofdeterminingthesyntacticstructureofatextby analysing its


constituent words based on an underlying grammar.
ExampleGrammar:

Then,theoutcomeoftheparsingprocess wouldbeaparsetree,wheresentenceistheroot,
intermediate nodes such as noun_phrase, verb_phrase etc. have children - hence they are
called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’are called
terminals.
2

ParseTree:

• Asentenceisparsedbyrelatingeachwordtootherwordsinthesentencewhichdepend on
it.
• Thesyntacticparsingofasentenceconsistsoffindingthecorrectsyntacticstructureof that
sentence in the given formalism/grammar.
• Dependencygrammar(DG)andphrasestructuregrammar(PSG)aretwosuch formalisms.
• PSG breaks sentence into constituents (phrases), which are then broken into
smallerconstituents.
• Describephrase,clausestructureExample:NP,PP,VPetc.,
• DG:syntacticstructureconsistsoflexicalitems,linkedbybinaryasymmetricrelations
called dependencies.
• Interestedingrammaticalrelationsbetweenindividual words.
• Doesproposearecursivestructureratheranetwork of relations
• Theserelationscanalso have labels.
3

• Atreebank canbedefinedasalinguisticallyannotatedcorpusthatincludes somekind of


syntactic analysis over and above part-of-speech tagging.

ConstituencytreevsDependencytree
• Dependencystructuresexplicitlyrepresent
- Head-dependentrelations(directedarcs)
- Functionalcategories(arclabels)
- Possiblysomestructuralcategories (POS)
• Phrasestructureexplicitlyrepresent
- Phrases(non-terminal nodes)
- Structuralcategories(non-terminallabels)
- Possiblesomefunctionalcategories(grammatical functions)

Definingcandidatedependencytreesforaninput sentence
 Learning:scoringpossibledependencygraphsforagivensentence,usuallyby
factoring the graphs into their component arcs
 Parsing:searchingforthehighest scoringgraphfor agiven sentence

Syntax:
• InNLP,thesyntacticanalysisofnaturallanguageinputcanvaryfrombeingverylow-
level,suchassimplytaggingeachwordinthesentencewithapartofspeech(POS),or very
high level, such as full parsing.
• In syntactic parsing, ambiguity is a particularly difficult problem because the most
possible analysis has to be chosen from an exponentially large number of alternative
analyses.
• From tagging to full parsing, algorithms that can handle such ambiguity have to be
carefully chosen.
• Hereweexplorethesyntacticanalysismethodsfromtaggingtofullparsingandtheuse of
supervised machine learning to deal with ambiguity.
Parsing Natural Language
• In a text-to-speech application, input sentences are to be converted to a spoken
output that should sound like it was spoken by a native speaker of the language.
• Example:Hewanted togoa driveinthe country.
• There is a natural pause between the words derive and In in sentence that reflects
an underlying hidden structure to the sentence.
• Parsingcanprovideastructuraldescriptionthatidentifiessuchabreakinthe intonation.
• Asimplercase:The catwholivesdangerouslyhadninelives.
• In this case, a text-to-speech system needs to know that the first instance of the
word lives is a verb and the second instance is a noun before it can begin to produce
the natural intonation for this sentence.
• This is an instance of the part-of-speech (POS) tagging problem where each word in
the sentence is assigned a most likely part of speech.
• Anothermotivationforparsingcomesfromthenaturallanguagetaskofsummarization,
inwhichseveraldocumentsaboutthesametopicshouldbecondenseddowntoasmall
digest of information.
• Such a summary may be in response to a question that is answered in the set of
4

documents.
5

• In this case, a useful subtask is to compress an individual sentence so that only the
relevant portions of a sentence is included in the summary.
• Forexample:Beyondthebasiclevel,theoperationsofthethreeproductsvarywidely. The
operations of the products vary.
• The elegant way to approach this task is to first parse the sentence to find the
various constituents: where we recursively partition the words in the sentence into
individual phrases such as a verb phrase or a noun phrase.

Treebanks:AData-DrivenApproachtoSyntax
 Parsingrecoversinformationthat isnotexplicit intheinput sentence.
 This implies that a parser requires some knowledge (syntactic rules) in addition to
the input sentence about the kind of syntactic analysis that should be produced as
output.
 Onemethodtoprovidesuchknowledgetotheparseristowritedownagrammarofthe
language – a set of rules of syntactic analysis as a CFGs.
 In natural language, it is far too complex to simply list all the syntactic rules in terms
of a CFG.
 Thesecondknowledgeacquisitionproblem-notonlydoweneedtoknowthesyntactic
rules for a particular language, but we also need to know which analysis is the most
plausible(probably) for a given input sentence.
 Theconstructionoftreebankisadatadrivenapproachtosyntaxanalysisthatallowsus to
address both of these knowledge acquisition bottlenecks in one stroke.
 Atreebankissimplyacollectionofsentences(alsocalledacorpusoftext),whereeach
sentence is provided a complete syntax analysis.
 Thesyntacticanalysisforeachsentencehasbeenjudgedbyahumanexpertasthemost
possible analysis for that sentence.
 Alot of care is taken during the human annotation process to ensure that a
consistent treatment is provided across the treebank for related grammatical
phenomena.
 There is no set of syntactic rules or linguistic grammar explicitly provided by a
treebank, and typically thereis no list of syntactic constructions provided explicitly in
a treebank.
 A detailed set of assumptions about the syntax is typically used as an annotation
guideline to help the human experts produce the single-most plausible syntactic
analysis for each sentence in the corpus.
 Treebanksprovideasolutiontothetwokindsofknowledgeacquisitionbottlenecks.
 Treebanks solve the first knowledge acquisition problem of finding the grammar
underlying the syntax analysis because the syntactic analysis is directly given instead
of a grammar.
 Infact,theparserdoesnotnecessarilyneedanyexplicitgrammarrulesaslongasitcan
faithfully produce a syntax analysis for an input sentence.
 Treebanksolvethesecondknowledgeacquisitionproblemas well.
 Because each sentence in a treebank has been given its most plausible(probable)
syntacticanalysis,supervisedmachinelearningmethodscanbeusedtolearnascoring
function over all possible syntax analyses.
6

 Two main approaches to syntax analysis are used to construct treebanks:


dependency graph and phrase structure trees.
 These two representations are very closely related to each other and under some
assumptions, one representation can be converted to another.
7

 Dependence analysis is typically favoured for languages such as Czech and Turkish,
that have free word order.
 Phrase structure analysis is often used to provide additional information about long-
distance dependencies and mostly languages like English and French.
 NLP:isthecapabilityofthecomputersoftwaretounderstandthenaturallanguage.
 Therearevariety oflanguagesintheworld.
 Eachlanguagehasitsownstructure(SVOorSOV)->calledgrammar ->hascertainset of
rules->determines: what is allowed, what is not allowed.
 English:SOV Otherlanguages:SVOorOSV
Ieatmango
 Grammarisdefinedastherulesforformingwell-structuredsentences.
 belongstoVN
 DifferentTypesofGrammarin NLP
1. Context-FreeGrammar(CFG)
2. ConstituencyGrammar(CG)orPhrasestructuregramma
r 3.Dependency Grammar (DG)
8

RepresentationofSyntacticStructure
SyntaxAnalysisUsingDependency Graphs
 The main philosophy behind dependency graphs is to connect a word- the head of
aphrase- with the dependents in that phrase.
 Thenotationconnectsaheadwithitsdependentusingadirected(asymmetric)connections
.
 Dependencygraphs,justlikephrasestructurestrees,isarepresentationthatis consistent
with many different linguistic frameworks.
 Thewordsintheinputsentencearetreatedastheonlyverticesinthegraph,whichare linked
together by directed arcs representing syntactic dependencies.
9

 Independency-basedsyntacticparsing,thetaskistoderiveasyntacticstructureforan
input sentence by identifying the syntactic head of each word in the sentence.
 This defines a dependency graph, where the nodes arethe words of the input
sentence and arcs are the binary relations from head to dependent.

• Thedependencytreeanalyses,whereeachworddependsonexactlyoneparent,
either another word or a dummy root symbol.
• By convention, in dependency tree 0 index is used to indicate the root symbol
and the directed arcs are drawn from the head word to the dependent word.
• In the Fig shows a dependency tree for Czech sentence taken from the
Prague dependency treebank.

 Each nodein thegraph is aword, its partofspeech and theposition oftheword in the
sentence. • For example [fakulte, N3,7] is the seventh word in the sentence with POS
tag N3.
 Thenode[#,ZSB,0] isthe rootnode ofthe dependency tree.
1
0
 Therearemanyvariationsofdependencysyntacticanalysis,butthebasictextualformat for
a dependency tree can be written in the following form.
 Where each dependent word specifies the head Word in the sentence, and exactly
one word is dependent to the root of the sentence.

SyntaxAnalysisUsingPhraseStructuresTrees
 APhrase Structure syntax analysis of a sentence derives from the traditional
sentence diagrams that partition asentenceinto constituents, and largerconstituents
areformed by meaning smaller ones.
 Phrase structure analysis also typically incorporate ideas from generative grammar
(from linguistics) to deal with displaced constituents or apparent long-distance
relationships between heads and constituents.
 A phrase structure tree can be viewed as implicitly having a predicate-argument
structure associated with it.
 Sentenceincludes asubject and apredicate.Thesubject is anoun phrase(NP)and the
predicate is a verb phrase.
 Forexample,thephrasestructureanalysis:Mr.Bakerseemsespeciallysensitive,taken
from the Penn Treebank.
 ThesubjectofthesentenceismarkedwiththeSBJmarkerandpredicateofthesentence is
marked with the PRD marker.
1
1

• NNP: proper noun, singularVBZ: verb, third person singular presentADJP: adjective
phrase RB: adverb JJ: adjective
• The same sentence gets the following dependency tree analysis: some of the
informationfromthe bracketinglabels fromthe phrasestructure analysisgetsmapped
onto the labelled arcs of the dependency analysis.

• To explain some details of phrase structure analysis in treebank, which was a project
thatannotated40,000sentencesfromthe wallstreetjournalwithphrasestructuretree.

ParsingAlgorithms
• Givenaninput sentence,aparserproducesanoutput analysisofthat sentence.
• Treebank parsers do not need to have an explicit grammar, but to discuss the
parsing algorithms simpler, we use CFG.
• ThesimpleCFGG that can be used to derive string such as a and b or c from thestart
symbol N.

 Animportantconceptforparsingisaderivation.
 Fortheinputstringaandborc,thefollowingsequenceofactionsseparatedbysymbol
represents a sequence of steps called derivation.
11
2

 Inthisderivation,eachlineiscalledasentential form.
 Intheabovederivation,werestrictedourselvestoonlyexpandedonther
ightmost nonterminal in each sentential form.
 Thismethodiscalledtherightmost derivationoftheinputusingaCFG.
 This derivations quenceexactlycorrespondstotheconstructionofthe
following parse tree from left to right, one symbol at a time.

 However,auniquederivationsequenceisnot guaranteed.
 Therecanbemanydifferentderivations.

Edit distance is the minimum number of single-character edits needed to transform one string
into another; the most common version, the Levenshtein distance, counts insertions, deletions,
and substitutions as unit-cost operations. It is widely used in spell checking, DNA sequence
comparison, and general string similarity tasks, because it quantifies how “far apart” two
sequences are based on minimal edits needed for conversion.

The edit distance between strings quantifies dissimilarity as the minimal sequence of edits that
transforms one string into the other, with the classic Levenshtein variant allowing insertion,
deletion, and substitution of single characters, each with cost 1. This metric is foundational in
computational linguistics and bioinformatics for comparing words and sequences.

Levenshtein distance: allows insertions, deletions, substitutions; works for strings of different
lengths.

Damerau–Levenshtein distance: extends Levenshtein by also counting a transposition of two


adjacent characters as a single edit, capturing common typos.

Hamming distance: counts substitutions only and requires equal-length strings; it is a special
case of edit distance under restricted operations.
11
3

Applications include spell correction (choose dictionary words with smallest distance),
approximate string matching in search, plagiarism detection, and sequence alignment in
computational biology.

Example

Example: compute Levenshtein distance between “kitten” and “sitting”.

One optimal edit path: substitute k→s (“sitten”), substitute e→i (“sittin”), insert g at end
(“sitting”), totaling 3 edits.

Therefore, the edit distance d(“kitten”, “sitting”) = 3.

Another short example: “SEA” → “ATE”. One optimal sequence: delete S (“EA”), substitute
E→T (“TA”), insert E at end (“TAE”) then swap order is not allowed in Levenshtein, so a
minimal Levenshtein path achieves 3 edits by substitution/insertion/deletion choices; concrete
optimal sequences can be derived from the DP table.

POS tagging (Part-of-Speech tagging) is the process of assigning a grammatical category (like
noun, verb, adjective, adverb, pronoun, preposition, conjunction, determiner, etc.) to each token
in a text based on its form and context, enabling downstream NLP tasks such as parsing, NER,
information extraction, and machine translation to operate on structured linguistic information
more accurately. POS tagging is the automatic labeling of each word in a sentence with its
appropriate part of speech, sometimes including subcategories like tense, number, or case
depending on the tagset used.

Common tagsets include the Penn Treebank (e.g., NN, VB, JJ) and Universal Dependencies,
providing standardized labels across corpora and languages.

Why it matters

It resolves lexical ambiguity by using context to decide the correct tag for words that can belong
to multiple categories (e.g., “book” as noun vs. verb), improving the quality of syntactic parsing
and semantic tasks. Accurate POS tags support applications such as information retrieval,
sentiment analysis, question answering, and translation by clarifying sentence structure and word
roles.

Workflow
11
4

Tokenization splits text into tokens; preprocessing (like lowercasing and punctuation handling)
prepares data for tagging. A model or rule system assigns tags considering the token, its
morphology, and surrounding context; outputs are evaluated and refined for accuracy.

Methods

Rule-based: Uses lexicons and hand-crafted context rules (e.g., if a word follows an article,
prefer noun), often in two stages: candidate tags from dictionary, then disambiguation rules.

Statistical/Stochastic: Learns probabilities from annotated corpora (e.g., HMMs, CRFs) to


choose tags that maximize likelihood given the context.

Transformation-based: Starts with baseline tags and iteratively applies learned transformation
rules to fix errors, blending rule interpretability with data-driven learning.

Examples

Penn Treebank includes tags like NN (singular noun), NNS (plural noun), VB (base verb), VBD
(past tense), JJ (adjective), RB (adverb), IN (preposition/subordinator), DT (determiner), PRP
(personal pronoun).

Universal Dependencies provides cross-lingual categories such as NOUN, VERB, ADJ, ADV,
ADP, PRON, DET, AUX, PART, SCONJ, CCONJ, NUM, PROPN, PUNCT.

Example (sentence-level)

Sentence: “The quick brown fox jumps over the lazy dog.”.

Tags (UD-style): The–DET, quick–ADJ, brown–ADJ, fox–NOUN, jumps–VERB, over–ADP,


the–DET, lazy–ADJ, dog–NOUN; or Penn Treebank: DT, JJ, JJ, NN, VBZ, IN, DT, JJ, NN.

Ambiguity handling

Context-sensitive tagging addresses words with multiple POS: “can” may be a modal (MD) in
“can go,” a noun (NN) in “a can,” or a verb (VB) in “can peaches,” resolved by surrounding
words and syntactic cues.

Performance is measured on annotated test sets using accuracy; precision/recall per tag can
diagnose specific weaknesses (e.g., adjectives vs. participles).

Applications

Parsing: POS tags guide phrase-structure or dependency parsers, improving syntactic analysis.
11
5

Information extraction and NER: Distinguishing proper nouns, verbs, and modifiers boosts entity
detection and relation extraction.

Example

Text: “Time flies like an arrow; fruit flies like a banana.” This classic ambiguity shows “flies” as
a verb in the first clause and a noun (plural) in the second; context and sequence modeling assign
VERB vs. NOUN appropriately, while “like” shifts between preposition and verb.

A statistical tagger leverages surrounding tags and learned transition/emission probabilities; a


rule-based tagger may apply patterns like “DET + NOUN” or “NOUN + VERB” to
disambiguate.

Limitations and challenges

Domain shift (e.g., social media slang, code-mixed text) degrades accuracy; adapting models or
fine-tuning on in-domain data helps.

Morphologically rich languages require more complex tagsets and features to capture agreement,
case, and derivation.

You might also like