0% found this document useful (0 votes)

22 views24 pages

Module 5

The document outlines the process of corpus creation in natural language processing (NLP), focusing on treebanks, which are annotated corpora that represent syntactic structures of sentences. It details the types of treebanks, their contents, and the steps involved in building them, including corpus collection, annotation, quality control, and release. Additionally, it discusses the applications of treebanks in various NLP tasks such as parsing, machine translation, and sentiment analysis.

Uploaded by

shanmukh899

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views24 pages

Module 5

Uploaded by

shanmukh899

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Module -5 Corpus Creation

 Introduction and definition of Corpus in natural language

processing

 Corpus Size, Balance, Representativeness, and Sampling, Data

Capture and Copyright,

 Corpus Markup and Annotation,

 Multilingual Corpora, Multimodal Corpora,

 Corpus Annotation Types, Morphosyntactic Annotation,

 Treebanks: Syntactic, Semantic, and Discourse Annotation,

15/05/2025
 The ProcessIntroduction
of Building Treebanks,
to Natural application
Language Processing 1
Treebank

 In Natural Language Processing (NLP), a treebank is a corpus of text that has been annotated with syntactic

structure, often represented as a tree. These trees, also known as parse trees, show how words in a sentence are

grammatically related.

 Treebanks are crucial for training and evaluating NLP models, particularly for tasks like parsing, part-of-speech tagging,

and machine translation.

 A treebank is a collection of sentences that have been parsed to show their grammatical structure. Each

sentence is annotated with its syntactic (and sometimes semantic) structure, usually in the form of a tree diagram—

hence the name "treebank".

15/05/2025 2
Corpus
Types of Treebanks

1.Constituency Treebanks:
1. Also known as phrase structure treebanks.

2. Represent sentences using constituent structures (e.g., noun phrases, verb phrases).

3. Example format: Penn Treebank (English).

2.Dependency Treebanks:
1. Represent grammatical relationships between words using dependency relations (e.g., subject, object).

2. More compact and often preferred in modern NLP models.

3. Example format: Universal Dependencies (UD).

15/05/2025 3
Corpus
What Do Treebanks Contain?

 Tokens: Words in the sentence.

 Part-of-speech (POS) tags: E.g., noun, verb, adjective.

 Syntactic structure: How words and phrases relate (either via constituency or dependency trees).

 (Optional) Semantic annotations: Roles like agent, patient, etc.

15/05/2025 4
Corpus
Treebanks can include three major types of annotations in NLP: Syntactic, Semantic, and
Discourse. Each captures a different level of linguistic information.
1. Syntactic Annotation
This refers to the grammatical structure of sentences—how words are grouped and related in phrases and clauses.
🔹 Example:
Sentence: "The cat sat on the mat."

1. Top Half – Constituency Parse Tree

This shows how the sentence is structured in terms of phrases:

• S = Sentence
• NP = Noun Phrase → (The cat)
• VP = Verb Phrase → (sat on the mat)
• PP = Prepositional Phrase → (on the mat)
• DT = Determiner → (The)
• NN = Noun → (cat, mat)
• VBD = Verb (past tense) → (sat)
• IN = Preposition → (on)

15/05/2025 5
Corpus
2. Bottom Half – Dependency Tree

Relations:
• sat is the root (main verb).

• cat → subject of sat (nsubj)

• on → prepositional modifier of sat (prep)

• mat → object of on (pobj)

• the → determiners for cat and mat (det)

In labeled format:

15/05/2025 6
Corpus
Examples of Syntactic Treebanks:

 Penn Treebank (English)

 Universal Dependencies (Many languages)

 TIGER Treebank (German)

15/05/2025 7
Corpus
2. Semantic Annotation
This includes information about meaning, such as predicate-argument structures, word senses, and named entities.

🔹 Example:
Sentence: "Mary gave John a book."
•Predicate: give
•Arguments:
• A0 (giver): Mary
• A1 (thing given): a book
• A2 (recipient): John

15/05/2025 8
Corpus
This is often annotated using frameworks like PropBank or FrameNet.

Examples of Semantic Treebanks:

 PropBank: Adds semantic role labels on top of the Penn Treebank.

 FrameNet: Annotates text based on semantic frames (e.g., COMMERCIAL_TRANSACTION).

 VerbNet: Provides verb classes with thematic roles.

15/05/2025 9
Corpus
3. Discourse Annotation
Discourse treebanks capture how sentences or clauses relate to each other in a larger context (coherence, topic shifts,

discourse relations).

🔹 Example:
Text:

"Mary was hungry. She ate a sandwich."

Discourse relation: Cause

("She ate a sandwich" is caused by "Mary was hungry")

Annotated using RST (Rhetorical Structure Theory) or PDTB (Penn Discourse Treebank).

15/05/2025 10
Corpus
Examples of Discourse Treebanks:
 Penn Discourse Treebank (PDTB): Annotates discourse connectives and relations (e.g., contrast, causality).

 RST Discourse Treebank: Based on RST theory of text coherence.

15/05/2025 11
Corpus
The Process of Building Treebanks
Steps
1. Corpus Collection

2. Preprocessing

3. Syntactic Annotation

4. Semantic Annotation

5. Discourse Annotation

6. Quality Control

7. Formatting

8. Release

15/05/2025 12
Corpus
15/05/2025 13
Corpus
1. Corpus Collection (Raw Data Selection)
Start with a large, representative set of raw text from one or more domains (e.g., news articles, conversations, literature).

 Must reflect natural usage of the language.

 Should be diverse: genre, style, vocabulary.

Example sources:

 Wall Street Journal (Penn Treebank)

 Wikipedia or web text (Universal Dependencies)

 Spoken transcripts (Switchboard corpus)

15/05/2025 14
Corpus
2. Text Preprocessing
Prepare the raw text for annotation.

Tasks include:

 Tokenization: Splitting text into words/tokens.

 Sentence segmentation: Identifying sentence boundaries.

 POS tagging (optional): Pre-labeling parts of speech to guide parsers.

15/05/2025 15
Corpus
3. Syntactic Annotation
Add grammatical structure using either:

a) Constituency Parsing

Sentences are annotated as nested phrases (e.g., NP, VP).

b) Dependency Parsing

Annotates head-dependent word relationships (e.g., subject, object).

Who does this?

•Automatic parsers generate initial structures.

•Human linguists review and correct errors manually.

15/05/2025 16
Corpus
4. Semantic Annotation (Optional)
Add meaning-based labels:

 Semantic roles (PropBank/FrameNet)

 Named entities

 Word sense disambiguation

5. Discourse Annotation (Optional)

Analyze how sentences relate in a larger text:
 Annotate discourse connectives (e.g., "however", "because")
 Label discourse relations (e.g., contrast, cause)
Tools: RST-DT, PDTB

15/05/2025 17
Corpus
6. Quality Control
Ensures consistency and accuracy:
 Inter-annotator agreement (IAA) is measured.
 Disagreements resolved by expert annotators.
 Annotation guidelines are refined.

7. Formatting and Conversion

Convert annotated data into standard formats like:
 PTB (Penn Treebank format) for constituency
 CoNLL-U format for dependency treebanks

8. Release and Maintenance

 Treebanks are made publicly available for researchers.
 They may be updated over time with more data or corrections.

15/05/2025 18
Corpus
Example: Penn Treebank Pipeline
1.Source: Wall Street Journal text

2.Preprocessing and initial parsing

3.Human annotators corrected parse trees

4.Layers added: POS tags, syntax trees, PropBank roles

15/05/2025 19
Corpus
Applications of Treebanks in NLP

 Training and Evaluating Parsers

 Syntactic Analysis
 Semantic Role Labeling (SRL)
 Machine Translation
 Information Extraction (IE)
 Sentiment Analysis
 Text Summarization
 Question Answering Systems
 Discourse Analysis
 Linguistic Research & Language Modeling

15/05/2025 20
Corpus
15/05/2025 21
Corpus
REFERENCES
Text Books:
1. Foundations & Text Preprocessing
"Speech and Language Processing" – Daniel Jurafsky & James H. Martin
 Covers fundamental NLP concepts, text processing, POS tagging, parsing, and machine learning
models.
 Best for understanding both theoretical and practical aspects of NLP.
"Natural Language Processing with Python" (NLTK Book) – Steven Bird, Ewan
Klein, & Edward Loper
 Great for hands-on coding, especially for text preprocessing, POS tagging, chunking, and named
entity recognition.
 Uses Python with NLTK, making it ideal for implementing your programs.
2. Morphological Analysis & Language Models
"Introduction to Natural Language Processing" – Jacob Eisenstein
15/05/2025 Introduction to Natural Language Processing
Covers morphology, syntax, and probabilistic language models (N-grams, HMMs).
22
3. Syntactic & Semantic Processing

"Handbook of Natural Language Processing" – Nitin Indurkhya & Fred J. Damerau

 Covers advanced syntactic analysis, POS tagging, chunking, and information extraction techniques.

"Statistical Natural Language Processing" – Christopher Manning & Hinrich Schütze

 Focuses on statistical approaches to NLP, including POS tagging, chunking, and language modeling.

4. Deep Learning & NLP Applications

"Deep Learning for Natural Language Processing" – Palash Goyal, Sumit Pandey, & Karan Jain

 Best for modern NLP applications like Named Entity Recognition, Transformers, and chatbot

development.

"Natural Language Processing with Transformers" – Lewis Tunstall, Leandro von Werra, &

Thomas Wolf
15/05/2025 Introduction to Natural Language Processing 23
 Focuses on deep learning and Transformer-based models (BERT, GPT, etc.).
THANK YOU

15/05/2025 Introduction to Natural Language Processing 24

NLP Unit-2
No ratings yet
NLP Unit-2
11 pages
NLP Chapter-1
No ratings yet
NLP Chapter-1
24 pages
Unit 2 New One
No ratings yet
Unit 2 New One
12 pages
Morphological Analysis in NLP
No ratings yet
Morphological Analysis in NLP
15 pages
Unit 2
No ratings yet
Unit 2
15 pages
Introduction to NLP and NLTK Basics
No ratings yet
Introduction to NLP and NLTK Basics
23 pages
What Is Parsing
No ratings yet
What Is Parsing
47 pages
NLTK: Python for Natural Language Processing
No ratings yet
NLTK: Python for Natural Language Processing
23 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
45 pages
Dependency Parsing Lecture
No ratings yet
Dependency Parsing Lecture
45 pages
NLP One Mark Questions With Answers
No ratings yet
NLP One Mark Questions With Answers
8 pages
4.chapter5 - Syntactic and Semantic Representations
No ratings yet
4.chapter5 - Syntactic and Semantic Representations
47 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
42 pages
Deep Parsing Techniques for NLP
No ratings yet
Deep Parsing Techniques for NLP
50 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
71 pages
Natural Language Processing Guide
No ratings yet
Natural Language Processing Guide
21 pages
NLP Ans
No ratings yet
NLP Ans
9 pages
Intro to Dependency Grammars
No ratings yet
Intro to Dependency Grammars
21 pages
NLP Unit 3 Part A PDF
No ratings yet
NLP Unit 3 Part A PDF
75 pages
Парсер для обработки языка AI
No ratings yet
Парсер для обработки языка AI
9 pages
NLP Sem Unit 2
No ratings yet
NLP Sem Unit 2
12 pages
Dependency Parsing in NLP with Deep Learning
No ratings yet
Dependency Parsing in NLP with Deep Learning
53 pages
NLP StudyMaterial
No ratings yet
NLP StudyMaterial
540 pages
Natural Language Annotation For Machine Learning A Guide To Corpus Building For Applications James Pustejovsky PDF Download
No ratings yet
Natural Language Annotation For Machine Learning A Guide To Corpus Building For Applications James Pustejovsky PDF Download
52 pages
Building Tamil Treebanks
No ratings yet
Building Tamil Treebanks
10 pages
UNIT 5 NLP Tools and Techniques
No ratings yet
UNIT 5 NLP Tools and Techniques
7 pages
NLP Unit 2
No ratings yet
NLP Unit 2
20 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
Background
No ratings yet
Background
18 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
49 pages
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
No ratings yet
Natural Language Processing Notes by Prof. Suresh R. Mestry: L I L L L I
41 pages
Semantic Role Annotation Techniques
No ratings yet
Semantic Role Annotation Techniques
105 pages
Natural Language Processing
No ratings yet
Natural Language Processing
32 pages
NLP Self
No ratings yet
NLP Self
22 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
47 pages
Unit - 5 Natural Language Processing
No ratings yet
Unit - 5 Natural Language Processing
66 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
17 pages
A Presentation On Annotation in Corpus Linguistics
No ratings yet
A Presentation On Annotation in Corpus Linguistics
20 pages
Unit 1
No ratings yet
Unit 1
14 pages
Understanding Semantic Parsing in NLP
No ratings yet
Understanding Semantic Parsing in NLP
11 pages
Natural Language Processing Unit 3
No ratings yet
Natural Language Processing Unit 3
55 pages
NLP Unit 1
No ratings yet
NLP Unit 1
43 pages
Grammars: Before You Can Parse You Need A Grammar. So Where Do Grammars Come From?
No ratings yet
Grammars: Before You Can Parse You Need A Grammar. So Where Do Grammars Come From?
32 pages
NLP Insem FlyHigh Services
No ratings yet
NLP Insem FlyHigh Services
7 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
Unit II
No ratings yet
Unit II
61 pages
Syntax and Dependency Parsing Overview
No ratings yet
Syntax and Dependency Parsing Overview
19 pages
NLP - Mid 2 Examination
No ratings yet
NLP - Mid 2 Examination
4 pages
UNIT-5 Quetions - Answers
No ratings yet
UNIT-5 Quetions - Answers
10 pages
Natural Language Processing
No ratings yet
Natural Language Processing
34 pages
NLP Corpus Approaches
No ratings yet
NLP Corpus Approaches
9 pages
NLP 5
No ratings yet
NLP 5
5 pages
Unit 3 Jntu
No ratings yet
Unit 3 Jntu
9 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
NLP
No ratings yet
NLP
2 pages
Artificial Intelligence: Unit V - Natural Language Processing
No ratings yet
Artificial Intelligence: Unit V - Natural Language Processing
30 pages
NLP Tools
No ratings yet
NLP Tools
5 pages
3rd Quarter Week 1 Answer Sheet
No ratings yet
3rd Quarter Week 1 Answer Sheet
4 pages
Purple B Marking
No ratings yet
Purple B Marking
2 pages
SF1 - 2019 - Grade 7 (Year I) - 7AM-1
No ratings yet
SF1 - 2019 - Grade 7 (Year I) - 7AM-1
8 pages
3° DPCC - Actv.03-Uni.5 2024
No ratings yet
3° DPCC - Actv.03-Uni.5 2024
4 pages
Doctrine of Noscitur A Sociis
No ratings yet
Doctrine of Noscitur A Sociis
12 pages
Adults I - Listening - A
No ratings yet
Adults I - Listening - A
2 pages
Introduction To Literary Translation
No ratings yet
Introduction To Literary Translation
18 pages
Lalsa Result Graduation
No ratings yet
Lalsa Result Graduation
1 page
Grade 7 English HL Term 1 2025 Test 1 - 084534
No ratings yet
Grade 7 English HL Term 1 2025 Test 1 - 084534
4 pages
GED102 Week 2 WGN - New (1) - GONZALES
No ratings yet
GED102 Week 2 WGN - New (1) - GONZALES
5 pages
Lyrics 12
No ratings yet
Lyrics 12
2 pages
Didactics and Applied Linguistics in Algeria
No ratings yet
Didactics and Applied Linguistics in Algeria
14 pages
Uncovering The Origins of Badang The Strongman
No ratings yet
Uncovering The Origins of Badang The Strongman
32 pages
Future Tenses and Conditional Exercises
No ratings yet
Future Tenses and Conditional Exercises
7 pages
Listening Skills Practice: Shopping For Clothes - Exercises: Preparation
No ratings yet
Listening Skills Practice: Shopping For Clothes - Exercises: Preparation
2 pages
TEST 2 Semestr Uchun Iqtisod Savollar
No ratings yet
TEST 2 Semestr Uchun Iqtisod Savollar
13 pages
Verbs + Gerunds and Infinitives
No ratings yet
Verbs + Gerunds and Infinitives
3 pages
Nouns, Verbs & Adjectives: Start!
No ratings yet
Nouns, Verbs & Adjectives: Start!
1 page
4 - Learning Activity 4 - Actividad de Aprendizaje 4
No ratings yet
4 - Learning Activity 4 - Actividad de Aprendizaje 4
6 pages
The Passive Voice
No ratings yet
The Passive Voice
2 pages
Refresh Quadrant 4 2016 Final TB
No ratings yet
Refresh Quadrant 4 2016 Final TB
88 pages
2024 BCWC Judging Criteria Official
No ratings yet
2024 BCWC Judging Criteria Official
2 pages
All Eligible
No ratings yet
All Eligible
44 pages
Grade 9: Express Permission, Obligation, and Prohibition Using Modals (
100% (1)
Grade 9: Express Permission, Obligation, and Prohibition Using Modals (
16 pages
Translating Puns in Sa’di’s Ghazals
No ratings yet
Translating Puns in Sa’di’s Ghazals
11 pages
ATG - Creative Writing - Docx 2024-2025
No ratings yet
ATG - Creative Writing - Docx 2024-2025
10 pages
Câu Hỏi Đuôi
No ratings yet
Câu Hỏi Đuôi
8 pages
نشاط 3
No ratings yet
نشاط 3
24 pages
A) Answer The Questions About Your Past, Present and Future. Illustrate Each Section With A Photograph or Image Representing That Stage of Your Life
No ratings yet
A) Answer The Questions About Your Past, Present and Future. Illustrate Each Section With A Photograph or Image Representing That Stage of Your Life
6 pages
Cambridge O Level: English Language 1123/11
No ratings yet
Cambridge O Level: English Language 1123/11
8 pages

Module 5

Uploaded by

Module 5

Uploaded by

Module -5 Corpus Creation

 Introduction and definition of Corpus in natural language

 Corpus Size, Balance, Representativeness, and Sampling, Data

Capture and Copyright,

 Corpus Markup and Annotation,

 Multilingual Corpora, Multimodal Corpora,

 Corpus Annotation Types, Morphosyntactic Annotation,

 Treebanks: Syntactic, Semantic, and Discourse Annotation,

and machine translation.

hence the name "treebank".

3. Example format: Penn Treebank (English).

2. More compact and often preferred in modern NLP models.

3. Example format: Universal Dependencies (UD).

 Tokens: Words in the sentence.

 Part-of-speech (POS) tags: E.g., noun, verb, adjective.

 (Optional) Semantic annotations: Roles like agent, patient, etc.

1. Top Half – Constituency Parse Tree

This shows how the sentence is structured in terms of phrases:

• cat → subject of sat (nsubj)

• on → prepositional modifier of sat (prep)

• mat → object of on (pobj)

• the → determiners for cat and mat (det)

 Penn Treebank (English)

 Universal Dependencies (Many languages)

 TIGER Treebank (German)

Examples of Semantic Treebanks:

 FrameNet: Annotates text based on semantic frames (e.g., COMMERCIAL_TRANSACTION).

 VerbNet: Provides verb classes with thematic roles.

"Mary was hungry. She ate a sandwich."

Discourse relation: Cause

("She ate a sandwich" is caused by "Mary was hungry")

 RST Discourse Treebank: Based on RST theory of text coherence.

 Must reflect natural usage of the language.

 Should be diverse: genre, style, vocabulary.

 Wall Street Journal (Penn Treebank)

 Wikipedia or web text (Universal Dependencies)

 Spoken transcripts (Switchboard corpus)

 Tokenization: Splitting text into words/tokens.

 Sentence segmentation: Identifying sentence boundaries.

 POS tagging (optional): Pre-labeling parts of speech to guide parsers.

Sentences are annotated as nested phrases (e.g., NP, VP).

Annotates head-dependent word relationships (e.g., subject, object).

Who does this?

•Automatic parsers generate initial structures.

•Human linguists review and correct errors manually.

 Semantic roles (PropBank/FrameNet)

 Word sense disambiguation

5. Discourse Annotation (Optional)

7. Formatting and Conversion

8. Release and Maintenance

2.Preprocessing and initial parsing

3.Human annotators corrected parse trees

4.Layers added: POS tags, syntax trees, PropBank roles

 Training and Evaluating Parsers

"Handbook of Natural Language Processing" – Nitin Indurkhya & Fred J. Damerau

"Statistical Natural Language Processing" – Christopher Manning & Hinrich Schütze

4. Deep Learning & NLP Applications

15/05/2025 Introduction to Natural Language Processing 24

You might also like