0% found this document useful (0 votes)
22 views24 pages

Module 5

The document outlines the process of corpus creation in natural language processing (NLP), focusing on treebanks, which are annotated corpora that represent syntactic structures of sentences. It details the types of treebanks, their contents, and the steps involved in building them, including corpus collection, annotation, quality control, and release. Additionally, it discusses the applications of treebanks in various NLP tasks such as parsing, machine translation, and sentiment analysis.

Uploaded by

shanmukh899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views24 pages

Module 5

The document outlines the process of corpus creation in natural language processing (NLP), focusing on treebanks, which are annotated corpora that represent syntactic structures of sentences. It details the types of treebanks, their contents, and the steps involved in building them, including corpus collection, annotation, quality control, and release. Additionally, it discusses the applications of treebanks in various NLP tasks such as parsing, machine translation, and sentiment analysis.

Uploaded by

shanmukh899
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Module -5 Corpus Creation

 Introduction and definition of Corpus in natural language

processing

 Corpus Size, Balance, Representativeness, and Sampling, Data

Capture and Copyright,

 Corpus Markup and Annotation,

 Multilingual Corpora, Multimodal Corpora,

 Corpus Annotation Types, Morphosyntactic Annotation,

 Treebanks: Syntactic, Semantic, and Discourse Annotation,

15/05/2025
 The ProcessIntroduction
of Building Treebanks,
to Natural application
Language Processing 1
Treebank

 In Natural Language Processing (NLP), a treebank is a corpus of text that has been annotated with syntactic

structure, often represented as a tree. These trees, also known as parse trees, show how words in a sentence are

grammatically related.

 Treebanks are crucial for training and evaluating NLP models, particularly for tasks like parsing, part-of-speech tagging,

and machine translation.

 A treebank is a collection of sentences that have been parsed to show their grammatical structure. Each

sentence is annotated with its syntactic (and sometimes semantic) structure, usually in the form of a tree diagram—

hence the name "treebank".

15/05/2025 2
Corpus
Types of Treebanks

1.Constituency Treebanks:
1. Also known as phrase structure treebanks.

2. Represent sentences using constituent structures (e.g., noun phrases, verb phrases).

3. Example format: Penn Treebank (English).

2.Dependency Treebanks:
1. Represent grammatical relationships between words using dependency relations (e.g., subject, object).

2. More compact and often preferred in modern NLP models.

3. Example format: Universal Dependencies (UD).

15/05/2025 3
Corpus
What Do Treebanks Contain?

 Tokens: Words in the sentence.

 Part-of-speech (POS) tags: E.g., noun, verb, adjective.

 Syntactic structure: How words and phrases relate (either via constituency or dependency trees).

 (Optional) Semantic annotations: Roles like agent, patient, etc.

15/05/2025 4
Corpus
Treebanks can include three major types of annotations in NLP: Syntactic, Semantic, and
Discourse. Each captures a different level of linguistic information.
1. Syntactic Annotation
This refers to the grammatical structure of sentences—how words are grouped and related in phrases and clauses.
🔹 Example:
Sentence: "The cat sat on the mat."

1. Top Half – Constituency Parse Tree

This shows how the sentence is structured in terms of phrases:

• S = Sentence
• NP = Noun Phrase → (The cat)
• VP = Verb Phrase → (sat on the mat)
• PP = Prepositional Phrase → (on the mat)
• DT = Determiner → (The)
• NN = Noun → (cat, mat)
• VBD = Verb (past tense) → (sat)
• IN = Preposition → (on)

15/05/2025 5
Corpus
2. Bottom Half – Dependency Tree

Relations:
• sat is the root (main verb).

• cat → subject of sat (nsubj)

• on → prepositional modifier of sat (prep)

• mat → object of on (pobj)

• the → determiners for cat and mat (det)

In labeled format:

15/05/2025 6
Corpus
Examples of Syntactic Treebanks:

 Penn Treebank (English)

 Universal Dependencies (Many languages)

 TIGER Treebank (German)

15/05/2025 7
Corpus
2. Semantic Annotation
This includes information about meaning, such as predicate-argument structures, word senses, and named entities.

🔹 Example:
Sentence: "Mary gave John a book."
•Predicate: give
•Arguments:
• A0 (giver): Mary
• A1 (thing given): a book
• A2 (recipient): John

15/05/2025 8
Corpus
This is often annotated using frameworks like PropBank or FrameNet.

Examples of Semantic Treebanks:


 PropBank: Adds semantic role labels on top of the Penn Treebank.

 FrameNet: Annotates text based on semantic frames (e.g., COMMERCIAL_TRANSACTION).

 VerbNet: Provides verb classes with thematic roles.

15/05/2025 9
Corpus
3. Discourse Annotation
Discourse treebanks capture how sentences or clauses relate to each other in a larger context (coherence, topic shifts,

discourse relations).

🔹 Example:
Text:

"Mary was hungry. She ate a sandwich."

Discourse relation: Cause

("She ate a sandwich" is caused by "Mary was hungry")

Annotated using RST (Rhetorical Structure Theory) or PDTB (Penn Discourse Treebank).

15/05/2025 10
Corpus
Examples of Discourse Treebanks:
 Penn Discourse Treebank (PDTB): Annotates discourse connectives and relations (e.g., contrast, causality).

 RST Discourse Treebank: Based on RST theory of text coherence.

15/05/2025 11
Corpus
The Process of Building Treebanks
Steps
1. Corpus Collection

2. Preprocessing

3. Syntactic Annotation

4. Semantic Annotation

5. Discourse Annotation

6. Quality Control

7. Formatting

8. Release

15/05/2025 12
Corpus
15/05/2025 13
Corpus
1. Corpus Collection (Raw Data Selection)
Start with a large, representative set of raw text from one or more domains (e.g., news articles, conversations, literature).

 Must reflect natural usage of the language.

 Should be diverse: genre, style, vocabulary.

Example sources:

 Wall Street Journal (Penn Treebank)

 Wikipedia or web text (Universal Dependencies)

 Spoken transcripts (Switchboard corpus)

15/05/2025 14
Corpus
2. Text Preprocessing
Prepare the raw text for annotation.

Tasks include:

 Tokenization: Splitting text into words/tokens.

 Sentence segmentation: Identifying sentence boundaries.

 POS tagging (optional): Pre-labeling parts of speech to guide parsers.

15/05/2025 15
Corpus
3. Syntactic Annotation
Add grammatical structure using either:

a) Constituency Parsing

Sentences are annotated as nested phrases (e.g., NP, VP).

b) Dependency Parsing

Annotates head-dependent word relationships (e.g., subject, object).

Who does this?

•Automatic parsers generate initial structures.

•Human linguists review and correct errors manually.

15/05/2025 16
Corpus
4. Semantic Annotation (Optional)
Add meaning-based labels:

 Semantic roles (PropBank/FrameNet)

 Named entities

 Word sense disambiguation

5. Discourse Annotation (Optional)


Analyze how sentences relate in a larger text:
 Annotate discourse connectives (e.g., "however", "because")
 Label discourse relations (e.g., contrast, cause)
Tools: RST-DT, PDTB

15/05/2025 17
Corpus
6. Quality Control
Ensures consistency and accuracy:
 Inter-annotator agreement (IAA) is measured.
 Disagreements resolved by expert annotators.
 Annotation guidelines are refined.

7. Formatting and Conversion


Convert annotated data into standard formats like:
 PTB (Penn Treebank format) for constituency
 CoNLL-U format for dependency treebanks

8. Release and Maintenance


 Treebanks are made publicly available for researchers.
 They may be updated over time with more data or corrections.

15/05/2025 18
Corpus
Example: Penn Treebank Pipeline
1.Source: Wall Street Journal text

2.Preprocessing and initial parsing

3.Human annotators corrected parse trees

4.Layers added: POS tags, syntax trees, PropBank roles

15/05/2025 19
Corpus
Applications of Treebanks in NLP

 Training and Evaluating Parsers


 Syntactic Analysis
 Semantic Role Labeling (SRL)
 Machine Translation
 Information Extraction (IE)
 Sentiment Analysis
 Text Summarization
 Question Answering Systems
 Discourse Analysis
 Linguistic Research & Language Modeling

15/05/2025 20
Corpus
15/05/2025 21
Corpus
REFERENCES
Text Books:
1. Foundations & Text Preprocessing
"Speech and Language Processing" – Daniel Jurafsky & James H. Martin
 Covers fundamental NLP concepts, text processing, POS tagging, parsing, and machine learning
models.
 Best for understanding both theoretical and practical aspects of NLP.
"Natural Language Processing with Python" (NLTK Book) – Steven Bird, Ewan
Klein, & Edward Loper
 Great for hands-on coding, especially for text preprocessing, POS tagging, chunking, and named
entity recognition.
 Uses Python with NLTK, making it ideal for implementing your programs.
2. Morphological Analysis & Language Models
"Introduction to Natural Language Processing" – Jacob Eisenstein
15/05/2025 Introduction to Natural Language Processing
Covers morphology, syntax, and probabilistic language models (N-grams, HMMs).
22
3. Syntactic & Semantic Processing

"Handbook of Natural Language Processing" – Nitin Indurkhya & Fred J. Damerau

 Covers advanced syntactic analysis, POS tagging, chunking, and information extraction techniques.

"Statistical Natural Language Processing" – Christopher Manning & Hinrich Schütze

 Focuses on statistical approaches to NLP, including POS tagging, chunking, and language modeling.

4. Deep Learning & NLP Applications

"Deep Learning for Natural Language Processing" – Palash Goyal, Sumit Pandey, & Karan Jain

 Best for modern NLP applications like Named Entity Recognition, Transformers, and chatbot

development.

"Natural Language Processing with Transformers" – Lewis Tunstall, Leandro von Werra, &

Thomas Wolf
15/05/2025 Introduction to Natural Language Processing 23
 Focuses on deep learning and Transformer-based models (BERT, GPT, etc.).
THANK YOU

15/05/2025 Introduction to Natural Language Processing 24

You might also like