0% found this document useful (0 votes)

11 views43 pages

Lecture 1-1

The document outlines the COM 6115 Text Processing module, detailing its structure, assessment methods, and key topics such as text encoding, information retrieval, and natural language generation. It includes information about the course schedule, instructors, and resources available on Blackboard, as well as the ethical implications of text processing technologies. The module aims to develop students' understanding of text representation and processing in computers, alongside practical skills in Python programming.

Uploaded by

yongjiangliu129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views43 pages

Lecture 1-1

Uploaded by

yongjiangliu129

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

COM 6115:

Text Processing
Lecture 1-1: Introduction to Text Processing
Based on slides by Varvara Papazoglou, Prof. Rob Gaizauskas and Dr Mark Hepple

Dr Xi Wang
Course Director for MSc Computer Science
with Speech & Natural Language Processing
Lecture Overview

Part A

• Information about the module

• What is Text Processing
• Examples of simple Text Processing tasks
• Example applications
Information about
the Module
Module Details

Mondays: 11:00 – 13:00

Lecture Theatre 4
The Diamond
Lectures

Fridays: 13:00 – 15:00

Computer Room 06
Labs The Diamond
Module Details

Taught by

Dr Xi Wang Dr Zheng Yuan

[email protected] [email protected]

Weeks 1 – 3 || 9 & 10 Weeks 4 – 7 || 11

We are both looking for PhD students and welcome PhD applications in the area of NLP and related.
Feel free to reach out if you are interested.

Related topics:
Supervised by Xi: Conversational AI, Retrieval Augmented Generation, User Simulation as well as
topics related NLP and IR.
Supervised by Zheng: Educational NLP, Multilingual NLP, Low-resource NLP as well as topics
related to NLP and machine learning.
Module Details

Taught by

Dr Xi Wang Dr Zheng Yuan

[email protected] [email protected]
Weeks 1 – 3 || 9 & 10 Weeks 4 – 7 || 11

Guest Lecture
Week 8
Module Details

GTAs

Mingzi Tyler Yanwen Xingyu

Module Details
All materials for COM6115 are (or will be) available on Blackboard:
• module details
• lecture slides + Encore recordings
• lab sheets and Python material (intro & setup help)
• assignment
• links to past exam papers
• announcements
• discussion board:
o Lectures
o Exams
o General questions
o Labs
o assignments
Lab Sessions

Weeks 1 – 2:
Python Intro, for those with limited (Python) programming
background.

Weeks 3– 11:
Specific Text Processing Topics, including some core skills for the
module assignment.
Assessment

30% Assignment

● Release Date: 24th of October, 2025 (Week 4)

● Submission Date: 20th of November, 2025, 15:00 (Week 8)
● Please, do not attempt to plagiarise!
(more info: https://www.sheffield.ac.uk/new-students/unfair-means)

70% Final exam

● During the exam period in January
University Guidance on Use of Generative AI

● https://www.sheffield.ac.uk/academic-skills/using-generative-ai

● https://students.sheffield.ac.uk/digital-learning/ai-principles
● Google Gemini is the UoS institutionally approved GenAI tool.
○ Access using your Uni credentials
Module Overview
• To develop an understanding of the fundamentals of how
digitally stored text is represented and processed in a
computer.
Module
• Develop ability to construct and refine systems for applying text
aims
processing techniques.
• To develop an understanding of the basic problems and
principles underlying text processing applications.

Prerequisites

• Interest in language and basic knowledge of English.

• Some mathematical basics, e.g. basic probability theory.
• Programming skills, including basic familiarity with Python.
Module Outline

• Text Encoding
• Text Compression
• Information Retrieval
• Lexical Semantics

• Sentiment Analysis
• Information Extraction
• Introduction to Deep
Learning for Text Processing
Wooclap -- Programme Experience
What is
Text Processing?
Why Text Processing?

- What is ‘text’?
Something that can be ‘read’; a symbolic arrangement of
graphemes that conveys some meaning.
“In literary theory, a text is any object that can be "read", whether this object is a work of
literature, a street sign, an arrangement of buildings on a city block, or styles of clothing. It is a
set of signs that is available to be reconstructed by a reader (or observer) if
Text (literary theory)
sufficient interpretants are available. This set of signs is considered in terms of the informative
-- sourced Wikipedia
message's content, rather than in terms of its physical form or the medium in which it is
represented.”

- Is speech text?
If transcribed, yes.
Why Text Processing?

Text Processing concerns

The creation, storage and access of text in digital form by computer.

Motivation
Abundance of digital text collections/streams (the web, social
media, digital archives, etc.)
● Many different languages
● Many different types of text
○ news, literature, academic publications, social media
posts, etc., each with sub-types
● Analysis of existing text and generation of new text
○ e.g. textual expression of sensor data/unstructured data
Convergence with NLP
Natural Language
Aspects Text Processing Similarity
Processing (NLP)

Performing simpler
Understanding and interpreting engineering tasks (e.g. Both aim to process and
Goal language meaning. cleaning, formatting, handle text data.
extracting patterns).

Both may use shared

High – requires linguistic and Lower – often rule-based or
Complexity semantic knowledge. statistical.
techniques like tokenisation
or parsing.

Both involve transforming

Machine translation, question Removing stop words,
Examples answering, sentiment analysis. tokenisation, stemming.
text into a form suitable for
computation.

Both deal with language as

Structure and surface form
Focus Meaning and context of language.
of text.
input to computational
methods.
Common Text Processing/NLP Applications
• Information Retrieval (IR)

• Information Extraction (IE)

• Text Categorisation

• Automatic Summarisation

• Natural Language (NL) Generation

• Machine Translation (MT)

• Text Compression
Some Simple
Text Processing
Tasks
Some Simple Text Processing Tasks

• Sentence splitting
• Normalisation
o case normalisation
o punctuation removal
• Tokenisation
• Stop-word removal
• Lemmatisation and stemming
Example Task 1: Sentence Splitting

Text Snippet

Current immunosuppression protocols to prevent lung transplant rejection

Any Ideas?
Example Task 1: Sentence Splitting

Text Snippet

Current immunosuppression protocols to prevent lung transplant rejection

Heuristic rule:
sentence boundary = period + space(s) + capital letter
Example Task 1: Sentence Splitting

Text Snippet

Current immunosuppression protocols to prevent lung transplant rejection

reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.
However, Th1 T-cell pro-inflammatory cytokine production is important in host
defense against bacterial infection in the lungs.
Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves
patients susceptible to infection.
Example Task 1: Sentence Splitting
Is this the Heuristic rule:
perfect solution? sentence boundary = period + space(s) + capital letter

It doesn’t always work

IL-33 is known to induce the production of Th2-associated

cytokines (e.g. IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated

cytokines (e.g.

IL-5 and IL-13).

Example Task 1: Sentence Splitting
IL-33 is known to induce the production of Th2-associated
cytokines (e.g. IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated

cytokines (e.g.

IL-5 and IL-13).

Two Solutions
• Add more rules to handle exceptions
• Machine Learning
Example Task 2: Tokenisation
Example Sentence
The protein is activated by IL2.

The protein is activated by IL2 .

• Convert a sentence into a sequence of tokens.

• Why tokenise?
o Because we do not want to treat a sentence as a sequence of
characters or as a distinct entity!
Example Task 2: Tokenisation
Example Sentence
The protein is activated by IL2.

The protein is activated by IL2 .

• Can you think of any problems?

o There are languages (e.g. German, Finnish) with many and large
compound words (e.g. ice cream). Should we decompound?
o How to tokenise Chinese text? (no spaces!)
Example
Applications
Application 1: Information Retrieval

Information Retrieval (IR)

concerned with developing

algorithms and models to retrieve
relevant documents from text
collections for a given query.
Application 1: Information Retrieval
● Text collection = a set of ‘documents’
○ originally, few hundred/thousand electronically stored documents (e.g.
journal paper abstracts)
○ now, billions of pages on the WWW
● Query: user indication of what they want
○ commonly, just 2 or 3 words — good basis for retrieval?
● How to decide what docs are relevant?
○ how to decide if one method works better than another?
● Much work is still left to the user:
○ task of selecting which of returned docs are relevant
○ task of extracting the relevant information
○ task of using retrieved documents -- retrieval augmented generation
Application 2: Information Extraction
● IR contrasts with Information Extraction (IE).
○ IE is about automatically extracting information from unstructured or semi-
structured data.
○ IR is about finding the most relevant or best matching documents given a
query.
● IE recognises specific information in documents, making it available to subsequent
automated processes
○ type of information to be extracted must be decided in advance.
■ entities (e.g. organisations, persons, locations) and
■ relations (e.g. person IS-EMPLOYED-BY organisation)
● Information recognised can be:
○ extracted and stored in a structured record, e.g. database system
■ sometimes called “knowledge base population”
○ stored in a document itself as embedded mark-up
Application 3: Text Categorisation
● Task: automatically assign texts to different categories

● Examples:
○ email — assign to categories:
junk vs. non-junk
○ newspaper articles — assign to categories:
sport vs. politics vs. other
○ product reviews — assign to sentiment categories:
positive vs. negative vs. neutral
● Usually: set of documents that are representative of each category
○ use statistical/probabilistic/neural computational methods to learn a
model that predicts which category of documents a new, previously
unseen document is most like
Application 4: Text Summarisation
Text summarisation: To automatically
generate a concise and digestible
summary of a document (or multiple
documents).

Types of summarisation:
● Extractive: use a proportion of
sentences/phrases/words (25%?)
from the input document.
● Abstractive: create a paraphrased,
restructured, shortened version of
the input document.
Application 5: Natural Language Generation
● Generate natural language text from an abstract, structured
representation. For example:
○ generate an equipment maintenance manual in multiple languages from a
single abstract representation of steps to be carried out;
○ generate a readable summary of several hours of numerical output from
sensors on a patient in intensive care;
○ generate content for a personalised spoken tour for an individual visiting a
museum given the individual’s interests and stored, abstract
representations describing items held in the museum.
● Doing this well involves moving beyond printing canned text in a pre-specified
format.
● Need to make choices about: selection and ordering of content, which
words/syntactic structures to use, when to introduce anaphors (e.g. pronouns),
etc.
Application 6: Machine Translation
● Translate text from one language to another
e.g. English to French and/or vice versa

● One solution: write a computer program to do the translation.

○ Very challenging problem!
○ Requires immense amount of knowledge about language and
the world.

● Better solution: Learn from corpora that are translations of each

other.
Application 6: Machine Translation
Translation is hard for computers:
● The same word/phrase can have different senses depending on the context
(word sense disambiguation).
○ I would like to have another piece of cake.
○ The task is a piece of cake.

● Idioms and grammatical/syntactic differences between languages:

○ (German) “Ich verstehe nur Bahnhof” (*I understand only train station)
= “It’s all Greek to me” (or “I don’t understand anything”)
= “Das kommt mir Spanisch vor” (*This seems to me Spanish).
○ (Greek) “Πνίγομαι σε μια κουταλιά νερό” (*(I) drown (myself) in a spoonful
of water)
= I blow the issue out of proportion.

● New idioms/slang are being introduced

Application 6: Machine Translation

Translation is hard for computers: (cont)

● Which term best describes the intended meaning?

○ friend, acquaintance,...
○ sibling, brother, sister,...

● Gaps
○ No Japanese word for privacy.
○ No English word for German Schadenfreude (Greek: χαιρεκακία).
○ No English word for 阴阳 (yin yang).
Application 6: Text Compression
Compression types:
● Lossy compression
○ Reduces size of compressed file by discarding information.
○ Mostly in images, audio, video.
○ Original file cannot be reconstructed.
○ Example: Saving a photo as JPEG may blur fine details, but the file size
becomes much smaller.

● Lossless compression
○ Reduces the size of compressed files by finding patterns or repetitions and
encoding them more efficiently.
○ Original file can be reconstructed.
○ Example: A ZIP file reduces size, but when you unzip, you get the exact
original files back.
Ethical/social implications for research and innovation

● Biased or incorrect or limited information in datasets

○ search engines/text mining tools may fail to retrieve relevant info
○ machine translation tools may mistranslate key passages
○ Decision making may be unfair/misguided/biased.
● Profiles of individuals may be gathered/inferred from the language they use
(under General Data Protection Regulation (GDPR); language is personal data)
○ opinions expressed on-line
○ decisions made by gender/age/IQ/race/sexual orientation classifiers
● See: UKRI Framework for responsible innovation (AREA: anticipate, reflect,
engage, act)
● See: ACL Ethics for NLP resources: https://aclweb.org/aclwiki/Ethics_in_NLP
Suggested Reading for the Module
● Major sources (in the reading list on BB):
o Information Retrieval
■ Baeza-Yates and Ribeiro-Neto, Modern Information Retrieval.
New York: Addison Wesley, 2011 (2nd ed.).
■ C. Manning, P. Raghavan and H. Schütze, Introduction to Information
Retrieval. Cambridge University Press, 2008.
o General
■ C. Manning and H. Schütze, Foundations of Statistical Natural
Language Processing. MIT Press, 1999.
■ D. Jurafsky and J. H. Martin, Speech and Language Processing.
2024 (3rd ed. draft). https://web.stanford.edu/~jurafsky/slp3/
● Python programming — see module homepage for suggestions.
Reading

D. Jurafsky and J. H. Martin,

Speech and Language Processing, 2025 (3rd ed. draft).
https://web.stanford.edu/~jurafsky/slp3/ed3book_aug25.pdf

● Chapter 2
End of Part A

Introduction NLC
No ratings yet
Introduction NLC
69 pages
Module 1
No ratings yet
Module 1
49 pages
Natural Language Processing (NLP) (A Complete Guide)
No ratings yet
Natural Language Processing (NLP) (A Complete Guide)
26 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
39 pages
NLP - Course EDC 1 29
No ratings yet
NLP - Course EDC 1 29
29 pages
LLM Models
No ratings yet
LLM Models
23 pages
Large Language Models
100% (1)
Large Language Models
23 pages
Hocken Maier 25
No ratings yet
Hocken Maier 25
46 pages
Data Science: Text & Speech Analysis Course
No ratings yet
Data Science: Text & Speech Analysis Course
2 pages
Deep Learning Approaches To Text Production
No ratings yet
Deep Learning Approaches To Text Production
201 pages
Natural Language Processing - Session 1 - Introduction
100% (1)
Natural Language Processing - Session 1 - Introduction
55 pages
Unit 1 and Unit 2 Good Notes
No ratings yet
Unit 1 and Unit 2 Good Notes
21 pages
Unit 1
No ratings yet
Unit 1
20 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
57 pages
NLP Basics for Beginners
No ratings yet
NLP Basics for Beginners
8 pages
NLP - NOTES (5 Units) - 1
No ratings yet
NLP - NOTES (5 Units) - 1
117 pages
NLP BAD613B FullNotes
No ratings yet
NLP BAD613B FullNotes
158 pages
Natural Language Processing Tools and Approaches
No ratings yet
Natural Language Processing Tools and Approaches
106 pages
NLP Intro1
No ratings yet
NLP Intro1
61 pages
Natural Language Processing
No ratings yet
Natural Language Processing
87 pages
Introduction To NLP - Part 1
No ratings yet
Introduction To NLP - Part 1
23 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
88 pages
Lecture 1 Introduction
No ratings yet
Lecture 1 Introduction
40 pages
Module1 NLP BAD613B Notes
No ratings yet
Module1 NLP BAD613B Notes
37 pages
Hadi Pres, 21-12-24-1
No ratings yet
Hadi Pres, 21-12-24-1
16 pages
Unit 1
No ratings yet
Unit 1
99 pages
NLP Lect Unit I
100% (1)
NLP Lect Unit I
140 pages
NLP Tokenization Techniques Guide
No ratings yet
NLP Tokenization Techniques Guide
50 pages
l1 Intro v2
No ratings yet
l1 Intro v2
34 pages
Languages: What Is Natural Language Processing ?
No ratings yet
Languages: What Is Natural Language Processing ?
25 pages
Top 50 NLP Interview Q&A Guide
No ratings yet
Top 50 NLP Interview Q&A Guide
27 pages
Unit No 1 Introduction To NLP
No ratings yet
Unit No 1 Introduction To NLP
20 pages
Nlp-Unit-I Final
No ratings yet
Nlp-Unit-I Final
31 pages
NLP DL
No ratings yet
NLP DL
26 pages
Natural Language Processing Revision Notes
No ratings yet
Natural Language Processing Revision Notes
4 pages
NLP Materia
No ratings yet
NLP Materia
29 pages
TSA Book
No ratings yet
TSA Book
154 pages
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
No ratings yet
Natural Language Processing A Machine Learning Perspective by Yue Zhang, Westlake University Zhiyang Teng, Westlake University
768 pages
Assignment I
No ratings yet
Assignment I
6 pages
12.introduction To Artificial Intelligence
No ratings yet
12.introduction To Artificial Intelligence
22 pages
NLP Notes Unit 1
No ratings yet
NLP Notes Unit 1
42 pages
Statistical NLP and Sequence Labeling: ML4291 Natural Language Processing LTPC 2 0 2 3 Course Objectives
No ratings yet
Statistical NLP and Sequence Labeling: ML4291 Natural Language Processing LTPC 2 0 2 3 Course Objectives
3 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
5 pages
Lesson 1 Introduction To Natural Language Processing
No ratings yet
Lesson 1 Introduction To Natural Language Processing
93 pages
NLP Lecture 1
No ratings yet
NLP Lecture 1
3 pages
Deep Learning For Natural Language Processing: July 2021
No ratings yet
Deep Learning For Natural Language Processing: July 2021
10 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
24 pages
NLP Intro
No ratings yet
NLP Intro
42 pages
ML Module A7707 - Part1
No ratings yet
ML Module A7707 - Part1
48 pages
Ai Part B ch12
No ratings yet
Ai Part B ch12
16 pages
Unit V Natural Language Processing
No ratings yet
Unit V Natural Language Processing
20 pages
Natural Language Processing Syllabus
No ratings yet
Natural Language Processing Syllabus
159 pages
NLP Unit-1 Merged
No ratings yet
NLP Unit-1 Merged
41 pages
U1 - NLP Complete
No ratings yet
U1 - NLP Complete
108 pages
NLP and Machine Learning Overview
No ratings yet
NLP and Machine Learning Overview
11 pages
Unit 4
No ratings yet
Unit 4
39 pages
1 NLP (Introduction)
No ratings yet
1 NLP (Introduction)
60 pages
Birds As Food. ISBN 978-0-9500513-0-7
No ratings yet
Birds As Food. ISBN 978-0-9500513-0-7
342 pages
The Family Plan
No ratings yet
The Family Plan
9 pages
Sds Sefar Nytal Quicktal 5604-03-2025 En-Au
No ratings yet
Sds Sefar Nytal Quicktal 5604-03-2025 En-Au
10 pages
o16LLNWMB8INT20MobileSynthes20InternationalProduct20Support20Materiallegacy Synthes PDF20
No ratings yet
o16LLNWMB8INT20MobileSynthes20InternationalProduct20Support20Materiallegacy Synthes PDF20
36 pages
Metonymy and Synecdhhoche
No ratings yet
Metonymy and Synecdhhoche
3 pages
Luce 1966, The 550 Jatakas in Old Burma
No ratings yet
Luce 1966, The 550 Jatakas in Old Burma
18 pages
CSC10002 - PROJECT The Matching Game: March 19, 2023
No ratings yet
CSC10002 - PROJECT The Matching Game: March 19, 2023
7 pages
Free Mock Tests
No ratings yet
Free Mock Tests
4 pages
TASER 7 and TASER 7 CQ User Manual
No ratings yet
TASER 7 and TASER 7 CQ User Manual
50 pages
Clonning Vectors
No ratings yet
Clonning Vectors
19 pages
FMCG Training Modules-Catalyst
0% (1)
FMCG Training Modules-Catalyst
12 pages
SMEA Proposal
100% (3)
SMEA Proposal
3 pages
B Ing
No ratings yet
B Ing
10 pages
Chapter 1 Student Handout - Substance Discovery and Product Development
No ratings yet
Chapter 1 Student Handout - Substance Discovery and Product Development
55 pages
WorldLink Invoice for Tripler Service
No ratings yet
WorldLink Invoice for Tripler Service
1 page
(Guides To The Coinage of The Ancient World) Clare Rowan - From Caesar To Augustus (C. 49 BC-AD 14) - Using Coins As Sources-Cambridge University Press (2018)
100% (1)
(Guides To The Coinage of The Ancient World) Clare Rowan - From Caesar To Augustus (C. 49 BC-AD 14) - Using Coins As Sources-Cambridge University Press (2018)
256 pages
Ebooks - Cengage Ereader8
No ratings yet
Ebooks - Cengage Ereader8
10 pages
Mathematics Grade Two Curriculum Design
No ratings yet
Mathematics Grade Two Curriculum Design
22 pages
CD Mid1 Answers
No ratings yet
CD Mid1 Answers
18 pages
Politics: Power, Ideology, and Governance
No ratings yet
Politics: Power, Ideology, and Governance
2 pages
STPB740
No ratings yet
STPB740
12 pages
Pathophysiology of Myocardial Infarction
100% (28)
Pathophysiology of Myocardial Infarction
2 pages
Rose Crochet Pattern Guide
100% (2)
Rose Crochet Pattern Guide
16 pages
Blackbox & UAT Analysis for Solusimedsosku
No ratings yet
Blackbox & UAT Analysis for Solusimedsosku
9 pages
Global Insurance Risk Insights
No ratings yet
Global Insurance Risk Insights
10 pages
Bproject KSDL
0% (2)
Bproject KSDL
84 pages
Final Oop Micro Project
100% (2)
Final Oop Micro Project
18 pages
Toshiba E-STUDIO165+167+205+207+237 Service Handbook
100% (4)
Toshiba E-STUDIO165+167+205+207+237 Service Handbook
313 pages
Managing Polypharmacy in Elderly Care
No ratings yet
Managing Polypharmacy in Elderly Care
21 pages
FIN010 EVT-001 Australian Budget Monitoring and Preparation - Project 3 Template V1.0
No ratings yet
FIN010 EVT-001 Australian Budget Monitoring and Preparation - Project 3 Template V1.0
6 pages

Lecture 1-1

Uploaded by

Lecture 1-1

Uploaded by

COM 6115:

• Information about the module

Mondays: 11:00 – 13:00

Fridays: 13:00 – 15:00

Dr Xi Wang Dr Zheng Yuan

Weeks 1 – 3 || 9 & 10 Weeks 4 – 7 || 11

Dr Xi Wang Dr Zheng Yuan

Mingzi Tyler Yanwen Xingyu

● Release Date: 24th of October, 2025 (Week 4)

70% Final exam

• Interest in language and basic knowledge of English.

Text Processing concerns

The creation, storage and access of text in digital form by computer.

Both may use shared

Both involve transforming

Both deal with language as

• Information Extraction (IE)

• Natural Language (NL) Generation

• Machine Translation (MT)

Current immunosuppression protocols to prevent lung transplant rejection

Current immunosuppression protocols to prevent lung transplant rejection

Current immunosuppression protocols to prevent lung transplant rejection

Current immunosuppression protocols to prevent lung transplant rejection

It doesn’t always work

IL-33 is known to induce the production of Th2-associated

IL-33 is known to induce the production of Th2-associated

IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated

IL-5 and IL-13).

The protein is activated by IL2 .

• Convert a sentence into a sequence of tokens.

The protein is activated by IL2 .

• Can you think of any problems?

Information Retrieval (IR)

concerned with developing

● One solution: write a computer program to do the translation.

● Better solution: Learn from corpora that are translations of each

● Idioms and grammatical/syntactic differences between languages:

● New idioms/slang are being introduced

Translation is hard for computers: (cont)

● Which term best describes the intended meaning?

● Biased or incorrect or limited information in datasets

D. Jurafsky and J. H. Martin,

You might also like