COM 6115:
Text Processing
Lecture 1-1: Introduction to Text Processing
Based on slides by Varvara Papazoglou, Prof. Rob Gaizauskas and Dr Mark Hepple
Dr Xi Wang
Course Director for MSc Computer Science
with Speech & Natural Language Processing
Lecture Overview
Part A
• Information about the module
• What is Text Processing
• Examples of simple Text Processing tasks
• Example applications
Information about
the Module
Module Details
Mondays: 11:00 – 13:00
Lecture Theatre 4
The Diamond
Lectures
Fridays: 13:00 – 15:00
Computer Room 06
Labs The Diamond
Module Details
Taught by
Dr Xi Wang Dr Zheng Yuan
[email protected] [email protected] Weeks 1 – 3 || 9 & 10 Weeks 4 – 7 || 11
We are both looking for PhD students and welcome PhD applications in the area of NLP and related.
Feel free to reach out if you are interested.
Related topics:
Supervised by Xi: Conversational AI, Retrieval Augmented Generation, User Simulation as well as
topics related NLP and IR.
Supervised by Zheng: Educational NLP, Multilingual NLP, Low-resource NLP as well as topics
related to NLP and machine learning.
Module Details
Taught by
Dr Xi Wang Dr Zheng Yuan
[email protected] [email protected] Weeks 1 – 3 || 9 & 10 Weeks 4 – 7 || 11
Guest Lecture
Week 8
Module Details
GTAs
Mingzi Tyler Yanwen Xingyu
Module Details
All materials for COM6115 are (or will be) available on Blackboard:
• module details
• lecture slides + Encore recordings
• lab sheets and Python material (intro & setup help)
• assignment
• links to past exam papers
• announcements
• discussion board:
o Lectures
o Exams
o General questions
o Labs
o assignments
Lab Sessions
Weeks 1 – 2:
Python Intro, for those with limited (Python) programming
background.
Weeks 3– 11:
Specific Text Processing Topics, including some core skills for the
module assignment.
Assessment
30% Assignment
● Release Date: 24th of October, 2025 (Week 4)
● Submission Date: 20th of November, 2025, 15:00 (Week 8)
● Please, do not attempt to plagiarise!
(more info: https://www.sheffield.ac.uk/new-students/unfair-means)
70% Final exam
● During the exam period in January
University Guidance on Use of Generative AI
● https://www.sheffield.ac.uk/academic-skills/using-generative-ai
● https://students.sheffield.ac.uk/digital-learning/ai-principles
● Google Gemini is the UoS institutionally approved GenAI tool.
○ Access using your Uni credentials
Module Overview
• To develop an understanding of the fundamentals of how
digitally stored text is represented and processed in a
computer.
Module
• Develop ability to construct and refine systems for applying text
aims
processing techniques.
• To develop an understanding of the basic problems and
principles underlying text processing applications.
Prerequisites
• Interest in language and basic knowledge of English.
• Some mathematical basics, e.g. basic probability theory.
• Programming skills, including basic familiarity with Python.
Module Outline
• Text Encoding
• Text Compression
• Information Retrieval
• Lexical Semantics
• Sentiment Analysis
• Information Extraction
• Introduction to Deep
Learning for Text Processing
Wooclap -- Programme Experience
What is
Text Processing?
Why Text Processing?
- What is ‘text’?
Something that can be ‘read’; a symbolic arrangement of
graphemes that conveys some meaning.
“In literary theory, a text is any object that can be "read", whether this object is a work of
literature, a street sign, an arrangement of buildings on a city block, or styles of clothing. It is a
set of signs that is available to be reconstructed by a reader (or observer) if
Text (literary theory)
sufficient interpretants are available. This set of signs is considered in terms of the informative
-- sourced Wikipedia
message's content, rather than in terms of its physical form or the medium in which it is
represented.”
- Is speech text?
If transcribed, yes.
Why Text Processing?
Text Processing concerns
The creation, storage and access of text in digital form by computer.
Motivation
Abundance of digital text collections/streams (the web, social
media, digital archives, etc.)
● Many different languages
● Many different types of text
○ news, literature, academic publications, social media
posts, etc., each with sub-types
● Analysis of existing text and generation of new text
○ e.g. textual expression of sensor data/unstructured data
Convergence with NLP
Natural Language
Aspects Text Processing Similarity
Processing (NLP)
Performing simpler
Understanding and interpreting engineering tasks (e.g. Both aim to process and
Goal language meaning. cleaning, formatting, handle text data.
extracting patterns).
Both may use shared
High – requires linguistic and Lower – often rule-based or
Complexity semantic knowledge. statistical.
techniques like tokenisation
or parsing.
Both involve transforming
Machine translation, question Removing stop words,
Examples answering, sentiment analysis. tokenisation, stemming.
text into a form suitable for
computation.
Both deal with language as
Structure and surface form
Focus Meaning and context of language.
of text.
input to computational
methods.
Common Text Processing/NLP Applications
• Information Retrieval (IR)
• Information Extraction (IE)
• Text Categorisation
• Automatic Summarisation
• Natural Language (NL) Generation
• Machine Translation (MT)
• Text Compression
Some Simple
Text Processing
Tasks
Some Simple Text Processing Tasks
• Sentence splitting
• Normalisation
o case normalisation
o punctuation removal
• Tokenisation
• Stop-word removal
• Lemmatisation and stemming
Example Task 1: Sentence Splitting
Text Snippet
Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-
cell pro-inflammatory cytokine production is important in host defense against
bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-
inflammatory cytokines leaves patients susceptible to infection.
Any Ideas?
Example Task 1: Sentence Splitting
Text Snippet
Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-
cell pro-inflammatory cytokine production is important in host defense against
bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-
inflammatory cytokines leaves patients susceptible to infection.
Heuristic rule:
sentence boundary = period + space(s) + capital letter
Example Task 1: Sentence Splitting
Text Snippet
Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-
cell pro-inflammatory cytokine production is important in host defense against
bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-
inflammatory cytokines leaves patients susceptible to infection.
Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.
However, Th1 T-cell pro-inflammatory cytokine production is important in host
defense against bacterial infection in the lungs.
Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves
patients susceptible to infection.
Example Task 1: Sentence Splitting
Is this the Heuristic rule:
perfect solution? sentence boundary = period + space(s) + capital letter
It doesn’t always work
IL-33 is known to induce the production of Th2-associated
cytokines (e.g. IL-5 and IL-13).
IL-33 is known to induce the production of Th2-associated
cytokines (e.g.
IL-5 and IL-13).
Example Task 1: Sentence Splitting
IL-33 is known to induce the production of Th2-associated
cytokines (e.g. IL-5 and IL-13).
IL-33 is known to induce the production of Th2-associated
cytokines (e.g.
IL-5 and IL-13).
Two Solutions
• Add more rules to handle exceptions
• Machine Learning
Example Task 2: Tokenisation
Example Sentence
The protein is activated by IL2.
The protein is activated by IL2 .
• Convert a sentence into a sequence of tokens.
• Why tokenise?
o Because we do not want to treat a sentence as a sequence of
characters or as a distinct entity!
Example Task 2: Tokenisation
Example Sentence
The protein is activated by IL2.
The protein is activated by IL2 .
• Can you think of any problems?
o There are languages (e.g. German, Finnish) with many and large
compound words (e.g. ice cream). Should we decompound?
o How to tokenise Chinese text? (no spaces!)
Example
Applications
Application 1: Information Retrieval
Information Retrieval (IR)
concerned with developing
algorithms and models to retrieve
relevant documents from text
collections for a given query.
Application 1: Information Retrieval
● Text collection = a set of ‘documents’
○ originally, few hundred/thousand electronically stored documents (e.g.
journal paper abstracts)
○ now, billions of pages on the WWW
● Query: user indication of what they want
○ commonly, just 2 or 3 words — good basis for retrieval?
● How to decide what docs are relevant?
○ how to decide if one method works better than another?
● Much work is still left to the user:
○ task of selecting which of returned docs are relevant
○ task of extracting the relevant information
○ task of using retrieved documents -- retrieval augmented generation
Application 2: Information Extraction
● IR contrasts with Information Extraction (IE).
○ IE is about automatically extracting information from unstructured or semi-
structured data.
○ IR is about finding the most relevant or best matching documents given a
query.
● IE recognises specific information in documents, making it available to subsequent
automated processes
○ type of information to be extracted must be decided in advance.
■ entities (e.g. organisations, persons, locations) and
■ relations (e.g. person IS-EMPLOYED-BY organisation)
● Information recognised can be:
○ extracted and stored in a structured record, e.g. database system
■ sometimes called “knowledge base population”
○ stored in a document itself as embedded mark-up
Application 3: Text Categorisation
● Task: automatically assign texts to different categories
● Examples:
○ email — assign to categories:
junk vs. non-junk
○ newspaper articles — assign to categories:
sport vs. politics vs. other
○ product reviews — assign to sentiment categories:
positive vs. negative vs. neutral
● Usually: set of documents that are representative of each category
○ use statistical/probabilistic/neural computational methods to learn a
model that predicts which category of documents a new, previously
unseen document is most like
Application 4: Text Summarisation
Text summarisation: To automatically
generate a concise and digestible
summary of a document (or multiple
documents).
Types of summarisation:
● Extractive: use a proportion of
sentences/phrases/words (25%?)
from the input document.
● Abstractive: create a paraphrased,
restructured, shortened version of
the input document.
Application 5: Natural Language Generation
● Generate natural language text from an abstract, structured
representation. For example:
○ generate an equipment maintenance manual in multiple languages from a
single abstract representation of steps to be carried out;
○ generate a readable summary of several hours of numerical output from
sensors on a patient in intensive care;
○ generate content for a personalised spoken tour for an individual visiting a
museum given the individual’s interests and stored, abstract
representations describing items held in the museum.
● Doing this well involves moving beyond printing canned text in a pre-specified
format.
● Need to make choices about: selection and ordering of content, which
words/syntactic structures to use, when to introduce anaphors (e.g. pronouns),
etc.
Application 6: Machine Translation
● Translate text from one language to another
e.g. English to French and/or vice versa
● One solution: write a computer program to do the translation.
○ Very challenging problem!
○ Requires immense amount of knowledge about language and
the world.
● Better solution: Learn from corpora that are translations of each
other.
Application 6: Machine Translation
Translation is hard for computers:
● The same word/phrase can have different senses depending on the context
(word sense disambiguation).
○ I would like to have another piece of cake.
○ The task is a piece of cake.
● Idioms and grammatical/syntactic differences between languages:
○ (German) “Ich verstehe nur Bahnhof” (*I understand only train station)
= “It’s all Greek to me” (or “I don’t understand anything”)
= “Das kommt mir Spanisch vor” (*This seems to me Spanish).
○ (Greek) “Πνίγομαι σε μια κουταλιά νερό” (*(I) drown (myself) in a spoonful
of water)
= I blow the issue out of proportion.
● New idioms/slang are being introduced
Application 6: Machine Translation
Translation is hard for computers: (cont)
● Which term best describes the intended meaning?
○ friend, acquaintance,...
○ sibling, brother, sister,...
● Gaps
○ No Japanese word for privacy.
○ No English word for German Schadenfreude (Greek: χαιρεκακία).
○ No English word for 阴阳 (yin yang).
Application 6: Text Compression
Compression types:
● Lossy compression
○ Reduces size of compressed file by discarding information.
○ Mostly in images, audio, video.
○ Original file cannot be reconstructed.
○ Example: Saving a photo as JPEG may blur fine details, but the file size
becomes much smaller.
● Lossless compression
○ Reduces the size of compressed files by finding patterns or repetitions and
encoding them more efficiently.
○ Original file can be reconstructed.
○ Example: A ZIP file reduces size, but when you unzip, you get the exact
original files back.
Ethical/social implications for research and innovation
● Biased or incorrect or limited information in datasets
○ search engines/text mining tools may fail to retrieve relevant info
○ machine translation tools may mistranslate key passages
○ Decision making may be unfair/misguided/biased.
● Profiles of individuals may be gathered/inferred from the language they use
(under General Data Protection Regulation (GDPR); language is personal data)
○ opinions expressed on-line
○ decisions made by gender/age/IQ/race/sexual orientation classifiers
● See: UKRI Framework for responsible innovation (AREA: anticipate, reflect,
engage, act)
● See: ACL Ethics for NLP resources: https://aclweb.org/aclwiki/Ethics_in_NLP
Suggested Reading for the Module
● Major sources (in the reading list on BB):
o Information Retrieval
■ Baeza-Yates and Ribeiro-Neto, Modern Information Retrieval.
New York: Addison Wesley, 2011 (2nd ed.).
■ C. Manning, P. Raghavan and H. Schütze, Introduction to Information
Retrieval. Cambridge University Press, 2008.
o General
■ C. Manning and H. Schütze, Foundations of Statistical Natural
Language Processing. MIT Press, 1999.
■ D. Jurafsky and J. H. Martin, Speech and Language Processing.
2024 (3rd ed. draft). https://web.stanford.edu/~jurafsky/slp3/
● Python programming — see module homepage for suggestions.
Reading
D. Jurafsky and J. H. Martin,
Speech and Language Processing, 2025 (3rd ed. draft).
https://web.stanford.edu/~jurafsky/slp3/ed3book_aug25.pdf
● Chapter 2
End of Part A