Addis Ababa Science and Technology
University
Artificial Intelligence And Robotics
Center Of Excellence
Assignment of Natural Language
Processing
Name : Firomsa Abera Keba [Document subtitle]
Id No: GSR0234/17
May, 2025
Introduction
Natural language processing is multidisciplinary field of study that deals with creating a machine
that can understand, and generate human languages. NLP is increasingly being used in interactivity
and productivity applications, such as creating spoken dialogue systems and speech-to-speech
engines, searching social networks for health or financial information, detecting moods and
emotions towards products and services, etc. [1]. Natural language processing has evolved
significantly over the years, supported by advancements in technology. It started with early rule-
based systems and, through progress, has reached the current deep learning models, taking a long
time and significant effort.
NLP Approaches
In this section we classify NLP algorithms into three main categories: rule-based methods, machine
learning methods and deep learning methods.
Rule Based Methods
In the early days, natural language processing worked using handcrafted rules and linguistic
knowledge. Rule-based methods are based on the processing of language using specific rules and
lexicons [2]. This method process texts using grammatical rules and word lists. It used a simple
form of pattern matching to respond to certain keywords and phrases, and although its capabilities
were quite limited by today’s standards, its impact was unmistakable. Rule-based system has the
following drawbacks:
➢ It requires manual effort: Creating and maintaining set of rules requires manual effort.
➢ Scalability issue: Do not scale to large datasets or evolving language, so the system must
update manually which is inefficient.
➢ Lack of generalization: They perform well on limited scope but it struggles with general
area.
➢ Limited learning capability: Its’ learning capability relies on handcrafted rules, it is
impossible to list all rules.
➢ It is rigid and difficult to customize
Statistical Methods
To address the complexities of listing all rules in a rule-based system, statistical methods have
emerged. These methods relied on large corpora of text to learn patterns and probabilities
associated with different linguistic phenomena [3]. Statistical models, such as HMM and n-gram
models, were used for tasks like language modelling, part-of-speech tagging, and machine
translation. To build a statistical NLP system, we first provide a large corpus from which the model
learns the probabilities of words. Then, the model uses these probabilities to perform further
calculations.
Limitations of Statistical methods:
➢ Data dependency: It depends on datasets.
➢ It does not capture semantic meaning or contextual nuances well
➢ It fails to give generalization for unseen areas.
➢ It fails to capture long range dependencies due to fixed window sizes.
➢ As dataset size increases, these methods become computationally expensive because it
involves calculating probabilities for a vast number of word sequences.
Machine learning models
Machine learning approaches carry out natural language tasks by learning representations and
patterns directly from data, rather than relying on handcrafted rules. It has the capability of
capturing patterns from corpus.
Supervised Learning Methods: Supervised learning involves training models using labelled data
sets. In these methods, the correct output label is known for each data sample and the model learns
to predict these labels [2].
Unsupervised Learning Methods: Unsupervised learning aims to discover hidden structures within
the data using unlabeled data sets [2].
Machine Learning has significantly transformed the field of Natural Language Processing, offering
numerous advantages over traditional rule-based approaches: it is scalable to handling vast
amounts of textual data efficiently, it is able to learn from data, models can improve over time with
more data and feedback and more. But it has some drawbacks include:
➢ Data dependency: to capture the pattern ML needs more dataset
➢ Lack of context awareness and ambiguity: the meaning of some words differs on context
➢ Domain specificity: If someone trains its model on specific domains to customize to
another domain it remains challenge.
➢ Bias and Fairness Issues: ML models inherit biases from their training data.
Deep Learning Methods
Deep learning methods are advanced algorithms that perform complex language processing tasks
using artificial neural networks [2]. Neural networks, especially RNNs and LSTM, were widely
used in earlier NLP applications, helping with tasks like machine translation and text generation
[4].
Transformer Models
Transformer models, such as BERT (Bidirectional Encoder Representations from Transformers)
and GPT (Generative Pretrained Transformer), leverage an attention mechanism that enables
models to weigh the importance of different words in a sentence, regardless of their position [4].
BERT
BERT has transformed the field by introducing bidirectional context, meaning it can understand
words in relation to all surrounding words in a sentence [4].
GPT
By pre-training on large corpora of text data, GPT can generate coherent and contextually relevant
text, making it suitable for applications like content creation, dialogue generation, and code
synthesis [4].
Deep learning has improved drawbacks of traditional machine learning models by capable of
capturing syntactic and semantic relationships, it reduces tasks of feature engineering by learning
features from raw text without manual engineering, pre-trained language models can be fine-tuned
for specific tasks with less labeled data. However, it has drawbacks:
➢ It requires massive amount of data and high computational hardware.
➢ Deep models act as "black boxes," making it difficult to interpret or explain predictions.
➢ Risk of significant bias
➢ High energy consumption and other.
Evaluation metrics in NLP
Evaluation metrics are crucial for assessing the performance of models across different NLP tasks.
These evaluation indices assist researchers in picking the most appropriate model for their research
circumstance [5]. The error rate is the proportion of misclassified samples to the total number of
samples. Precision, recall, and F1 scores may be computed from below confusion matrix
Actual Positive Actual Negative
Predicted Positive True Positive (TP) False Positive (FP)
Predicted Negative False Negative (FN) True Negative (TN)
➢ Accuracy: It is the ratio of correctly estimated samples to the total number of samples. It is
particularly useful in balanced data sets [2]. It is calculated as follows
𝑇𝑃+𝑇𝑁
Acc = (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)
➢ Precision: It is the ratio of correct positive predictions to total positive predictions. High
𝑇𝑃
precision indicates few false positives [2]. It is calculated as follows Precision = (𝑇𝑃+𝐹𝑃)
➢ Recall is the number of positive instances in the sample that were predicted to be correct.
High recall indicates catching most of the true positives [2]. it is calculated as follows =>
𝑇𝑃
Recall = (𝑇𝑃+𝐹𝑁)
➢ The F-measure score assesses distinct precision/recall preferences. It provides a balanced
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑅𝑒𝑐𝑎𝑙𝑙
evaluation by considering both precision and recall. F = 2 ∗ (𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗𝑅𝑒𝑐𝑎𝑙𝑙)
➢ ROUGE and BLEU Scores: ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
and BLEU (Bilingual Evaluation Understudy) are metrics used in tasks such as text
summarization and translation. ROUGE measures how good summarizations are, while
BLEU assesses translation quality [2].
Limitations and Risks Associated with NLP Systems
There are several limitations and risks that need to be considered while using NLP systems.
Quality data: The effectiveness of the computational models relies on the quality and
comprehensiveness of the data. Although many political discourses are public, including data
sources such as news, press releases, legislation, and campaigns, when it comes to surveying public
opinions, social media might be a biased representation of the whole population [6].
Black box: many modern NLP models have black box nature; We do not encourage decision-
making systems to depend fully on NLP, but suggest that NLP can assist human decision-makers
[6].
Linguistic and Contextual Challenges: Since human languages are ambiguous, it is challenging
to NLP models to understand the contextual meanings of human languages.
Those above-mentioned limitation may arise several risks:
Misinformation and Fake Content: NLP systems, especially generative models, can be misused
to create fake news, phishing emails, or deepfake text.
Accountability: When NLP systems make incorrect or harmful decisions, it’s unclear who is
responsible — the developer, the deployer, or the data provider. And many more risks.
Ethical Issues in Natural Language Processing
Privacy Issues
Many natural language processing methods rely on user data to obtain better performance; in some
case it poses potential risks when that data is sensitive data. That data needs to be secured or not
to shared in public. Since natural language processing tasks usually involve collecting and
processing sensitive user information, appropriate privacy protection measures must be taken to
protect users' privacy [7].
Bias Issues
Bias refers to systematic and unfair discrimination that embedded in dataset, models, or algorithms
that affect the performance of NLP systems. These biases may arise from multiple sources [8]:
➢ Data bias: is due to an imbalance in training data.
➢ Model bias: is due to the limitations of the model, such as the chosen algorithm or model
architecture.
➢ Assessment bias is due to the choice of assessment metrics or the limited nature of the
assessment dataset.
➢ Algorithm bias is due to the design or selection of a particular algorithm or process.
➢ Social bias: is due to the fact that NLP systems are designed and applied to specific social
contexts and thus may reflect or enhance real-world inequalities.
In order to mitigate these biases, we can use several approaches, such as data augmentation, model
adaptation, redefinition of assessment metrics, algorithm improvement, and social engagement [7].
Misinformation: Because of, NLP systems relay on data, that data may have biases, which makes
later generation of misinformation. For example, if someone trains models on social media data,
mostly social media data have hated speech and more fake information which makes the generation
of our model biased.
Societal Implications of Natural Language Processing
Enhancement of technology on language processing provide many benefits to society, but it has
also introduced certain challenges and negative consequences.
Accessibility: NLP enhances the digital accessibility for individuals. For example, it provides
Text-to-speech for those who have visual impairments and Speech-to-text for peoples with hearing
impairments [9]. And also, for peoples with physical disabilities, it helps them with making
machines ability to taking orders from them using natural language. Machine translation helps
individuals who do not speak multiple languages by automatically translating content from one
language to another and many more things [10].
Impact on Communication: The emerging of NLP changed the way people communicate. NLP
applications, such as chatbots, autocomplete, and virtual assistants, influence the way people
communicate. For example, now a time writing skill is not a challenge because autocorrect and
predictive text helps reshaping sentence construction and vocabulary usage. But it also has
drawbacks, for instance, they can reduce the need to develop strong writing and grammar skills,
especially among younger users who rely heavily on automation.
Impact on Culture: NLP systems are heavily dependent on large datasets (corpora), which are
often sourced from specific regions or dominant cultures. As a result: The cultural values and
norms embedded in these datasets can create algorithmic bias that favors certain cultures and
worldviews. This cultural and linguistic bias may contribute to a form of digital colonialism, where
the cultural dominance of a few groups is perpetuated and amplified by AI technologies
References
[1] A. P. S. Мaria А. Kazakova, "Analysis of natural language processing technology: modern problems,"
2022.
[2] A. ARISOY, "NATURAL LANGUAGE PROCESSING ALGORITHMS AND PERFORMANCE COMPARISON,"
2024.
[3] O. (. Masoumzadeh, "From Rule-Based Systems to Transformers: A Journey through the Evolution
of Natural Language Processing," 2023.
[4] A. V. Christopher Sola, "Advanced Natural Language Processing," 2025.
[5] M. S. H. T. A. K. J. Abdul Ahad ABRO, "Natural Language Processing Challenges and Issues," Journal
of Science, 2023.
[6] R. M. Zhijing Jin, "Natural Language Processing for Policymaking," 2023.
[7] Y. Ma, "A Study of Ethical Issues in Natural Language Processing with Artificial Intelligence," Journal
of Computer Science and Technology Studies , 2023.
[8] P. S. Hovy. D., "Five sources of bias in natural language processing.," Language and Linguistics
Compass, 2021.
[9] T. V. K. P. k. V. Madhusudhana Reddy, "Speech-to-Text and Text-to-Speech Recognition using Deep
Learning," in Proceedings of the Second International Conference on Edge Computing and
Applications (ICECAA 2023), 2023.
[10] H. W. Z. H. L. H. K. W. C. Haifeng Wang, "Progress in Machine Translation," 2022.