Group Memeber: Nathaniel Dakin, Tianna-Lee Salmon, Calvin Stephenson, Pierre
Mannix
Title: The Role of Named Entity Recognition in Healthcare Data Analysis
Introduction
Named Entity Recognition (NER), a pivotal component of natural language processing
(NLP), refers to the methodical identification and categorization of key information elements
in text, such as names of persons, organizations, locations, and other specific terminologies.
In the realm of healthcare, the role of NER takes on a critical dimension, as it involves
parsing through complex, often unstructured medical data to extract vital entities like disease
names, symptoms, medication details, and patient information (Nadeau & Sekine, 2007). The
significance of NER in healthcare is profound, given its potential to revolutionize data
processing, enhance patient care, and streamline healthcare services.
This study specifically focuses on the application of NER in healthcare tasks, with an
emphasis on extracting and categorizing patient-specific information and identifying disease
names from various textual sources. This process is not only integral for effective data
management in healthcare systems but also pivotal in facilitating timely and accurate clinical
decision-making. By harnessing the power of NER, healthcare professionals can gain quicker
access to patient histories, treatment plans, and other critical data, thereby improving the
overall quality of care (Meystre et al., 2008).
The primary objective of this presentation is to provide a comprehensive overview of the
application and impact of NER in the healthcare domain. It aims to explore the various
methodologies and technologies employed in NER, specifically tailored for healthcare
settings, and to evaluate their effectiveness and efficiency. The scope of this presentation
encompasses a detailed analysis of the datasets used in training Clinical NER models,
specific applications and techniques of NER in healthcare, and the ethical considerations
associated with its use. By the end of this presentation, the audience will have a clearer
understanding of how NER is shaping the future of healthcare information management and
the challenges and opportunities that lie ahead in this rapidly evolving field.
Background
Named Entity Recognition (NER), a cornerstone in the field of natural language processing
(NLP), is designed to automatically detect and classify key information entities in text. These
entities can include names of people, organizations, locations, expressions of time, quantities,
monetary values, and more specific terminologies. In essence, NER works by parsing text
and identifying segments that correspond to predefined categories, essentially 'tagging' these
segments for further processing or analysis (Nadeau & Sekine, 2007). The operation of NER
involves complex algorithms that combine linguistic grammar-based techniques with
statistical models, including recent advances in machine learning and deep learning.
In the healthcare domain, the importance of NER cannot be overstated. It plays a critical role
in managing the vast amounts of unstructured data generated daily, such as clinical notes,
research reports, and patient records. By efficiently identifying and categorizing crucial
medical information, NER aids in streamlining data processing and analysis, thus
contributing significantly to health informatics. The application of NER in healthcare
facilitates more informed and faster decision-making processes, enhances patient care, and
supports the systematic study of medical conditions (Meystre et al., 2008).
One of the key examples of NER in healthcare is the analysis of Electronic Health Records
(EHRs). EHRs are rich sources of patient data, but their unstructured nature poses a challenge
for effective data extraction and analysis. NER systems are utilized to extract pertinent
information such as patient diagnoses, treatment plans, and medical histories, converting
them into a structured format suitable for further analysis and use in clinical decision support
systems (Murdoch & Detsky, 2013).
Another significant application of NER in healthcare is in monitoring drug interactions and
adverse effects. NER systems can scan through medical literature, patient records, and other
relevant documents to identify mentions of drugs and their associated reactions or
interactions. This is vital for pharmacovigilance and ensuring patient safety, as it allows
healthcare providers to quickly identify potential risks and take appropriate actions (Botsis et
al., 2010).
Furthermore, NER is instrumental in clinical trial data management. It enables the extraction
of relevant data points from clinical trial documents and patient records, facilitating the
aggregation and analysis of trial data. This enhances the efficiency of clinical trials by
streamlining data collection and analysis processes, thereby accelerating the pace of medical
research and the development of new treatments (Kreimeyer et al., 2017).
In conclusion, the background of NER in healthcare highlights its indispensable role in
various aspects of healthcare data management. From EHR analysis to drug safety
monitoring and clinical trial management, NER stands as a vital tool in transforming
unstructured medical data into actionable insights.
Dataset Description
Diverse Medical Datasets
In the realm of medical data science, diverse datasets are pivotal for training and refining
Named Entity Recognition (NER) systems in healthcare. One prominent dataset is the
GENIA corpus, comprised of around 2,000 abstracts from MEDLINE. This corpus focuses
on entities relevant to molecular biology, such as DNA and proteins, making it a fundamental
resource for biomedical NER research (Data Science Central, n.d.).
Another significant dataset in this domain is the BioCreative II GENETAG dataset. It is
tailored specifically for gene mentions, with a unique approach of treating proteins, DNA,
and RNA as a single entity type. This methodology enhances the dataset's utility in
gene-centric NER applications (Data Science Central, n.d.).
The Arizona Disease Corpus (AZDC) and the NCBI disease corpus further enrich the
biomedical NER dataset landscape. Specializing in disease mentions, these datasets are
intricately linked to the Unified Medical Language System (UMLS), thereby expanding their
applicability to a wider range of medical terminologies and contexts. This connection to
UMLS broadens the datasets' utility, making them invaluable for disease-focused NER tasks
in healthcare (Data Science Central, n.d.).
Chemical and Drug Entity Recognition
Focusing on the pharmacological aspect of NER, datasets like the IUPAC training/test
datasets and the DrugDDI dataset come into play. These are essential for recognizing and
classifying chemical entities and understanding drug-drug interactions. They provide specific
data crucial for NER systems working in the fields of drug safety and pharmacology (Data
Science Central, n.d.).
Furthermore, the CHEMDNER dataset, emerging from the BioCreative IV challenge, is a
comprehensive resource containing over 85,000 chemical mentions extracted from PubMed
abstracts. It serves as a substantial resource for researchers and NER systems focusing on the
identification and classification of chemical compounds in biomedical texts (Data Science
Central, n.d.).
Clinical Data in Electronic Medical Records (EMRs)
Electronic Medical Records (EMRs) represent a rich source of clinical data for NER systems
in healthcare. A notable example is the CER dataset, which includes 5,160 clinical records.
These records are crucial for extracting patient-specific information, supporting medical
decision-making processes, and enhancing patient care. EMR datasets, despite the challenges
in their construction due to privacy and confidentiality concerns, are indispensable in the
development of NER systems tailored for the healthcare sector (Data Science Central, n.d.).
These datasets capture the intricacies and variations of clinical language and medical
scenarios, providing a realistic and diverse array of data for NER systems. They not only
facilitate the understanding and processing of complex medical texts but also aid in the
advancement of NER technology, ensuring its applicability and effectiveness in real-world
healthcare settings (Data Science Central, n.d.).
In conclusion, these diverse medical datasets, ranging from molecular biology to clinical
records, form the backbone of NER systems in healthcare. Their varied nature allows for
comprehensive training and refinement of NER models, ensuring their effectiveness in
diverse medical scenarios and applications. The continuous development and expansion of
such datasets are vital for the advancement of data science applications in healthcare,
particularly in the field of NER (Data Science Central, n.d.).
Specific Applications of NER in Healthcare
Named Entity Recognition (NER), also known as Clinical NER in healthcare, addresses the
challenge of dealing with complex clinical text data. This data often contains lengthy
narratives that encompass patient histories, symptoms, diagnoses and treatments. One
application of Clinical NER is Clinical Entity Identification which involves the extraction of
entities such as diseases, medicines and symptoms from electronic health records (EHRs) or
clinical notes. Accurately identifying and classifying clinical data allows for efficient
information retrieval.
In healthcare, Segment Representation (SR) is often used to accurately identify clinical data.
This technique involves assigning suitable class labels to individual words within a given
text. SR is applied to aid with tasks such as part-of-speech tagging and noun-phrase
chunking. The tags used in SR techniques are Begin (B), End (E), Inside (I), Single (S), and
Outside (O). Different SR techniques have been employed within Clinical NER such as the
IOBE model which respectively assigns the tags ‘B’ and ‘E’ for the first and last word of all
named entities. The ‘I’ tag would be used for tokens inside all named entities that consist of
more than two words and the ‘O’ tag is for tokens outside of any named entity. An ‘S’ tag is
typically used for single-word entities. (Nayel & Shashirekha, 2017). By segmenting the text
appropriately, NER models can capture the context surrounding disease mentions, improving
the accuracy of identification. This is useful because diseases may be mentioned across
multiple sentences or within specific sections of a clinical note. The table below illustrates
how SR is used to tag the text fragment “Treatment / stay IHSS AF ESRD on HD , IgA
nephropathy on”. Text fragments such as these are common in EHRs as clinicians often use
condensed text fragments to efficiently communicate essential patient information.
Table showing an example of using different Segment Representation models. Adapted from “Improving NER for
clinical texts by ensemble approach using segment representations” by Nayel and Shashirekha (2017).
Techniques such as word embeddings or contextual embeddings (e.g. BERT embeddings) can
be used to create segment representations that better capture the nuances of clinical language
as they help with capturing semantic relationships between words based on their context.
In terms of the machine learning models used for Clinical Entity Identification, these range
from traditional models, such as SVMs and CRFs, to deep learning models such as
(LSTM)-CRF. (Zhang et al., 2018). In evaluating the performance of these models, the
harmonic average of precision and recall (F1 score) is often the primary metric. (Bose et al.,
2021, Zhang et al., 2018). The F1 score is derived from precision and recall and is calculated
as follows:
Image showing the calculations for machine learning metrics. Adapted from “A Survey on Recent Named Entity
Recognition and Relationship Extraction Techniques on Clinical Texts” by Bose et al. (2021).
Zhang et. al (2018) did a comprehensive evaluation of various studies which focused on the
optimal combination of features and models used within Clinical Named Entity Recognition.
From these evaluations, it was seen that the performance of the model was closely related to
the features selected. For example, experiments on 220 clinical tests revealed that the CRF
model performed best when used with a combination of part-of-speech features, dictionary
features and word clustering features, achieving an F1-score of 0.8915. (Zhang et al., 2018).
However, experiments combining the strengths of multiple models performed the best. As an
example, when the sentence classifier from an SVM model was combined with a CRF model,
an F1-score as high as 0.935 was achieved. (Zhang et al., 2018). While this is not directly
comparable with the previously mentioned CRF model due to a lack of information on the
specific features used, it highlights the benefit of employing ensemble learning in Clinical
NER.
Ensemble learning involves combining predictions from multiple models to enhance overall
performance. One common ensemble technique used in Clinical NER is majority voting,
where each model in the ensemble provides a prediction, and the final prediction is
determined by the majority vote among these individual predictions. This is useful because
some classifiers might give good results on some datasets yet perform very badly on others.
Considering the decisions of multiple classifiers instead of a single classifier can help
mitigate errors from individual models and enhance the overall accuracy of the NER system.
(Nayel & Shashirekha, 2017). Incorporating ensemble techniques, such as majority voting,
into Clinical NER systems proves to be a valuable strategy, capitalizing on the strengths of
multiple models.
Ethical Concerns of NER in the Health Sector
Machine learning models are as effective as the data they receive, and as such, security
concerns related to how data is handled in machine learning models such as NER’s should be
addressed because healthcare data is extremely sensitive and can be highly consequential in
the wrong hands. Within the healthcare sector, the usage of machine learning models such as
NER has become more prominent as there is a need to better understand insights from large
data processing. With that in mind, approximately 90% of healthcare organizations
experience some form of data breach, based on reports from research reported by Silvestri et
al. (2023). This means that within the health sector, the constant need for data protection is a
real concern. Added to this is the sensitivity of the information that is managed in the
healthcare industry, making it a constant target. McCradden et al. (2020), in their research,
outlined the concept of deidentification as a means of making information and data collection
more ethical as it relates to research in the AI model. What this means is that the large data
will just be data that is useful for modelling insights rather than being associated with an
individual or group. This is an important concept as it relates to data processing that is backed
by HIPAA, a U.S. law that focuses on the protection of health information, privacy, and
security in the healthcare sector. The vulnerable information that this would protect includes
names, addresses, data, etc., which can reduce personal attacks or group-targeted approaches
by large data processing companies. In order to remain progressive in creating effective
solutions in the healthcare industry, the implementation of machine learning through NER’s
can be effective; however, while progress is dependent on large amounts of data on real
medical cases, it is essential that learning models such as NER’s are developed in a way that
is coherent with the aim of protecting the individuals that information is collected on.
Case Studies
Case Study 1: Deep Learning in Biomedical Named Entity Recognition
This case study, conducted by Ahmad, Shah, and Lee (2023), investigates the application of
deep learning (DL) techniques in biomedical Named Entity Recognition (bNER). The study
highlights the transformative impact of DL in enhancing the efficacy of bNER systems,
particularly in the analysis of electronic health records (EHRs). It underscores the shift from
traditional rule-based systems to advanced DL models capable of learning complex patterns
in biomedical texts automatically, thus marking a significant advancement in the field.
Key points include the categorization of bNER-based tools and their role in creating labeled
datasets for sentiment analysis. The study also discusses the challenges faced by bNER
systems, such as the need for efficient data processing methods and the handling of the
complexity inherent in biomedical texts. Furthermore, the study delves into the future
directions of bNER in healthcare, focusing on how these advanced systems can be leveraged
to gain deeper insights into patient data and improve clinical decision-making processes.
Case Study 2: Biomedical Named Entity Recognition in Healthcare Domain
This case study, authored by Ahmad, Shah, and Lee (2023), explores the role of biomedical
Named Entity Recognition (bNER) in the healthcare domain. It delves into how bNER,
critical in biomedical informatics, identifies entities with special meanings such as drugs,
proteins, genes, and diseases in electronic health records (EHRs). The study emphasizes the
growth of medical literature and the necessity of bNER systems to handle the complexity of
biomedical texts.
Key highlights include the shift from early manual configuration of bNER systems to more
advanced deep learning (DL)-based systems, which learn patterns of biomedical text
automatically. This advancement has made bNER systems more robust and efficient. The
study also categorizes bNER-based tools, showcasing their role in creating labeled datasets
for machine learning sentiment analyzers. Moreover, it addresses the challenges facing bNER
systems and provides future directions in the healthcare field.
Conclusion and Future Directions
References
Ahmad, P., Shah, S. Y., & Lee, Y. (2023). Biomedical Named Entity Recognition in
Healthcare Domain. [Case study from the provided PDF].
Ahmad, P., Shah, S. Y., & Lee, Y. (2023). Deep Learning in Biomedical Named Entity
Recognition. [Case study from the provided PDF].
Bose, P., Srinivasan, S., Sleeman, W. C., Palta, J., Kapoor, R., & Ghosh, P. (2021). A Survey
on Recent Named Entity Recognition and Relationship Extraction Techniques on
Clinical Texts. Applied Sciences, 11(18), 8319. https://doi.org/10.3390/app11188319
Botsis, T., Nguyen, M. D., Woo, E. J., Markatou, M., & Ball, R. (2010). Text mining for the
Vaccine Adverse Event Reporting System: medical text classification using
informative feature selection. Journal of the American Medical Informatics
Association, 17(5), 584-591.
Data Science Central. (n.d.). Mobilizing Data Science in Healthcare: Applications,
Challenges, and Solutions. Retrieved from Data Science Central.
Kim, J. D., Ohta, T., Tateisi, Y., & Tsujii, J. (2003). GENIA corpus--semantically annotated
corpus for bio-textmining. Bioinformatics (Oxford, England), 19 Suppl 1, i180–i182.
https://doi.org/10.1093/bioinformatics/btg1023
Kreimeyer, K., Foster, M., Pandey, A., Arya, N., Halford, G., Jones, S. F., ... & Botsis, T.
(2017). Natural language processing systems for capturing and standardizing
unstructured clinical information: A systematic review. Journal of Biomedical
Informatics, 73, 14-29.
McCradden, M. D., Baba, A., Saha, A., Ahmad, S., Boparai, K., Fadaiefard, P., & Cusimano,
M. D. (2020). Ethical concerns around use of artificial intelligence in health care
research from the perspective of patients with meningioma, caregivers and health care
providers: a qualitative study. CMAJ Open, 8(1), E90–E95.
https://doi.org/10.9778/cmajo.20190151
Meystre, S. M., Savova, G. K., Kipper-Schuler, K. C., & Hurdle, J. F. (2008). Extracting
information from textual documents in the electronic health record: a review of recent
research. Yearbook of Medical Informatics, 128-144.
Murdoch, T. B., & Detsky, A. S. (2013). The inevitable application of big data to health care.
JAMA, 309(13), 1351-1352.
Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification.
Linguisticae Investigationes, 30(1), 3-26.
Nayel, H., & Shashirekha, H. L. (2017, December). Improving NER for clinical texts by
ensemble approach using segment representations. In Proceedings of the 14th
International Conference on Natural Language Processing (ICON-2017) (pp.
197-204).
Silvestri, S., Islam, S., Amelin, D., Weiler, G., Papastergiou, S., & Ciampi, M. (2023). Cyber
threat assessment and management for securing healthcare ecosystems using natural
language processing. International Journal of Information Security.
https://doi.org/10.1007/s10207-023-00769-w
Zhang, Y., Wang, X., Hou, Z., & Li, J. (2018). Clinical Named Entity Recognition From
Chinese Electronic Health Records via Machine Learning Methods. JMIR Medical
Informatics, 6(4), e50. https://doi.org/10.2196/medinform.9965