Papers by Marco Basaldella

arXiv (Cornell University), Mar 29, 2024
Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. H... more Large Language Models (LLMs) have demonstrated remarkable capability in a variety of NLP tasks. However, LLMs are also prone to generate nonfactual content. Uncertainty Quantification (UQ) is pivotal in enhancing our understanding of a model's confidence on its generation, thereby aiding in the mitigation of nonfactual outputs. Existing research on UQ predominantly targets short text generation, typically yielding brief, word-limited responses. However, real-world applications frequently necessitate much longer responses. Our study first highlights the limitations of current UQ methods in handling long text generation. We then introduce LUQ with its two variations: LUQ-ATOMIC and LUQ-PAIR, a series of novel sampling-based UQ approaches specifically designed for long text. Our findings reveal that LUQ outperforms existing baseline methods in correlating with the model's factuality scores (negative coefficient of-0.85 observed for Gemini Pro). To further improve the factuality of LLM responses, we propose LUQ-ENSEMBLE, a method that ensembles responses from multiple models and selects the response with the lowest uncertainty. The ensembling method greatly improves the response factuality upon the best standalone LLM. 1

arXiv (Cornell University), Oct 7, 2020
Whilst there has been growing progress in Entity Linking (EL) for general language, existing data... more Whilst there has been growing progress in Entity Linking (EL) for general language, existing datasets fail to address the complex nature of health terminology in layman's language. Meanwhile, there is a growing need for applications that can understand the public's voice in the health domain. To address this we introduce a new corpus called COMETA, consisting of 20k English biomedical entity mentions from Reddit expert-annotated with links to SNOMED CT, a widely-used medical knowledge graph. Our corpus satisfies a combination of desirable properties, from scale and coverage to diversity and quality, that to the best of our knowledge has not been met by any of the existing resources in the field. Through benchmark experiments on 20 EL baselines from string-to neural-based models we shed light on the ability of these systems to perform complex inference on entities and concepts under 2 challenging evaluation scenarios. Our experimental results on COMETA illustrate that no golden bullet exists and even the best mainstream techniques still have a significant performance gap to fill, while the best solution relies on combining different views of data.

arXiv (Cornell University), Oct 22, 2020
Despite the widespread success of selfsupervised learning via masked language models (MLM), accur... more Despite the widespread success of selfsupervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SAPBERT, a pretraining scheme that selfaligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipelinebased hybrid systems, SAPBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without taskspecific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BIOBERT, SCIBERT and PUB-MEDBERT, our pretraining scheme proves to be both effective and robust. 1
This short paper briefly presents an efficient implementation of a named entity recognition syste... more This short paper briefly presents an efficient implementation of a named entity recognition system for biomedical entities, which is also available as a web service. The approach is based on a dictionary-based entity recognizer combined with a machine-learning classifier which acts as a filter. We evaluated the efficiency of the approach through participation in the TIPS challenge (BioCreative V.5), where it obtained the best results among participating systems. We separately evaluated the quality of entity recognition and linking, using a manually annotated corpus as a reference (CRAFT), where we obtained state-of-the-art results.

Companion of the The Web Conference 2018 on The Web Conference 2018 - WWW '18
In this paper we explore possible negative drawbacks in the use of wearable sensors, i.e., wearab... more In this paper we explore possible negative drawbacks in the use of wearable sensors, i.e., wearable devices used to detect different kinds of activity, e.g., from step and calories counting to heart rate and sleep monitoring. These technologies, which in the latter years witnessed a rapid development in terms of accuracy and diffusion, are now available on different platforms at reasonable prices and can lead to an healthier behavior in people using them. Nevertheless, we will try to investigate possibly harming behaviors related to these devices. We will provide different scenarios in which wearable sensors, in connection with social media, data mining, or other technologies, could prove harmful for their users. CCS CONCEPTS • Security and privacy → Social aspects of security and privacy; Social network security and privacy; • Social and professional topics → Surveillance; Medical information policy;
This short paper briefly presents an efficient implementation of a named entity recognition syste... more This short paper briefly presents an efficient implementation of a named entity recognition system for biomedical entities, which is also available as a web service. The approach is based on a dictionary-based entity recognizer combined with a machine-learning classifier which acts as a filter. We evaluated the efficiency of the approach through participation in the TIPS challenge (BioCreative V.5), where it obtained the best results among participating systems. We separately evaluated the quality of entity recognition and linking, using a manually annotated corpus as a reference (CRAFT), where we obtained state-of-the-art results.
Cross-target generalization constitutes an important issue for news Stance Detection (SD). In thi... more Cross-target generalization constitutes an important issue for news Stance Detection (SD). In this short paper, we investigate adversarial cross-genre SD, where knowledge from annotated user-generated data is leveraged to improve news SD on targets unseen during training. We implement a BERT-based adversarial network and show experimental performance improvements over a set of strong baselines. Given the abundance of user-generated data, which are considerably less expensive to retrieve and annotate than news articles, this constitutes a promising research direction.
Conference of the European Chapter of the Association for Computational Linguistics, Apr 1, 2021
Cross-target generalization constitutes an important issue for news Stance Detection (SD). In thi... more Cross-target generalization constitutes an important issue for news Stance Detection (SD). In this short paper, we investigate adversarial cross-genre SD, where knowledge from annotated user-generated data is leveraged to improve news SD on targets unseen during training. We implement a BERT-based adversarial network and show experimental performance improvements over a set of strong baselines. Given the abundance of user-generated data, which are considerably less expensive to retrieve and annotate than news articles, this constitutes a promising research direction.
The Digital Libraries community has shown over the last years a growing interest in Semantic Sear... more The Digital Libraries community has shown over the last years a growing interest in Semantic Search technologies. Content analysis and annotation is a vital task, but for large corpora it’s not feasible to do it manually. Several automatic tools are available, but such tools usually provide little tuning possibilities and do not support integration with different systems. Search and adaptation technologies, on the other hand, are becoming increasingly multi-lingual and cross-domain to tackle the continuous growth of the available information. So, we claim that to tackle such criticalities a more systematic and flexible approach, such as the use of a framework, is needed. In this paper we present a novel framework for Knowledge Extraction, whose main goal is to support the development of new applications and to ease the integration of
This paper presents an approach towards high performance extraction of biomedical entities from t... more This paper presents an approach towards high performance extraction of biomedical entities from the literature, which is built by combining a high recall dictionary-based technique with a high-precision machine learning filtering step. The technique is then evaluated on the CRAFT corpus. We present the performance we obtained, analyze the errors and propose a possible follow-up of this work.
In this paper we analyze the effectiveness of using linguistic knowledge from coreference and ana... more In this paper we analyze the effectiveness of using linguistic knowledge from coreference and anaphora resolution for improving the performance for supervised keyphrase extraction. In order to verify the impact of these features, we define a baseline keyphrase extraction system and evaluate its performance on a standard dataset using different machine learning algorithms. Then, we consider new sets of features by adding combinations of the linguistic features we propose and we evaluate the new performance of the system. We also use anaphora and coreference resolution to transform the documents, trying to simulate the cohesion process performed by the human mind. We found that our approach has a slightly positive impact on the performance of automatic keyphrase extraction, in particular when considering the ranking of the results.

Crowdsourcing has become an alternative approach to collect relevance judgments at scale thanks t... more Crowdsourcing has become an alternative approach to collect relevance judgments at scale thanks to the availability of crowdsourcing platforms and quality control techniques that allow to obtain reliable results. Previous work has used crowdsourcing to ask multiple crowd workers to judge the relevance of a document with respect to a query and studied how to best aggregate multiple judgments of the same topic-document pair. This paper addresses an aspect that has been rather overlooked so far: we study how the time available to express a relevance judgment affects its quality. We also discuss the quality loss of making crowdsourced relevance judgments more efficient in terms of time taken to judge the relevance of a document. We use standard test collections to run a battery of experiments on the crowdsourcing platform CrowdFlower, studying how much time crowd workers need to judge the relevance of a document and at what is the effect of reducing the available time to judge on the ov...

RANLP 2017 - Recent Advances in Natural Language Processing Meet Deep Learning
This paper evaluates different techniques for building a supervised, multilanguage keyphrase extr... more This paper evaluates different techniques for building a supervised, multilanguage keyphrase extraction pipeline for languages which lack a gold standard. Starting from an unsupervised English keyphrase extraction pipeline, we implement pipelines for Arabic, Italian, Portuguese, and Romanian, and we build test collections for languages which lack one. Then, we add a Machine Learning module trained on a well-known English language corpus and we evaluate the performance not only over English but on the other languages as well. Finally, we repeat the same evaluation after training the pipeline over an Arabic language corpus to check whether using a language-specific corpus brings a further improvement in performance. On the five languages we analyzed, results show an improvement in performance when using a machine learning algorithm, even if such algorithm is not trained and tested on the same language.
Shut Up and Run: the Never-ending Quest for Social Fitness
In this paper we explore possible negative drawbacks in the use of wearable sensors, i.e., wearab... more In this paper we explore possible negative drawbacks in the use of wearable sensors, i.e., wearable devices used to detect different kinds of activity, e.g., from step and calories counting to heart rate and sleep monitoring. These technologies, which in the latter years witnessed a rapid development in terms of accuracy and diffusion, are now available on different platforms at reasonable prices and can lead to an healthier behavior in people using them. Nevertheless, we will try to investigate possibly harming behaviors related to these devices. We will provide different scenarios in which wearable sensors, in connection with social media, data mining, or other technologies, could prove harmful for their users.
In 2015, we introduced a novel knowledge extraction framework called the Distiller Framework, wit... more In 2015, we introduced a novel knowledge extraction framework called the Distiller Framework, with the goal of offering the research community a flexible, multilingual information extraction framework [3]. Two years later, the project has significantly evolved, by supporting more languages and many machine learning algorithms. In this paper we present the current design of the framework and some of its applications.
Self-Alignment Pretraining for Biomedical Entity Representations
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Natural Language Processing for Achieving Sustainable Development: the Case of Neural Labelling to Enhance Community Profiling
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
In recent years, there has been an increasing interest in the application of Artificial Intellige... more In recent years, there has been an increasing interest in the application of Artificial Intelligence-and especially Machine Learning to the field of Sustainable Development (SD). However, until now, NLP has not been systematically applied in this context. In this paper, we show the high potential of NLP to enhance project sustainability. In particular, we focus on the case of community profiling in developing countries, where, in contrast to the developed world, a notable data gap exists. Here, NLP could help to address the cost and time barrier of structuring qualitative data that prohibits its widespread use and associated benefits. We propose the new extreme multi-class multi-label Automatic User-Perceived Value classification task. We release Stories2Insights (S2I), an expert annotated dataset of interviews carried out in Uganda, we provide a detailed corpus analysis, and we implement a number of strong neural baselines to address the task. Experimental results show that the problem is challenging, and leaves considerable room for future research at the intersection of NLP and SD.
COMETA: A Corpus for Medical Entity Linking in the Social Media
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Word embeddings, in their different shapes and evolutions, have changed the natural language proc... more Word embeddings, in their different shapes and evolutions, have changed the natural language processing research landscape in the last years. The biomedical text processing field is no stranger to this revolution; however, researchers in the field largely trained their embeddings on scientific documents, even when working on user-generated data. In this paper we show how training embeddings from a corpus collected from user-generated text from medical forums heavily influences the performance on downstream tasks, outperforming embeddings trained both on general purpose data or on scientific papers when applied to user-generated content.
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)
Word embeddings, in their different shapes and evolutions, have changed the natural language proc... more Word embeddings, in their different shapes and evolutions, have changed the natural language processing research landscape in the last years. The biomedical text processing field is no stranger to this revolution; however, researchers in the field largely trained their embeddings on scientific documents, even when working on user-generated data. In this paper we show how training embeddings from a corpus collected from user-generated text from medical forums heavily influences the performance on downstream tasks, outperforming embeddings trained both on general purpose data or on scientific papers when applied to user-generated content.
Uploads
Papers by Marco Basaldella