Skip to main content

Francesco Gargiulo

Followers

2

Following

1

Public Views

Benemérita Escuela Normal Federalizada de Tamaulipas

Universidad Icesi

Rafael Pereira da Silva

Faculdade de Ciências da Universidade do Porto

nishikanta mohanty

Christoph Niedermeier

Susan Mniszewski

Mary Mehrnoosh Eshaghian-Wilner

Interests

Uploads

Papers by Francesco Gargiulo

Best practices for portfolio optimization by quantum computing, experimented on real quantum devices

In finance, portfolio optimization aims at finding optimal investments maximizing a trade-off bet... more In finance, portfolio optimization aims at finding optimal investments maximizing a trade-off between return and risks, given some constraints. Classical formulations of this quadratic optimization problem have exact or heuristic solutions, but the complexity scales up as the market dimension increases. Recently, researchers are evaluating the possibility of facing the complexity scaling issue by employing quantum computing. In this paper, the problem is solved using the Variational Quantum Eigensolver (VQE), which in principle is very efficient. The main outcome of this work consists of the definition of the best hyperparameters to set, in order to perform Portfolio Optimization by VQE on real quantum computers. In particular, a quite general formulation of the constrained quadratic problem is considered, which is translated into Quadratic Unconstrained Binary Optimization by the binary encoding of variables and by including constraints in the objective function. This is converted ...

A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records

IEEE Access, 2021

In the last years, the need to de-identify privacy-sensitive information within Electronic Health... more In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and publication of their content in accordance with the restrictions imposed by both national and supranational privacy authorities. In the field of Natural Language Processing (NLP), several deep learning techniques for Named Entity Recognition (NER) have been applied to face this issue, significantly improving the effectiveness in identifying sensitive information in EHRs written in English. However, the lack of data sets in other languages has strongly limited their applicability and performance evaluation. To this aim, a new de-identification data set in Italian has been developed in this work, starting from the 115 COVID-19 EHRs provided by the Italian Society of Radiology (SIRM): 65 were used for training and development, the remaining 50 were used for testing. The data set was labelled following the guidelines of the i2b2 2014 de-identification track. As additional contribution, combined with the best performing Bi-LSTM + CRF sequence labeling architecture, a stacked word representation form, not yet experimented for the Italian clinical de-identification scenario, has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities. Finally, other cutting-edge approaches were compared with the proposed model, which achieved the best performance highlighting the goodness of the promoted approach. INDEX TERMS Clinical de-identification, contextualized embedding, deep learning, Italian language, named entity recognition.

Identification of traffic flows hiding behind TCP port 80

Using the Dempster-Shafer theory for network traffic labelling

When addressing the problem of network intrusion detection by means of supervised machine learnin... more When addressing the problem of network intrusion detection by means of supervised machine learning techniques, it may be necessary to have some data available for training. Suitably labelled datasets, in fact, may be used to build a proper model of the network environment to protect. In this abstract we propose an architecture for automatically building such a traffic database, We attach to each packet a label from the set L = {Normal, Attack}, and eventually an attribute from a set A describing the specific type of attack which has been detected. The elements of A depend both on the operational context, and on the type of targeted attacks. The obtained dataset can be used, for example, to train a supervised multiclassifier system such as the one described in [2]. We want to automate the process of traffic dataset labelling, starting from raw tcpdump traffic traces; to this purpose, we propose a multistage system [3]. In the preliminary stages, we take advantage of some classifiers which don’t require any training. This allows us to obtain a gross-grained classification of the packets. By using the preliminary results, we can train some supervised classifiers, which can be later involved in the labelling process, contributing with their classification capabilities. Each stag e consists of some Intrusion Detection Modules, called respectively Base IDS (B-IDS) and Supervised IDS (S-IDS). In order to obtain the best from different classification techniques, we propose to combine them by using the well known Dempster-Shafer theory [1]. According to such a theory, each classifier involved in the final decision has to be associated a Base Probability Assignment (BPA), describing the subjective degree of confidence attributed to it by means of prior hypotheses or observations. Typical examples of B-IDS can be signature based IDS, such as Snort, or IDS based on unsupervised techniques. During the operating phase, a bank of B-IDS starts analyzing offline the packets contained in a dumped traffic database. No prior knowledge is needed about such traffic. Each decision from each of the B-IDS will be supported by the BPA associated to it, which will express the degree of belief in its decision. BPA’s can be assigned to a B-IDS by considering the specific category it belongs to (e.g. signature-based, anomaly-based, etc.). Some general criteria to do that have been presented in [3]. BPA’s will be combined by means of the Dempster-Schafer combination rule, in order to obtain the overall degree of confidence attributed to the ensemble of classifiers. Since the confidence degree is a real number, we also defined a criterion for obtaining a crisp classification of

Network Protocol Verification by a Classifier Selection Ensemble

Lecture Notes in Computer Science, 2009

Special Issue on Big Data for eHealth Applications

Applied Sciences

In the last few years, the rapid growth in available digitised medical data has opened new challe... more

Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases

Applied Sciences

The large availability of clinical natural language documents, such as clinical narratives or dia... more The large availability of clinical natural language documents, such as clinical narratives or diagnoses, requires the definition of smart automatic systems for their processing and analysis, but the lack of annotated corpora in the biomedical domain, especially in languages different from English, makes it difficult to exploit the state-of-art machine-learning systems to extract information from such kinds of documents. For these reasons, healthcare professionals lose big opportunities that can arise from the analysis of this data. In this paper, we propose a methodology to reduce the manual efforts needed to annotate a biomedical named entity recognition (B-NER) corpus, exploiting both active learning and distant supervision, respectively based on deep learning models (e.g., Bi-LSTM, word2vec FastText, ELMo and BERT) and biomedical knowledge bases, in order to speed up the annotation task and limit class imbalance issues. We assessed this approach by creating an Italian-language el...

A Methodology to Reduce the Complexity of Validation Model Creation from Medical Specification Document

Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies, 2017

In this paper we propose a novel approach to reduce the complexity of the definition and implemen... more In this paper we propose a novel approach to reduce the complexity of the definition and implementation of a medical document validation model. Usually the conformance requirements for specifications are contained in documents written in natural language format and it is necessary to manually translate them in a software model for validation purposes. It should be very useful to extract and group the conformance rules that have a similar pattern to reduce the manual effort needed to accomplish this task. We will show an innovative cluster approach that automatically evaluates the optimal number of groups using an iterative method based on internal cluster measures evaluation. We will show the application of this method on two case studies: i) Patient Summary (Profilo Sanitario Sintetico) and ii) Hospital Discharge Letter (Lettera di Dimissione Ospedaliera) for the Italian specification of the conformance rules.

Deep Convolution Neural Network for Extreme Multi-label Text Classification

Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies, 2018

In this paper we present an analysis on the usage of Deep Neural Networks for extreme multi-label... more In this paper we present an analysis on the usage of Deep Neural Networks for extreme multi-label and multiclass text classification. We will consider two network models: the first one is formed by a word embeddings (WEs) stage followed by two dense layers, hereinafter Dense, and a second model with a convolution stage between the WEs and the dense layers, hereinafter CNN-Dense. We will take into account classification problems characterized by different number of labels, ranging from an order of 10 to an order of 30, 000, showing the different performances of the neural networks varying the total label number and the average number of labels for sample, exploiting the hierarchical structure of the label space of the dataset used for experimental assessment. It is worth noting that multi-label classification is an harder problem if compared to multi-class, due to the variable number of labels associated to each sample. We will even investigate on the behaviour of the neural networks as function of the training hyperparameters, analysing the link between them and the dataset complexity. All the result will be evaluated using the PubMed scientific articles collection as test case.

Multiple Classifier Systems in Adversarial Environments: "Challenges and Solutions

Pattern recognition methods offer technological background for a variety of applications in a mod... more Pattern recognition methods offer technological background for a variety of applications in a modern information society. They are however undermined by several kinds of "adversarial" misuses like email and web spam, attacks to computer networks, etc. A classical example of such "adversarial" environment are various evasion techniques used in generation of spam emails. Similar problems arise in web search (web spam) and malware analysis (obfuscation and polymorphism). The underlying problem is that pattern recognition, as well as data analysis techniques in general, have not been designed to work in adversarial environments. This consideration arise with the problem to define a general framework to prevents this kind of evasions. In this thesis we propose some techniques to approach with the "adversarial" environments. We first present a novel multiple classifier systems approach, called SOCIAL, and then we will show some methodologies applied to differ...

An RDF-Based Framework for Semantic Indexing of Web Pages

2013 IEEE Seventh International Conference on Semantic Computing, 2013

ABSTRACT Managing efficiently and effectively very large amount of digital documents requires the... more ABSTRACT Managing efficiently and effectively very large amount of digital documents requires the definition of indexes able to capture and express documents&#39; semantics. In this work, we propose an RDF based framework for semantic indexing of web pages considering the related textual information. In particular, we propose to capture the semantic nature of a given document, commonly expressed in natural language, by retrieving a number of RDF triples and to semantically index the documents on the base of meaning of the triples&#39; elements (i.e. subject, verb, object). Preliminary experiments are reported to evaluate the proposed index strategy.

Combining Syntactic and Semantic Vector Space Models in the Health Domain by using a Clustering Ensemble

Proceedings of the International Conference on Health Informatics, 2013

The adoption of services for automatic information management is one of the most interesting open... more The adoption of services for automatic information management is one of the most interesting open problems in various professional and social fields. We focus on the health domain characterized by the production of huge amount of documents, in which the adoption of innovative systems for information management can significantly improve the tasks performed by the actors involved and the quality of the health services offered. In this work we propose a methodology for automatic documents categorization based on the adoption of unsupervised learning techniques. We extracted both semantic and syntactic features in order to define the vector space models and proposed the use of a clustering ensemble in order to increase the discriminative power of our approach. Results on real medical records, digitalized by means of a state-of-the-art OCR technique, demonstrated the effectiveness of the proposed approach.

Improving Biomedical Information Extraction with Word Embeddings Trained on Closed-Domain Corpora

2019 IEEE Symposium on Computers and Communications (ISCC), 2019

Named Entity Recognition (NER) systems allow complex concepts extraction and text mining from Nat... more Named Entity Recognition (NER) systems allow complex concepts extraction and text mining from Natural Language documents. Actually, NER systems based on Deep Learning (DL) approaches are able to reach state-of-the-art performances when applied to general domain texts. On the other hand, the performances of these systems decrease when they are applied to texts that belong to specific domains, such as the biomedical one. In particular, Biomedical NER (B-NER) is a crucial task for the automatic analysis of medical documents, such as Electronic Health Records (EHRs), in order to support the work of physicians and researchers. Thus, new approaches are required to boost B-NER systems performances. In this paper we analyze the behaviour of a B-NER DL architecture specifically devoted to Italian EHRs, focusing on the contribution of different Word Embeddings (WEs) models used as input text representation layer. The achieved results show the substantial contribution of WEs trained on a close...

Exploit Hierarchical Label Knowledge for Deep Learning

In this paper we propose a methodology based on a complex deep learning network topology, named H... more In this paper we propose a methodology based on a complex deep learning network topology, named Hierarchical Deep Neural Network (HDNN), applied to eXtreme Multi-label Text Classification (XMTC) problem. The HDNN topology reproduces the label hierarchy. The main idea arises directly from the assumption that, if the label-set structure is defined, forcing this information into the network topology could improve classification performances and results interpretation. In this way, we define a method to force prior knowledge into the DNN. We perform the experimental assessment on a XMTC task related to a real application domain problem, namely the automatic labelling of biomedical scientific literature extracted from PubMed. The obtained preliminary results show that, despite the very high computational time needed to update the network weights, a slight performance improvement is obtained, with respect to a classical approach based on Convolution Neural Network. Some considerations wil...

Exploit Multilingual Language Model at Scale for ICD-10 Clinical Text Classification

2020 IEEE Symposium on Computers and Communications (ISCC), 2020

The automatic ICD-10 classification of medical documents is actually an unresolved issue, despite... more The automatic ICD-10 classification of medical documents is actually an unresolved issue, despite its crucial importance. The existence of machine learning approaches de-voted to this task is in contrast with the lack of annotated resources, especially for languages different from English. Recent Transformer-based multilingual neural language models at scale have provided an innovative approach for dealing with cross lingual Natural Language Processing tasks. In this paper, we present a preliminary evaluation of the Cross-lingual Language Model (XLM) architecture, a recent multilingual Transformer-based model presented in literature, tested in the cross lingual ICD-10 multilabel classification of short medical notes. In detail, we analysed the performances obtained by fine tuning the XLM model on English language training data and tested for ICD-10 codes prediction of an Italian test set. The obtained results show that the use of the novel XLM multilingual neural language architecture is very promising and it can be very useful in case of low resource languages.

A Big Data architecture for knowledge discovery in PubMed articles

2017 IEEE Symposium on Computers and Communications (ISCC), 2017

The need of smart information retrieval systems is in contrast with the difficulties to deal with... more The need of smart information retrieval systems is in contrast with the difficulties to deal with huge amount of data. In this paper we present a Big Data Analytics architecture used to implement a semantic similarity search tool for natural language texts in biomedical domain. The implemented methodology is based on Word Embeddings (WEs) models obtained using the word2vec algorithm. The system has been assessed with documents extracted from the whole PubMed library. It will be also presented a user friendly web front-end able to assess the methodology on a real context.

Visual and OCR-Based Features for Detecting Image Spam

Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems, 2008

The presence of unsolicited bulk emails, commonly known as spam, can seriously compromise normal ... more The presence of unsolicited bulk emails, commonly known as spam, can seriously compromise normal user activities, forcing them to navigate through mailboxes to find the-relatively few-interesting emails. Even if a quite huge variety of spam filters has been developed until now, this problem is far to be resolved since spammers continuously modify their malicious techniques in order to bypass filters. In particular, in the last years spammers have begun vehiculating unsolicited commercial messages by means of images attached to emails whose textual part appears perfectly legitimate. In this paper we present a method for overcoming some of the problems that still remain with state-of-the-art spam filters when checking images attached to emails. Results on both personal and publicly available email databases are presented, in order to assess the performance of the proposed approach.

A Method for Topic Detection in Great Volumes of Data

Communications in Computer and Information Science, 2015

Topics extraction has become increasingly important due to its effectiveness in many tasks, inclu... more Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the \(tf-idf\) matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in the documents, represented by the rows. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them, considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.

A Topic Detection Method for high dimensional datasets

A Multimedia Summarizer Integrating Text and Images

Smart Innovation, Systems and Technologies, 2015

ABSTRACT We present a multimedia summarizer system for retrieving relevant information from some ... more ABSTRACT We present a multimedia summarizer system for retrieving relevant information from some web repositories based on the extraction of semantic descriptors of documents. In particular, semantics attached to each document textual sentences is expressed as a set of assertions in the ⟨subject, verb, object⟩ shape as in the RDF data model. While, images’ semantics is captured using a set of keywords derived from high level information such as the related title, description and tags. We leverage an unsupervised clustering algorithm exploiting the notion of semantic similarity and use the centroids of clusters to determine the most significant summary sentences. At the same time, several images are attached to each cluster on the base of keywords’ term frequency. Finally, several experiments are presented and discussed.

Best practices for portfolio optimization by quantum computing, experimented on real quantum devices

In finance, portfolio optimization aims at finding optimal investments maximizing a trade-off bet... more In finance, portfolio optimization aims at finding optimal investments maximizing a trade-off between return and risks, given some constraints. Classical formulations of this quadratic optimization problem have exact or heuristic solutions, but the complexity scales up as the market dimension increases. Recently, researchers are evaluating the possibility of facing the complexity scaling issue by employing quantum computing. In this paper, the problem is solved using the Variational Quantum Eigensolver (VQE), which in principle is very efficient. The main outcome of this work consists of the definition of the best hyperparameters to set, in order to perform Portfolio Optimization by VQE on real quantum computers. In particular, a quite general formulation of the constrained quadratic problem is considered, which is translated into Quadratic Unconstrained Binary Optimization by the binary encoding of variables and by including constraints in the objective function. This is converted ...

A Novel COVID-19 Data Set and an Effective Deep Learning Approach for the De-Identification of Italian Medical Records

IEEE Access, 2021

In the last years, the need to de-identify privacy-sensitive information within Electronic Health... more In the last years, the need to de-identify privacy-sensitive information within Electronic Health Records (EHRs) has become increasingly felt and extremely relevant to encourage the sharing and publication of their content in accordance with the restrictions imposed by both national and supranational privacy authorities. In the field of Natural Language Processing (NLP), several deep learning techniques for Named Entity Recognition (NER) have been applied to face this issue, significantly improving the effectiveness in identifying sensitive information in EHRs written in English. However, the lack of data sets in other languages has strongly limited their applicability and performance evaluation. To this aim, a new de-identification data set in Italian has been developed in this work, starting from the 115 COVID-19 EHRs provided by the Italian Society of Radiology (SIRM): 65 were used for training and development, the remaining 50 were used for testing. The data set was labelled following the guidelines of the i2b2 2014 de-identification track. As additional contribution, combined with the best performing Bi-LSTM + CRF sequence labeling architecture, a stacked word representation form, not yet experimented for the Italian clinical de-identification scenario, has been tested, based both on a contextualized linguistic model to manage word polysemy and its morpho-syntactic variations and on sub-word embeddings to better capture latent syntactic and semantic similarities. Finally, other cutting-edge approaches were compared with the proposed model, which achieved the best performance highlighting the goodness of the promoted approach. INDEX TERMS Clinical de-identification, contextualized embedding, deep learning, Italian language, named entity recognition.

Identification of traffic flows hiding behind TCP port 80

Using the Dempster-Shafer theory for network traffic labelling

When addressing the problem of network intrusion detection by means of supervised machine learnin... more When addressing the problem of network intrusion detection by means of supervised machine learning techniques, it may be necessary to have some data available for training. Suitably labelled datasets, in fact, may be used to build a proper model of the network environment to protect. In this abstract we propose an architecture for automatically building such a traffic database, We attach to each packet a label from the set L = {Normal, Attack}, and eventually an attribute from a set A describing the specific type of attack which has been detected. The elements of A depend both on the operational context, and on the type of targeted attacks. The obtained dataset can be used, for example, to train a supervised multiclassifier system such as the one described in [2]. We want to automate the process of traffic dataset labelling, starting from raw tcpdump traffic traces; to this purpose, we propose a multistage system [3]. In the preliminary stages, we take advantage of some classifiers which don’t require any training. This allows us to obtain a gross-grained classification of the packets. By using the preliminary results, we can train some supervised classifiers, which can be later involved in the labelling process, contributing with their classification capabilities. Each stag e consists of some Intrusion Detection Modules, called respectively Base IDS (B-IDS) and Supervised IDS (S-IDS). In order to obtain the best from different classification techniques, we propose to combine them by using the well known Dempster-Shafer theory [1]. According to such a theory, each classifier involved in the final decision has to be associated a Base Probability Assignment (BPA), describing the subjective degree of confidence attributed to it by means of prior hypotheses or observations. Typical examples of B-IDS can be signature based IDS, such as Snort, or IDS based on unsupervised techniques. During the operating phase, a bank of B-IDS starts analyzing offline the packets contained in a dumped traffic database. No prior knowledge is needed about such traffic. Each decision from each of the B-IDS will be supported by the BPA associated to it, which will express the degree of belief in its decision. BPA’s can be assigned to a B-IDS by considering the specific category it belongs to (e.g. signature-based, anomaly-based, etc.). Some general criteria to do that have been presented in [3]. BPA’s will be combined by means of the Dempster-Schafer combination rule, in order to obtain the overall degree of confidence attributed to the ensemble of classifiers. Since the confidence degree is a real number, we also defined a criterion for obtaining a crisp classification of

Network Protocol Verification by a Classifier Selection Ensemble

Lecture Notes in Computer Science, 2009

Special Issue on Big Data for eHealth Applications

Applied Sciences

In the last few years, the rapid growth in available digitised medical data has opened new challe... more

Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases

Applied Sciences

The large availability of clinical natural language documents, such as clinical narratives or dia... more The large availability of clinical natural language documents, such as clinical narratives or diagnoses, requires the definition of smart automatic systems for their processing and analysis, but the lack of annotated corpora in the biomedical domain, especially in languages different from English, makes it difficult to exploit the state-of-art machine-learning systems to extract information from such kinds of documents. For these reasons, healthcare professionals lose big opportunities that can arise from the analysis of this data. In this paper, we propose a methodology to reduce the manual efforts needed to annotate a biomedical named entity recognition (B-NER) corpus, exploiting both active learning and distant supervision, respectively based on deep learning models (e.g., Bi-LSTM, word2vec FastText, ELMo and BERT) and biomedical knowledge bases, in order to speed up the annotation task and limit class imbalance issues. We assessed this approach by creating an Italian-language el...

A Methodology to Reduce the Complexity of Validation Model Creation from Medical Specification Document

Proceedings of the 10th International Joint Conference on Biomedical Engineering Systems and Technologies, 2017

In this paper we propose a novel approach to reduce the complexity of the definition and implemen... more In this paper we propose a novel approach to reduce the complexity of the definition and implementation of a medical document validation model. Usually the conformance requirements for specifications are contained in documents written in natural language format and it is necessary to manually translate them in a software model for validation purposes. It should be very useful to extract and group the conformance rules that have a similar pattern to reduce the manual effort needed to accomplish this task. We will show an innovative cluster approach that automatically evaluates the optimal number of groups using an iterative method based on internal cluster measures evaluation. We will show the application of this method on two case studies: i) Patient Summary (Profilo Sanitario Sintetico) and ii) Hospital Discharge Letter (Lettera di Dimissione Ospedaliera) for the Italian specification of the conformance rules.

Deep Convolution Neural Network for Extreme Multi-label Text Classification

Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies, 2018

In this paper we present an analysis on the usage of Deep Neural Networks for extreme multi-label... more In this paper we present an analysis on the usage of Deep Neural Networks for extreme multi-label and multiclass text classification. We will consider two network models: the first one is formed by a word embeddings (WEs) stage followed by two dense layers, hereinafter Dense, and a second model with a convolution stage between the WEs and the dense layers, hereinafter CNN-Dense. We will take into account classification problems characterized by different number of labels, ranging from an order of 10 to an order of 30, 000, showing the different performances of the neural networks varying the total label number and the average number of labels for sample, exploiting the hierarchical structure of the label space of the dataset used for experimental assessment. It is worth noting that multi-label classification is an harder problem if compared to multi-class, due to the variable number of labels associated to each sample. We will even investigate on the behaviour of the neural networks as function of the training hyperparameters, analysing the link between them and the dataset complexity. All the result will be evaluated using the PubMed scientific articles collection as test case.

Multiple Classifier Systems in Adversarial Environments: "Challenges and Solutions

Pattern recognition methods offer technological background for a variety of applications in a mod... more Pattern recognition methods offer technological background for a variety of applications in a modern information society. They are however undermined by several kinds of "adversarial" misuses like email and web spam, attacks to computer networks, etc. A classical example of such "adversarial" environment are various evasion techniques used in generation of spam emails. Similar problems arise in web search (web spam) and malware analysis (obfuscation and polymorphism). The underlying problem is that pattern recognition, as well as data analysis techniques in general, have not been designed to work in adversarial environments. This consideration arise with the problem to define a general framework to prevents this kind of evasions. In this thesis we propose some techniques to approach with the "adversarial" environments. We first present a novel multiple classifier systems approach, called SOCIAL, and then we will show some methodologies applied to differ...

An RDF-Based Framework for Semantic Indexing of Web Pages

2013 IEEE Seventh International Conference on Semantic Computing, 2013

ABSTRACT Managing efficiently and effectively very large amount of digital documents requires the... more ABSTRACT Managing efficiently and effectively very large amount of digital documents requires the definition of indexes able to capture and express documents&#39; semantics. In this work, we propose an RDF based framework for semantic indexing of web pages considering the related textual information. In particular, we propose to capture the semantic nature of a given document, commonly expressed in natural language, by retrieving a number of RDF triples and to semantically index the documents on the base of meaning of the triples&#39; elements (i.e. subject, verb, object). Preliminary experiments are reported to evaluate the proposed index strategy.

Combining Syntactic and Semantic Vector Space Models in the Health Domain by using a Clustering Ensemble

Proceedings of the International Conference on Health Informatics, 2013

The adoption of services for automatic information management is one of the most interesting open... more The adoption of services for automatic information management is one of the most interesting open problems in various professional and social fields. We focus on the health domain characterized by the production of huge amount of documents, in which the adoption of innovative systems for information management can significantly improve the tasks performed by the actors involved and the quality of the health services offered. In this work we propose a methodology for automatic documents categorization based on the adoption of unsupervised learning techniques. We extracted both semantic and syntactic features in order to define the vector space models and proposed the use of a clustering ensemble in order to increase the discriminative power of our approach. Results on real medical records, digitalized by means of a state-of-the-art OCR technique, demonstrated the effectiveness of the proposed approach.

Improving Biomedical Information Extraction with Word Embeddings Trained on Closed-Domain Corpora

2019 IEEE Symposium on Computers and Communications (ISCC), 2019

Named Entity Recognition (NER) systems allow complex concepts extraction and text mining from Nat... more Named Entity Recognition (NER) systems allow complex concepts extraction and text mining from Natural Language documents. Actually, NER systems based on Deep Learning (DL) approaches are able to reach state-of-the-art performances when applied to general domain texts. On the other hand, the performances of these systems decrease when they are applied to texts that belong to specific domains, such as the biomedical one. In particular, Biomedical NER (B-NER) is a crucial task for the automatic analysis of medical documents, such as Electronic Health Records (EHRs), in order to support the work of physicians and researchers. Thus, new approaches are required to boost B-NER systems performances. In this paper we analyze the behaviour of a B-NER DL architecture specifically devoted to Italian EHRs, focusing on the contribution of different Word Embeddings (WEs) models used as input text representation layer. The achieved results show the substantial contribution of WEs trained on a close...

Exploit Hierarchical Label Knowledge for Deep Learning

In this paper we propose a methodology based on a complex deep learning network topology, named H... more In this paper we propose a methodology based on a complex deep learning network topology, named Hierarchical Deep Neural Network (HDNN), applied to eXtreme Multi-label Text Classification (XMTC) problem. The HDNN topology reproduces the label hierarchy. The main idea arises directly from the assumption that, if the label-set structure is defined, forcing this information into the network topology could improve classification performances and results interpretation. In this way, we define a method to force prior knowledge into the DNN. We perform the experimental assessment on a XMTC task related to a real application domain problem, namely the automatic labelling of biomedical scientific literature extracted from PubMed. The obtained preliminary results show that, despite the very high computational time needed to update the network weights, a slight performance improvement is obtained, with respect to a classical approach based on Convolution Neural Network. Some considerations wil...

Exploit Multilingual Language Model at Scale for ICD-10 Clinical Text Classification

2020 IEEE Symposium on Computers and Communications (ISCC), 2020

The automatic ICD-10 classification of medical documents is actually an unresolved issue, despite... more The automatic ICD-10 classification of medical documents is actually an unresolved issue, despite its crucial importance. The existence of machine learning approaches de-voted to this task is in contrast with the lack of annotated resources, especially for languages different from English. Recent Transformer-based multilingual neural language models at scale have provided an innovative approach for dealing with cross lingual Natural Language Processing tasks. In this paper, we present a preliminary evaluation of the Cross-lingual Language Model (XLM) architecture, a recent multilingual Transformer-based model presented in literature, tested in the cross lingual ICD-10 multilabel classification of short medical notes. In detail, we analysed the performances obtained by fine tuning the XLM model on English language training data and tested for ICD-10 codes prediction of an Italian test set. The obtained results show that the use of the novel XLM multilingual neural language architecture is very promising and it can be very useful in case of low resource languages.

A Big Data architecture for knowledge discovery in PubMed articles

2017 IEEE Symposium on Computers and Communications (ISCC), 2017

The need of smart information retrieval systems is in contrast with the difficulties to deal with... more The need of smart information retrieval systems is in contrast with the difficulties to deal with huge amount of data. In this paper we present a Big Data Analytics architecture used to implement a semantic similarity search tool for natural language texts in biomedical domain. The implemented methodology is based on Word Embeddings (WEs) models obtained using the word2vec algorithm. The system has been assessed with documents extracted from the whole PubMed library. It will be also presented a user friendly web front-end able to assess the methodology on a real context.

Visual and OCR-Based Features for Detecting Image Spam

Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems, 2008

The presence of unsolicited bulk emails, commonly known as spam, can seriously compromise normal ... more The presence of unsolicited bulk emails, commonly known as spam, can seriously compromise normal user activities, forcing them to navigate through mailboxes to find the-relatively few-interesting emails. Even if a quite huge variety of spam filters has been developed until now, this problem is far to be resolved since spammers continuously modify their malicious techniques in order to bypass filters. In particular, in the last years spammers have begun vehiculating unsolicited commercial messages by means of images attached to emails whose textual part appears perfectly legitimate. In this paper we present a method for overcoming some of the problems that still remain with state-of-the-art spam filters when checking images attached to emails. Results on both personal and publicly available email databases are presented, in order to assess the performance of the proposed approach.

A Method for Topic Detection in Great Volumes of Data

Communications in Computer and Information Science, 2015

Topics extraction has become increasingly important due to its effectiveness in many tasks, inclu... more Topics extraction has become increasingly important due to its effectiveness in many tasks, including information filtering, information retrieval and organization of document collections in digital libraries. The Topic Detection consists to find the most significant topics within a document corpus. In this paper we explore the adoption of a methodology of feature reduction to underline the most significant topics within a document corpus. We used an approach based on a clustering algorithm (X-means) over the \(tf-idf\) matrix calculated starting from the corpus, by which we describe the frequency of terms, represented by the columns, that occur in the documents, represented by the rows. To extract the topics, we build n binary problems, where n is the numbers of clusters produced by an unsupervised clustering approach and we operate a supervised feature selection over them, considering the top features as the topic descriptors. We will show the results obtained on two different corpora. Both collections are expressed in Italian: the first collection consists of documents of the University of Naples Federico II, the second one consists in a collection of medical records.

A Topic Detection Method for high dimensional datasets

A Multimedia Summarizer Integrating Text and Images

Smart Innovation, Systems and Technologies, 2015

ABSTRACT We present a multimedia summarizer system for retrieving relevant information from some ... more ABSTRACT We present a multimedia summarizer system for retrieving relevant information from some web repositories based on the extraction of semantic descriptors of documents. In particular, semantics attached to each document textual sentences is expressed as a set of assertions in the ⟨subject, verb, object⟩ shape as in the RDF data model. While, images’ semantics is captured using a set of keywords derived from high level information such as the related title, description and tags. We leverage an unsupervised clustering algorithm exploiting the notion of semantic similarity and use the centroids of clusters to determine the most significant summary sentences. At the same time, several images are attached to each cluster on the base of keywords’ term frequency. Finally, several experiments are presented and discussed.