Papers by Mohamed Ben Aouicha

International Journal on Artificial Intelligence Tools
In the context of machine learning, an imbalanced classification problem states to a dataset in w... more In the context of machine learning, an imbalanced classification problem states to a dataset in which the classes are not evenly distributed. This problem commonly occurs when attempting to classify data in which the distribution of labels or classes is not uniform. Using resampling methods to accumulate samples or entries from the minority class or to drop those from the majority class can be considered the best solution to this problem. The focus of this study is to propose a framework pattern to handle any imbalance dataset for fraud detection. For this purpose, Undersampling (Random and NearMiss) and oversampling (Random, SMOTE, BorderLine SMOTE) were used as resampling techniques for the concentration of our experiments for balancing an evaluated dataset. For the first time, a large-scale unbalanced dataset collected from the Kaggle website was used to test both methods for detecting fraud in the Tunisian company for electricity and gas consumption. It was also evaluated with f...
Zenodo (CERN European Organization for Nuclear Research), May 11, 2023

Online Information Review
PurposeThe intensive blooming of social media, specifically social networks, pushed users to be i... more PurposeThe intensive blooming of social media, specifically social networks, pushed users to be integrated into more than one social network and therefore many new “cross-network” scenarios have emerged, including cross-social networks content posting and recommendation systems. For this reason, it is mightily a necessity to identify implicit bridge users across social networks, known as social network reconciliation problem, to deal with such scenarios.Design/methodology/approachWe propose the BUNet (Bridge Users for cross-social Networks analysis) dataset built on the basis of a feature-based approach for identifying implicit bridge users across two popular social networks: Facebook and Twitter. The proposed approach leverages various similarity measures for identity matching. The Jaccard index is selected as the similarity measure outperforming all the tested measures for computing the degree of similarity between friends’ sets of two accounts of the same real person on two diffe...

2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA)
Users recently are joining multiple online social networks simultaneously. The different accounts... more Users recently are joining multiple online social networks simultaneously. The different accounts owned by the same user in multiple social networks are most of time isolated from each other. Identifying the same users, so called i-bridge, across networks is an important task for many interesting inter-network based applications such as viral marketing, presidential campaigns, product announcement, etc. In this paper, we tackle the problem of me edge identification across online social networks. Indeed, for each user in the source network, we extract the set of similar accounts in the target network and then a set of friends for each similar account in the target network will be extracted. A suitable comparison will be performed using similarity functions between the two sets of friends in the source and target networks to identify bridge users. Experiments are performed through the extraction of users from the two most popular social networks, Facebook and Twitter, and then extract the list of friends for these users. Then, the me edge link identification is built through the exploitation of the friends sets assigned to user accounts in different social networks. Experiments on two real-world social networks Facebook and Twitter provide a high identification rate of me edge links.

Scientometrics
In this research letter, we build upon recent studies about the sleeping beauties awakened by the... more In this research letter, we build upon recent studies about the sleeping beauties awakened by the COVID-19 pandemic. We prove that a peak of citations for sleeping beauties is associated with a sharp increase in the number of citations received by their references. This demonstrates the existence of a cascading activation of citation-based sleeping beauties. Keywords Sleeping beauties • References • Citation impact • Indirect impact Sleeping beauties have always been one of the most challenging phenomena in scientometrics (Garfield, 1980; Haghani & Varamini, 2021; van Raan, 2021). Their citation patterns characterized by their spooky delayed awakening have urged multiple scientists to study the reasons behind late citation bursts (Garfield, 1980; Haghani & Varamini, 2021; van Raan, 2021). Such factors include the bibliographic features of sleeping beauties such as keywords (Yang, et al., 2022) as well as altmetric data like social media interactions (Hou & Yang, 2020). Delayed recognition of these research publications can also be due to their opposition to the main knowledge of their field of interest or the limited scientific rank of their authors when the research papers were published (Garfield, 1980). In this context, citation and co-citation data can bring insights into how an uncited paper can be transformed into a trendy one within a short period, particularly through the influence of a novel and trendy paper citing it (so-called Prince) (Song et al., 2018). This is enabled thanks to machine learning models (Wang et al., 2021), network analysis (Song et al., 2018), the use of statistical metrics such as the Sleeping Beauty Index (Lin et al., 2022), and co-keyword analysis (Zhang et al., 2021). During the COVID-19 pandemic, a notable number of sleeping beauties have been awakened due to the resurgence of several topics that were not active in the last years and

Companion Proceedings of the Web Conference 2022
Semantic text annotations have been a key factor for supporting computer applications ranging fro... more Semantic text annotations have been a key factor for supporting computer applications ranging from knowledge graph construction to biomedical question answering. In this systematic review, we provide an analysis of the data models that have been applied to semantic annotation projects for the scholarly publications available in the CORD-19 dataset, an open database of the full texts of scholarly publications about COVID-19. Based on Google Scholar and the screening of specific research venues, we retrieve seventeen publications on the topic mostly from the United States of America. Subsequently, we outline and explain the inline semantic annotation models currently applied on the full texts of biomedical scholarly publications. Then, we discuss the data models currently used with reference to semantic annotation projects on the CORD-19 dataset to provide interesting directions for the development of semantic annotation models and projects.
Online Information Review, May 4, 2022
Intelligent Systems Design and Applications, 2022

This dataset has been created between 2017 and 2021 to provide a textual resource that can be use... more This dataset has been created between 2017 and 2021 to provide a textual resource that can be used to study the behaviors of Tunisian people in writing Tunisian Arabic (ISO 693-3: aeb) in Latin Script. This corpus is constituted from messages written using Tunisian Arabic Chat Alphabet or Arabizi and is developed to solve the matter of the lack of NLP databases about the use of the Latin Script for transcribing Tunisian Arabic. The messages are automatically pulled using web scraping of Facebook public pages and are kept as they are without any annotation, spelling adjustments or morphological and syntactic labeling. Then, messages that are written in Latin Script but not in Tunisian Arabic are manually eliminated. Finally, every collection of messages that are retrieved from the same Facebook page in the same period is included in the same text file where every message is featured as one line.

This paper presents an information retrieval model on XML documents based on tree matching. Queri... more This paper presents an information retrieval model on XML documents based on tree matching. Queries and documents are represented by extended trees. An extended tree is built starting from the original tree, with additional weighted virtual links between each node and its indirect descendants allowing to directly reach each descendant. Therefore only one level separates between each node and its indirect descendants. This allows to compare the user query and the document with flexibility and with respect to the structural constraints of the query. The content of each node is very important to decide weither a document element is relevant or not, thus the content should be taken into account in the retrieval process. We separate between the structure-based and the content-based retrieval processes. The content-based score of each node is commonly based on the well-known Tf × Idf criteria. In this paper, we compare between this criteria and another one we call Tf × Ief. The comparison...
This dataset includes the biomedical abbreviations stated between parentheses in the titles of th... more This dataset includes the biomedical abbreviations stated between parentheses in the titles of the scholarly publications indexed by PubMed. Each abbreviation is extracted thanks to the parenthetic level count algorithm and is assigned to the title, PMID and year of publication of each corresponding research paper. As well, every acronym is allocated its length and the number of upper and lower case letters it involves.

This image dataset has been derived from Wikimedia Commons (https://commons.wikimedia.org), a lar... more This image dataset has been derived from Wikimedia Commons (https://commons.wikimedia.org), a large-scale free and collaborative media repository currently including over 92 million items including images, videos and recordings. The dataset includes images of several types of animals. The images are assigned to categories according to the Wikimedia Commons Category Graph and its direct links with the considered medias. As an output of the development of this image dataset, we created three ZIP Files that can be used for the evaluation of the effect of semantic features of considered labels on the efficiency of mono-label image classification algorithms : Class1 (The most general one): The labels are "Birds" and "mammals". Each category precisely includes 450 images. Class2: The labels are "Cat", "Cattle", "Columbidae", "Dog", "Phoenicopteridae", and "Psittacidae". Each category precisely includes 150 ima...

Journal of Information Science, 2021
During the last years, several infectious diseases have caused widespread nationwide epidemics th... more During the last years, several infectious diseases have caused widespread nationwide epidemics that affected information seeking behaviours, people mobility, economics and research trends. Examples of these epidemics are 2003 severe acute respiratory syndrome (SARS) epidemic in mainland China and Hong Kong, 2014–2016 Ebola epidemic in Guinea and Sierra Leone, 2015–2016 Zika epidemic in Brazil, Colombia and Puerto Rico and the recent COVID-19 epidemic in China and other countries. In this research article, we investigate the effect of large-scale outbreaks of infectious diseases on the research productivity and landscape of nations through the analysis of the research outputs of main countries affected by SARS, Zika and Ebola epidemics as returned by Web of Science Core Collection. Despite the mobility restrictions and the limitations of work conditions due to the epidemics, we surprisingly found that the research characteristics and productivity of the countries that have excellent ...

Advances in Intelligent Systems and Computing, 2020
Co-citation analysis can be exploited as a bibliometric technique used for mining information on ... more Co-citation analysis can be exploited as a bibliometric technique used for mining information on the relationships between scientific papers. Proposed methods rely, however, on co-citation counting techniques that slightly take the semantic aspect into consideration. The present study proposes a new technique based on the measure of Semantic Similarity (SS) between the titles of co-cited papers. Several computational measures rely on knowledge resources to quantify the semantic similarity, such as the WordNet «is a» taxonomy. Our proposal analyzes the SS between the titles of co-cited papers using word-based SS measures. Two major analytical experiments are performed: the first includes the benchmarks designed for testing word-based SS measures; the second exploits the dataset DBLP (DBLP: Digital Bibliography & Library Project.) citation network. As a result, we found the SS measures behave the same as human judgement for the lexical similarity and can be consequently used for the automatic assessment of similarity between co-cited papers. The analysis of highly repeated co-citations demonstrates that the different SS measures display almost similar behaviours, with slight differences due to the distribution of the provided SS values. Furthermore, we note a low percentage of similar referred papers into the co-citations.
Journal of the American Medical Informatics Association : JAMIA, 2021
This letter discusses the limitations of the use of filters to enhance the accuracy of the extrac... more This letter discusses the limitations of the use of filters to enhance the accuracy of the extraction of parenthetic abbreviations from scholarly publications and proposes the usage of the parentheses level count algorithm to efficiently extract entities between parentheses from raw texts as well as of machine learning-based supervised classification techniques for the identification of biomedical abbreviations to significantly reduce the removal of acronyms including disallowed punctuations.

This dataset is a companion reproducibility package of the related paper submitted for publicatio... more This dataset is a companion reproducibility package of the related paper submitted for publication, whose aim is to allow the exact replication of a very large experimental survey on word similarity between the families of ontology-based semantic similarity measures and word embedding models as detailed in ‘appendix-reproducible-experiments.pdf’ file. Our experiments are based on the evaluation of all methods with the HESML V1R4 semantic measures library and the recording of these experiments with Reprozip. HESML is a self-contained Java software library of semantic measures based on WordNet whose latest version, called HESML V1R4, also supports the evaluation of pre-trained word embedding files. HESML is a self-contained experimentation platform on word similarity which is especially well suited to run large experimental surveys by supporting the execution of automatic reproducible experiment files on word similarity based on a XML-based file format called (*.exp). On the other han...

Trans. Comput. Collect. Intell., 2018
Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a c... more Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a computational manner. WSD is considered as a task whose solution is at least as hard as the most difficult problems in artificial intelligence. This is basically used in application like information retrieval, machine translation, information extraction because of its semantics understanding. This paper describes the proposed approach W3SD (This paper is an extended version of our work [4] published in the 8th International Conference on Computational Collective Intelligence.) which is based on the words surrounding the polysemous word in a context. Each meaning of these words is represented by a vector composed of weighted nouns using WordNet and Wiktionary features through the taxonomic information content from WordNet and the glosses from Wiktionary. The main emphasis of this paper is feature selection for disambiguation purpose. The assessment of WSD systems is discussed in the contex...
Uploads
Papers by Mohamed Ben Aouicha