Papers by Menna Ibrahim Gabr

International Journal of Advanced Computer Science and Applications, 2022
Most real-world datasets contaminated by quality issues have a severe effect on the analysis resu... more Most real-world datasets contaminated by quality issues have a severe effect on the analysis results. Duplication is one of the main quality issues that hinder these results. Different studies have tackled the duplication issue from different perspectives. However, revealing the sensitivity of supervised and unsupervised learning models under the existence of different types of duplicates, deterministic and probabilistic, is not broadly addressed. Furthermore, a simple metric is used to estimate the ratio of both types of duplicates regardless of the probability by which the record is considered duplicate. In this paper, the sensitivity of five classifiers and four clustering algorithms toward deterministic and probabilistic duplicates with different ratios (0%-15%) is tracked. Five evaluation metrics are used to accurately track the changes in the sensitivity of each learning model, MCC, F1-Score, Accuracy, Average Silhouette Coefficient, and DUNN Index. Also, a metric to measure the ratio of probabilistic duplicates within a dataset is introduced. The results revealed the effectiveness of the proposed metric that reflects the ratio of probabilistic duplicates within the dataset. All learning models, classification, and clustering models are differently sensitive to the existence of duplicates. RF and Kmeans are positively affected by the existence of duplicates which means that their performce increase as the percentage of duplicates increases. Furthermore, the rest of classifiers and clustering algorithms are sensitive toward duplicates existence, especially within high percentage that negatively affect their performance.
Big data and cognitive computing, Mar 22, 2023
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY

International Journal of Advanced Computer Science and Applications
Most real-world datasets contaminated by quality issues have a severe effect on the analysis resu... more Most real-world datasets contaminated by quality issues have a severe effect on the analysis results. Duplication is one of the main quality issues that hinder these results. Different studies have tackled the duplication issue from different perspectives. However, revealing the sensitivity of supervised and unsupervised learning models under the existence of different types of duplicates, deterministic and probabilistic, is not broadly addressed. Furthermore, a simple metric is used to estimate the ratio of both types of duplicates regardless of the probability by which the record is considered duplicate. In this paper, the sensitivity of five classifiers and four clustering algorithms toward deterministic and probabilistic duplicates with different ratios (0%-15%) is tracked. Five evaluation metrics are used to accurately track the changes in the sensitivity of each learning model, MCC, F1-Score, Accuracy, Average Silhouette Coefficient, and DUNN Index. Also, a metric to measure the ratio of probabilistic duplicates within a dataset is introduced. The results revealed the effectiveness of the proposed metric that reflects the ratio of probabilistic duplicates within the dataset. All learning models, classification, and clustering models are differently sensitive to the existence of duplicates. RF and Kmeans are positively affected by the existence of duplicates which means that their performce increase as the percentage of duplicates increases. Furthermore, the rest of classifiers and clustering algorithms are sensitive toward duplicates existence, especially within high percentage that negatively affect their performance.

Data Quality Dimensions
Internet of Things—Applications and Future, 2020
Data quality dimension is a term to identify quality measure that is related to many data element... more Data quality dimension is a term to identify quality measure that is related to many data elements including attribute, record, table, system or more abstract groupings such as business unit, company or product range. This paper presents a thorough analysis of three data quality dimensions which are completeness, relevance, and duplication. Besides; it covers all commonly used techniques for each dimension. Regarding completeness; Predictive value imputation, distribution-based imputation, KNN, and more methods are investigated. Moreover; relevance dimension is explored via filter and wrapper approach, rough set theory, hybrid feature selection, and other techniques. Duplication is investigated throughout many techniques such as; K-medoids, standard duplicate elimination algorithm, online record matching, and sorted blocks.

Future Computing and Informatics Journal, 2021
Achieving high level of data quality is considered one of the most important assets for any small... more Achieving high level of data quality is considered one of the most important assets for any small, medium and large size organizations. Data quality is the main hype for both practitioners and researchers who deal with traditional or big data. The level of data quality is measured through several quality dimensions. High percentage of the current studies focus on assessing and applying data quality on traditional data. As we are in the era of big data, the attention should be paid to the tremendous volume of generated and processed data in which 80% of all the generated data is unstructured. However, the initiatives for creating big data quality evaluation models are still under development. This paper investigates the data quality dimensions that are mostly used in both traditional and big data to figure out the metrics and techniques that are used to measure and handle each dimension. A complete definition for each traditional and big data quality dimension, metrics and handling t...

International Journal of Advanced Research in Computer Science and Software Engineering, 2017
Data classification is one of the most important tasks in data mining, which identify to which ca... more Data classification is one of the most important tasks in data mining, which identify to which categories a new observation belongs, on the basis of a training set. Preparing data before doing any data mining is essential step to ensure the quality of mined data. There are different algorithms used to solve classification problems. In this research four algorithms namely support vector machine (SVM), C5.0, K-nearest neighbor (KNN) and Recursive Partitioning and Regression Trees (rpart) are compared before and after applying two feature selection techniques. These techniques are Wrapper and Filter. This comparative study is implemented throughout using R programming language. Direct marketing campaigns dataset of banking institution is used to predict if the client will subscribe a term deposit or not. The dataset is composed of 4521 instances. 3521 instance as training set 78%, 1000 instance as testing set 22%. The results show that C5.0 is superior to other algorithms before implementing FS technique and SVM is superior to others after implementing FS.

Big Data and Cognitive Computing
Data completeness is one of the most common challenges that hinder the performance of data analyt... more Data completeness is one of the most common challenges that hinder the performance of data analytics platforms. Different studies have assessed the effect of missing values on different classification models based on a single evaluation metric, namely, accuracy. However, accuracy on its own is a misleading measure of classifier performance because it does not consider unbalanced datasets. This paper presents an experimental study that assesses the effect of incomplete datasets on the performance of five classification models. The analysis was conducted with different ratios of missing values in six datasets that vary in size, type, and balance. Moreover, for unbiased analysis, the performance of the classifiers was measured using three different metrics, namely, the Matthews correlation coefficient (MCC), the F1-score, and accuracy. The results show that the sensitivity of the supervised classifiers to missing data differs according to a set of factors. The most significant factor i...

Data classification is one of the most important tasks in data mining, which identify to which ca... more Data classification is one of the most important tasks in data mining, which identify to which categories a new observation belongs, on the basis of a training set. Preparing data before doing any data mining is essential step to ensure the quality of mined data. There are different algorithms used to solve classification problems. In this research four algorithms namely support vector machine (SVM), C5.0, K-nearest neighbor (KNN) and Recursive Partitioning and Regression Trees (rpart) are compared before and after applying two feature selection techniques. These techniques are Wrapper and Filter. This comparative study is implemented throughout using R programming language. Direct marketing campaigns dataset of banking institution is used to predict if the client will subscribe a term deposit or not. The dataset is composed of 4521 instances. 3521 instance as training set 78%, 1000 instance as testing set 22%. The results show that C5.0 is superior to other algorithms before implementing FS technique and SVM is superior to others after implementing FS.
Uploads
Papers by Menna Ibrahim Gabr