Papers by mohamed eltabakh

arXiv (Cornell University), Mar 29, 2023
Can foundation models (such as ChatGPT) clean your data? In this proposal, we demonstrate that in... more Can foundation models (such as ChatGPT) clean your data? In this proposal, we demonstrate that indeed ChatGPT can assist in data cleaning by suggesting corrections for specific cells in a data table (scenario 1). However, ChatGPT may struggle with datasets it has never encountered before (e.g., local enterprise data) or when the user requires an explanation of the source of the suggested clean values. To address these issues, we developed a retrieval-based method that complements ChatGPT's power with a user-provided data lake. The data lake is first indexed, we then retrieve the toprelevant tuples to the user's query tuple and finally leverage Chat-GPT to infer the correct value (scenario 2). Nevertheless, sharing enterprise data with ChatGPT, an externally hosted model, might not be feasible for privacy reasons. To assist with this scenario, we developed a custom RoBERTa-based foundation model that can be locally deployed. By fine-tuning it on a small number of examples, it can effectively make value inferences based on the retrieved tuples (scenario 3). Our proposed system, RetClean, seamlessly supports all three scenarios and provides a user-friendly GUI that enables the VLDB audience to explore and experiment with the system.

at addressing recent research results and forecasting challenges on selected topics related to co... more at addressing recent research results and forecasting challenges on selected topics related to communications, computation, networks and technologies. Considering the importance of innovative topics in today's technology-driven society, there is a paradigm shift in classical-by-now approaches, such as networking, communications, resource sharing, collaboration and telecommunications. Recent achievements demand rethinking available technologies and considering the emerging ones. The conference had the following tracks: Networking Mobility and Ubiquity Security, Trust, and Privacy We take here the opportunity to warmly thank all the members of the INNOV 2016 technical program committee, as well as the numerous reviewers. The creation of such a high quality conference program would not have been possible without their involvement. We also kindly thank all the authors that dedicated much of their time and effort to contribute to INNOV 2016. We truly believe that, thanks to all t...

a series of events addressing recent research results and forecasting challenges on selected topi... more a series of events addressing recent research results and forecasting challenges on selected topics. Considering the importance of innovative topics in today's technology-driven society, there is a paradigm shift in classical-by-now approaches, such as networking, communications, resource sharing, collaboration and telecommunications. Recent achievements demand rethinking available technologies and considering the emerging ones. We take here the opportunity to warmly thank all the members of the INNOV 2013 Technical Program Committee, as well as the numerous reviewers. The creation of such a high quality conference program would not have been possible without their involvement. We also kindly thank all the authors who dedicated much of their time and efforts to contribute to INNOV 2013. We truly believe that, thanks to all these efforts, the final conference program consisted of top quality contributions. Also, this event could not have been a reality without the support of many...

2017 IEEE 33rd International Conference on Data Engineering (ICDE), 2017
As datasets increase radically in size, highly scalable algorithms leveraging modern distributed ... more As datasets increase radically in size, highly scalable algorithms leveraging modern distributed infrastructures need to be developed for detecting outliers in massive datasets. In this work, we present the first distributed distance-based outlier detection approach using the MapReduce-based infrastructure, called DOD. DOD features a single-pass execution framework that minimizes communication overhead. Furthermore, DOD overturns two fundamental assumptions widely adopted in the distributed analytics literature, namely cardinality-based load balancing and one algorithm for all data. The multi-tactic strategy of DOD achieves a truly balanced workload by taking into account the data characteristics in data partitioning and assigns most appropriate algorithm for each partition based on our theoretical cost models established for distinct classes of detection algorithms. Thus, DOD effectively minimizes the end-to-end execution time. Our experimental study confirms the efficiency of DOD and its scalability to terabytes of data, beating the baseline solutions by a factor of 20x.

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015
Annotation management and data curation has been extensively studied in the context of relational... more Annotation management and data curation has been extensively studied in the context of relational databases. However, existing annotation management techniques share a common limitation, which is that they are all passive engines, i.e., they only manage the annotations obtained from external sources such as DB admins, domain experts, and curation tools. They neither learn from the available annotations nor exploit the annotations-to-data correlations to further enhance the quality of the annotated database. Delegating such crucial and complex tasks to end-users-especially under largescale databases and annotation sets-is clearly the wrong choice. In this paper, we propose the Nebula system, an advanced and proactive annotation management engine in relational databases. Nebula complements the state-of-art techniques in annotation management by learning from the available annotations, analyzing their content and semantics, and understanding their correlations with the data. And then, Nebula proactively discovers and recommends potentially missing annotation-to-data attachments. We propose context-aware ranking and prioritization of the discovered attachments that take into account the relationships among the data tuples and their annotations. We also propose approximation techniques and expert-enabled verification mechanisms that adaptively maintain high-accuracy predictions while minimizing the experts' involvement. Nebula is realized on top of an existing annotation management engine, and experimentally evaluated to illustrate the effectiveness of the proposed techniques, and to demonstrate the potential gain in enhancing the quality of annotated databases.

Proceedings of the VLDB Endowment, 2015
With the increasing complexity of data-intensive MapReduce workloads, Hadoop must often accommoda... more With the increasing complexity of data-intensive MapReduce workloads, Hadoop must often accommodate hundreds or even thousands of recurring analytics queries that periodically execute over frequently updated datasets, e.g., latest stock transactions, new log files, or recent news feeds. For many applications, such recurring queries come with user-specified service-level agreements (SLAs), commonly expressed as the maximum allowed latency for producing results before their merits decay. The recurring nature of these emerging workloads combined with their SLA constraints make it challenging to share and optimize their execution. While some recent efforts on multi-job optimization in MapReduce have emerged, they focus on only sharing work among ad-hoc jobs on static datasets. Unfortunately, these sharing techniques neither take the recurring nature of the queries into account nor guarantee the satisfaction of the SLA requirements. In this work, we propose the first scalable multi-query...
22nd International Conference on Data Engineering (ICDE'06), 2006
19th International Conference on Scientific and Statistical Database Management (SSDBM 2007), 2007

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management - CIKM '13, 2013
In this paper, we demonstrate the FusionDB system; an extended relational database engine for man... more In this paper, we demonstrate the FusionDB system; an extended relational database engine for managing conflicts in small-science databases. In small sciences, groups-each consists of few scientists-may share and exchange parts of their own databases among each other to foster collaboration. The goal of such sharing, especially when done at early stages of the discovery process, is not to build a warehouse or a unified schema, instead the goal is to compare and verify results, detect and assess conflicts, and possibly modify or redesign the discovery process. FusionDB is designed to meet the requirements and address the challenges of such sharing model. We will demonstrate the key functionalities of FusionDB including: (1) Detecting conflicts using a rule-based model over heterogeneous schemas, (2) Assessing conflicts and providing probabilistic estimates for values' correctness, (3) Extended querying capabilities in the presence of conflicts, and (4) Providing curation operations to help scientists resolve and investigate conflicts according to different priorities. FusionDB is realized on top of PostgreSQL DBMS.

Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology, 2009
Annotations play a key role in understanding and curating databases. Annotations may represent co... more Annotations play a key role in understanding and curating databases. Annotations may represent comments, descriptions, lineage information, among several others. Annotation management is a vital mechanism for sharing knowledge and building an interactive and collaborative environment among database users and scientists. What makes it challenging is that annotations can be attached to database entities at various granularities, e.g., at the table, tuple, column, cell levels, or more generally, to any subset of cells that results from a select statement. Therefore, simple comment fields in tuples would not work because of the combinatorial nature of the annotations. In this paper, we present extensions to current database management systems to support annotations. We propose storage schemes to efficiently store annotations at multiple granularities, i.e., at the table, tuple, column, and cell levels. Compared to storing the annotations with the individual cells, the proposed schemes achieve more than an order-of-magnitude reduction in storage and up to 70% saving in the query execution time. We define types of annotations that inherit different behaviors. Through these types, users can specify, for example, whether or not an annotation is continuously applied over newly inserted data and whether or not an annotation is archived when the base data is modified. These annotation types raise several storage and processing challenges that are addressed in the paper. We propose declarative ways to add, archive, query, and propagate annotations. The proposed mechanisms are realized through extensions to the standard SQL. We implemented the proposed functionalities inside PostgreSQL with an easy to use Excel-based front-end graphical interface.
2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010), 2010
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010
2008 IEEE 24th International Conference on Data Engineering, 2008

Biologists are increasingly using databases for storing and managing their data. Biological datab... more Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of comp...
IEEE Transactions on Knowledge and Data Engineering, 2014

Most emerging applications, especially in science domains, maintain databases that are rich in me... more Most emerging applications, especially in science domains, maintain databases that are rich in metadata and annotation information, e.g., auxiliary exchanged comments, related articles and images, provenance information, corrections and versioning information, and even scientists’ thoughts and observations. To manage these annotated databases, numerous techniques have been proposed to extend the DBMSs and efficiently integrate the annotations into the data processing cycle, e.g., storage, indexing, extended query languages and semantics, and query optimization. In this paper, we address a new facet of annotation management, which is the discovery and exploitation of the hidden corrections that may exist in annotated databases. Such correlations can be either between the data and the annotations (data-to-annotation), or between the annotations themselves (annotation-to-annotation). We make the case that the discovery of these annotation-related correlations can be exploited in variou...
Uploads
Papers by mohamed eltabakh