Papers by Carlos Eduardo Pires

HAL (Le Centre pour la Communication Scientifique Directe), Jun 5, 2018
Entity resolution is the task of identifying duplicate entities in a dataset or multiple datasets... more Entity resolution is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to the size of the dataset. In practice, this task can be outsourced to a cloud service, and thus, a service customer may be interested in estimating the costs of an entity resolution solution before executing it. Since the execution time of an entity resolution solution depends on a combination of various algorithms, their respective parameter values and the employed cloud infrastructure, in practice it is hard to perform an a priori estimation of infrastructure costs for executing an entity resolution task. Besides estimating customer costs, the estimation of entity resolution costs is also important to evaluate if a set of ER parameter values can be employed to execute a task that meets predefined time and budget restrictions. Aiming to tackle these challenges, we formalize the problem of estimating ER costs taking into account the main parameters that may influence the execution time of the ER task. We also propose an algorithm, denominated T BF , for evaluating the feasibility of ER parameter values, given a set of predefined customer restrictions. Since the efficacy of the proposed algorithm is strongly tied to the accuracy provided by the theoretical estimations of the ER costs, we also present a number of guidelines that can be further explored to improve even more the efficacy of the proposed model.

Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, 2019
The increasing use of Web systems has become a valuable source of semi-structured data. In this c... more The increasing use of Web systems has become a valuable source of semi-structured data. In this context, the Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between the data items (i.e., entities). Usually, blocking techniques are widely applied as an initial step of ER approaches in order to avoid computing similarities between all pairs of entities (quadratic cost). In practice, heterogeneous and noisy data increase the difficulties faced by blocking techniques, since these issues directly interfere the block generation. To address these challenges, we propose the NA-BLOCKER technique, which is capable of tolerating noisy data to extract information regarding the data schema and generate high-quality blocks. NA-BLOCKER applies Locality Sensitive Hashing (LSH) to hash the attribute values of entities and enable the generation of high-quality blocks, even with the presence of noise in the attribute values. In our experimental evaluation, we use five real-world datasets, and highlight that NA-BLOCKER presents better results regarding effectiveness compared to the state-of-the-art technique. In terms of efficiency, NA-BLOCKER produces, on average, 34% less comparisons. However, due to the cost introduced by LSH, it results in an increase of the execution time at around 30%, on average.

Proceedings of the 35th Annual ACM Symposium on Applied Computing, 2020
Currently, a wide number of information systems produce a large amount of data continuously. Sinc... more Currently, a wide number of information systems produce a large amount of data continuously. Since these sources may have overlapping knowledge, the Entity Resolution (ER) task emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between entities. Considering the quadratic cost of the ER task, blocking techniques are often used to improve efficiency. Such techniques face two main challenges related to data volume (i.e., large data sources) and variety (i.e., heterogeneous data). Besides these challenges, blocking techniques also face two other ones: streaming data and incremental processing. To address these four challenges simultaneously, we propose PI-Block, a novel incremental schema-agnostic blocking technique that utilizes parallelism (through distributed computational infrastructure) to enhance blocking efficiency. In our experimental evaluation, we use four real-world data source pairs, and highlight that PI-Block achieves better results regarding efficiency and effectiveness compared to the state-of-the-art technique.
Entity resolution (ER) emerges as a fundamental step to integrate multiple knowledge bases or ide... more Entity resolution (ER) emerges as a fundamental step to integrate multiple knowledge bases or identify similarities between entities. Blocking is widely applied as an initial step of ER to avoid computing similarities between all pairs of entities. Heterogeneous and noisy data increase the difficulties faced by blocking, since these issues directly interfere the block generation. In this work, we propose a technique capable to tolerate noisy data to extract information regarding the data schema, and generate high-quality blocks. We apply Locality Sensitive Hashing (LSH) to hash the entities values, and enable the generation of high-quality blocks, even with the presence of noise in the values. In the experiments, we highlight that our approach has better effectiveness compared to the state-of-the-art technique, as well as less produced

Anais Estendidos do XXXVIII Simpósio Brasileiro de Banco de Dados (SBBD Estendido 2023)
Privacy-Preserving Record Linkage (PPRL) intends to integrate private/ sensitive data from severa... more Privacy-Preserving Record Linkage (PPRL) intends to integrate private/ sensitive data from several data sources held by different parties. It aims to identify records (e.g., persons or objects) representing the same real-world entity over private data sources held by different custodians. Due to recent laws and regulations (e.g., General Data Protection Regulation), PPRL approaches are increasingly demanded in real-world application areas such as health care, credit analysis, public policy evaluation, and national security. As a result, the PPRL process needs to deal with efficacy (linkage quality), and privacy problems. For instance, the PPRL process needs to be executed over data sources (e.g., a database containing personal information of governmental income distribution and assistance programs), with an accurate linkage of the entities, and, at the same time, protect the privacy of the information. In this context, our work presents contributions to improve the privacy and quali...
Anais do XXXI Simpósio Brasileiro de Banco de Dados (SBBD 2016)
A Resolução de Entidades com Preservação de Privacidade (REPP) consiste em identificar entidades ... more A Resolução de Entidades com Preservação de Privacidade (REPP) consiste em identificar entidades que representam o mesmo objeto no mundo real, mantendo a privacidade dos dados. Nesse contexto, diferentes técnicas de comparação de dados privados vêm sendo utilizadas como, por exemplo, Filtros de Bloom. Contudo, o Filtro de Bloom, não apresenta uma boa precisão quando dados numéricos ou datas são comparados. Este trabalho tem por objetivo avaliar empiricamente se a Criptografia Assimétrica Homomórfica (CAH) pode melhorar a precisão da comparação envolvendo dados privados não textuais. Os resultados apontam que o uso de CAH para comparar dados não textuais pode melhorar a precisão da REPP.

Information
Web systems have become a valuable source of semi-structured and streaming data. In this sense, E... more Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-n neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover,...

Proceedings of the 21st International Database Engineering & Applications Symposium on - IDEAS 2017
As cities are becoming green and smart, public information systems are being revamped to adopt di... more As cities are becoming green and smart, public information systems are being revamped to adopt digital technologies. There are several sources (official or not) that can provide information related to a city. The availability of multiple sources enables the design of advanced analyses for offering valuable services to both citizens and municipalities. However, such analyses would fail if the considered data were affected by errors and uncertainties: Data Quality is one of the main requirements for the successful exploitation of the available information. This paper highlights the importance of the Data Quality evaluation in the context of geographical data sources. Moreover, we describe how the Entity Matching task can provide additional information to refine the quality assessment and, consequently, obtain a better evaluation of the reliability data sources. Data gathered from the public transportation and urban areas of Curitiba, Brazil, are used to show the strengths and effectiveness of the presented approach.
Anais Estendidos do XXXVI Simpósio Brasileiro de Banco de Dados (SBBD Estendido 2021), 2021
Redes neurais profundas foram aplicadas com sucesso a muitas tarefas diferentes de processamento ... more Redes neurais profundas foram aplicadas com sucesso a muitas tarefas diferentes de processamento de linguagem natural. Um modelo de rede neural que alavancou os resultados em uma ampla gama de tarefas de PNL foi o modelo BERT - uma sigla para Bidirectional Encoder Representations from Transformers. Nesta pesquisa, apresentamos como o modelo BERT pode ser utilizado para resumir documentos textuais da Polícia Federal Brasileira. Os documentos visam relatar um resumo das atividades investigativas. Devido ao tamanho e à complexidade dos documentos, é um trabalho exaustivo ler e compreender todo o seu conteúdo. Assim, objetivamos analisar a viabilidade da utilização do modelo BERT para extrair e sintetizar as informações mais importantes de documentos da Polícia Federal.
iSys, 2011
Neste trabalho e proposto um modelo conceitual para Data Warehouse de Trajetorias que permite ana... more Neste trabalho e proposto um modelo conceitual para Data Warehouse de Trajetorias que permite analisar o comportamento dos objetos moveis sobre e entre regioes sobre diferentes niveis de granularidade. O modelo permite a segmentacao de trajetorias em componentes, tais como paradas e movimentos, os quais podem transportar informacoes semânticas que dao significado a trajetoria. Para amenizar o problema da grande quantidade de dados, as trajetorias sao armazenadas de forma compactada atraves da sumarizacao de suas paradas e movimentos.
The increasing urban population sets new demands for mobility solutions. The impacts of traffic c... more The increasing urban population sets new demands for mobility solutions. The impacts of traffic congestions or inefficient transit connectivity directly affect public health (e.g. emissions and stress) and the city economy (e.g. deaths in road accidents, productivity, and commuting). In parallel, the advance of technology has made it easier to obtain data about the systems which make up the city information systems. This paper takes advantage of GIS and real-time data to present: 1) a web application integrating multiple services; 2) an android application for bus visualization and prediction and 3) a dashboard focused on applying exploratory data analysis techniques on ticketing data.

Proceedings of the 33rd Annual ACM Symposium on Applied Computing, 2018
In many scenarios, it is necessary to identify records referring to the same real-world object ac... more In many scenarios, it is necessary to identify records referring to the same real-world object across different data sources (Record Linkage). Yet, such need is often in contrast with privacy requirements concerning (e.g., identify patients with the same diseases, genome matching, and fraud detection). Thus, in the cases where the parties interested in the Record Linkage process need to preserve the privacy of their data, Privacy-Preserving Record Linkage (PPRL) approaches are applied to address the privacy problem. In this sense, the first step of PPRL is the agreement of the parties about the data (attributes) that will be used during the record linkage process. Thus, to reach an agreement, the parties must share information about their data schema, which in turn can be utilized to break the data privacy. To overcome the (vulnerability) problem caused by the schema information sharing, we propose a novel privacypreserving approach for attribute pairing to aid PPRL applications. Empirical experiments demonstrate that our privacy-preserving approach improves considerably the efficiency and effectiveness in comparison to a state-of-the-art baseline.

International Journal of Distributed Systems and Technologies, 2012
Peer Data Management Systems (PDMSs) are advanced P2P applications in which each peer represents ... more Peer Data Management Systems (PDMSs) are advanced P2P applications in which each peer represents an autonomous data source making available an exported schema to be shared with other peers. Query answering in PDMSs can be improved if peers are efficiently disposed in the overlay network according to the similarity of their content. The set of peers can be partitioned into clusters, so as the semantic similarity among the peers participating into the same cluster is maximal. The creation and maintenance of clusters is a challenging problem in the current stage of development of PDMSs. This work proposes an incremental peer clustering process. The authors present a PDMS architecture designed to facilitate the connection of new peers according to their exported schema described by an ontology. The authors propose a clustering process and the underlying algorithm. The authors present and discuss some experimental results on peer clustering using the approach.
Anais do Simpósio Brasileiro de Banco de Dados (SBBD), 2019
In the Smart Cities scenario, to avoid the conflicting geospatial records between official and no... more In the Smart Cities scenario, to avoid the conflicting geospatial records between official and non-official sources, it is necessary to detect the inconsistencies regarding the geospatial data provided by them. To this end, the map matching task, i.e., the task of identifying correspondent features between two geospatial data sources, should be applied. For spatial Big Data, the map matching task is confronted with challenges related to volume and veracity of the data. In this sense, we propose a Spark-based map matching approach, called MATCH-UPS. To evaluate, real-world data sources of New York (USA) and Curitiba (Brazil) were applied. The results showed that MATCH-UPS improved the precision by 26% and reduced the execution time by one third.

IEEE/WIC/ACM International Conference on Web Intelligence
The widespread use of information systems has become a valuable source of semi-structured data. I... more The widespread use of information systems has become a valuable source of semi-structured data. In this context, Entity Resolution (ER) emerges as a fundamental task to integrate multiple knowledge bases or identify similarities between data items (i.e., entities). Since ER is an inherently quadratic task, blocking techniques are often used to improve efficiency. Beyond the challenges related to the data volume and heterogeneity, blocking techniques also face two other challenges: streaming data and incremental processing. To address these challenges, we propose PRIME, a novel incremental schema-agnostic blocking technique that utilizes parallelism to enhance blocking efficiency. The proposed technique deals with streaming and incremental data using a distributed computational infrastructure. To improve efficiency, the technique avoids unnecessary comparisons and applies a time window strategy to prevent excessive memory consumption.
Anais do VII Simpósio Brasileiro de Sistemas de Informação (SBSI 2011), 2011
Este trabalho propõe um modelo conceitual para Data Warehouse de Trajetórias que permite analisar... more Este trabalho propõe um modelo conceitual para Data Warehouse de Trajetórias que permite analisar o comportamento dos objetos móveis sobre e entre regiões no espaço e no tempo, de acordo com diferentes níveis de granularidade, através do uso de agregações. O modelo permite a segmentação de trajetórias em componentes, tais como paradas e movimentos. Estes componentes podem transportar informações semânticas que dão significado a partes da trajetória. Para amenizar o problema da grande quantidade de dados, as trajetórias são armazenadas de forma compactada, sumarizando-se suas paradas e movimentos. Experimentos foram realizados para avaliar o nível de compactação obtido para esses dados.

Anais do IX Simpósio Brasileiro de Sistemas de Informação (SBSI 2013), 2013
Stored procedures are commonly used to provide access and manipulation ofdatabase data for inform... more Stored procedures are commonly used to provide access and manipulation ofdatabase data for information systems and other applications. If procedures present inefficient programming logic or data manipulation, excessive delays are provided to the client applications. Such delays can cause, among other problems, expressive financial losses to enterprises. In additon, if procedures are developed using bad programming practices, they may become complex to maintain and evolve. In general, attempts to minimize these problems using manual analysis of source code are labor- and time-consuming. In this work, we present PL/SQL Advisor, a static analysis-based tool, which automatically detects potential improvements on database stored procedures written in PL/SQL. The results ofa case study, using real open source projects, show that our tool is able to suggest a reasonable amount ofcode improvements with low cost.
Anais do Simpósio Brasileiro de Banco de Dados (SBBD), 2019
General Transit Feed Specification (GTFS) is a standard data format generated by transportation a... more General Transit Feed Specification (GTFS) is a standard data format generated by transportation agencies of most of the cities worldwide to provide scheduled data of their services. Despite being the standard in the public transportation field, applications that consume GTFS data may face two problems: outdated versions, since some transportation agencies do not provide GTFS in the same frequency that transit changes; and discrepancy with positioning data sent by the buses. This paper provides a conformity analysis of GTFS routes and bus positioning data from multiple cities. We have found inconsistencies related to GPS route labels and GTFS routes. We also classify the conformity of bus trajectories and enumerate the main inconsistencies found in data analysis.

Future Generation Computer Systems, 2019
Data analysis of public transportation data in large cities is a challenging problem. Managing da... more Data analysis of public transportation data in large cities is a challenging problem. Managing data ingestion, data storage, data quality enhancement, modelling and analysis requires intensive computing and a non-trivial amount of resources. In EUBra-BIGSEA (Europe-Brazil Collaboration of Big Data Scientic Research Through Cloud-Centric Applications), we address such problems in a comprehensive and integrated way. EUBra-BIGSEA provides a platform for building up data analytic workows on top of elastic cloud services without requiring skills related to either programming or cloud services. The approach combines cloud orchestration, Quality of Service and automatic parallelisation on a platform that includes a toolbox for implementing privacy guarantees and data quality enhancement as well as advanced services for sentiment analysis, trac jam estimation and trip recommendation based on estimated crowdedness. All developments are available under Open Source licenses (
Revista Brasileira de Informática na Educação, 2015
Aprender a lidar com dados é de fundamental importância para garantir a qualidade dos mesmos. Con... more Aprender a lidar com dados é de fundamental importância para garantir a qualidade dos mesmos. Considerando o caráter subjetivo das dimensões de qualidade de dados, o aprendizado das mesmas é um desafio para estudantes. Este artigo propõe o desenvolvimento de um jogo didático para reforçar a aprendizagem de banco de dados e qualidade de dados. O jogo simula diversos cenários para os quais o jogador (estudante) assume o papel de um analista de dados e tem que avaliar a qualidade de dados em bancos de dados relacionais. Para isso, deve analisar os cenários e indicar os dados que apresentam problemas de qualidade. Os resultados obtidos durante a realização dos experimentos indicam que o jogo é capaz de melhorar a aprendizagem dos alunos, ao mesmo tempo em que serve como uma opção de divertimento.
Uploads
Papers by Carlos Eduardo Pires