Skip to main content

Vanessa Ayala-Rivera

Technological University Dublin, Ireland, School of Informatics and Cybersecurity, Faculty Member

Followers

105

Following

25

Co-author

1

Public Views

Wil van der Aalst

RWTH Aachen University

Dimitrios G . Katehakis

Jordi Ribes-González

Maryline Laurent

University of California, Santa Cruz

Christopher Millard

Queen Mary, University of London

University of Basel, Switzerland

Kiran Lakkaraju

Sandia National Laboratories

Interests

Uploads

Papers by Vanessa Ayala-Rivera

Towards an Efficient Log Data Protection in Software Systems through Data Minimization and Anonymization

2019 7th International Conference in Software Engineering Research and Innovation (CONISOFT), 2019

IT infrastructures of companies generate large amounts of log data every day. These logs are typi... more IT infrastructures of companies generate large amounts of log data every day. These logs are typically analyzed by software engineers to gain insights about activities occurring within a company (e.g., to debug issues exhibited by the production systems). To facilitate this process, log data management is often outsourced to cloud providers. However, logs may contain information that is sensitive by nature and considered personal identifiable under most of the new privacy protection laws, such as the European General Data Protection Regulation (GDPR). To ensure that companies do not violate regulatory compliance, they must adopt, in their software systems, appropriate data protection measures. Such privacy protection laws also promote the use of anonymization techniques as possible mechanisms to operationalize data protection. However, companies struggle to put anonymization in practice due to the lack of integrated, intuitive, and easy-to-use tools that accommodate effectively with their log management systems. In this paper, we propose an automatic approach (SafeLog) to filter out information and anonymize log streams to safeguard the confidentiality of sensitive data and prevent its exposure and misuse from third parties. Our results show that atomic anonymization operations can be effectively applied to log streams to preserve the confidentiality of information, while still allowing to conduct different types of analysis tasks such as users behavior, and anomaly detection. Our approach also reduces the amount of data sent to cloud vendors, hence decreasing the financial costs and the risk of overexposing information.

Towards an Efficient Performance Testing Through Dynamic Workload Adaptation

Testing Software and Systems, 2019

Performance testing is a critical task to ensure an acceptable user experience with software syst... more Performance testing is a critical task to ensure an acceptable user experience with software systems, especially when there are high numbers of concurrent users. Selecting an appropriate test workload is a challenging and time-consuming process that relies heavily on the testers' expertise. Not only are workloads application-dependent, but also it is usually unclear how large a workload must be to expose any performance issues that exist in an application. Previous research has proposed to dynamically adapt the test workloads in real-time based on the application behavior. By reducing the need for the trial-and-error test cycles required when using static workloads, dynamic workload adaptation can reduce the effort and expertise needed to carry out performance testing. However, such approaches usually require testers to properly configure several parameters in order to be effective in identifying workloaddependent performance bugs, which may hinder their usability among practitioners. To address this issue, this paper examines the different criteria needed to conduct performance testing efficiently using dynamic workload adaptation. We present the results of comprehensively evaluating one such approach, providing insights into how to tune it properly in order to obtain better outcomes based on different scenarios. We also study the effects of varying its configuration and how this can affect the results obtained.

DYNAMOJM: A JMeter Tool for Performance Testing Using Dynamic Workload Adaptation

Testing Software and Systems, 2019

Performance testing is a critical task to assure optimal experience for users, especially when th... more Performance testing is a critical task to assure optimal experience for users, especially when there are high loads of concurrent users. JMeter is one of the most widely used tools for load and stress testing. With JMeter, it is possible to test the performance of static and dynamic resources on the web. This paper presents DYNAMOJM, a novel tool built on top of JMeter that enables testers to create a dynamic workload for performance testing. This tool implements the DYNAMO approach, which has proven useful to find performance issues more efficiently than static testing techniques.

Improving the Testing of Java Garbage Collection Through an Efficient Benchmark Generation

2018 6th International Conference in Software Engineering Research and Innovation (CONISOFT), 2018

Garbage Collection (GC) is a core feature of multiple modern technologies (e.g., Java, Android). ... more Garbage Collection (GC) is a core feature of multiple modern technologies (e.g., Java, Android). On one hand, it offers significant software engineering benefits over explicitly memory management, like preventing most types of memory leaks. On the other hand, GC is a known cause of performance degradation. However, it is considerably challenging to understand its exact impact on the overall application performance. This is because the non-deterministic nature of GC makes very complex to properly model it and evaluate its performance impacts. To help tackling these problems, we present an engine to generate realistic GC benchmarks by enabling to effectively capture the GC/memory behaviours experienced by real-world Java applications. We also demonstrate, through a comprehensive experimental evaluation, how such benchmarks can be useful to strengthen the evaluation of GC-related advancements.

Improving the Testing of Clustered Systems Through the Effective Usage of Java Benchmarks

2017 5th International Conference in Software Engineering Research and Innovation (CONISOFT), 2017

A Requirements-Based Approach for the Evaluation of Emulated IoT Systems

2018 4th International Workshop on Requirements Engineering for Self-Adaptive, Collaborative, and Cyber Physical Systems (RESACS), 2018

The Internet of Things (IoT) has become a major technological revolution. Evaluating any IoT adva... more The Internet of Things (IoT) has become a major technological revolution. Evaluating any IoT advancements comprehensively is critical to understand the conditions under which they can be more useful, as well as to assess the robustness and efficiency of IoT systems to validate them before their deployment in real life. Nevertheless, the creation of an appropriate IoT test environment is a difficult, effort-intensive, and expensive task; typically requiring a significant amount of human effort and physical hardware to build it. To tackle this problem, emulation tools to test IoT devices have been proposed. However, there is a lack of systematic approaches for evaluating IoT emulation environments. In this paper, we present a requirements-based framework to enable the systematic evaluation of the suitability of an emulated IoT environment to fulfil the requirements that secure the quality of an adequate test environment for IoT.

Improving the Utility of Anonymized Datasets through Dynamic Evaluation of Generalization Hierarchies

2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), 2016

The dissemination of textual personal information has become a key driver for innovation and valu... more The dissemination of textual personal information has become a key driver for innovation and value creation. However, due to the possible content of sensitive information, this data must be anonymized, which can reduce its usefulness for secondary uses. One of the most used techniques to anonymize data is generalization. However, its effectiveness can be hampered by the Value Generalization Hierarchies (VGHs) used to dictate the anonymization of data, as poorlyspecified VGHs can reduce the usefulness of the resulting data. To tackle this problem, we propose a metric for evaluating the quality of textual VGHs used in anonymization. Our evaluation approach considers the semantic properties of VGHs and exploits information from the input datasets to predict with higher accuracy (compared to existing approaches) the potential effectiveness of VGHs for anonymizing data. As a consequence, the utility of the resulting datasets is improved without sacrificing the privacy goal. We also introduce a novel rating scale to classify the quality of the VGHs into categories to facilitate the interpretation of our quality metric for practitioners.

Automatic Construction of Generalization Hierarchies for Publishing Anonymized Data

Lecture Notes in Computer Science, 2016

Concept hierarchies are widely used in multiple fields to carry out data analysis. In data privac... more Concept hierarchies are widely used in multiple fields to carry out data analysis. In data privacy, they are known as Value Generalization Hierarchies (VGHs), and are used by generalization algorithms to dictate the data anonymization. Thus, their proper specification is critical to obtain anonymized data of good quality. The creation and evaluation of VGHs require expert knowledge and a significant amount of manual effort, making these tasks highly error-prone and time-consuming. In this paper we present AIKA, a knowledge-based framework to automatically construct and evaluate VGHs for the anonymization of categorical data. AIKA integrates ontologies to objectively create and evaluate VGHs. It also implements a multi-dimensional reward function to tailor the VGH evaluation to different use cases. Our experiments show that AIKA improved the creation of VGHs by generating VGHs of good quality in less time than when manually done. Results also showed how the reward function properly captures the desired VGH properties.

COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

Lecture Notes in Computer Science, 2016

Conducting extensive testing of anonymization techniques is critical to assess their robustness a... more Conducting extensive testing of anonymization techniques is critical to assess their robustness and identify the scenarios where they are most suitable. However, the access to real microdata is highly restricted and the one that is publicly-available is usually anonymized or aggregated; hence, reducing its value for testing purposes. In this paper, we present a framework (COCOA) for the generation of realistic synthetic microdata that allows to define multi-attribute relationships in order to preserve the functional dependencies of the data. We prove how COCOA is useful to strengthen the testing of anonymization techniques by broadening the number and diversity of the test scenarios. Results also show how COCOA is practical to generate large datasets.

Protecting organizational data confidentiality in the cloud using a high-performance anonymization engine

12th Information Technology &Telecommunications (IT&T) Conference, Athlone, Ireland, March, 2013

Ontology-Based Quality Evaluation of Value Generalization Hierarchies for Data Anonymization

ArXiv, 2015

In privacy-preserving data publishing, approaches using Value Generalization Hierarchies (VGHs) f... more In privacy-preserving data publishing, approaches using Value Generalization Hierarchies (VGHs) form an important class of anonymization algorithms. VGHs play a key role in the utility of published datasets as they dictate how the anonymization of the data occurs. For categorical attributes, it is imperative to preserve the semantics of the original data in order to achieve a higher utility. Despite this, semantics have not being formally considered in the specification of VGHs. Moreover, there are no methods that allow the users to assess the quality of their VGH. In this paper, we propose a measurement scheme, based on ontologies, to quantitatively evaluate the quality of VGHs, in terms of semantic consistency and taxonomic organization, with the aim of producing higher-quality anonymizations. We demonstrate, through a case study, how our evaluation scheme can be used to compare the quality of multiple VGHs and can help to identify faulty VGHs.

Enhancing the Utility of Anonymized Data by Improving the Quality of Generalization Hierarchies

Trans. Data Priv., 2017

The dissemination of textual personal information has become an important driver of innovation. H... more The dissemination of textual personal information has become an important driver of innovation. However, due to the possible content of sensitive information, this data must be anonymized. A commonly-used technique to anonymize data is generalization. Nevertheless, its effectiveness can be hampered by the Value Generalization Hierarchies (VGHs) used as poorly-specified VGHs can decrease the usefulness of the resulting data. To tackle this problem, in our previous work we presented the Generalization Semantic Loss (GSL), a metric that captures the quality of categorical VGHs in terms of semantic consistency and taxonomic organization. We validated the accuracy of GSL using an intrinsic evaluation with respect to a gold standard ontology. In this paper, we extend our previous work by conducting an extrinsic evaluation of GSL with respect to the performance that VGHs have in anonymization (using data utility metrics). We show how GSL can be used to perform an a priori assessment of the...

Synthetic Data Generation using Benerator Tool

ArXiv, 2013

Datasets of dierent characteristics are needed by the research community for experimental purpose... more Datasets of dierent characteristics are needed by the research community for experimental purposes. However, real data may be dicult to obtain due to privacy concerns. Moreover, real data may not meet specic characteristics which are needed to verify new approaches under certain conditions. Given these limitations, the use of synthetic data is a viable alternative to complement the real data. In this report, we describe the process followed to generate synthetic data using Benerator, a publicly available tool. The results show that the synthetic data preserves a high level of accuracy compared to the original data. The generated datasets correspond to microdata containing records with social, economic and demographic data which mimics the distribution of aggregated statistics from the 2011 Irish Census data.

Enhancing the utility of anonymized data in privacy-preserving data publishing

A Systematic Comparison and Evaluation of k-Anonymization Algorithms for Practitioners

The vast amount of data being collected about individuals has brought new challenges in protectin... more The vast amount of data being collected about individuals has brought new challenges in protecting their privacy when this data is disseminated. As a result, Privacy-Preserving Data Publishing has become an active research area, in which multiple anonymization algorithms have been proposed. However, given the large number of algorithms available and limited information regarding their performance, it is difficult to identify and select the most appropriate algorithm given a particular publishing scenario, especially for practitioners. In this paper, we perform a systematic comparison of three well-known k-anonymization algorithms to measure their efficiency (in terms of resources usage) and their effectiveness (in terms of data utility). We extend the scope of their original evaluation by employing a more comprehensive set of scenarios: different parameters, metrics and datasets. Using publicly available implementations of those algorithms, we conduct a series of experiments and a c...

Towards an Efficient Log Data Protection in Software Systems through Data Minimization and Anonymization

7th International Conference in Software Engineering Research and Innovation (CONISOFT), 2019

IT infrastructures of companies generate large amounts of log data every day. These logs are typi... more IT infrastructures of companies generate large amounts of log data every day. These logs are typically analyzed by software engineers to gain insights about activities occurring within a company (e.g., to debug issues exhibited by the production systems). To facilitate this process, log data management is often outsourced to cloud providers. However, logs may contain information that is sensitive by nature and considered personal identifiable under most of the new privacy protection laws, such as the European General Data Protection Regulation (GDPR). To ensure that companies do not violate regulatory compliance, they must adopt, in their software systems, appropriate data protection measures. Such privacy protection laws also promote the use of anonymization techniques as possible mechanisms to operationalize data protection. However, companies struggle to put anonymization in practice due to the lack of integrated, intuitive, and easy-to-use tools that accommodate effectively with...

Towards an Efficient Performance Testing through Dynamic Workload Adaptation

FIP International Conference on Testing Software and Systems, 2019

Performance testing is a critical task to ensure an acceptable user experience with software syst... more Performance testing is a critical task to ensure an acceptable user experience with software systems, especially when there are high numbers of concurrent users. Selecting an appropriate test work-load is a challenging and time-consuming process that relies heavily on the testers' expertise. Not only are workloads application-dependent, but also it is usually unclear how large a workload must be to expose any performance issues that exist in an application. Previous research has proposed to dynamically adapt the test workloads in real-time based on the application behavior. By reducing the need for the trial-and-error test cycles required when using static workloads, dynamic workload adaptation can reduce the effort and expertise needed to carry out performance testing. However, such approaches usually require testers to properly configure several parameters in order to be effective in identifying workload-dependent performance bugs, which may hinder their usability among practi...

DYNAMOJM: A JMeter Tool for Performance Testing using Dynamic Workload Adaptation

IFIP International Conference on Testing Software and Systems, 2019

Performance testing is a critical task to assure optimal experience for users, especially when th... more Performance testing is a critical task to assure optimal experience for users, especially when there are high loads of concurrent users. JMeter is one of the most widely used tools for load and stress testing. With JMeter, it is possible to test the performance of static and dynamic resources on the web. This paper presents DYNAMOJM, a novel tool built on top of JMeter that enables testers to create a dynamic workload for performance testing. This tool implements the DYNAMO approach, which has proven useful to find performance issues more efficiently than static testing techniques.

Improving the Testing of Java Garbage Collection Through an Efficient Benchmark Generation

International Conference in Software Engineering Research and Innovation (CONISOFT), 2018

Garbage Collection (GC) is a core feature of multiple modern technologies (e.g., Java, Android). ... more Garbage Collection (GC) is a core feature of multiple modern technologies (e.g., Java, Android). On one hand, it offers significant software engineering benefits over explicitly memory management, like preventing most types of memory leaks. On the other hand, GC is a known cause of performance degradation. However, it is considerably challenging to understand its exact impact on the overall application performance. This is because the non-deterministic nature of GC makes very complex to properly model it and evaluate its performance impacts. To help tackling these problems, we present an engine to generate realistic GC benchmarks by enabling to effectively capture the GC/memory behaviours experienced by real-world Java applications. We also demonstrate, through a comprehensive experimental evaluation, how such benchmarks can be useful to strengthen the evaluation of GC-related advancements.

Protecting Organizational Data Confidentiality in the Cloud using a High-Performance Anonymization Engine

Data security remains a top concern for the adoption of cloud-based delivery models, especially i... more Data security remains a top concern for the adoption of cloud-based delivery models, especially in the case of the Software as a Service (SaaS). This concern is primarily caused due to the lack of transparency on how customer data is managed. Clients depend on the security measures implemented by the service providers to keep their information protected. However, not many practical solutions exist to protect data from malicious insiders working for the cloud providers, a factor that represents a high potential for data breaches. This paper presents the High-Performance Anonymization Engine (HPAE), an approach to allow companies to protect their sensitive information from SaaS providers in a public cloud. This approach uses data anonymiza-tion to prevent the exposure of sensitive data in its original form, thus reducing the risk for misuses of customer information. This work involved the implementation of a prototype and an experimental validation phase, which assessed the performanc...

Towards an Efficient Log Data Protection in Software Systems through Data Minimization and Anonymization

2019 7th International Conference in Software Engineering Research and Innovation (CONISOFT), 2019

IT infrastructures of companies generate large amounts of log data every day. These logs are typi... more IT infrastructures of companies generate large amounts of log data every day. These logs are typically analyzed by software engineers to gain insights about activities occurring within a company (e.g., to debug issues exhibited by the production systems). To facilitate this process, log data management is often outsourced to cloud providers. However, logs may contain information that is sensitive by nature and considered personal identifiable under most of the new privacy protection laws, such as the European General Data Protection Regulation (GDPR). To ensure that companies do not violate regulatory compliance, they must adopt, in their software systems, appropriate data protection measures. Such privacy protection laws also promote the use of anonymization techniques as possible mechanisms to operationalize data protection. However, companies struggle to put anonymization in practice due to the lack of integrated, intuitive, and easy-to-use tools that accommodate effectively with their log management systems. In this paper, we propose an automatic approach (SafeLog) to filter out information and anonymize log streams to safeguard the confidentiality of sensitive data and prevent its exposure and misuse from third parties. Our results show that atomic anonymization operations can be effectively applied to log streams to preserve the confidentiality of information, while still allowing to conduct different types of analysis tasks such as users behavior, and anomaly detection. Our approach also reduces the amount of data sent to cloud vendors, hence decreasing the financial costs and the risk of overexposing information.

Towards an Efficient Performance Testing Through Dynamic Workload Adaptation

Testing Software and Systems, 2019

Performance testing is a critical task to ensure an acceptable user experience with software syst... more Performance testing is a critical task to ensure an acceptable user experience with software systems, especially when there are high numbers of concurrent users. Selecting an appropriate test workload is a challenging and time-consuming process that relies heavily on the testers' expertise. Not only are workloads application-dependent, but also it is usually unclear how large a workload must be to expose any performance issues that exist in an application. Previous research has proposed to dynamically adapt the test workloads in real-time based on the application behavior. By reducing the need for the trial-and-error test cycles required when using static workloads, dynamic workload adaptation can reduce the effort and expertise needed to carry out performance testing. However, such approaches usually require testers to properly configure several parameters in order to be effective in identifying workloaddependent performance bugs, which may hinder their usability among practitioners. To address this issue, this paper examines the different criteria needed to conduct performance testing efficiently using dynamic workload adaptation. We present the results of comprehensively evaluating one such approach, providing insights into how to tune it properly in order to obtain better outcomes based on different scenarios. We also study the effects of varying its configuration and how this can affect the results obtained.

DYNAMOJM: A JMeter Tool for Performance Testing Using Dynamic Workload Adaptation

Testing Software and Systems, 2019

Performance testing is a critical task to assure optimal experience for users, especially when th... more Performance testing is a critical task to assure optimal experience for users, especially when there are high loads of concurrent users. JMeter is one of the most widely used tools for load and stress testing. With JMeter, it is possible to test the performance of static and dynamic resources on the web. This paper presents DYNAMOJM, a novel tool built on top of JMeter that enables testers to create a dynamic workload for performance testing. This tool implements the DYNAMO approach, which has proven useful to find performance issues more efficiently than static testing techniques.

Improving the Testing of Java Garbage Collection Through an Efficient Benchmark Generation

2018 6th International Conference in Software Engineering Research and Innovation (CONISOFT), 2018

Garbage Collection (GC) is a core feature of multiple modern technologies (e.g., Java, Android). ... more Garbage Collection (GC) is a core feature of multiple modern technologies (e.g., Java, Android). On one hand, it offers significant software engineering benefits over explicitly memory management, like preventing most types of memory leaks. On the other hand, GC is a known cause of performance degradation. However, it is considerably challenging to understand its exact impact on the overall application performance. This is because the non-deterministic nature of GC makes very complex to properly model it and evaluate its performance impacts. To help tackling these problems, we present an engine to generate realistic GC benchmarks by enabling to effectively capture the GC/memory behaviours experienced by real-world Java applications. We also demonstrate, through a comprehensive experimental evaluation, how such benchmarks can be useful to strengthen the evaluation of GC-related advancements.

Improving the Testing of Clustered Systems Through the Effective Usage of Java Benchmarks

2017 5th International Conference in Software Engineering Research and Innovation (CONISOFT), 2017

A Requirements-Based Approach for the Evaluation of Emulated IoT Systems

2018 4th International Workshop on Requirements Engineering for Self-Adaptive, Collaborative, and Cyber Physical Systems (RESACS), 2018

The Internet of Things (IoT) has become a major technological revolution. Evaluating any IoT adva... more The Internet of Things (IoT) has become a major technological revolution. Evaluating any IoT advancements comprehensively is critical to understand the conditions under which they can be more useful, as well as to assess the robustness and efficiency of IoT systems to validate them before their deployment in real life. Nevertheless, the creation of an appropriate IoT test environment is a difficult, effort-intensive, and expensive task; typically requiring a significant amount of human effort and physical hardware to build it. To tackle this problem, emulation tools to test IoT devices have been proposed. However, there is a lack of systematic approaches for evaluating IoT emulation environments. In this paper, we present a requirements-based framework to enable the systematic evaluation of the suitability of an emulated IoT environment to fulfil the requirements that secure the quality of an adequate test environment for IoT.

Improving the Utility of Anonymized Datasets through Dynamic Evaluation of Generalization Hierarchies

2016 IEEE 17th International Conference on Information Reuse and Integration (IRI), 2016

The dissemination of textual personal information has become a key driver for innovation and valu... more The dissemination of textual personal information has become a key driver for innovation and value creation. However, due to the possible content of sensitive information, this data must be anonymized, which can reduce its usefulness for secondary uses. One of the most used techniques to anonymize data is generalization. However, its effectiveness can be hampered by the Value Generalization Hierarchies (VGHs) used to dictate the anonymization of data, as poorlyspecified VGHs can reduce the usefulness of the resulting data. To tackle this problem, we propose a metric for evaluating the quality of textual VGHs used in anonymization. Our evaluation approach considers the semantic properties of VGHs and exploits information from the input datasets to predict with higher accuracy (compared to existing approaches) the potential effectiveness of VGHs for anonymizing data. As a consequence, the utility of the resulting datasets is improved without sacrificing the privacy goal. We also introduce a novel rating scale to classify the quality of the VGHs into categories to facilitate the interpretation of our quality metric for practitioners.

Automatic Construction of Generalization Hierarchies for Publishing Anonymized Data

Lecture Notes in Computer Science, 2016

Concept hierarchies are widely used in multiple fields to carry out data analysis. In data privac... more Concept hierarchies are widely used in multiple fields to carry out data analysis. In data privacy, they are known as Value Generalization Hierarchies (VGHs), and are used by generalization algorithms to dictate the data anonymization. Thus, their proper specification is critical to obtain anonymized data of good quality. The creation and evaluation of VGHs require expert knowledge and a significant amount of manual effort, making these tasks highly error-prone and time-consuming. In this paper we present AIKA, a knowledge-based framework to automatically construct and evaluate VGHs for the anonymization of categorical data. AIKA integrates ontologies to objectively create and evaluate VGHs. It also implements a multi-dimensional reward function to tailor the VGH evaluation to different use cases. Our experiments show that AIKA improved the creation of VGHs by generating VGHs of good quality in less time than when manually done. Results also showed how the reward function properly captures the desired VGH properties.

COCOA: A Synthetic Data Generator for Testing Anonymization Techniques

Lecture Notes in Computer Science, 2016

Conducting extensive testing of anonymization techniques is critical to assess their robustness a... more Conducting extensive testing of anonymization techniques is critical to assess their robustness and identify the scenarios where they are most suitable. However, the access to real microdata is highly restricted and the one that is publicly-available is usually anonymized or aggregated; hence, reducing its value for testing purposes. In this paper, we present a framework (COCOA) for the generation of realistic synthetic microdata that allows to define multi-attribute relationships in order to preserve the functional dependencies of the data. We prove how COCOA is useful to strengthen the testing of anonymization techniques by broadening the number and diversity of the test scenarios. Results also show how COCOA is practical to generate large datasets.

Protecting organizational data confidentiality in the cloud using a high-performance anonymization engine

12th Information Technology &Telecommunications (IT&T) Conference, Athlone, Ireland, March, 2013

Ontology-Based Quality Evaluation of Value Generalization Hierarchies for Data Anonymization

ArXiv, 2015

In privacy-preserving data publishing, approaches using Value Generalization Hierarchies (VGHs) f... more In privacy-preserving data publishing, approaches using Value Generalization Hierarchies (VGHs) form an important class of anonymization algorithms. VGHs play a key role in the utility of published datasets as they dictate how the anonymization of the data occurs. For categorical attributes, it is imperative to preserve the semantics of the original data in order to achieve a higher utility. Despite this, semantics have not being formally considered in the specification of VGHs. Moreover, there are no methods that allow the users to assess the quality of their VGH. In this paper, we propose a measurement scheme, based on ontologies, to quantitatively evaluate the quality of VGHs, in terms of semantic consistency and taxonomic organization, with the aim of producing higher-quality anonymizations. We demonstrate, through a case study, how our evaluation scheme can be used to compare the quality of multiple VGHs and can help to identify faulty VGHs.

Enhancing the Utility of Anonymized Data by Improving the Quality of Generalization Hierarchies

Trans. Data Priv., 2017

The dissemination of textual personal information has become an important driver of innovation. H... more The dissemination of textual personal information has become an important driver of innovation. However, due to the possible content of sensitive information, this data must be anonymized. A commonly-used technique to anonymize data is generalization. Nevertheless, its effectiveness can be hampered by the Value Generalization Hierarchies (VGHs) used as poorly-specified VGHs can decrease the usefulness of the resulting data. To tackle this problem, in our previous work we presented the Generalization Semantic Loss (GSL), a metric that captures the quality of categorical VGHs in terms of semantic consistency and taxonomic organization. We validated the accuracy of GSL using an intrinsic evaluation with respect to a gold standard ontology. In this paper, we extend our previous work by conducting an extrinsic evaluation of GSL with respect to the performance that VGHs have in anonymization (using data utility metrics). We show how GSL can be used to perform an a priori assessment of the...

Synthetic Data Generation using Benerator Tool

ArXiv, 2013

Datasets of dierent characteristics are needed by the research community for experimental purpose... more Datasets of dierent characteristics are needed by the research community for experimental purposes. However, real data may be dicult to obtain due to privacy concerns. Moreover, real data may not meet specic characteristics which are needed to verify new approaches under certain conditions. Given these limitations, the use of synthetic data is a viable alternative to complement the real data. In this report, we describe the process followed to generate synthetic data using Benerator, a publicly available tool. The results show that the synthetic data preserves a high level of accuracy compared to the original data. The generated datasets correspond to microdata containing records with social, economic and demographic data which mimics the distribution of aggregated statistics from the 2011 Irish Census data.

Enhancing the utility of anonymized data in privacy-preserving data publishing

A Systematic Comparison and Evaluation of k-Anonymization Algorithms for Practitioners

The vast amount of data being collected about individuals has brought new challenges in protectin... more The vast amount of data being collected about individuals has brought new challenges in protecting their privacy when this data is disseminated. As a result, Privacy-Preserving Data Publishing has become an active research area, in which multiple anonymization algorithms have been proposed. However, given the large number of algorithms available and limited information regarding their performance, it is difficult to identify and select the most appropriate algorithm given a particular publishing scenario, especially for practitioners. In this paper, we perform a systematic comparison of three well-known k-anonymization algorithms to measure their efficiency (in terms of resources usage) and their effectiveness (in terms of data utility). We extend the scope of their original evaluation by employing a more comprehensive set of scenarios: different parameters, metrics and datasets. Using publicly available implementations of those algorithms, we conduct a series of experiments and a c...

Towards an Efficient Log Data Protection in Software Systems through Data Minimization and Anonymization

7th International Conference in Software Engineering Research and Innovation (CONISOFT), 2019

IT infrastructures of companies generate large amounts of log data every day. These logs are typi... more IT infrastructures of companies generate large amounts of log data every day. These logs are typically analyzed by software engineers to gain insights about activities occurring within a company (e.g., to debug issues exhibited by the production systems). To facilitate this process, log data management is often outsourced to cloud providers. However, logs may contain information that is sensitive by nature and considered personal identifiable under most of the new privacy protection laws, such as the European General Data Protection Regulation (GDPR). To ensure that companies do not violate regulatory compliance, they must adopt, in their software systems, appropriate data protection measures. Such privacy protection laws also promote the use of anonymization techniques as possible mechanisms to operationalize data protection. However, companies struggle to put anonymization in practice due to the lack of integrated, intuitive, and easy-to-use tools that accommodate effectively with...

Towards an Efficient Performance Testing through Dynamic Workload Adaptation

FIP International Conference on Testing Software and Systems, 2019

Performance testing is a critical task to ensure an acceptable user experience with software syst... more Performance testing is a critical task to ensure an acceptable user experience with software systems, especially when there are high numbers of concurrent users. Selecting an appropriate test work-load is a challenging and time-consuming process that relies heavily on the testers' expertise. Not only are workloads application-dependent, but also it is usually unclear how large a workload must be to expose any performance issues that exist in an application. Previous research has proposed to dynamically adapt the test workloads in real-time based on the application behavior. By reducing the need for the trial-and-error test cycles required when using static workloads, dynamic workload adaptation can reduce the effort and expertise needed to carry out performance testing. However, such approaches usually require testers to properly configure several parameters in order to be effective in identifying workload-dependent performance bugs, which may hinder their usability among practi...

DYNAMOJM: A JMeter Tool for Performance Testing using Dynamic Workload Adaptation

IFIP International Conference on Testing Software and Systems, 2019

Performance testing is a critical task to assure optimal experience for users, especially when th... more Performance testing is a critical task to assure optimal experience for users, especially when there are high loads of concurrent users. JMeter is one of the most widely used tools for load and stress testing. With JMeter, it is possible to test the performance of static and dynamic resources on the web. This paper presents DYNAMOJM, a novel tool built on top of JMeter that enables testers to create a dynamic workload for performance testing. This tool implements the DYNAMO approach, which has proven useful to find performance issues more efficiently than static testing techniques.

Improving the Testing of Java Garbage Collection Through an Efficient Benchmark Generation

International Conference in Software Engineering Research and Innovation (CONISOFT), 2018

Garbage Collection (GC) is a core feature of multiple modern technologies (e.g., Java, Android). ... more Garbage Collection (GC) is a core feature of multiple modern technologies (e.g., Java, Android). On one hand, it offers significant software engineering benefits over explicitly memory management, like preventing most types of memory leaks. On the other hand, GC is a known cause of performance degradation. However, it is considerably challenging to understand its exact impact on the overall application performance. This is because the non-deterministic nature of GC makes very complex to properly model it and evaluate its performance impacts. To help tackling these problems, we present an engine to generate realistic GC benchmarks by enabling to effectively capture the GC/memory behaviours experienced by real-world Java applications. We also demonstrate, through a comprehensive experimental evaluation, how such benchmarks can be useful to strengthen the evaluation of GC-related advancements.

Protecting Organizational Data Confidentiality in the Cloud using a High-Performance Anonymization Engine

Data security remains a top concern for the adoption of cloud-based delivery models, especially i... more Data security remains a top concern for the adoption of cloud-based delivery models, especially in the case of the Software as a Service (SaaS). This concern is primarily caused due to the lack of transparency on how customer data is managed. Clients depend on the security measures implemented by the service providers to keep their information protected. However, not many practical solutions exist to protect data from malicious insiders working for the cloud providers, a factor that represents a high potential for data breaches. This paper presents the High-Performance Anonymization Engine (HPAE), an approach to allow companies to protect their sensitive information from SaaS providers in a public cloud. This approach uses data anonymiza-tion to prevent the exposure of sensitive data in its original form, thus reducing the risk for misuses of customer information. This work involved the implementation of a prototype and an experimental validation phase, which assessed the performanc...