Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
Automatically producing behavioral exception (BE) API documentation helps developers correctly us... more Automatically producing behavioral exception (BE) API documentation helps developers correctly use the libraries. The state-of-theart approaches are either rule-based, which is too restrictive in its applicability, or deep learning (DL)-based, which requires large training dataset. To address that, we propose StatGen, a novel hybrid approach between statistical machine translation (SMT) and tree-structured translation to generate the BE documentation for any code and vice versa. We consider the documentation and source code of an API method as the two abstraction levels of the same intent. StatGen is specifically designed for this two-way inference, and takes advantage of their structures for higher accuracy. We conducted several experiments to evaluate StatGen. We show that it achieves high precision (75% and 75%), and recall (81% and 84%), in inferring BE documentation from source code and vice versa. StatGen achieves higher precision, recall, and BLEU score than the state-of-the-art, DL-based baseline models. We show Stat-Gen's usefulness in two applications. First, we use it to generate the BE documentation for Apache APIs that lack of documentation by learning from the documentation of the equivalent APIs in JDK. 44% of the generated documentation were rated as useful and 42% as somewhat useful. In the second application, we use StatGen to detect the inconsistency between the BE documentation and corresponding implementations of several JDK8 packages.
Modern software systems are increasingly including machine learning (ML) as an integral component... more Modern software systems are increasingly including machine learning (ML) as an integral component. However, we do not yet understand the difficulties faced by software developers when learning about ML libraries and using them within their systems. To that end, this work reports on a detailed (manual) examination of 3,243 highly-rated Q&A posts related to ten ML libraries, namely Tensorflow, Keras, scikit-learn, Weka, Caffe, Theano, MLlib, Torch, Mahout, and H2O, on Stack Overflow, a popular online technical Q&A forum. We classify these questions into seven typical stages of an ML pipeline to understand the correlation between the library and the stage. Then we study the questions and perform statistical analysis to explore the answer to four research objectives (finding the most difficult stage, understanding the nature of problems, nature of libraries and studying whether the difficulties stayed consistent over time). Our findings reveal the urgent need for software engineering (SE) research in this area. Both static and dynamic analyses are mostly absent and badly needed to help developers find errors earlier. While there has been some early research on debugging, much more work is needed. API misuses are prevalent and API design improvements are sorely needed. Last and somewhat surprisingly, a tug of war between providing higher levels of abstractions and the need to understand the behavior of the trained model is prevalent.
APIs are essential ingredients for developing complex software systems. However, they are difficu... more APIs are essential ingredients for developing complex software systems. However, they are difficult to learn and to use. Thus, developers may misuse them, which results in various types of issues. In this paper, we explore the use of a bio-inspired approach (artificial immune system) to detect API misuses in client code. We built APIMMUNE, a novel API misuse detector. We collect normal usages of a given APIs from the set of client programs using the APIs, especially after some API usages were fixed in those programs. The normal API usages are considered as normal body cells. We transform them into normal-usage signatures. Then, artificial detectors are randomly generated by generating artificial deviations from these usages with the objective of being different from the normal usage signatures. The generated detectors have the ability to detect risky uses of APIs exactly as the immune system detects foreign cells of the organism. Moreover, for the detection purpose, only the artific...
2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019
To avoid the exposure of original source code in a Web application, the variable names in JS code... more To avoid the exposure of original source code in a Web application, the variable names in JS code deployed in the wild are often replaced by short, meaningless names, thus making the code extremely difficult to manually understand and analysis. This paper presents JSNEAT, an information retrieval (IR)-based approach to recover the variable names in minified JS code. JSNEAT follows a data-driven approach to recover names by searching for them in a large corpus of open-source JS code. We use three types of contexts to match a variable in given minified code against the corpus including the context of the properties and roles of the variable, the context of that variable and relations with other variables under recovery, and the context of the task of the function to which the variable contributes. We performed several empirical experiments to evaluate JSNEAT on the dataset of more than 322K JS files with 1M functions, and 3.5M variables with 176K unique variable names. We found that JSNEAT achieves a high accuracy of 69.1%, which is the relative improvements of 66.1% and 43% over two state-of-theart approaches JSNice and JSNaughty, respectively. The time to recover for a file or a variable with JSNEAT is twice as fast as with JSNice and 4x as fast as with JNaughty, respectively.
Proceedings of the 40th International Conference on Software Engineering, 2018
Software developers often make use of the online forums such as StackOverflow (SO) to learn how t... more Software developers often make use of the online forums such as StackOverflow (SO) to learn how to use software libraries and their APIs. However, the code snippets in such a forum often contain undeclared, ambiguous, or largely unqualified external references. Such declaration ambiguity and external reference ambiguity present challenges for developers in learning to correctly use the APIs. In this paper, we propose STATTYPE, a statistical approach to resolve the fully qualified names (FQNs) for the API elements in such code snippets. Unlike existing approaches that are based on heuristics, STATTYPE has two well-integrated factors. We first learn from a large training code corpus the FQNs that often co-occur. Then, to derive the FQN for an API name in a code snippet, we use that knowledge and also leverage the context consisting of neighboring API names. To realize those factors, we treat the problem as statistical machine translation from source code with partially qualified names to source code with FQNs of the APIs. Our empirical evaluation on real-world code and StackOverflow posts shows that STATTYPE achieves very high accuracy with 97.6% precision and 96.7% recall, which is 16.5% relatively higher than the state-of-the-art approach. CCS CONCEPTS • Software and its engineering Software libraries and repositories; API languages;
2017 IEEE/ACM 39th International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER), 2017
API documentation is useful for developers to better understand how to correctly use the librarie... more API documentation is useful for developers to better understand how to correctly use the libraries. However, not all libraries provide good documentation on API usages. To provide better documentation, existing techniques have been proposed including program analysis-based and data mining-based approaches. In this work, instead of mining, we aim to generate behavioral exception documentation for any given code. We treat the problem of automatically generating documentation from a novel perspective: statistical machine translation (SMT). We consider the documentation and source code for an API method as the two abstraction levels of the same intention. We use SMT to translate documentation from source code and vice versa. Our preliminary results show that the direction of statistical learning for inference between implementations and documentation is very promising.
2013 35th International Conference on Software Engineering (ICSE), 2013
In today's software-centric world, ultra-large-scale software repositories, e.g. SourceForge (350... more In today's software-centric world, ultra-large-scale software repositories, e.g. SourceForge (350,000+ projects), GitHub (250,000+ projects), and Google Code (250,000+ projects) are the new library of Alexandria. They contain an enormous corpus of software and information about software. Scientists and engineers alike are interested in analyzing this wealth of information both for curiosity as well as for testing important hypotheses. However, systematic extraction of relevant data from these repositories and analysis of such data for testing hypotheses is hard, and best left for mining software repository (MSR) experts! The goal of Boa, a domain-specific language and infrastructure described here, is to ease testing MSR-related hypotheses. We have implemented Boa and provide a web-based interface to Boa's infrastructure. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also see drastic improvements in scalability. Last but not least, reproducing an experiment conducted using Boa is just a matter of re-running small Boa programs provided by previous researchers.
Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014
Modern software relies on existing application programming interfaces (APIs) from libraries. Form... more Modern software relies on existing application programming interfaces (APIs) from libraries. Formal specifications for the APIs enable many software engineering tasks as well as help developers correctly use them. In this work, we mine large-scale repositories of existing open-source software to derive potential preconditions for API methods. Our key idea is that APIs' preconditions would appear frequently in an ultra-large code corpus with a large number of API usages, while project-specific conditions will occur less frequently. First, we find all client methods invoking APIs. We then compute a control dependence relation from each call site and mine the potential conditions used to reach those call sites. We use these guard conditions as a starting point to automatically infer the preconditions for each API. We analyzed almost 120 million lines of code from SourceForge and Apache projects to infer preconditions for the standard Java Development Kit (JDK) library. The results show that our technique can achieve high accuracy with recall from 75-80% and precision from 82-84%. We also found 5 preconditions missing from human written specifications. They were all confirmed by a specification expert. In a user study, participants found 82% of the mined preconditions as a good starting point for writing specifications. Using our mining result, we also built a benchmark of more than 4,000 precondition-related bugs.
Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity, 2013
Mining source code has become a common task for researchers and yielded significant benefits for ... more Mining source code has become a common task for researchers and yielded significant benefits for the software engineering community. Mining source code however is a very difficult and time consuming task. The Boa language and infrastructure was designed to ease mining of project and revision metadata. Recently Boa was extended to support mining source code and currently contains source code for over 23k Java projects, including full revision histories. In this demonstration we pose source code mining tasks and give solutions using Boa. We then execute these programs via our web-based infrastructure and show how to easily make the results available for future researchers.
Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity, 2012
Analyzing the wealth of information contained in software repositories requires significant exper... more Analyzing the wealth of information contained in software repositories requires significant expertise in mining techniques as well as a large infrastructure. In order to make this information more reachable for non-experts, we present the Boa language and infrastructure. Using Boa, these mining tasks are much simpler to write as the details are abstracted away. Boa programs also run on a distributed cluster to automatically provide massive parallelization to users and return results in minutes instead of potentially days.
Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, 2009
The interplay of multiple objects in object-oriented programming often follows specific protocols... more The interplay of multiple objects in object-oriented programming often follows specific protocols, for example certain orders of method calls and/or control structure constraints among them that are parts of the intended object usages. Unfortunately, the information is not always documented. That creates long learning curve, and importantly, leads to subtle problems due to the misuse of objects. In this paper, we propose GrouMiner, a novel graph-based approach for mining the usage patterns of one or multiple objects. GrouMiner approach includes a graph-based representation for multiple object usages, a pattern mining algorithm, and an anomaly detection technique that are efficient, accurate, and resilient to software changes. Our experiments on several real-world programs show that our prototype is able to find useful usage patterns with multiple objects and control structures, and to translate them into user-friendly code skeletons to assist developers in programming. It could also detect the usage anomalies that caused yet undiscovered defects and code smells in those programs.
Existing version control systems are often based on text line-oriented models for change represen... more Existing version control systems are often based on text line-oriented models for change representation, which do not facilitate software developers in understanding code evolution. Other advanced change representation models that encompass more program semantics and structures are still not quite practical due to their high computational complexity. This paper presents OperV, a novel operation-based version control model that is able to support both coarse and fine levels of granularity in program source code. In OperV, a software system is represented by a project tree whose nodes represent all program entities, such as packages, classes, methods, etc. The changes of the system are represented via edit operations on the tree. OperV also provides the algorithms to differ, store, and retrieve the versions of such entities. These algorithms are based on the mapping of the nodes between versions of the project tree. This mapping technique uses 1) divide-and-conquer technique to map coarse-and fine-grained entities separately, 2) unchanged text regions to map unchanged leaf nodes, and 3) structure-based similarity of the sub-trees to map their root nodes bottom-up and then topdown. The empirical evaluation of OperV has shown that it is scalable, efficient, and could be useful in understanding program evolution.
2009 IEEE/ACM International Conference on Automated Software Engineering, 2009
Recent research results show several benefits of the management of code clones. In this paper, we... more Recent research results show several benefits of the management of code clones. In this paper, we introduce Clever, a novel clone-aware software configuration management (SCM) system. In addition to traditional SCM functionality, Clever provides clone management support, including clone detection and update, clone change management, clone consistency validating, clone synchronizing, and clone merging. Clever represents source code and clones as (sub)trees in Abstract Syntax Trees (ASTs), measures code similarity based on structural characteristic vectors, and describes code changes as tree editing scripts. The key techniques of Clever include the algorithms to compute tree editing scripts; to detect and update code clones and their groups; and to analyze the changes of cloned code to validate their consistency and recommend the relevant synchronization. Our empirical study on many real-world programs shows that Clever is highly efficient and accurate in clone detection and updating, and provides useful analysis of clone changes.
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, 2010
Abstract Previous research confirms the existence of recurring bug fixes in software systems. Ana... more Abstract Previous research confirms the existence of recurring bug fixes in software systems. Analyzing such fixes manually, we found that a large percentage of them occurs in code peers, the classes/methods having the similar roles in the systems, such as providing similar functions and/or participating in similar object interactions. Based on graph-based representation of object usages, we have developed several techniques to identify code peers, recognize recurring bug fixes, and recommend changes for code units from the bug ...
2009 IEEE 31st International Conference on Software Engineering, 2009
Model-Driven Engineering (MDE) has become an important development framework for many large-scale... more Model-Driven Engineering (MDE) has become an important development framework for many large-scale software. Previous research has reported that as in traditional code-based development, cloning also occurs in MDE. However, there has been little work on clone detection in models with the limitations on detection precision and completeness. This paper presents ModelCD, a novel clone detection tool for Matlab/Simulink models, that is able to efficiently and accurately detect both exactly matched and approximate model clones. The core of ModelCD is two novel graph-based clone detection algorithms that are able to systematically and incrementally discover clones with a high degree of completeness, accuracy, and scalability. We have conducted an empirical evaluation with various experimental studies on many real-world systems to demonstrate the usefulness of our approach and to compare the performance of ModelCD with existing tools.
Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, 2012
Software building is an important task during software development. However, program analysis sup... more Software building is an important task during software development. However, program analysis support for build code is still limited, especially for build code written in a dynamic language such as Make. We introduce SYMake, a novel program analysis and refactoring tool for build code in Makefiles. SYMake is capable of detecting several types of code smells and errors such as cyclic dependencies, rule inclusion, duplicate prerequisites, recursive variable loops, etc. It also supports automatic build code refactoring, e.g. rule extraction/removal, target creation, target/variable renaming, prerequisite extraction, etc. In addition, SYMake provides the analysis on defined rules, targets, prerequisites, and associated information to help developers better understand build code in a Makefile and its included files.
DRC: A detection tool for dangling references in PHP-based web applications
2013 35th International Conference on Software Engineering (ICSE), 2013
ABSTRACT PHP is a server-side language that is widely used for creating dynamic Web applications.... more ABSTRACT PHP is a server-side language that is widely used for creating dynamic Web applications. However, as a dynamic language, PHP may induce certain programming errors that reveal themselves only at run time. A common type of error is dangling references, which occur if the referred program entities have not been declared in the current program execution. To prevent the run-time errors caused by such dangling references, we introduce Dangling Reference Checker (DRC), a novel tool to statically detect those references in the source code of PHP-based Web applications. DRC first identifies the path constraints of the program executions in which a program entity appears and then matches the path constraints of the entity's declarations and references to detect dangling ones. DRC is able to detect dangling reference errors in several real-world PHP systems with high accuracy. The video demonstration for DRC is available at http://www.youtube.com/watch?v=3Dy_AKZYhLlU4.
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, 2013
Recent research has successfully applied the statistical ngram language model to show that source... more Recent research has successfully applied the statistical ngram language model to show that source code exhibits a good level of repetition. The n-gram model is shown to have good predictability in supporting code suggestion and completion. However, the state-of-the-art n-gram approach to capture source code regularities/patterns is based only on the lexical information in a local context of the code units. To improve predictability, we introduce SLAMC, a novel statistical semantic language model for source code. It incorporates semantic information into code tokens and models the regularities/patterns of such semantic annotations, called sememes, rather than their lexemes. It combines the local context in semantic n-grams with the global technical concerns/ functionality into an n-gram topic model, together with pairwise associations of program elements. Based on SLAMC, we developed a new code suggestion method, which is empirically evaluated on several projects to have relatively 18-68% higher accuracy than the state-of-the-art approach.
Detecting recurring and similar software vulnerabilities
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2, 2010
Abstract New software security vulnerabilities are discovered on almost daily basis and it is vit... more Abstract New software security vulnerabilities are discovered on almost daily basis and it is vital to be able to identify and resolve them as early as possible. Fortunately, many software vulnerabilities are recurring or very similar, thus, one could effectively detect and fix a vulnerability in a system by consulting the similar vulnerabilities and fixes from other systems. In this paper, we propose, SecureSync, an automatic approach to detect and provide suggested resolutions for recurring software vulnerabilities on multiple systems ...
Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
Automatically producing behavioral exception (BE) API documentation helps developers correctly us... more Automatically producing behavioral exception (BE) API documentation helps developers correctly use the libraries. The state-of-theart approaches are either rule-based, which is too restrictive in its applicability, or deep learning (DL)-based, which requires large training dataset. To address that, we propose StatGen, a novel hybrid approach between statistical machine translation (SMT) and tree-structured translation to generate the BE documentation for any code and vice versa. We consider the documentation and source code of an API method as the two abstraction levels of the same intent. StatGen is specifically designed for this two-way inference, and takes advantage of their structures for higher accuracy. We conducted several experiments to evaluate StatGen. We show that it achieves high precision (75% and 75%), and recall (81% and 84%), in inferring BE documentation from source code and vice versa. StatGen achieves higher precision, recall, and BLEU score than the state-of-the-art, DL-based baseline models. We show Stat-Gen's usefulness in two applications. First, we use it to generate the BE documentation for Apache APIs that lack of documentation by learning from the documentation of the equivalent APIs in JDK. 44% of the generated documentation were rated as useful and 42% as somewhat useful. In the second application, we use StatGen to detect the inconsistency between the BE documentation and corresponding implementations of several JDK8 packages.
Modern software systems are increasingly including machine learning (ML) as an integral component... more Modern software systems are increasingly including machine learning (ML) as an integral component. However, we do not yet understand the difficulties faced by software developers when learning about ML libraries and using them within their systems. To that end, this work reports on a detailed (manual) examination of 3,243 highly-rated Q&A posts related to ten ML libraries, namely Tensorflow, Keras, scikit-learn, Weka, Caffe, Theano, MLlib, Torch, Mahout, and H2O, on Stack Overflow, a popular online technical Q&A forum. We classify these questions into seven typical stages of an ML pipeline to understand the correlation between the library and the stage. Then we study the questions and perform statistical analysis to explore the answer to four research objectives (finding the most difficult stage, understanding the nature of problems, nature of libraries and studying whether the difficulties stayed consistent over time). Our findings reveal the urgent need for software engineering (SE) research in this area. Both static and dynamic analyses are mostly absent and badly needed to help developers find errors earlier. While there has been some early research on debugging, much more work is needed. API misuses are prevalent and API design improvements are sorely needed. Last and somewhat surprisingly, a tug of war between providing higher levels of abstractions and the need to understand the behavior of the trained model is prevalent.
APIs are essential ingredients for developing complex software systems. However, they are difficu... more APIs are essential ingredients for developing complex software systems. However, they are difficult to learn and to use. Thus, developers may misuse them, which results in various types of issues. In this paper, we explore the use of a bio-inspired approach (artificial immune system) to detect API misuses in client code. We built APIMMUNE, a novel API misuse detector. We collect normal usages of a given APIs from the set of client programs using the APIs, especially after some API usages were fixed in those programs. The normal API usages are considered as normal body cells. We transform them into normal-usage signatures. Then, artificial detectors are randomly generated by generating artificial deviations from these usages with the objective of being different from the normal usage signatures. The generated detectors have the ability to detect risky uses of APIs exactly as the immune system detects foreign cells of the organism. Moreover, for the detection purpose, only the artific...
2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019
To avoid the exposure of original source code in a Web application, the variable names in JS code... more To avoid the exposure of original source code in a Web application, the variable names in JS code deployed in the wild are often replaced by short, meaningless names, thus making the code extremely difficult to manually understand and analysis. This paper presents JSNEAT, an information retrieval (IR)-based approach to recover the variable names in minified JS code. JSNEAT follows a data-driven approach to recover names by searching for them in a large corpus of open-source JS code. We use three types of contexts to match a variable in given minified code against the corpus including the context of the properties and roles of the variable, the context of that variable and relations with other variables under recovery, and the context of the task of the function to which the variable contributes. We performed several empirical experiments to evaluate JSNEAT on the dataset of more than 322K JS files with 1M functions, and 3.5M variables with 176K unique variable names. We found that JSNEAT achieves a high accuracy of 69.1%, which is the relative improvements of 66.1% and 43% over two state-of-theart approaches JSNice and JSNaughty, respectively. The time to recover for a file or a variable with JSNEAT is twice as fast as with JSNice and 4x as fast as with JNaughty, respectively.
Proceedings of the 40th International Conference on Software Engineering, 2018
Software developers often make use of the online forums such as StackOverflow (SO) to learn how t... more Software developers often make use of the online forums such as StackOverflow (SO) to learn how to use software libraries and their APIs. However, the code snippets in such a forum often contain undeclared, ambiguous, or largely unqualified external references. Such declaration ambiguity and external reference ambiguity present challenges for developers in learning to correctly use the APIs. In this paper, we propose STATTYPE, a statistical approach to resolve the fully qualified names (FQNs) for the API elements in such code snippets. Unlike existing approaches that are based on heuristics, STATTYPE has two well-integrated factors. We first learn from a large training code corpus the FQNs that often co-occur. Then, to derive the FQN for an API name in a code snippet, we use that knowledge and also leverage the context consisting of neighboring API names. To realize those factors, we treat the problem as statistical machine translation from source code with partially qualified names to source code with FQNs of the APIs. Our empirical evaluation on real-world code and StackOverflow posts shows that STATTYPE achieves very high accuracy with 97.6% precision and 96.7% recall, which is 16.5% relatively higher than the state-of-the-art approach. CCS CONCEPTS • Software and its engineering Software libraries and repositories; API languages;
2017 IEEE/ACM 39th International Conference on Software Engineering: New Ideas and Emerging Technologies Results Track (ICSE-NIER), 2017
API documentation is useful for developers to better understand how to correctly use the librarie... more API documentation is useful for developers to better understand how to correctly use the libraries. However, not all libraries provide good documentation on API usages. To provide better documentation, existing techniques have been proposed including program analysis-based and data mining-based approaches. In this work, instead of mining, we aim to generate behavioral exception documentation for any given code. We treat the problem of automatically generating documentation from a novel perspective: statistical machine translation (SMT). We consider the documentation and source code for an API method as the two abstraction levels of the same intention. We use SMT to translate documentation from source code and vice versa. Our preliminary results show that the direction of statistical learning for inference between implementations and documentation is very promising.
2013 35th International Conference on Software Engineering (ICSE), 2013
In today's software-centric world, ultra-large-scale software repositories, e.g. SourceForge (350... more In today's software-centric world, ultra-large-scale software repositories, e.g. SourceForge (350,000+ projects), GitHub (250,000+ projects), and Google Code (250,000+ projects) are the new library of Alexandria. They contain an enormous corpus of software and information about software. Scientists and engineers alike are interested in analyzing this wealth of information both for curiosity as well as for testing important hypotheses. However, systematic extraction of relevant data from these repositories and analysis of such data for testing hypotheses is hard, and best left for mining software repository (MSR) experts! The goal of Boa, a domain-specific language and infrastructure described here, is to ease testing MSR-related hypotheses. We have implemented Boa and provide a web-based interface to Boa's infrastructure. Our evaluation demonstrates that Boa substantially reduces programming efforts, thus lowering the barrier to entry. We also see drastic improvements in scalability. Last but not least, reproducing an experiment conducted using Boa is just a matter of re-running small Boa programs provided by previous researchers.
Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014
Modern software relies on existing application programming interfaces (APIs) from libraries. Form... more Modern software relies on existing application programming interfaces (APIs) from libraries. Formal specifications for the APIs enable many software engineering tasks as well as help developers correctly use them. In this work, we mine large-scale repositories of existing open-source software to derive potential preconditions for API methods. Our key idea is that APIs' preconditions would appear frequently in an ultra-large code corpus with a large number of API usages, while project-specific conditions will occur less frequently. First, we find all client methods invoking APIs. We then compute a control dependence relation from each call site and mine the potential conditions used to reach those call sites. We use these guard conditions as a starting point to automatically infer the preconditions for each API. We analyzed almost 120 million lines of code from SourceForge and Apache projects to infer preconditions for the standard Java Development Kit (JDK) library. The results show that our technique can achieve high accuracy with recall from 75-80% and precision from 82-84%. We also found 5 preconditions missing from human written specifications. They were all confirmed by a specification expert. In a user study, participants found 82% of the mined preconditions as a good starting point for writing specifications. Using our mining result, we also built a benchmark of more than 4,000 precondition-related bugs.
Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity, 2013
Mining source code has become a common task for researchers and yielded significant benefits for ... more Mining source code has become a common task for researchers and yielded significant benefits for the software engineering community. Mining source code however is a very difficult and time consuming task. The Boa language and infrastructure was designed to ease mining of project and revision metadata. Recently Boa was extended to support mining source code and currently contains source code for over 23k Java projects, including full revision histories. In this demonstration we pose source code mining tasks and give solutions using Boa. We then execute these programs via our web-based infrastructure and show how to easily make the results available for future researchers.
Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity, 2012
Analyzing the wealth of information contained in software repositories requires significant exper... more Analyzing the wealth of information contained in software repositories requires significant expertise in mining techniques as well as a large infrastructure. In order to make this information more reachable for non-experts, we present the Boa language and infrastructure. Using Boa, these mining tasks are much simpler to write as the details are abstracted away. Boa programs also run on a distributed cluster to automatically provide massive parallelization to users and return results in minutes instead of potentially days.
Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on The foundations of software engineering, 2009
The interplay of multiple objects in object-oriented programming often follows specific protocols... more The interplay of multiple objects in object-oriented programming often follows specific protocols, for example certain orders of method calls and/or control structure constraints among them that are parts of the intended object usages. Unfortunately, the information is not always documented. That creates long learning curve, and importantly, leads to subtle problems due to the misuse of objects. In this paper, we propose GrouMiner, a novel graph-based approach for mining the usage patterns of one or multiple objects. GrouMiner approach includes a graph-based representation for multiple object usages, a pattern mining algorithm, and an anomaly detection technique that are efficient, accurate, and resilient to software changes. Our experiments on several real-world programs show that our prototype is able to find useful usage patterns with multiple objects and control structures, and to translate them into user-friendly code skeletons to assist developers in programming. It could also detect the usage anomalies that caused yet undiscovered defects and code smells in those programs.
Existing version control systems are often based on text line-oriented models for change represen... more Existing version control systems are often based on text line-oriented models for change representation, which do not facilitate software developers in understanding code evolution. Other advanced change representation models that encompass more program semantics and structures are still not quite practical due to their high computational complexity. This paper presents OperV, a novel operation-based version control model that is able to support both coarse and fine levels of granularity in program source code. In OperV, a software system is represented by a project tree whose nodes represent all program entities, such as packages, classes, methods, etc. The changes of the system are represented via edit operations on the tree. OperV also provides the algorithms to differ, store, and retrieve the versions of such entities. These algorithms are based on the mapping of the nodes between versions of the project tree. This mapping technique uses 1) divide-and-conquer technique to map coarse-and fine-grained entities separately, 2) unchanged text regions to map unchanged leaf nodes, and 3) structure-based similarity of the sub-trees to map their root nodes bottom-up and then topdown. The empirical evaluation of OperV has shown that it is scalable, efficient, and could be useful in understanding program evolution.
2009 IEEE/ACM International Conference on Automated Software Engineering, 2009
Recent research results show several benefits of the management of code clones. In this paper, we... more Recent research results show several benefits of the management of code clones. In this paper, we introduce Clever, a novel clone-aware software configuration management (SCM) system. In addition to traditional SCM functionality, Clever provides clone management support, including clone detection and update, clone change management, clone consistency validating, clone synchronizing, and clone merging. Clever represents source code and clones as (sub)trees in Abstract Syntax Trees (ASTs), measures code similarity based on structural characteristic vectors, and describes code changes as tree editing scripts. The key techniques of Clever include the algorithms to compute tree editing scripts; to detect and update code clones and their groups; and to analyze the changes of cloned code to validate their consistency and recommend the relevant synchronization. Our empirical study on many real-world programs shows that Clever is highly efficient and accurate in clone detection and updating, and provides useful analysis of clone changes.
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, 2010
Abstract Previous research confirms the existence of recurring bug fixes in software systems. Ana... more Abstract Previous research confirms the existence of recurring bug fixes in software systems. Analyzing such fixes manually, we found that a large percentage of them occurs in code peers, the classes/methods having the similar roles in the systems, such as providing similar functions and/or participating in similar object interactions. Based on graph-based representation of object usages, we have developed several techniques to identify code peers, recognize recurring bug fixes, and recommend changes for code units from the bug ...
2009 IEEE 31st International Conference on Software Engineering, 2009
Model-Driven Engineering (MDE) has become an important development framework for many large-scale... more Model-Driven Engineering (MDE) has become an important development framework for many large-scale software. Previous research has reported that as in traditional code-based development, cloning also occurs in MDE. However, there has been little work on clone detection in models with the limitations on detection precision and completeness. This paper presents ModelCD, a novel clone detection tool for Matlab/Simulink models, that is able to efficiently and accurately detect both exactly matched and approximate model clones. The core of ModelCD is two novel graph-based clone detection algorithms that are able to systematically and incrementally discover clones with a high degree of completeness, accuracy, and scalability. We have conducted an empirical evaluation with various experimental studies on many real-world systems to demonstrate the usefulness of our approach and to compare the performance of ModelCD with existing tools.
Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, 2012
Software building is an important task during software development. However, program analysis sup... more Software building is an important task during software development. However, program analysis support for build code is still limited, especially for build code written in a dynamic language such as Make. We introduce SYMake, a novel program analysis and refactoring tool for build code in Makefiles. SYMake is capable of detecting several types of code smells and errors such as cyclic dependencies, rule inclusion, duplicate prerequisites, recursive variable loops, etc. It also supports automatic build code refactoring, e.g. rule extraction/removal, target creation, target/variable renaming, prerequisite extraction, etc. In addition, SYMake provides the analysis on defined rules, targets, prerequisites, and associated information to help developers better understand build code in a Makefile and its included files.
DRC: A detection tool for dangling references in PHP-based web applications
2013 35th International Conference on Software Engineering (ICSE), 2013
ABSTRACT PHP is a server-side language that is widely used for creating dynamic Web applications.... more ABSTRACT PHP is a server-side language that is widely used for creating dynamic Web applications. However, as a dynamic language, PHP may induce certain programming errors that reveal themselves only at run time. A common type of error is dangling references, which occur if the referred program entities have not been declared in the current program execution. To prevent the run-time errors caused by such dangling references, we introduce Dangling Reference Checker (DRC), a novel tool to statically detect those references in the source code of PHP-based Web applications. DRC first identifies the path constraints of the program executions in which a program entity appears and then matches the path constraints of the entity's declarations and references to detect dangling ones. DRC is able to detect dangling reference errors in several real-world PHP systems with high accuracy. The video demonstration for DRC is available at http://www.youtube.com/watch?v=3Dy_AKZYhLlU4.
Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, 2013
Recent research has successfully applied the statistical ngram language model to show that source... more Recent research has successfully applied the statistical ngram language model to show that source code exhibits a good level of repetition. The n-gram model is shown to have good predictability in supporting code suggestion and completion. However, the state-of-the-art n-gram approach to capture source code regularities/patterns is based only on the lexical information in a local context of the code units. To improve predictability, we introduce SLAMC, a novel statistical semantic language model for source code. It incorporates semantic information into code tokens and models the regularities/patterns of such semantic annotations, called sememes, rather than their lexemes. It combines the local context in semantic n-grams with the global technical concerns/ functionality into an n-gram topic model, together with pairwise associations of program elements. Based on SLAMC, we developed a new code suggestion method, which is empirically evaluated on several projects to have relatively 18-68% higher accuracy than the state-of-the-art approach.
Detecting recurring and similar software vulnerabilities
Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 2, 2010
Abstract New software security vulnerabilities are discovered on almost daily basis and it is vit... more Abstract New software security vulnerabilities are discovered on almost daily basis and it is vital to be able to identify and resolve them as early as possible. Fortunately, many software vulnerabilities are recurring or very similar, thus, one could effectively detect and fix a vulnerability in a system by consulting the similar vulnerabilities and fixes from other systems. In this paper, we propose, SecureSync, an automatic approach to detect and provide suggested resolutions for recurring software vulnerabilities on multiple systems ...
Uploads
Papers by Hoàn Nguyễn