CodeCleaner is an automated code refactoring toolkit for mitigating data contamination in evaluating code language models (CLMs).
- It performs code refactoring to resolve data contamination following the idea of disrupting the consecutive characters/tokens as much as possible while maintaining the semantic unchanged.
- It includes 11 refactoring operators applicable to Python and Java codes, covering structure, semantics, and code style refactoring.
An example to illustrate the key idea, where the N-gram matching overlap ratio between the given code and the Stack-v1 training corpus drops dramatically after refactoring: (click the image or here to zoom in)
This is the artifact of CodeCleaner - ICSE Industry Challenge 2025. It includes:
- implementation of the CodeCleaner refactoring toolkit (9 method-level and 2 class-level Python refactoring operators;
- toolkit demo Jupyter notebooks;
- data for obtaining the study results presented in our paper;
- scripts for reproducing our study (e.g., the scripts for calculating overlap ratio and perplexity).
This operator negates the if-condition and flips the statements in if- and else-branches.
This operator switches while loops with the equivalent for loop.
In Python, there are two ways to iterate over an iterable object, i.e., to directly access each element or use indices within a range. This operator transforms code between these two ways.
This operator performs commutative law in logical operations.
Negate the if-condition and flip the statements in if- and else-branches.
This operator appends these parameters if they do not exist in the original method declarations.
This operator adds two decorators not affecting method functions, i.e., @timing (measuring the execution time) and @measure_memory_usage (measuring the memory usage), to Python methods.
For the classes that inherit from superclasses, this operator copies methods from the superclasses to the (sub)class if these methods are not overridden in the subclass.
This operator replaces the identifiers in the source code with their corresponding synonyms.
This operator performs normalization on the code styles, such as unifying single and double quote marks, regularizing the number of spaces, and using parentheses to indicate operation precedence.
The camel case (e.g., camelCase) and lower snake case (e.g., lower_snake) are two popularly-used naming conventions in programming. This operator transforms identifier names between these two styles.
We provide Jupyter notebooks to illustrate the usage of our toolkit to perform refactoring based on the operators introduced above.
/CodeCleaner's Artifact
/src # Source files of tools, experiment scripts, demos, etc.
/utils # Key source files of tools
{xxx}.ipynb # Demo Jupyter Notebooks
{xxx}.py # Source files & scripts
/data # Data used by the study conducted in the paper
/data-{type}# {type} in {pymethod2021, pyclass324, year, java} corresponding to the data for RQ1-4
/{opr}/question.jsonl # Storing the corresponding code entities refactored with {opr}
/model_score # The models' MIA-related scores for the codes refactored with {opr}
/overlap_v1/overlap{opr}.txt # The overlap calculation record for the codes refactored with {opr}
/Figures # Illustrations for our toolkit
README.md
LICENSE
requirements.txt# or other dependency management files







