code_transformed:
The Influence of Large Language Models on Code

Data Collection

GitHub Data

We collect a total of 19,898 GitHub repositories and 926,935 source code files, corresponding to arXiv papers from the first quarter of 2020 to the first quarter of 2025. Our arXiv dataset is organized across two GitHub repositories: Python files are in LLM_code/arxiv_dataset, and C/C++ code is in LLM_code/arxiv_dataset_cpp.

├── 2020                   // Year
    ├── Q1                 // Quarter
        ├── repo_name      // Repository name
            ├── xxx.py     // Project Python file
            ...
            ├── time_info.txt  // File creation/modification time information

Human-Written Code

We utilize Code4Bench, a multidimensional benchmark based on Codeforces data. This dataset contains user submissions on Codeforces before 2020, which were barely impacted by LLMs. We generate code using LLMs with various prompting strategies.

Naming Patterns

we categorize variable, function, and file names into several distinct formats (e.g. snake_case). The length of the names has also been considered.

[!IMPORTANT] > Finding 1: The coding style of human-written code may be influenced by LLMs: they may not only mirror existing norms but also subtly reshape them, gradually pushing human developers toward greater stylistic alignment with LLM-preferred conventions.

Complexity and Maintainability

Cyclomatic complexity is a metric used to measure the number of linearly independent paths in the code.

[!IMPORTANT] > Finding 2: For I/O algorithm problems, LLM-generated code tends to exhibit higher maintainability, lower difficulty, and fewer bugs than human-written solutions, which aligns with the evolution of Github code after 2023Q1. Moreover, the quality of reference-guided code is generally inferior to that of directly generated code.

Code Similarity

We compare three versions of each problem’s code: the original human-authored solution (AC), the LLM’s output given only the problem description (ANS), and the LLM’s output when additionally conditioned on the human solution (REF). We compute pairwise cosine and Jaccard similarities among AC, ANS, and REF.

[!IMPORTANT] > Finding 3: LLMs can effectively mimic human coding style when given reference code, but without such guidance, their generated solutions diverge significantly from human-written code—especially in IO algorithm tasks.

Labels in the Reasoning Process

To further refine our analysis, we individually examine the matching of reasoning and labels for each question.

Let $T$ denote the set of all labels. For each question $q$, let $A_q \subseteq T$ be the set of true labels in the question description, and let $R_q \subseteq T$ be the set of labels in the reasoning process.

We then define the $\mathrm{match}$ and $\mathrm{error}$ metrics as follows:

$$ \begin{align} \mathrm{match}(q) &= \mathbf{1}\left( A_q \cap R_q \ne \varnothing \right), \\ \mathrm{error}(q) &= \mathbf{1}\left( \left( T \setminus A_q \right) \cap R_q \ne \varnothing \right), \end{align} $$

where $\mathbf{1}(\cdot)$ is the indicator function: it returns 1 if the condition is met, and 0 otherwise.

[!IMPORTANT] > Finding 4: LLMs have low algorithm analysis capabilities, are more inclined to approach C/C++ code from an algorithmic perspective, and harder problems may better activate their algorithmic reasoning capabilities.

Citation

@article{xu2025code_transformed,
  title={code\_transformed: The Influence of Large Language Models on Code},
  author={Xu, Yuliang and Huang, Siming and Geng, Mingmeng and Wan, Yao and Shi, Xuanhua and Chen, Dongping},
  journal={arXiv preprint arXiv:2506.12014},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
arxiv_dataset		arxiv_dataset
arxiv_dataset_cpp		arxiv_dataset_cpp
complexity_maintainability		complexity_maintainability
dataset_collection		dataset_collection
naming_patterns		naming_patterns
reasoning_label		reasoning_label
similarity		similarity
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

code_transformed:
The Influence of Large Language Models on Code

Contents

Data Collection

GitHub Data

Human-Written Code

Naming Patterns

Complexity and Maintainability

Code Similarity

Labels in the Reasoning Process

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

ignorancex/LLM_code

Folders and files

Latest commit

History

Repository files navigation

code_transformed: The Influence of Large Language Models on Code

Contents

Data Collection

GitHub Data

Human-Written Code

Naming Patterns

Complexity and Maintainability

Code Similarity

Labels in the Reasoning Process

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

code_transformed:
The Influence of Large Language Models on Code

Packages