LLM Programming Paper
LLM Programming Paper
YIHONG DONG∗ , XUE JIANG∗ , ZHI JIN, and GE LI† , Key Laboratory of High Confidence Software
Technologies (Peking University), Ministry of Education; School of Computer Science, Peking University,
Beijing, China
arXiv:2304.07590v3 [[Link]] 11 May 2024
Although Large Language Models (LLMs) have demonstrated remarkable code-generation ability, they still
struggle with complex tasks. In real-world software development, humans usually tackle complex tasks through
collaborative teamwork, a strategy that significantly controls development complexity and enhances software
quality. Inspired by this, we present a self-collaboration framework for code generation employing LLMs,
exemplified by ChatGPT. Specifically, through role instructions, 1) Multiple LLM agents act as distinct ‘experts’,
each responsible for a specific subtask within a complex task; 2) Specify the way to collaborate and interact,
so that different roles form a virtual team to facilitate each other’s work, ultimately the virtual team addresses
code generation tasks collaboratively without the need for human intervention. To effectively organize and
manage this virtual team, we incorporate software-development methodology into the framework. Thus,
we assemble an elementary team consisting of three LLM roles (i.e., analyst, coder, and tester) responsible
for software development’s analysis, coding, and testing stages. We conduct comprehensive experiments on
various code-generation benchmarks. Experimental results indicate that self-collaboration code generation
relatively improves 29.9%-47.1% Pass@1 compared to the base LLM agent. Moreover, we showcase that self-
collaboration could potentially enable LLMs to efficiently handle complex repository-level tasks that are not
readily solved by the single LLM agent.
CCS Concepts: • Software and its engineering → Software creation and management; • Computing
methodologies → Artificial intelligence.
Additional Key Words and Phrases: Code Generation, Large Language Models, Multi-Agent Collaboration,
Software Development.
ACM Reference Format:
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-collaboration Code Generation via ChatGPT. ACM
Trans. Softw. Eng. Methodol. 1, 1 (May 2023), 38 pages. [Link]
1 INTRODUCTION
Code generation aims to generate code that satisfies human requirements expressed in the form of
some specification. Successful code generation can improve the efficiency and quality of software
development, even causing changes in social production modes. Therefore, code generation has
been a significant research hotspot in the fields of artificial intelligence, natural language process-
ing, and software engineering. Recently, code generation has made substantial advancements in
∗ Equal Contribution
† Corresponding author
0 [Link]
Authors’ address: Yihong Dong, dongyh@[Link]; Xue Jiang, jiangxue@[Link]; Zhi Jin, zhijin@[Link];
Ge Li, lige@[Link], Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of
Education; School of Computer Science, Peking University, Beijing, China.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@[Link].
© 2023 Association for Computing Machinery.
1049-331X/2023/5-ART $15.00
[Link]
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
2 Dong et al.
both academic and industrial domains [8, 16, 30, 49]. In particular, LLMs have achieved excellent
performance and demonstrate promising potential on code generation tasks [15, 18, 38, 64].
Nonetheless, generating correct code for complex requirements poses a substantial challenge,
even for experienced human programmers. Intuitively, humans, as social beings, tend to rely on
collaborative teamwork when encountering complex tasks. Teamwork through division of labor,
interaction, and collaboration to solve complex problems, has been theorized to play an impor-
tant role in dealing with complexity, as posited in both teamwork theory [5, 26] and software
engineering practice [4, 13, 34]. The benefits of collaborative teamwork are manifold: 1) It breaks
down complex tasks into smaller subtasks, making the entire code generation process more effi-
cient and controllable. 2) It assists with error detection and quality control. Team members can
review and test the generated code, providing feedback and suggestions for improvement, thus
reducing potential errors and defects. 3) It ensures that the generated code is consistent with the
expected requirements. Team members can offer different viewpoints to solve problems and reduce
misunderstandings.
A straightforward way to implement collaborative teamwork entails training different models
to handle the corresponding subtasks, subsequently conducting joint training to foster mutual
understanding of behaviors to assemble them into a team [48]. However, this training approach
is costly, especially for LLMs. The scarcity of relevant training data further exacerbates the diffi-
culty of achieving collaborative code generation. Revolutionary advancements in artificial general
intelligence (AGI), especially LLMs represented by ChatGPT [39], provide a turning point. These
LLMs perform commendably across tasks in various stages of software development, laying the
groundwork for division of labor. Furthermore, LLMs use language as the foundation for input and
output and align with human needs through instructions or prompts, offering the potential for
inter-model interaction and collaboration.
Requirement:
prime_fib returns n-th number that is a Fibonacci number and it's also prime.
User I want you to act as a requirement analyst. User I want you to act as a developer. User I want you to act as a tester.
Alright, let's analyze the requirements for the Sure! Here's an example implementation of a Sure! I can help you test the prime_fib function.
prime_fib function that returns the n-th function called prime_fib that returns the n-th Could you please provide me with the code for
number that is both a Fibonacci number and a number that is both a Fibonacci number and the prime_fib function? Once I have the code, I
prime number. Here are some key points to prime: can run some tests to verify its correctness.
consider: … Python
def prime_fib(n): …
Fig. 1. An example of role-playing. Through role-playing, LLM transforms into an expert within a specific
domain, delivering a professional-perspective response to the same requirement.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 3
Self-collaboration Framework
Role Instructions
LLM Role Instructions LLM ... LLM ... LLM Roles Initialized
Coding
LL
M Collaboration
...
Task
Passed! LLM
Analysis Testing
Codes
Team
Requirements M Blackboard
LL
Instantiating
complexity and improving software quality [1, 46]. Following SDM, we instantiate an elementary
team composed of three roles (i.e., analyst, coder, and tester) to achieve this goal. These roles
adhere to an SDM-defined workflow where the stages of analysis, coding, and testing are performed
sequentially, with each stage providing feedback to its predecessor. Specifically, the analyst breaks
down requirements and develops high-level plans for guiding the coder; the coder creates or
improves code based on the plans or the tester’s feedback; the tester compiles test reports based on
the coder’s outcome and documents any issues found during testing. We employ three ChatGPT1
agents to respectively play the three roles through role instructions, and then they collaborate to
address code generation tasks under the guidance of self-collaboration framework. The primary
contributions of our work can be summarized as follows:
(1) We propose a self-collaboration framework with role instruction, which allows LLM agents
to collaborate with each other to generate code for complex requirements.
(2) Following software-development methodology, we instantiate an elementary team, which
comprises three LLM roles (i.e., analyst, coder, and tester) responsible for their respective
stages in the software development process.
(3) Building on self-collaboration framework, the virtual team formed by ChatGPT (GPT-3.5)
can achieve significant improvements compared to the single LLM agent on multiple code-
generation benchmarks.
(4) In some real-world scenarios, self-collaboration code generation exhibits notable effectiveness
on more complex code generation tasks (such as repository-level code generation) that are
challenging for the single LLM agent.
2 SELF-COLLABORATION FRAMEWORK
Our self-collaboration framework consists of two parts: division of labor (DOL) and collaboration,
which is shown in Fig. 2 (left). Given a requirement 𝑥, we propose to perform self-collaboration
with LLMs to generate the output 𝑦. The task is defined as T : 𝑥 → 𝑦.
1 The ChatGPT referenced throughout our paper defaults to the GPT-3.5 version.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
4 Dong et al.
It is widely acknowledged that LLMs are sensitive to context, as they are trained to predict
subsequent tokens based on preceding ones. Consequently, it is prevalent to control LLM gener-
ation using instructions or prompts [11, 40, 41]. In order to achieve division of labor, we craft a
specific type of instruction to assign roles and responsibilities to LLMs, which we refer to as role
instructions. Specifically, we ask an LLM to act as a particular role that has a strong correlation
with its responsibilities. Furthermore, we need to convey the detailed tasks, i.e. responsibilities,
this role should perform. In general, a clear and unverbose task description in the instruction could
lead to LLM’s behavior being more in line with your expectations. One case where it may not be
necessary to outline a role’s responsibilities is when the division of labor is common enough that
matching roles can be found in reality.
Through role-playing, we can effectively situate LLM within a specific domain, thereby eliciting its
expertise within that domain. Our empirical evidence suggests that this role-playing approach yields
superior results compared to directly engaging the LLM in the task without a pre-defined contextual
setting. Thus, role-playing can be harnessed as an efficient tool to enhance the performance of the
LLM in specialized tasks.
Note that the role instruction only needs to be provided once at the initialization of each
LLM agent to provide specific guidance on its behavior throughout subsequent interaction, thus
enhancing the overall efficiency and clarity in collaboration.
2.2 Collaboration
After assigning roles to LLMs in DOL part, roles interact their outputs with other roles as the stages
progress, refining the work and ensuring an accurate and thoughtful output 𝑦. In collaboration part,
we focus on facilitating effective interactions among distinct roles to ensure that they mutually
enhance each other’s work.
The interaction among roles occurs in the form of natural language (NL), which is supported by
the foundational aspects of the language model. We specify the role, information, and format that
each role interacts with in role instructions, which allows the whole process of collaboration to be
well controlled. The collaboration part can be formalized as follows:
arg max 𝑃 (𝑠𝑡 |𝑠 {<𝑡 } , 𝑅𝑚 (𝑆𝑡 ) , 𝑥), (1)
𝑠𝑡
where 𝑠𝑡 is the output of stage S𝑡 , 𝑠 {<𝑡 } indicates the prerequisite-stage outputs of S𝑡 3 , and 𝑅𝑚 (𝑆𝑡 )
represents the role corresponding to S𝑡 . We consider the computation of 𝑃 (𝑠𝑡 |𝑠 <𝑡 , 𝑅𝑚 (𝑆𝑡 ) , 𝑥) as the
collaboration, wherein role 𝑅𝑚 (𝑆𝑡 ) collaborates with the roles of each preceding stage to generate
𝑠𝑡 . Output 𝑦 is iteratively updated along with the progression of S𝑡 :
𝑦𝑡 = 𝑓 (𝑠𝑡 , 𝑦 <𝑡 ), (2)
2 Inour self-collaboration framework, each stage is managed by a specific type of roles, but the number of roles within that
type can vary. The order of the stages determines the sequence in which different types of roles are executed, but the roles
within a single stage operate in parallel.
3 Note that our self-collaboration framework can be parallelized if the relationship between stages { S } 𝑙
𝑖 𝑖=1 is not a straight-
forward linear relationship. In other words, if one stage does not depend on the results of another stage, then they can
be executed in parallel. A specific scenario is that in a software development project including front-end, back-end, and
database development, the analysis stage only needs to define the interface in advance and the corresponding coding stages
can be carried out in parallel.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 5
# Collaboration Part
3: Initial blackboard B and index 𝑡.
4: repeat
5: Obtain 𝑠 {<𝑡 } from B.
6: Sample 𝑠𝑡 via Eq. (1).
7: Add 𝑠𝑡 to B.
8: Compute 𝑦𝑡 via Eq. (2).
9: Update 𝑦 and 𝑡.
10: until End condition is satisfied
11: return 𝑦
where 𝑓 is an update function. Once the end condition is satisfied4 , the final output 𝑦 is derived. To
coordinate collaboration between different roles, we set up a shared blackboard [37], from which
each role exchanges the required information to accomplish their respective tasks 𝑠𝑡 via Eq. (1).
The pseudocode of our self-collaboration framework is outlined in Algorithm 1.
3 INSTANCE
We introduce the classic waterfall model [43] from software-development methodology into self-
collaboration framework to make the teamwork for code generation more efficient. Specifically, we
design a simplified waterfall model consisting of three stages, i.e. analysis, coding, and testing, as an
instance for self-collaboration code generation. The workflow of this instance follows the waterfall
model flowing from one stage to the next, and if issues are found, it returns to the previous stage to
refine. Thus, we establish an elementary team, comprising an analyst, coder, and tester, responsible
for the analysis, coding, and testing stages, as illustrated in Fig. 2 (right). These three roles are
assigned the following tasks:
Analyst. The goal of the analyst is to reduce the difficulty of coding by abstracting and decom-
posing the task from a high level, rather than delving into the details of the implementation. Given
a requirement 𝑥, the analyst breaks 𝑥 down into several easily solvable subtasks to facilitate the
division of functional units and develops a high-level plan to guide the coder in writing the code.
Coder. As the central role of this team, the coder is responsible for writing the code, but its
work is carried out with the assistance and supervision of the analyst and tester. Thus, we assign
two responsibilities to the coder: 1) Write code that fulfills the specified requirements, adhering to
the plan provided by the analyst. 2) Repair or refine code, taking into account the feedback of test
reports feedbacked by the tester.
Tester. The tester is responsible for inspecting the code and generating a test report on various
aspects of functionality, readability, and maintainability to help the coder improve the quality of its
code. Rather than directly introducing a compiler and test cases to execute the code, we use the
model to simulate the testing process and produce test reports, thereby avoiding external efforts.
4 The end condition is defined by prior knowledge, and an example can be found in the last paragraph of Section 3.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
6 Dong et al.
Coder:
I want you to act as a developer on our development team. You will receive
plans from a requirements analyst or test reports from a tester. Your job is
split into two parts:
1. If you receive a plan from a requirements analyst, write code in Python
Role Description that meets the requirements following the plan. Ensure that the code you
write is efficient, readable, and follows best practices.
2. If you receive a test report from a tester, fix or improve the code based
on the content of the report. Ensure that any changes made to the code do
not introduce new bugs or negatively impact the performance of the code.
Remember, do not need to explain the code you wrote.
Fig. 3. An example of role instruction for coder in the instance of self-collaboration framework.
We customize role instructions for LLMs (exemplified by ChatGPT) to play the three roles. The
role instruction includes not only the role description (role and its responsibilities) but also the
team description and the user requirements, which will work together to initialize the ChatGPT
agent, thereby setting the behavior of ChatGPT. An example of role instruction for coder is shown
in Fig. 3. In addition, interactions occur between roles responsible for two successive stages, and
we limit the maximum interaction to 𝑛. We update the output 𝑦𝑡 only when the stage 𝑆𝑡 is coding,
and this workflow terminates upon 𝑛 is reached or the tester confirms that no issues persist with
𝑦𝑡 . When dealing with 𝑦𝑡 composed of multiple components, it is recommended to ask LLMs to
generate outputs in JSON format directly, which can reduce omissions and enhance the quality of
generation outputs.
4 EVALUATION
We aim at answering the following research questions (RQs):
• RQ1: What is the performance of self-collaboration approach compared to the various
baselines on public code-generation benchmarks?
• RQ2: What is the effect of roles in self-collaboration? Specifically, it can be divided into three
questions: 1. What is the contribution of each role in the virtual team? 2. What is the effect
of other virtual teams? 3. What is the effect of role-playing?
• RQ3: What is the performance of self-collaboration based on different LLMs, especially the
most powerful LLM GPT-4?
• RQ4: What is the impact of interaction numbers for self-collaboration?
• RQ5: What are the results of more detailed analysis (specifically, error analysis and cost
analysis) for self-collaboration?
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 7
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
8 Dong et al.
• CodeX (175B) + CodeT [7] is a previous state-of-the-art (SOTA) approach before GPT-4.
CodeT employs LLMs to automatically generate test cases for code samples. It executes code
samples with these test cases and conducts a dual execution agreement, taking into account
output consistency against test cases and output agreement among code samples.
• CodeLlama (34B) [45] is an open foundational model for code generation tasks, derived
from continuous training and fine-tuning based on Llama 2 [55].
II. Generalist LLMs represent the LLMs that are trained on data widely collected from Internet
and show strong performance on a variety of tasks, which are used as the base model of our approach.
This kind of baseline includes:
• ChatGPT [39] is a sibling model to InstructGPT [42], which is trained to follow an instruction
in a prompt and provide a detailed response. We access ChatGPT through OpenAI’s API.
Since ChatGPT receives regular updates, we employ a fixed version ‘gpt-3.5-turbo-0301’ as
our base model, which will not receive updates, to minimize the risk of unexpected model
changes affecting the results.
• GPT-4 [40] is a large-scale, multimodal model which can accept image and text inputs and
produce text outputs. GPT-4 exhibits human-level performance on various benchmarks.
III. Prompting approaches here represent the general prompting approaches as well as those
specifically designed for code generation, representing a category of our related work. This kind of
baseline includes:
• Chain-of-thought (CoT) [57] generates a chain of thought for each question and then
generates the corresponding code. For CoT, we use the instruction “Let’s think step by step.”
[27] to implement it.
• Self-planning [23] and Self-debugging [9] teach LLMs to perform planning and debugging
with few-shot prompting. We use the prompts provided in their original papers to implement
them.
• Iter-improving is a baseline approach proposed in this paper, which considers allowing the
base model to continuously improve the generated code until it can no longer be modified.
This approach is used to demonstrate that the effectiveness of our method is not solely due
to multiple improvements to the output of LLMs. We use the instruction ‘Please improve this
code’ to implement iter-improving, and the maximum number of iterations is set to 10.
4.1.3 Evaluation Metric. In this paper, we mainly focus on Pass@1: the probability that the model
solves the problem in one attempt, since in real-world scenarios we usually only consider one
generated code. We adopt the unbiased variant of Pass@1 [8] to measure the functional correctness
of top-1 generated codes by executing test cases, which can be formulated as:
Pass@1 = E 𝜒 (generated code) . (3)
Problems
where E denotes the expectation of all problems and 𝜒 is an indicator function that outputs 1
Problems
if generated code passes all test cases of the corresponding problem, otherwise 0.
4.1.4 Implementation Details. For self-collaboration and all prompting approaches, we employ
ChatGPT as the base model. We access ChatGPT using the ‘gpt-3.5-turbo’ API with fixed version
‘0301’, which will not receive updates. For self-collaboration, the maximum number of interactions
between roles is limited to 4. In all experiments, we set max tokens to 512 and temperature to 0 for
code generation. We only reference the results reported in their original papers for AlphaCode (1.1B)
[30] and PaLM Coder (540) [10], which are inaccessible to us. The results of all other baselines are
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 9
evaluated under the same settings of self-collaboration for fairness. All these results are comparable
with the results reported in their original paper.
Table 1. Comparison of self-collaboration and baselines, where the green highlights indicate the improvements
in comparison to ChatGPT (GPT-3.5).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
10 Dong et al.
Table 2. The performance of self-collaboration code generation on APPS, where the green highlights indicate
the improvements in comparison to ChatGPT (GPT-3.5).
with ChatGPT except for HumanEval-ET. This is because ChatGPT tends to optimize for edge
test cases. While this helps in HumanEval-ET, it sometimes leads to unnecessary modifications of
correct edge cases. Such adjustments might be safer in some scenarios, like throwing exceptions,
but deviate from expected output.
We evaluate self-collaboration on a more algorithmically intricate benchmark, namely, APPS.
This benchmark comprises three levels of difficulty: introductory, interview, and competition. In
this experiment, we select the first 1000 code generation tasks from the most challenging two
levels: interview and competition. Following the settings in CodeT [7], we use NL in APPS as a
comment and then concatenate the signature ‘def solution(stdin : str) → str :’ as input to the model.
Experimental results shown in Table 2 indicate that self-collaboration enhances the performance of
ChatGPT substantially and exceeds the performance of the previous SOTA approach CodeX (175B)
+ CodeT [7].
To evaluate self-collaboration on a more challenging and realistic benchmark, we conduct
experiments on CoderEval [61] based on ChatGPT (GPT-3.5), following the settings in its paper. As
shown in Table 3, self-collaboration code generation substantially outperforms the single ChatGPT,
achieving relative improvements of 47.1% on Pass@1.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 11
Table 4. Effectiveness of ChatGPT roles in self-collaboration code generation, where the green highlights
indicate the improvements in comparison to Coder.
team. The coder-analyst-tester team achieved the best results on HumanEval and HumanEval-ET
benchmarks, with relative improvements of 40.8% and 47.1%, respectively.
Second, we find that the performance of ‘Analyst + Coder + Tester + Compiler’ is comparable to
that of ‘Analyst + Coder + Tester’. The reason may be that some mistakes can already be resolved
by the tester without relying on the complier’s results. Moreover, the team of pair programming
is superior to the single LLM agent, but still performed slightly worse compared to the team of
‘Analyst + Coder + Tester’.
Third, the results show that role-playing approach substantially outperforms the baselines
without role-playing. We suppose that the possible reason for the better performance of role-
playing LLMs is that it provides a specific context that constrains the generation space of LLMs,
making it reason within the constraints of the scenario, generating responses that align with the
perspectives that LLM in that role might have. Therefore, role-playing serves to evoke the latent
abilities of LLMs instead of directly improving LLM’s abilities. Moreover, in two scenarios without
role-playing, we observe that instruction (zero-shot) is slightly better than few-shot prompting. We
identified two potential factors that could lead to this observation. 1) Few-shot prompting may bias
the LLMs’ understanding of human intent due to the limited selection of examples that might not
fully reflect intent. 2) The long prompt (about 14 times the instruction length in the experiment)
used in few-shot prompting could hinder the LLMs’ effective extraction of relevant information.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
12 Dong et al.
0.8
0.7
0.6
0.5
Pass@1
0.4
0.3
0.2
0.1
0
Fastchat ChatGLM Vicuna MPT Vicuna HuggingChat Dromedary ChatGPT GPT-4
(3B) (6B) (7B) (7B) (13B) (30B) (65B) (GPT-3.5)
7B parameters and subsequently continues to escalate, serving to evoke latent intelligence within
LLMs.
For the most powerful LLM GPT-4, we conduct additional experiments to evaluate self-collaboration
code generation based on GPT-4, following the settings in the GPT-4 technical report [40]. The
experimental results are shown in Table 5, and we can find that the enhancement effect of self-
collaboration on GPT-4 is significant.
Table 5. The performance of self-collaboration code generation with GPT-4. The result in brackets is reported
on the GPT-4 technical report [40].
To figure out the abilities required for self-collaboration, we conduct experiments on a series
of models including CodeLlama 7B and 34B [45], Llama2 [55] (the base model of CodeLlama) and
their Instruct version (model with instruction tuning). As illustrated in Figure 5, the enhancement
observed in Llama2-7B using self-collaboration falls short when compared to CodeLlama-7B. This
discrepancy emphasizes the critical role of domain-specific expertise. The performance improvement
of the Instruct version using self-collaboration generally exceeds that of the original version,
highlighting the significance of in-context learning capabilities. Furthermore, the advancement
in CodeLlama-34B using self-collaboration eclipses that of CodeLlama-7B on both two versions,
underscoring the importance of reasoning abilities. Therefore, self-collaboration may require the
following abilities for LLMs, including strong domain-specific expertise ability for role-playing,
strong in-context learning ability to follow instructions, and strong reasoning ability to solve
problems effectively.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 13
0.6
0.5
0.4
Pass@1
0.3
0.2
0.1
0
LLama2 LLama2-Instruct CodeLLama CodeLLama-Instruct CodeLLama CodeLLama-Instruct
(7B) (7B) (7B) (7B) (34B) (34B)
Table 6. The effect of maximum interaction (MI) for self-collaboration code generation.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
14 Dong et al.
Runtime error
36.5
Function not generated Unpassed 45.1
Passed
Parameter error (type or number) (Coder)
Return value error (type or number)
includes tasks that cannot be correctly generated by the coder itself but can be generated correctly
through introducing self-collaboration; and the third category includes tasks that cannot be cor-
rectly generated even with self-collaboration. By manually checking the tasks in the third category,
we attribute the responsibility for generating errors to different roles and further subdivide the
types of errors in Figure 6. From the findings, it is evident that the predominant error originates
from the coder. This error is inherently tied to the performance of the base model. However, the
introduction of self-collaboration markedly improves the quality of code generation, where 18.4%
of tasks unpassed by coder are corrected by self-collaboration. Moreover, the errors caused by the
analyst mainly stem from ambiguous or incomplete requirement descriptions, which cannot be
solely attributed to the analyst. Remedying this issue can be alleviated by providing more precise
and complete requirement descriptions or incorporating a small amount of human guidance. On
the other hand, errors associated with testers predominantly result from exception handling. These
adjustments usually do not introduce new errors and tend to enhance the safety of the program.
However, they may cause some instances to throw an error instead of returning a value, thus failing
the test case.
In addition to the preceding problems, the probability of deviating from the requirements caused
by the additional introduction of the analyst and tester is less than 1%, so the whole system is
relatively reliable.
4.6.2 Cost Analysis. As illustrated in Figure 7, we measure the cost (prompt tokens + generated
tokens) and performance of self-collaboration, coder, and other prompting approaches, where we
normalize the cost of coder as 1. The experimental results indicate that the improvement of the
self-collaboration approach is significant, and its token usage is moderate among all prompting
approaches. However, considering the high labor expenses of software development teams, the
cost-effectiveness of self-collaboration is obvious.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 15
65
Self-collaboration
60
Self-debugging
55
Self-planning
Performance
50
45 Coder
Iter-improving
40
CoT
35
0 2 4 6 8 10 12 14 16 18
Cost
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
16 Dong et al.
Requirement:
As an adventurous player, I wanted to play a breakout
game where I can control a character that starts from
the starting point and goes through various obstacles to
Bombs
reach the ending point so that I can experience an (Keep falling from top)
exciting and challenging journey.
I want to be able to control the character by moving the Ending Point
mouse in any direction in order to avoid obstacles that
block his progress, such as rocks and bushes, as well as
bombs that fall from top. If the character collides with
an obstacle or is hit by a bomb, the game is lost. If
the character reaches the end, the game is won.
Attention:
— The starting point is marked as a gray circle in the
bottom left corner and the ending point is marked as a
Ob
green circle in the top right corner. st a
cle
— Load the images of the characters, bombs and obstacles s
from the specified asset path(assets/[Link],
assets/[Link], assets/[Link], assets/[Link]) and Starting Point
scale them to the right size.
— The initial position of the character is at the
starting point.
— The game starts when I click the mouse.
The character
— The bombs need to keep falling from the top. (Control by mouse)
— Use pygame to create the game in python.
Fig. 9. Case study on complex tasks in real-world scenarios. Red markers are added to denote specific objects.
We also conduct a case study related to website development. The requirement of this case
requires the model to develop a weather forecast website, involving the generation of multiple
types of files (HTML, CSS, Javascript), which can be considered as a micro development project. The
case study of the website development is shown in Fig. 10. Our self-collaboration code generation
approach produces a website that is superior to ChatGPT direct generation approach in terms of
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 17
Requirement:
The requirement from users is as following:
Create a weather forecast application where users can enter a city name and get real-time weather information for that city.
Build the user interface using HTML, CSS and JavaScript and get the real-time data by calling the weather API.
Notes:
— Uses weather data provided by OpenWeatherMap, a third-party weather API. The API key that will be used is ×××××××.
— Displays today's weather information in Beijing, London, New York and Tokyo on the page at all times.
— Displays the weather information for the searched city, or "No city searched" if no city is searched.
— The weather information includes temperature, weather descriptions and icons, wind speed and direction, humidity, and
barometric pressure.
— Show world wide temperature weather map on the page all the time through [Link]
{x}/{y}.png?appid={API key}.
— Use CSS styles to beautify the application. Elements in the page should be centered if possible.
functionality and visual appeal. The analysts make the generated code comprehensive consider-
ing every requirement, including search functionality, weather information display, temperature
weather map, etc. The testers assure that each requirement has not been misunderstood and is truly
in line with the user’s intent. In contrast, the direct generation approach occasionally falls short,
either by missing certain requirements or misunderstanding them. For instance, it often neglects to
include some weather information and fails to display worldwide temperature weather maps.
5 RELATED WORK
In this section, we outline the most relevant directions and associated papers of this work to
highlight our research’s innovative and advanced nature.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
18 Dong et al.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 19
to reproduce a given bug report and found that this approach holds great potential in enhancing
developer efficiency. The work [47] also demonstrates the effectiveness of applying LLMs to test
generation. In a comprehensive study conducted by the work [59], the direct application of LLMs
for APR was explored and it was shown that LLMs outperform all existing APR techniques by
a substantial margin. Additionally, the work [60] successfully implemented conversational APR
using ChatGPT.
The applications of LLMs in software development, as highlighted above, have shown numerous
successful outcomes in different stages. However, these successes are limited to individual task
(stage) of software development. These tasks can be performed synergistically through LLMs to
maximize their overall impact and thus achieve a higher level of automation in software develop-
ment.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
20 Dong et al.
7 CONCLUSION
In this paper, we have proposed a self-collaboration framework designed to enhance the problem-
solving capability of LLMs in a collaborative and interactive way. We investigate the potential of
LLMs in facilitating collaborative code generation within software development processes. Specif-
ically, based on the proposed framework, we assemble an elementary team consisting of three
distinct LLM agents, designed to address code generation tasks collaboratively. Extensive experi-
mental results demonstrate the effectiveness and generalizability of self-collaboration framework.
In conclusion, self-collaboration framework provides an effective approach to automatic code gener-
ation. This innovative approach has the potential to substantially improve the quality of generated
code, reduce human intervention, and accelerate the development of complex software systems.
Moreover, our work can serve as a foundation for future research on multi-agent collaboration
approaches in various domains and the development of more advanced and specialized virtual
teams to tackle more complex tasks.
REFERENCES
[1] Pekka Abrahamsson, Outi Salo, Jussi Ronkainen, and Juhani Warsta. 2002. Agile software development methods:
Review and analysis. (2002).
[2] Chetan Arora, John Grundy, and Mohamed Abdelrazek. 2023. Advancing Requirements Engineering through Generative
AI: Assessing the Role of LLMs. CoRR abs/2310.13976 (2023).
[3] Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang,
Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. 2021. Program Synthesis with Large Language Models.
CoRR abs/2108.07732 (2021).
[4] Kent Beck, Mike Beedle, Arie Van Bennekum, Alistair Cockburn, Ward Cunningham, Martin Fowler, James Grenning,
Jim Highsmith, Andrew Hunt, Ron Jeffries, et al. 2001. Manifesto for agile software development. (2001).
[5] R Meredith Belbin. 2012. Team roles at work. Routledge.
[6] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan,
Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford,
Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS 2020.
[7] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. 2022. CodeT: Code
Generation with Generated Tests. CoRR abs/2207.10397 (2022).
[8] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison
Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy
Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert,
Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak,
Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan
Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati,
Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba.
2021. Evaluating Large Language Models Trained on Code. CoRR (2021). [Link]
[9] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching Large Language Models to Self-Debug.
CoRR abs/2304.05128 (2023).
[10] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham,
Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua
Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben
Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke,
Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson,
Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan
Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankaranarayana Pillai,
Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi
Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff
Dean, Slav Petrov, and Noah Fiedel. 2022. PaLM: Scaling Language Modeling with Pathways. CoRR abs/2204.02311
(2022).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 21
[11] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa De-
hghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha
Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Y. Zhao, Yanping Huang, Andrew M. Dai, Hongkun
Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. 2022.
Scaling Instruction-Finetuned Language Models. CoRR abs/2210.11416 (2022).
[12] Caroline Claus and Craig Boutilier. 1998. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems.
In AAAI/IAAI. AAAI Press / The MIT Press, 746–752.
[13] Tom DeMarco and Tim Lister. 2013. Peopleware: productive projects and teams. Addison-Wesley.
[14] Yihong Dong, Jiazheng Ding, Xue Jiang, Zhuo Li, Ge Li, and Zhi Jin. 2023. CodeScore: Evaluating Code Generation by
Learning Code Execution. CoRR abs/2301.09043 (2023).
[15] Yihong Dong, Xue Jiang, Huanyu Liu, Zhi Jin, and Ge Li. 2024. Generalization or Memorization: Data Contamination
and Trustworthy Evaluation for Large Language Models. CoRR abs/2402.15938 (2024).
[16] Yihong Dong, Ge Li, and Zhi Jin. 2023. CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation. In
ISSTA. ACM, 188–198.
[17] Yihong Dong, Kangcheng Luo, Xue Jiang, Zhi Jin, and Ge Li. 2023. PACE: Improving Prompt with Actor-Critic Editing
for Large Language Model. CoRR abs/2308.10088 (2023).
[18] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih, Luke
Zettlemoyer, and Mike Lewis. 2022. InCoder: A Generative Model for Code Infilling and Synthesis. CoRR abs/2204.05999
(2022).
[19] GitHub. 2022. Copilot. [Link]
[20] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik,
Horace He, Dawn Song, and Jacob Steinhardt. 2021. Measuring Coding Challenge Competence With APPS. In NeurIPS
Datasets and Benchmarks.
[21] Huggingface. 2023. HuggingChat. [Link]
[22] IBM. 2023. Dromedary. [Link]
[23] Xue Jiang, Yihong Dong, Lecheng Wang, Qiwei Shang, and Ge Li. 2023. Self-planning Code Generation with Large
Language Model. CoRR abs/2303.06689 (2023).
[24] Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How Can We Know What Language Models Know.
Trans. Assoc. Comput. Linguistics 8 (2020), 423–438.
[25] Sungmin Kang, Juyeon Yoon, and Shin Yoo. 2022. Large Language Models are Few-shot Testers: Exploring LLM-based
General Bug Reproduction. CoRR abs/2209.11515 (2022).
[26] Jon R Katzenbach and Douglas K Smith. 2015. The wisdom of teams: Creating the high-performance organization.
Harvard Business Review Press.
[27] Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models
are Zero-Shot Reasoners. In NeurIPS.
[28] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023. CAMEL:
Communicative Agents for" Mind" Exploration of Large Scale Language Model Society. CoRR abs/2303.17760 (2023).
[29] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone,
Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier
Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade,
Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo
Wang, Rudra Murthy V, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan
Zhang, Nour Moustafa-Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas,
Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey
Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-
Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean
Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: may the source be with
you! CoRR abs/2305.06161 (2023).
[30] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling,
Felix Gimeno, Agustin Dal Lago, et al. 2022. Competition-level code generation with alphacode. Science 378, 6624
(2022), 1092–1097.
[31] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, Prompt,
and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Comput. Surv. 55, 9
(2023), 195:1–195:35.
[32] LMSYS. 2023. Fastchat. [Link]
[33] LMSYS. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. [Link]
2023-03-30-vicuna/
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
22 Dong et al.
[34] Ian R McChesney and Seamus Gallagher. 2004. Communication and co-ordination practices in software engineering
projects. Information and Software Technology 46, 7 (2004), 473–489.
[35] Marvin Minsky. 2007. The emotion machine: Commonsense thinking, artificial intelligence, and the future of the human
mind. Simon and Schuster.
[36] MosaicML. 2023. Introducing MPT-7B: A New Standard for Open-Source, ly Usable LLMs. [Link]/blog/mpt-
7b
[37] H Penny Nii. 1986. Blackboard Systems. (1986).
[38] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. 2022.
Codegen: An open large language model for code with multi-turn program synthesis. CoRR abs/2203.13474 (2022).
[39] OpenAI. 2022. ChatGPT. [Link]
[40] OpenAI. 2023. GPT-4 Technical Report. CoRR abs/2303.08774 (2023).
[41] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow
instructions with human feedback. CoRR abs/2203.02155 (2022).
[42] Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow
instructions with human feedback. CoRR abs/2203.02155 (2022).
[43] Kai Petersen, Claes Wohlin, and Dejan Baca. 2009. The Waterfall Model in Large-Scale Development. In PROFES
(Lecture years in Business Information Processing, Vol. 32). Springer, 386–400.
[44] Laria Reynolds and Kyle McDonell. 2021. Prompt Programming for Large Language Models: Beyond the Few-Shot
Paradigm. In CHI Extended Abstracts. ACM, 314:1–314:7.
[45] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal
Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton-Ferrer, Aaron
Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier,
Thomas Scialom, and Gabriel Synnaeve. 2023. Code Llama: Open Foundation Models for Code. CoRR abs/2308.12950
(2023).
[46] Nayan B. Ruparelia. 2010. Software development lifecycle models. ACM SIGSOFT Softw. Eng. years 35, 3 (2010), 8–13.
[47] Max Schäfer, Sarah Nadi, Aryaz Eghbali, and Frank Tip. 2024. An Empirical Evaluation of Using Large Language
Models for Automated Unit Test Generation. IEEE Trans. Software Eng. 50, 1 (2024), 85–105.
[48] Timo Schick, Jane Dwivedi-Yu, Zhengbao Jiang, Fabio Petroni, Patrick S. H. Lewis, Gautier Izacard, Qingfei You,
Christoforos Nalmpantis, Edouard Grave, and Sebastian Riedel. 2022. PEER: A Collaborative Language Model. CoRR
abs/2208.11663 (2022).
[49] Sijie Shen, Xiang Zhu, Yihong Dong, Qizhi Guo, Yankun Zhen, and Ge Li. 2022. Incorporating domain knowledge
through task augmentation for front-end JavaScript code generation. In ESEC/SIGSOFT FSE. ACM, 1533–1543.
[50] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai
tasks with chatgpt and its friends in huggingface. CoRR abs/2303.17580 (2023).
[51] Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowl-
edge from Language Models with Automatically Generated Prompts. In EMNLP (1). Association for Computational
Linguistics, 4222–4235.
[52] Stephen W. Smoliar. 1991. Marvin Minsky, The Society of Mind. Artif. Intell. 48, 3 (1991), 349–370.
[53] Chang-You Tai, Ziru Chen, Tianshu Zhang, Xiang Deng, and Huan Sun. 2023. Exploring Chain-of-Thought Style
Prompting for Text-to-SQL. CoRR abs/2305.14215 (2023).
[54] THUDM. 2023. ChatGLM. [Link]
[55] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya
Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem
Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman
Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa,
Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich,
Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton,
Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian,
Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov,
Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov,
and Thomas Scialom. 2023. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288 (2023).
[56] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain of
thought prompting elicits reasoning in large language models. CoRR abs/2201.11903 (2022).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 23
[57] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny
Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS.
[58] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. 2023. Visual chatgpt:
Talking, drawing and editing with visual foundation models. CoRR abs/2303.04671 (2023).
[59] Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2022. Practical Program Repair in the Era of Large Pre-trained
Language Models. CoRR abs/2210.14179 (2022).
[60] Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational Automated Program Repair. CoRR abs/2301.13246
(2023).
[61] Hao Yu, Bo Shen, Dezhi Ran, Jiaxin Zhang, Qi Zhang, Yuchi Ma, Guangtai Liang, Ying Li, Qianxiang Wang, and Tao
Xie. 2024. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. In ICSE. ACM,
37:1–37:12.
[62] Eric Zelikman, Qian Huang, Gabriel Poesia, Noah D. Goodman, and Nick Haber. 2022. Parsel: A Unified Natural
Language Framework for Algorithmic Reasoning. CoRR abs/2212.10561 (2022).
[63] Tianyi Zhang, Tao Yu, Tatsunori B. Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, and Sida I. Wang. 2022. Coder
Reviewer Reranking for Code Generation. CoRR abs/2211.16490 (2022).
[64] Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zi-Yuan Wang, Lei Shen, Andi Wang, Yang Li,
Teng Su, Zhilin Yang, and Jie Tang. 2023. CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual
Evaluations on HumanEval-X. CoRR abs/2303.17568.
[65] Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. 2022.
Large Language Models Are Human-Level Prompt Engineers. CoRR abs/2211.01910 (2022).
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
24 Dong et al.
A PRELIMINARY KNOWLEDGE
A.1 Code Generation
Code generation is a technology that automatically generates source code to facilitate automatic
machine programming in accordance with user requirements. It is regarded as a significant approach
to enhancing the automation and overall quality of software development. Existing code generation
approaches demonstrate relative proficiency in addressing "minor requirements" scenarios, such
as function completion and line-level code generation. However, when confronted with complex
requirements and software system design, they fall short of offering a comprehensive solution.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 25
To this end, we also explore the second setting, namely NL-only, which is more consistent with
real-world development scenarios.
Table 7. Effectiveness of ChatGPT roles in self-collaboration code generation with ‘NL + signature + use
cases’ setting.
We also performed ablation studies with roles on MBPP in ‘NL-only’ setting. benchmark. The
results are shown in Table 8. From the results, it is clear that different roles perform differently on
MBPP than on HumanEval. The coder-tester team attained the highest performance on MBPP and
MBPP-ET benchmarks, with relative improvements of 36.7% and 39.4%, respectively.
Table 8. Effectiveness of ChatGPT roles in self-collaboration code generation on MBPP with ‘NL-only’ Setting.
The suboptimal performance of the analyst on MBPP and MBPP-ET benchmarks may be attrib-
uted to some tasks in MBPP feature inputs and outputs that diverge from conventional human
coding practices and cognitive processes, and they lack any hints provided in requirement. Analysis
on such unconventional data usually yields results that align with human habits, while coding
directly based on requirements can handle this unconventional data. Based on our experimental
results and the phenomenon we observed, we suspect there might be some degree of data contami-
nation in MBPP. However, considering this fact, we still choose to publish these results. Under ‘NL
+ signature + use cases’ setting, LLM has access to more comprehensive input-output information,
and we find that the role of the analyst is significant.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
26 Dong et al.
Requirement:
Write a python function to find the sum of even numbers at even positions. To address these issues, I suggest the following changes to the code:
```
def sum_even_at_even_positions(lst):
1. Decomposition: even_pos_sum = 0
- Identify even positions in the list for i in range(0, len(lst), 2):
- Check if the number at the identified position is even if i < len(lst) and lst[i] % 2 == 0:
- If the number is even, add it to the sum even_pos_sum += lst[i]
- Return the sum return even_pos_sum if even_pos_sum != 0 else None
Analyst ```
2. High-level plan:
- Define a function that takes a list as input With these changes, the function will consider even positions starting from
- Initialize a variable to store the sum of even numbers at even positions index 0, handle the case where the input list is empty, and return None if
- Loop through the list and identify even positions there are no even numbers at even positions.
- Check if the number at the identified position is even
- If the number is even, add it to the sum After testing the updated code with various inputs, I can confirm that it
- Return the sum satisfies the requirements and passes all edge cases. Therefore, my
conclusion is:
Note: The high-level plan can be further refined and optimized during the
development process. Code Test Passed.
After testing the code, I have found the following issues: After testing the code, I can confirm that it satisfies the requirements and passes all
edge cases. The function considers even positions starting from index 0, handles the
1. The function only considers even positions starting from index 1, which means it case where the input list is empty, and returns None if there are no even numbers at
will miss the even numbers at the first position (index 0). even positions.
2. The function does not handle the case where the input list is empty. It will raise an Tester
IndexError exception. Therefore, my conclusion is: Code Test Passed.
3. The function does not handle the case where there are no even numbers at even
Tester positions. It will return 0, which may not be a reasonable return value in some cases.
Requirement:
We have an array 'arr' of N integers arr[1], arr[2], ..., arr[N].The numbers in the
array will be randomly ordered. Your task is to determine if it is possible to get an
array sorted in non-decreasing order by performing the following operation on the
given array: You are allowed to perform right shift operation any number of times.
After testing the code, I have found the following issues:
One right shift operation means shifting all elements of the array by one position in
the right direction. The last element of the array will be moved to the starting
1. The code only checks if it is possible to sort the array by performing right shift
position in the array i.e. 0th index. If it is possible to obtain the sorted array by
operations. It does not actually perform the operations to sort the array.
performing the above operation then return True else return False. If the given array
is empty then return True. Note: The given list is guaranteed to have unique
2. The code assumes that the minimum element in the array needs to be moved to
elements.
the first position to sort the array. This is not always true. For example, if the array is
[3, 2, 1], the minimum element is 1, but it needs to be moved to the last position to
sort the array.
Tester
The user requirement is to determine if it is possible to sort an array in non- Based on these issues, I cannot conclude that the code has passed the tests. The
decreasing order by performing a right shift operation any number of times. The code needs to be modified to handle these edge cases and to actually perform the
requirement can be decomposed into the following subproblems: right shift operations to sort the array.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 27
DRIVER = “I want you to act as the driver in this team. Your job is as follows:
1. You are responsible for writing code, i.e. translating your understanding of the requirement
into code.
2. You need to explain the code to help the observer understand what you have written.
3. If you receive suggestions from the observer, you need to fix or improve your code based
on his suggestions. Ensure that any changes made to the code do not introduce new bugs
or negatively impact the performance of the code.”
OBSERVER = “I want you to act as the observer in this team. You will receive the code
written by the driver, and your job is as follows:
1. You are primarily responsible for reviewing code written by drivers to ensure its quality
and accuracy. You need to provide suggestions on the code written by the drivers.
2. You also need to think about the needs that the code meets.
3. You also need to predict possible problems and errors and instruct drivers to correct
them.”
TEAM = “There is a pair programming team that includes a driver, and an observer. The
team needs to develop a program that meets a requirement. The different roles have different
divisions of labor and need to cooperate with each others.”
ANALYSIS = “1. Decompose the requirement into several easy-to-solve subproblems that
can be more easily implemented by the developer.
2. Develop a high-level plan that outlines the major steps of the program.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
28 Dong et al.
Remember, your plan should be high-level and focused on guiding the developer in writing
code, rather than providing implementation details.”
CODING = “Write code in Python that meets the requirements following the plan. Ensure
that the code you write is efficient, readable, and follows best practices.
Remember, do not need to explain the code you wrote.”
REPAIRING= “Fix or improve the code based on the content of the report. Ensure that any
changes made to the code do not introduce new bugs or negatively impact the performance
of the code. Remember, do not need to explain the code you wrote.”
TESTING = “1. Test the functionality of the code to ensure it satisfies the requirements.
2. Write reports on any issues or bugs you encounter.
3. If the code or the revised code has passed your tests, write a conclusion "Code Test
Passed".
Remember, the report should be as concise as possible, without sacrificing clarity and
completeness of information. Do not include any error handling or exception handling
suggestions in your report.”
Few-shot prompting intends to convey to model the task of each stage by example. We sample
four examples from the dataset for prompting. For fairness, we exclude the four examples from
the evaluation and keep all approaches consistent in experiments. The prompt for each stage is as
follows:
ANALYSIS = “
Requirement:
prime_fib returns n-th number that is a Fibonacci number and it’s also prime.
Plan:
1. Create a function to check if a number is prime.
2. Generate a Fibonacci sequence.
3. Check if each number in the Fibonacci sequence is prime, decrement the counter.
4. If the counter is 0, return the Fibonacci number.
end
Requirement:
Create a function that takes integers, floats, or strings representing real numbers, and
returns the larger variable in its given variable type. Return None if the values are equal.
Note: If a real number is represented as a string, the floating point might be . or ,
Plan:
1. Store the original inputs.
2. Check if inputs are strings and convert to floats.
3. Compare the two inputs and return the larger one in its original data type.
end
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 29
Requirement:
Create a function which takes a string representing a file’s name, and returns ’Yes’ if the
file’s name is valid, and returns ’No’ otherwise. A file’s name is considered to be valid if
and only if all the following conditions are met: - There should not be more than three
digits (’0’-’9’) in the file’s name. - The file’s name contains exactly one dot ’.’ - The substring
before the dot should not be empty, and it starts with a letter from the latin alphapet (’a’-’z’
and ’A’-’Z’). - The substring after the dot should be one of these: [’txt’, ’exe’, ’dll’]
Plan:
1. Check if the file name is valid according to the conditions.
2. Return "Yes" if valid, otherwise return "NO".
end
Requirement:
This function takes a list l and returns a list l’ such that l’ is identical to l in the odd indicies,
while its values at the even indicies are equal to the values of the even indicies of l, but
sorted.
Plan:
1. Create a list of all the even indices of the given list.
2. Sort the list of even indices.
3. Create a new list that is identical to the original list in the odd indicies, and equal to the
sorted even indices in the even indicies.
end
Requirement:
{x_test}
Plan:
”
CODING = “
Requirement: prime_fib returns n-th number that is a Fibonacci number and it’s also prime.
Plan: 1. Create a function to check if a number is prime. 2. Generate a Fibonacci sequence. 3.
Check if each number in the Fibonacci sequence is prime, decrement the counter. 4. If the
counter is 0, return the Fibonacci number.
Code:
def is_prime(n: int):
if n < 2:
return False
for i in range(2, n):
if n % i == 0:
return False
return True
fib_seq = [1, 1]
counter = n
while counter > 0:
fib_seq.append(fib_seq[-1] + fib_seq[-2])
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
30 Dong et al.
if is_prime(fib_seq[-1]):
counter -= 1
return fib_seq[-1]
end
Requirement: Create a function that takes integers, floats, or strings representing real
numbers, and returns the larger variable in its given variable type. Return None if the values
are equal. Note: If a real number is represented as a string, the floating point might be . or ,
Plan: 1. Store the original inputs. 2. Check if inputs are strings and convert to floats. 3.
Compare the two inputs and return the larger one in its original data type.
Code:
def compare_one(a, b):
a_original = a
b_original = b
if isinstance(a, str):
a = float([Link](',', '.'))
if isinstance(b, str):
b = float([Link](',', '.'))
if a > b:
return a_original
elif b > a:
return b_original
else:
return None
end
Requirement: Create a function which takes a string representing a file’s name, and returns
’Yes’ if the file’s name is valid, and returns ’No’ otherwise. A file’s name is considered to
be valid if and only if all the following conditions are met: - There should not be more
than three digits (’0’-’9’) in the file’s name. - The file’s name contains exactly one dot ’.’ -
The substring before the dot should not be empty, and it starts with a letter from the latin
alphapet (’a’-’z’ and ’A’-’Z’). - The substring after the dot should be one of these: [’txt’, ’exe’,
’dll’]
Plan: 1. Check if the file name is valid according to the conditions. 2. Return "Yes" if valid,
otherwise return "NO".
Code:
def file_name_check(file_name):
if len([Link](r'\\d', file_name)) > 3:
return 'No'
if len([Link](r'\\.', file_name)) != 1:
return 'No'
if not [Link](r'[a-zA-Z]', file_name.split('.')[0]):
return 'No'
if file_name.split('.')[1] not in ['txt', 'exe', 'dll']:
return 'No'
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 31
return 'Yes'
end
Requirement: This function takes a list l and returns a list l’ such that l’ is identical to l in
the odd indicies, while its values at the even indicies are equal to the values of the even
indicies of l, but sorted.
Plan: 1. Create a list of all the even indices of the given list. 2. Sort the list of even indices. 3.
Create a new list that is identical to the original list in the odd indicies, and equal to the
sorted even indices in the even indicies.
Code:
def sort_even(l: list):
even_indices = [x for x in range(len(l)) if x % 2 == 0]
sorted_even_indices = sorted(l[i] for i in even_indices)
result = []
for i in range(len(l)):
if i % 2 == 0:
[Link](sorted_even_indices.pop(0))
else:
[Link](l[i])
return result
end
{x_test}
Code: ”
REPAIRING = “
Requirement: prime_fib returns n-th number that is a Fibonacci number and it’s also prime.
Code:
def is_prime(n):
if n <= 1:
return False
if n <= 3:
return True
if n % 2 == 0 or n % 3 == 0:
return False
i = 5
while i * i <= n:
if n % i == 0 or n % (i + 2) == 0:
return False
i += 6
return True
def prime_fib(n):
if n <= 0:
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
32 Dong et al.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 33
Requirement: Create a function that takes integers, floats, or strings representing real
numbers, and returns the larger variable in its given variable type. Return None if the values
are equal. Note: If a real number is represented as a string, the floating point might be . or ,
Code:
def compare_one(a, b):
if a == b:
return None
if isinstance(a, str):
a = float([Link](',', '.'))
if isinstance(b, str):
b = float([Link](',', '.'))
Requirement: This function takes a list l and returns a list l’ such that l’ is identical to l in
the odd indicies, while its values at the even indicies are equal to the values of the even
indicies of l, but sorted.
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
34 Dong et al.
Code:
def file_name_check(file_name):
if len([Link](r'\\d', file_name)) > 3:
return 'No'
if len([Link](r'\\.', file_name)) != 1:
return 'No'
if not [Link](r'[a-zA-Z]', file_name.split('.')[0]):
return 'No'
if file_name.split('.')[1] not in ['txt', 'exe', 'dll']:
return 'No'
return 'Yes'
Test Report: Upon careful examination and testing of the ‘file_name_check‘ function, it was
found to perform as expected. The function successfully verifies whether a file name meets
certain criteria: - It contains no more than three digits. - It has exactly one period. - It starts
with a letter before the period. - The extension after the period is one of ’txt’, ’exe’, or ’dll’. If
the file name meets all of these conditions, the function returns ’Yes’. Otherwise, it returns
’No’. In conclusion, no issues were found during testing. The ‘file_name_check‘ function
has passed all tests and is working as expected. Code Test Passed
Modified Code: ’file_name_check’ function passed all tests and is working as expected. So
the code has no need to change.
end
Requirement: This function takes a list l and returns a list l’ such that l’ is identical to l in
the odd indicies, while its values at the even indicies are equal to the values of the even
indicies of l, but sorted.
Code:
def sort_even(l: list):
even_indices = [l[i] for i in range(len(l)) if i % 2 == 0]
sorted_even_indices = sorted(even_indices)
l_prime = [Link]()
for i in range(len(l_prime)):
if i % 2 == 0:
l_prime[i] = sorted_even_indices.pop(0)
return l_prime
Test Report: Upon careful examination and testing of the ‘sort_even‘ function in the provided
code, the following issues were discovered: 1. The function sorts elements at even indices
in the list, rather than sorting even numbers themselves. Addressing these issues will result
in a function that correctly sorts all even numbers in the input list while maintaining the
order and position of odd numbers.
Modified Code:
def sort_even(l: list):
even_indices = [x for x in range(len(l)) if x % 2 == 0]
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 35
{x_test}
Modified Code: ”
TESTING = ”
Requirement: prime_fib returns n-th number that is a Fibonacci number and it’s also prime.
Code to test:
def is_prime(n):
if n <= 1:
return False
if n <= 3:
return True
if n % 2 == 0 or n % 3 == 0:
return False
i = 5
while i * i <= n:
if n % i == 0 or n % (i + 2) == 0:
return False
i += 6
return True
def prime_fib(n):
if n <= 0:
raise ValueError("n must be a positive integer.")
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
36 Dong et al.
Requirement: Create a function that takes integers, floats, or strings representing real
numbers, and returns the larger variable in its given variable type. Return None if the values
are equal. Note: If a real number is represented as a string, the floating point might be . or ,
Code to test:
def compare_one(a, b):
if a == b:
return None
if isinstance(a, str):
a = float([Link](',', '.'))
if isinstance(b, str):
b = float([Link](',', '.'))
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
Self-collaboration Code Generation via ChatGPT 37
Requirement: Create a function which takes a string representing a file’s name, and returns
’Yes’ if the file’s name is valid, and returns ’No’ otherwise. A file’s name is considered to
be valid if and only if all the following conditions are met: - There should not be more
than three digits (’0’-’9’) in the file’s name. - The file’s name contains exactly one dot ’.’ -
The substring before the dot should not be empty, and it starts with a letter from the latin
alphapet (’a’-’z’ and ’A’-’Z’). - The substring after the dot should be one of these: [’txt’, ’exe’,
’dll’]
Code to test:
def file_name_check(file_name):
if len([Link](r'\\d', file_name)) > 3:
return 'No'
if len([Link](r'\\.', file_name)) != 1:
return 'No'
if not [Link](r'[a-zA-Z]', file_name.split('.')[0]):
return 'No'
if file_name.split('.')[1] not in ['txt', 'exe', 'dll']:
return 'No'
return 'Yes'
Test Report: Upon careful examination and testing of the ‘file_name_check‘ function, it
was found to perform as expected. The function successfully verifies whether a file name
meets certain criteria: - It contains no more than three digits. - It has exactly one period. - It
starts with a letter before the period. - The extension after the period is one of ’txt’, ’exe’, or
’dll’. If the file name meets all of these conditions, the function returns ’Yes’. Otherwise, it
returns ’No’. In conclusion, no issues were found during testing. The ‘file_name_check‘
function has passed all tests and is working as expected. Code Test Passed
end
Requirement: This function takes a list l and returns a list l’ such that l’ is identical to l in
the odd indicies, while its values at the even indicies are equal to the values of the even
indicies of l, but sorted.
Code to test:
def sort_even(l: list):
even_indices = [l[i] for i in range(len(l)) if i % 2 == 0]
sorted_even_indices = sorted(even_indices)
l_prime = [Link]()
for i in range(len(l_prime)):
if i % 2 == 0:
l_prime[i] = sorted_even_indices.pop(0)
return l_prime
Test Report: Upon careful examination and testing of the ‘sort_even‘ function in the
provided code, the following issues were discovered: 1. The function sorts elements at
even indices in the list, rather than sorting even numbers themselves. Addressing these
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
38 Dong et al.
issues will result in a function that correctly sorts all even numbers in the input list while
maintaining the order and position of odd numbers.
end
{x_test}
Test Report: ”
ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 1, Article . Publication date: May 2023.
The self-collaboration framework has a significant impact on repository-level code generation tasks by enabling collaborative problem-solving, which is more effective than single agent solutions. In real-world scenarios, this framework exhibits notable effectiveness in handling complex code generation tasks, as it allows the distributed roles to manage different aspects of the codebase systematically. This approach overcomes the limitations of individual models by synthesizing multiple perspectives and expertise areas, thus improving the handling of intricate and large-scale tasks .
Evidence indicates that role-playing instructions enhance clarity and efficiency by guiding LLMs to align their behaviors with specific roles, which results in behavior consistent with expectations throughout interactions. The instructions need to be set only once, improving overall efficiency and clarity during LLM interactions while performing their designated tasks, as demonstrated by improved task outcomes and engagement in specialized fields .
Role-playing enhances performance by situating language models within specific domains, thus harnessing their expertise in those areas. By providing role instructions at initialization, LLMs align their behavior with specific expectations, resulting in task performance that is more consistent with predefined goals. This approach has been shown to yield superior results compared to engaging LLMs without pre-defined contextual settings .
The self-collaboration framework might outperform the previous SOTA CodeX (175B) + CodeT in complex tasks because of its structured division of tasks and collaborative approach using roles like analyst, coder, and tester. This division allows for specialized and focused problem-solving where individual roles contribute expertise to address different aspects of code generation. This structured, team-based approach effectively handles complex logical and operational components of tasks that a single model or simplistic combination might struggle with, resulting in superior performance on benchmarks like CoderEval .
The self-collaboration framework improves the performance of code generation tasks by leveraging a division of labor and collaboration among multiple LLM roles. This approach enables different roles such as analyst, coder, and tester to handle specific stages of software development, thereby surpassing the performance of a single LLM agent. Experiments show that self-collaboration achieves significant improvements on benchmarks like CoderEval, with a reported 47.1% improvement on Pass@1 over a single ChatGPT .
Introducing software-development methodology (SDM) into the self-collaboration framework addresses challenges of organizing and managing tasks effectively. SDM provides defined stages, principles, and practices that control task complexity and improve software quality. By structuring the development process, it ensures efficient collaboration among LLM roles and reduces potential ambiguities in task execution .
The 'prime_fib' function is designed to return the n-th number that is both a Fibonacci number and prime. This involves generating Fibonacci numbers sequentially and checking each for primality, decrementing a counter until the desired nth number is found. Within the self-collaboration framework, it serves as a complex task embodying both recursive number generation and computational logic, thus showcasing LLMs' ability to handle complex, domain-specific requirements .
The primary roles in the self-collaboration framework are analyst, coder, and tester. The analyst breaks down requirements and develops high-level plans, guiding the coder. The coder creates or improves code based on the analyst's plans or feedback from the tester. The tester compiles test reports based on the coder’s work and documents any issues. This division allows for specialized handling of each stage, enhancing efficiency and code quality .
Alternative team configurations, such as those including a compiler or utilizing pair programming (driver and observer), have been tested. While incorporating a compiler adds some value by constructing use cases and execution results, it didn't significantly outperform the standard analyst-coder-tester configuration. The reason might be that testers efficiently resolve errors without needing compiler outputs. Pair programming with a driver and observer also didn't surpass the traditional setup, which indicates the strength of a three-role team configuration's synergy .
Instructions and few-shot prompting are two strategies used to enhance code generation. Instructions, specifically crafted for role-playing, provide guidance aligned with expected roles, which is effective in structuring behavior in a predictable manner. Few-shot prompting, alternatively, involves providing examples of tasks, demonstrating the results before a model begins generating. Both methods enhance performance, but using role instructions, which frame tasks in familiar terms, has shown performance advantages in certain settings .