Safurai-001: Advancements in Code LLMs
Safurai-001: Advancements in Code LLMs
A BSTRACT
This paper presents Safurai-001, a new Large Language Model (LLM) with signif-
icant potential in the domain of coding assistance. Driven by recent advancements
in coding LLMs, Safurai-001 competes in performance with the latest models
like WizardCoder [Xu et al. (2023)], PanguCoder [Shen et al. (2023)] and Phi-
1 [Gunasekar et al. (2023)] but aims to deliver a more ”conversational” interac-
tion. By capitalizing on the progress in data engineering (latest techniques of data
transformation and prompt engineering) and instruction tuning, this new model
promises to stand toe-to-toe with recent closed and open source developments.
Recognizing the need for an efficacious evaluation metric for coding LLMs, this
paper also introduces GPT4-based MultiParameters: an evaluation benchmark that
harnesses varied parameters to present a comprehensive insight into the model’s
functioning and performance. Our assessment shows that Safurai-001 can out-
perform GPT-3.51 by 1.58% and WizardCoder by 18.78% in Code Readability
parameter and more.
1 I NTRODUCTION
Code large language models are one of the most promising applications of LLMs and they have
drawn a lot of interest from both academia and industry because of their extraordinary aptitude for
tasks involving codes.
The closed-source models landscape is dominated by OpenAI models: GPT-3.5 and GPT-4 [OpenAI
(2023)] (actually, the best ranked model in HumanEval pass@1 chart). Before the release of Star-
coder [Li et al. (2023)], open-source world fall far behind commercial models in terms of model
size, capability, and performance.
However, this paradigm started changing with the advent of Starcoder. It have been frequently
employed as a foundational model in the development of other models with great results like Wiz-
ardCoder [Xu et al. (2023)] and PanguCoder [Shen et al. (2023)], diminishing significantly the per-
formance gap between open and closed-source coding LLMs. Lately, also Meta introduced a new set
of 12 LLMs available for commercial use: LLAMA2 release [Touvron et al. (2023)]. Teams from
all over in the world could use LLAMA2 as new foundation model for Coding LLMs, in competition
with StarCoder.
In the latest publications in Coding LLMs field, many efforts have been made regarding for data
engineering (Phi-1) and instruction tuning (WizardCoder).
We have tried to capitalize on all the latest innovations in the field of Coding LLMs to develop a
high-performance model that is in line with the latest open-source releases.
In a nutshell, we make the following contribution:
∗
https://www.safurai.com/team
1
https://openai.com/blog/introducing-chatgpt-and-whisper-apis
1
• We present Safurai-001, a model that competes with WizardCoder for performances and
tries to have a more ”conversational” approach.
2 R ELATED WORK
The impressive Codex model, with its 12 billion parameters, illustrates a remarkable capacity to
solve approximately 72% of Python programming challenges. This achievement has paved the
way for the development of other advanced code generation models, including AlphaCode [Li et al.
(2022)], PaLM-Coder [Chowdhery et al. (2022)], and PanGu-Coder [Shen et al. (2023)]. However,
one notable drawback is the lack of open-source availability of these state-of-the-art models, a void
that has subsequently been filled by the release of several open-source variants such as CodeParrot2,
PolyCoder3, PyCodeGPT4 , SantaCoder [Allal et al. (2023)], and StarCoder [Li et al. (2023)]. This
new wave of open-source models have reinvigorated the code generation field.
Furthermore, the sequential expansions of code generation application scopes are reflective of
the field’s ever-growing practicality. For instance, CodeGeeX [Zheng et al. (2023)], BLOOM
[Workshop (2022)] and ERNIE-Code [Chai et al. (2022)] have been developed to enable multilin-
gual modeling. JuPyT5 [Chandel et al. (2022)] was trained using an extensive corpus of Jupyter
notebooks, its primary objective being to enhance the process of interactive programming. Mod-
els like DocCoder and APICoder [Zan et al. (2022)] have also been constructed to equip language
models with the functionality to call APIs. Moreover, a number of models, including InCoder
[Fried et al. (2022)], SantaCoder, and StarCoder, support code generation at arbitrary locations.
Recently, some groups have been utilizing instructional tuning techniques to tap into the vast po-
tential knowledge contained within extensive language models. This process involves carefully
refining these models with high-quality datasets. In terms of code generation, WizardCoder (15B),
PanguCoder and phi-1 (1.3B) models stand out for their exemplary performance. This was achieved
through careful fine-tuning with data generated by OpenAI’s GPT-3.5 and GPT-4.
The landscape of code, logic, and algebra datasets is teeming with new possible resources that can
be used for finetuning Coding LLMs (the majority of them are open source).
The most important coding dataset in this field is CodeAlpaca-20k5. Many models, like PanGu-
Coder or WizardCoder, have structured their dataset also through the manipulation of Code Alpaca
with data augmentation techniques. Also Phi-1 [Gunasekar et al. (2023)] coding model has been
trained with filtered code-language dataset, which is a subset of The Stack6 (it contains over 6TB of
permissively-licensed source code files covering 358 programming languages).
The open source community offers a variety of resources in Q&A format that are helpful for fine-
tuning LLMs in terms of datasets for mathematics and logic. The majority of these datasets were
produced by T57 , GPT-3.5, GPT-4, or a combination of these models (although OpenAI policies can
still be interpreted in this context).
2
https://huggingface.co/codeparrot/codeparrot
3
https://huggingface.co/NinedayWang/PolyCoder-2.7B
4
https://github.com/microsoft/PyCodeGPT
5
https://huggingface.co/datasets/HuggingFaceH4/CodeAlpaca 20K
6
https://huggingface.co/datasets/bigcode/the-stack
7
https://huggingface.co/docs/transformers/model doc/t5
2
2.3 L ATEST T ECHNIQUES FOR P ROMPT E NGINEERING
In this section, we outline the primary prompt engineering methods combined with prompt engi-
neering, applied to the coding LLMs field:
• Chain of Thoughts (CoT): Wei et al.[2023] report that large language models can enable
the emergence of reasoning abilities when prompted in this way. A chain of thought is a
series of intermediate natural language reasoning steps that lead to the final output.
• CoT and Self-Consistency: this is the natural evolution of CoT technique. It first samples
a diverse set of reasoning paths instead of only taking the greedy one, and then selects the
most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency
leverages the intuition that a complex reasoning problem typically admits multiple different
ways of thinking leading to its unique correct answer (Wang et al.[2022]).
• Tree of Thoughts (ToT): Yao et al.[2023] report that ToT allows LMs to perform deliber-
ate decision making by considering multiple different reasoning paths and self-evaluating
choices to decide the next course of action, as well as looking ahead or backtracking when
necessary to make global choices.
• Teacher CoT: Ho et al.[2023] demonstrated that through the augmentation of the prompt
with an ”educational” explanation generated by a larger model, excellent results are ob-
tained in the finetuning of smaller models. Also Mukherjee et al.[2023] used this ”teach-
ing” approach to develop Orca model.
• EvolInstruct: Luo et al.[2023] proposed a new approach for data augmentation that
achieved important results. They found that LLMs can make given instructions more com-
plex and difficult using specific prompts. Additionally, models can generate entirely new
instructions that are equally complex but completely different. Using this discovery, the
WizardCoder creators can iteratively evolve an initial instruction dataset, improving diffi-
culty level and expanding its richness and diversity
2.4 L ATEST E VALUATION T ECHNIQUES FOR C ODING LLM S (H UMAN E VAL , MBPP,
M ULTI PL-E, H UMAN E VAL PACK )
This subchapter provides an overview of the benchmarks currently being used to evaluate Coding
LLMs.
1. HumanEval8: This general standard benchmark holds a set of 163 problems constrained to
Python language. It assesses whether the model’s code successfully passes all the tests
and provides binary and quantitative results only. Generally, there are 3 types of Hu-
maneval evaluation: pass@1, pass@10 and pass@100. They are different in the number of
”chances” given to the tested model to generate the right answer to the problem.
2. MultiPL-E9 : Based on the premise of HumanEval, MultiPL-E takes this benchmark and
translates its results to numerous programming languages like C++, Rust, Go, Java and
more. With the same ranking structure as HumanEval, this tool also provides a quantitative
binary evaluation.
3. MBPP10 : Consisting of approximately 1000 programming issues sourced from Python pro-
grammers, this benchmark is geared towards beginners. It offers a description of tasks,
corresponding code solutions, and three automatic test cases. Its focus is on programming
fundamentals and the application of standard library functions.
4. HumanEval Pack: This innovative evaluation method by BigCode’s11 team brings a fresh
perspective to the assessment of Coding LLMs. It expands the HumanEval by engaging
three different stages: Fix, Explain, and Synthesize. The “Fix” stage evaluates the model’s
ability to rectify code functions containing subtle bugs, the “Explain” stage assesses the
model’s capacity to generate clear code explanations, while the “Synthesize” stage gauges
how effectively the model synthesizes code given a natural language instruction.
8
https://huggingface.co/datasets/openai humaneval
9
https://huggingface.co/datasets/nuprl/MultiPL-E
10
https://huggingface.co/datasets/mbpp
11
https://huggingface.co/bigcode
3
3 M ETHODS
3.1 DATASET OVERVIEW
Overall, for the generation of Safurai-001 (starting from StarCoder 15B [Li et al. (2023)]) we used
a dataset of 200,000 Q&A examples.
As we have seen from the publications of WizardCoder [Xu et al. (2023)] and Phi-1
[Gunasekar et al. (2023)], data quality is essential for the generation of a performing LLM cod-
ing. For this we have used the latest data augmentation and prompting engineering techniques to
generate the datasets. Furthermore, we involved some datasets and data related to basic logical and
algebraic reasoning, in order to boost the comprehension StarCoder abilities.
These are our proprietary datasets that we selected for Safurai-001 training:
We employed an additional LLM to enhance the educational potential present within the model. By
incorporating both a problem and its solution, we prompted the model to elucidate the reasoning
process leading to the solution.
Our experimentation with various techniques led to the creation of a diverse dataset. The following
are some of the methods we harnessed to augment the educational value:
Trasformation techniques used for our initial datasets:
4
"But there’s a twist: envisage a collaboration between
three experts, each adding a piece to the puzzle. "
"After contributing a step, they discuss it with the
group before proceeding. "
"If an expert determines their step is incorrect, they
step away from the task. "
"The exercise concludes when a comprehensive correct
answer has been achieved, or all experts have
withdrawn."
5
3.4 S AFURAI -001 DATASETS USED FOR FINETUNING :
• Safurai Code Instructor (16k) (filtered examples from initial Safurai Code Dataset, not
transformed)
• Logic Q&A Dataset (22k)
• Teacher Code Instructor (70k)
• Math Q&A Dataset (15k)
• Teacher Code Instructor with Potential Errors (21k)
• ToT Code Instructor (30k)
• CoT Code Instructor (26k)
The training process for the StarCoder model was carried out on eight 100 A100 80GB graphics
cards, with the full network trained for 10 hours. This process was implemented using the Deep-
Speed ZeRO-3 framework. In order to fine-tune the StarCoder models, we leveraged the GitHub
repository of Big Code: https://github.com/bigcode-project/starcoder/tree/main/chat
Given that the focus was on instruction tuning, the data was wrapped in a dialogue format. An
example of this is illustrated below:
{
"messages": [
{
"content": "Is it possible to imagine a society
without law?",
"role": "user"},
{
"content": "It is difficult to imagine a
society that is able to be maintained
without any semblance of Law.",
"role": "assistant",
},
{
6
"content": "It seems like you consider the
absence of law equal to the absence of
anything that could guide the behaviour of
the individual.",
"role": "user",
},
{
"content": "You are correct that there are
other factors that can guide behavior in a
society and play a role in shaping
individuals’ behavior and interactions with
each other. However, even in societies where
these factors are present, laws still serve
an important role in maintaining social
order and resolving conflicts.",
"role": "assistant",
}
]
}
We also wrapped the dialogue with special tokens. Here’s an example of how it looks:
<|system|>
Below is a dialogue between a human and AI assistant
called StarChat.
<|end|>
<|user|>
Is it possible to imagine a society without law?<|end|>
<|assistant|>
It is difficult to imagine ...<|end|>
<|user|>
It seems like you ...<|end|>
<|assistant|>
You are correct ...<|end|>
<|user|>
Yeah, but laws are complicated ...<|end|>
The training process involved setting up several hyperparameters. The hyperparameters settings for
the training are detailed as follows:
3.6 E VALUATION
Deepening our grasp of the capabilities and scope of LLM models is substantial to refining their
application in the real world. However, we found the currently available evaluation methods such
as HumanEval12 to be limited in their ability to provide a comprehensive analysis of these models’
abilities. This led to the invention of the GPT4-based MultiParameters Evaluation method, a qual-
itative alternative designed to provide a more nuanced understanding of the performance of coding
LLMs.
12
https://huggingface.co/datasets/openai humaneval
7
These new qualitative criteria enable us to explore more use-cases outside the conventional binary
pass-fail result of the existing quantitative methods, thus providing a more detailed narrative that
identifies the unique strengths (or weaknesses) of each model. HumanEval, MBPP13 and MultiPL-
E14 . Most of them lean towards a quantitative rather than a qualitative evaluation, leaving out crucial
aspects of the models’ capabilities. As such, we justify the innovation and necessity of our GPT4-
based MultiParameters Evaluation method in addressing this gap.
Seeking to explore the qualitative aspects of our LLM model Safurai, we experimented with a new
evaluation approach based on GPT-4 [OpenAI (2023)].
The experiments detailed above provided a holistic process for comparative model evaluation. By
evaluating 20 (GPT-4 HE-20) and 40 (GPT-4 HE-40) responses from each compared model using
the HumanEval dataset, we generated valuable quantitative data and underlying qualitative insights
on model performance.
13
https://huggingface.co/datasets/mbpp
14
https://huggingface.co/datasets/nuprl/MultiPL-E
15
https://www.anthropic.com/index/introducing-claude
16
https://openai.com/blog/chatgpt
17
https://huggingface.co/HuggingFaceH4/starchat-alpha
8
However, we recognized that the comprehensive ratings provided by GPT-4, while integral to the
evaluation process, cannot fully capture the nuanced specificities inherent in each model. Compre-
hensive ratings bootstrap a model’s ability to resolve a problem and generate correct code, but they
fall short in illuminating aspects such as efficiency, readability, best coding practices, and relevance
to problem. These key dimensions, though less evident, are equally vital to a model’s utility and
impact in real-world software development scenarios.
To alleviate these shortcomings and provide a more detailed, multidimensional, and nuanced ap-
praisal of the models’ functionalities, we introduced a four-parameter rating system.
1. Code Correctness and Completeness: This involved gauging whether the code runs without
errors and if it fully solves the problem, considering all potential edge cases.
2. Efficiency: This measurement determined the optimization level of the code. It scrutinized
whether the code utilizes resources capably, and whether it scales efficiently as input size
increases.
3. Readability and Best Practices: This criterion evaluated the clarity of the written code,
whether it’s easily comprehensible, and if it conforms to established coding conventions
and best practices.
4. Relevance to Problem (On-point Answer): This parameter evaluated how directly the code
solves the given problem, assessing whether the solution implemented is efficacious and
appropriate.
1. I asked this to 4 different AI models: [problem] This is the first model answer: [answer]
This is the second model answer: [answer] This is the third model answer: [answer]
This is the fourth model answer: [answer] These are the tests for the code solution of the
problem: [tests] Please rate each answer from 0 to 100 (best answer possible) based on
Code Completeness. Consider whether the code fully solves the problem, if it handles all
edge cases, and if it contains all necessary functionalities. Also, provide a short explanation
for each rating.
2. I asked this to 4 different AI models: [problem] This is the first model answer: [answer]
This is the second model answer: [answer] This is the third model answer: [answer] This is
the fourth model answer: [answer] These are the tests for the code solution of the problem:
[tests] Please rate each answer from 0 to 100 (best answer possible) on Efficiency. This
entails considering how well-optimized the code is, how frugally it uses system resources,
and its scalability or robustness for larger inputs. Consider both its time complexity (ability
to perform tasks quickly) and space complexity (how much memory the program uses).
Also, provide a short explanation for each rating.
3. I asked this to 4 different AI models: [problem] This is the first model answer: [answer]
This is the second model answer: [answer] This is the third model answer: [answer]
This is the fourth model answer: [answer] These are the tests for the code solution of
the problem: [tests] Please rate each answer from 0 to 100 (best answer possible) based
on its Helpfulness and Educational Value. Consider whether the answer provides clear
explanations, whether it’s easy to follow and understand, whether it teaches you something
valuable about the problem or the coding concepts involved, and whether it gives you new
insights that could help you in future similar problems. Also, provide a short explanation
for each rating.
4. I asked this to 4 different AI models: [problem] This is the first model answer: [answer]
This is the second model answer: [answer] This is the third model answer: [answer]
This is the fourth model answer: [answer] These are the tests for the code solution of the
9
problem: [tests] Please rate each answer from 0 to 100 (best answer possible) based on
its Relevance to Problem (On-point answer). Consider how directly the code answers the
problem, if it provides an adequate and appropriate solution, and whether it implements
the requirements stated in the problem. Also, provide a short explanation for each rating.
3.7.3 W HY GPT-4?
Choosing GPT-4 as the foundation for our novel evaluation method was a clear and strategic deci-
sion. At the current state-of-the-art, GPT-4 showcases the most superior performance amongst all
existing coding models.
In addition to its exemplary robustness and efficiency, GPT-4 boasts an extensive range of applica-
tions, making it an ideal model for evaluating less potent ones in diverse facets. This ability allows
the evaluation process to be as comprehensive as possible, scrutinizing every model under multiple
parameters and therefore obtaining an all-encompassing performance profile.
However, it’s worth noting that there are inherent limitations associated with GPT-4. These limi-
tations, reflected as weaknesses in our newly proposed evaluation system, include potential biases
carried over from GPT-4 behaviours. Nevertheless, as long as no other model matches or surpasses
GPT-4’s aptitude, these weaknesses remain relatively negligible. We reckon that at this stage, GPT-
4 presents the most accurate measure for performance indicators considered in our GPT4-based
MultiParameters Evaluation methodology.
Thus, until a more advanced model than GPT-4 enters the arena, our focus remains on deriving
maximum benefit from this superior model, optimizing its use in revealing the strengths and areas
of improvement of each evaluated coding model.
4 E XPERIMENTAL R ESULT
Safurai-001: As an AI language model, I have been created and trained to have a strong
understanding of Python programming language. I can help with a wide range of topics and
tasks in Python, from basic syntax and concepts to more advanced algorithms or frameworks.
Feel free to ask me anything related to Python coding, and I’ll do my best to assist you!
USER: How can I write a function in Python to find the median of a list? Please include the
time and space complexity of the solution.
10
Safurai-001: The median of a list is the middle value when the list is sorted in ascending
order. If the list has an odd number of elements, the median is the middle element. If the list
has an even number of elements, the median is the average of the two middle elements.
To solve this problem, we can follow these steps:
1. Sort the list in ascending order.
2. Check if the length of the list is odd or even.
3. If the length is odd, return the middle element.
4. If the length is even, calculate the average of the two middle elements and return it.
Here is the implementation of the median function:
def median(l: list):
l.sort() # Sort the list in ascending
order
length = len(l)
if length % 2 == 1: # Check if the
length is odd
return l[length // 2] # Return the
middle element
else:
mid1 = l[length // 2] # First middle
element
mid2 = l[length // 2 - 1] # Second
middle element
return (mid1 + mid2) / 2 # Return
the average of the two middle
elements
Test cases:
print(median([3, 1, 2, 4, 5])) # Output: 3
print(median([-10, 4, 6, 1000, 10, 20])) #
Output: 15.0
The time complexity of this solution is O(n log n), where n is the length of the input list.
This is because the sorting operation takes O(n log n) time. The space complexity is O(1) as
we are not using any additional data structures.
Original Input
Write a JavaScript code to loop over all elements in the given array. Input: numbersArray =
[45, 6, 23, 12, 35]
Original Output
Augmented Output
11
Sure, I’d be happy to help! 1. Start by declaring the array variable:
const numbersArray = [45, 6, 23, 12, 35]
2. We want to loop over all elements in the array, so we need to use a for loop. 3. Declare
our index variable i and set it to 0 4. Define the loop condition; we want to continue looping
as long as the index variable is less than the length of the array. 5. Use the console.log()
method to print out the value of the current element in the array. 6. Finally, increment the
index variable by 1 to move onto the next element in the array. Here’s the final code:
const numbersArray = [45, 6, 23, 12, 35];
for (let i = 0; i < numbersArray.
length; i++) { console.log(
numbersArray[i]); }
This code will loop over all elements in the array and print out their values to the console.
HumanEval18, established by OpenAI, has been the go-to benchmark in evaluating coding AI mod-
els. It focuses on the model’s code-generation capacity based on precise requests and provides
standard solutions and tests. MultiPL-E19 extends the boundaries of the current benchmarks by
translating them to new languages and thereby becoming a massive multi-language benchmarking
platform. It is continuously expanding in terms of number of programming languages, providing a
useful comparison point for models like Safurai.
Our model, Safurai001, achieved a pass@1 score of 50.61% on the HumanEval benchmark with
n=20. (Table 1)
However, the adoption of only these standards limits our analysis to quantitative metrics, thereby
losing some critical flavors of the models.
18
https://github.com/openai/human-eval
19
https://huggingface.co/datasets/nuprl/MultiPL-E/viewer/humaneval-rs/test?row=0
12
4.3.2 N EW Q UALITATIVE E VALUATION B ENCHMARK
We tested the models with the 40 selected problems of HumanEval, already used in GPT4-based
Analysis. The GPT4-based MultiParameters Evaluation method elucidates areas for optimization,
explains why a specific response is superior, and significantly comprehends the specific code-
generation abilities of each model; thus providing a detailed qualitative metric.
We found that this method reveals a plethora of valuable insights into each model’s strengths and
weaknesses, enabling the development of targeted strategies for enhancement. (Table 3)
We put our proposed GPT4-based MultiParameters Evaluation method to the test, using the same 40
selected problems from HumanEval which had been previously used in our GPT4-based Analysis.
The results obtained were intriguing, enlightening and informative, revealing areas of optimization
and superiority in specific responses and highlighting the need to explore code-generation abilities at
a profound level. The qualitative data provided by this method was a treasure trove of information,
reaching depths previous evaluation methods did not venture.
Interestingly, this assessment unveiled nuances in model performance that were not entirely predic-
tive of functionality during actual deployment. For instance, despite WizardCoder [Xu et al. (2023)]
achieving higher scores in the HumanEval evaluation, it was observed that real-world day-to-day
usage, especially for developers, was not as smooth. The model’s conversational abilities seemed to
be somewhat lacking, making it hard to interact effectively with it. This was reflected in its score of
67.1 in the Code Readability category, a stark contrast with Safurai001’s impressive score of 85.88.
In shadowing the performance of conventional quantitative benchmarks like HumanEval and
MultiPL-E, we developed a new qualitative evaluation method: the GPT4-based MultiParameters
Evaluation. This unprecedented approach provided a broader perspective of the nuances and intri-
cacies of LLM models, broadening the spectrum of their functionality and applications.
Models like Phi1 [Gunasekar et al. (2023)], developed by Microsoft Researchers, StarCoder
[Li et al. (2023)], and WizardCoder [Xu et al. (2023)], are mainly evaluated using conventional
methods. While efficient, these methods lack the ability to provide an exhaustive understanding
of the model’s capabilities, thus justifying the necessity for the development of our new evaluation
method.
The GPT4-based MultiParameters Evaluation method breaks new ground in the LLM models eval-
uation field, enabling researchers to delve deeper into the functionality of these models and dramat-
ically enhancing their potential for improvement.
13
of the problem. This inconsistency presents challenges in providing a robust evaluation of
the Coding AI models.
• Time-intensive: The GPT4-based MultiParameters Evaluation is a relatively time-
consuming process. Due to the complex and multifaceted nature of this evaluation method,
it requires substantial time and resource compared to conventional methods. This could
limit its scalability and practical application in rapidly iterating development cycles.
• Qualitative nature: The GPT4-based MultiParameters Evaluation method mainly pro-
vides a qualitative evaluation. It may not effectively handle quantitative data or metrics,
and comparing those could be challenging. This could present limitations when models’
evaluation requires numerical comparisons or precise quantifiable results.
This is an example of GPT4-based MultiParameter Evaluation response. The problem is the Num-
ber 0 of HumanEval Benchmark. About the answers, the first model is Safurai-001, the second
ChatGPT, the third WizardCoder, and the fourth GPT4.
First Parameter: Code Correctness
First Model Answer: Rating: 90 Explanation: The code provided by the first model is
thorough, complete and well-explained. It correctly implements the function to solve the
problem. It handles the basic functionality of checking if any two numbers in the list are
closer than the threshold. However, it does not explicitly handle the edge case where the list
of numbers is empty.
Second Model Answer: Rating: 90 Explanation: The second model provides a complete
solution that is functional and concise. However, it does not handle edge case where the list
could be empty or include only one element.
Third Model Answer: Rating: 80 Explanation: The third model provides a functional solu-
tion to the problem, but it lacks in explanation and handling edge case when the list is empty
or includes only one element.
Fourth Model Answer: Rating: 100 Explanation: The fourth answer provides two complete
solutions, one with a time complexity of O(n2 ) and another efficient one with O(nlogn). It
correctly handles all scenarios and edge cases and provides a good balance between expla-
nation, completeness and efficiency.
Answer 1: 75/100 This answer provides a brute-force solution by comparing each element
with every other element in the list. Although clear and correct, this solution has a time
complexity of O(n2 ), which means it may not be efficient for large lists of numbers, as it
doesn’t scale well with increasing input size.
Answer 2: 75/100 The solution given in Answer 2 is almost identical to the one in Answer
1, having a time complexity of O(n2 ) with a pairwise comparison. This lacks efficiency for
large inputs, and could benefit from using more optimized approaches, such as sorting the
list first.
Answer 3: 75/100 Much like the first two answers, answer 3 also uses a brute-force approach
in the solution. As it compares all pairs of numbers, it has a time complexity of O(n2 ),
meaning it won’t be efficient for larger lists.
Answer 4: 90/100 Answer 4 is the best of the lot in terms of efficiency. The answer actually
lists two different approaches: a brute-force approach mentioned in the first three answers,
and a more efficient approach that first sorts the list before comparing adjacent numbers.
The sorted approach has a time complexity of O(nlogn), which scales better with a growing
list of numbers than the O(n2 ) complexity of the brute-force methods. The only point
deduction is due to not explicitly explaining this crucial difference in complexity between
the two methods.
14
Third Parameter: Code Readability
First model’s answer: Rating: 90 Explanation: The first model provides a detailed expla-
nation of the problem, the method used to solve it, and even breaks down the entire code,
making it easy to understand for beginners too. The example provided at the end adds an
extra value to the answer as well. Yet, it still fails to explain or consider any efficiency
measures or alternate ways to solve the problem.
Second model’s answer: Rating: 70 Explanation: The second model also correctly interprets
the problem and comes up with a valid approach. However, it’s less detailed than the first
model, doesn’t provide a clear breakdown of the code, and offers no example of how to use
the function.
Third model’s answer: Rating: 35 Explanation: The third model just provides the code
needed to solve the problem without any explanations or breakdown of the code. Its answer
is correct but not very helpful or educational, especially for someone trying to learn.
Fourth model’s answer: Rating: 95 Explanation: The fourth model does a great job of
explaining the problem and also introduces two different approaches to solve it, one being
more efficient than the other. It not only offers insight into a more effective solution but also
explains why it’s more efficient. But despite these advantages, the model did not break down
the code or provide examples of usage.
Rating for the first model: 90. The answer from the first model is quite thorough and directly
addresses the problem. The explained code solves the problem as required and it perfectly
implements the requirement. However, it does not consider the efficiency of the solution.
Rating for the second model: 85. The second model also provides an adequate answer,
explaining the solution fairly well and implementing the described function correctly. How-
ever, much like the first model, this model does not consider any performance or efficiency
aspects in its solution.
Rating for the third model: 80. This AI model provided a correct and direct solution, but
with less explanation about the approach or notes on performance or usage. Therefore, the
rating is lower compared to the first and second models.
Rating for the fourth model: 100. The fourth model provided two solutions: the first one
similar to the previous models and the second one that considers performance improvements
by sorting the list first. This model shows a greater understanding of real-world requirements
by considering efficiency, thus making it the best answer.
4.3.5 C OMPARISON OF GPT4- BASED M ULTI PARAMETERS AND H UMAN E VAL BENCHMARKS
Problem
15
Safurai-001 Response
First, we need to understand what the problem is asking us to do. We have a list of numbers and we
need to check if any two of those numbers are closer than a certain threshold.
To solve this problem, we can use a loop to compare each number in the list to every other number
in the list. If the difference between the two numbers is less than or equal to the threshold, we know
that they are closer than the threshold.
Here is the code to solve this problem:
# Define a function that takes in a list of
numbers and a threshold
def has_close_elements(numbers: List[float],
threshold: float) -> bool:
# Loop through each number in the list
for i in range(len(numbers)):
# Loop through every other number in the
list
for j in range(i+1, len(numbers)):
# Check if the difference between the
two numbers is less than or equal
to the threshold
if abs(numbers[i] - numbers[j]) <=
threshold:
# If it is, return True
return True
# If we have checked every pair of numbers
and none of them are closer than the
threshold, return False
return False
In this example, the list of numbers ‘[1.0, 2.8, 3.0, 4.0, 5.0, 2.0]‘ has two numbers that are closer
than the threshold of ‘0.3‘. The function returns ‘True‘ because at least one pair of numbers is closer
than the threshold.
16
ChatGPT Response
The code you provided defines a function called has1 close elements that takes two arguments:
numbers, which is a list of float numbers, and threshold, which is a float representing the maximum
allowed difference between two numbers. The function checks if there are any two numbers in the
numbers list that are closer to each other than the given threshold. It returns True if such a pair
exists, and False otherwise. Here’s an implementation of the has close elements function based on
the code you provided:
from typing import List
This implementation uses nested loops to compare each pair of numbers in the numbers list. The
abs() function is used to calculate the absolute difference between the two numbers, and if this dif-
ference is less than the threshold, the function returns True. If no such pair is found after examining
all combinations, the function returns False. You can use this function to check if there are any close
elements in a list of numbers as demonstrated in the examples you provided.
For HumanEval tests, both of Safurai-001 and ChatGPT answers are correct in the same way.
17
Code Readability
First model answer: Rating: 95 The answer elaboratively explains the problem and its solution. It
clearly describes how the function works, and the step-by-step breakdown of the code is beneficial in
understanding the algorithm. The example at the end, which demonstrates how to use the function,
is also very helpful.
Second model answer: Rating: 80 This answer also gives a clear explanation of the solution and
function. However, it lacks an in-depth breakdown of the code and doesn’t provide an example of
how to use the function, which would have been useful for demonstrating its functionality.
Question Relevance
First model answer rating: 95/100 This response not only provides the correct implementation of the
function, but also explains the solution in a detailed manner. It thoroughly covers how the function
works, including explaining its various parts like two nested loops, use of abs() function and the if
conditions. The example usage was an added bonus. I’ve docked a few points because the model
does not talk about the time complexity of this solution and also does not provide the full code
snippet which is importing necessary modules from the typing module.
Second model answer rating: 85/100 This response also delivers a correct code implementation
with a succinct explanation. Though it explains the working of the function effectively, it’s not as
detailed as the first model’s answer. This response is also short on providing information on usage
of the function and discussing the time complexity of the solution. Importantly, it doesn’t restate
the question’s example into code for demonstrating purposes, which could be highly informative for
users who are learning or new to Python. Therefore, I’ve rated it slightly lower.
Safurai-001 95 90 95 95
ChatGPT 85 90 80 85
Although the two responses are quite similar, as previously shown, GPT4-based MultiParameters
Evaluation is able to identify small details and differences that HumanEval would not be able to
identify given its quantitative nature.
5 C ONCLUSION
This research reveals deep insights into how the utilization of data transformation can lead to the
creation of complex, high-quality datasets. Following the stance of the Orca team (Gunesekar er al.
[2023]), we stress the profound significance of the datasets chosen for fine-tuning. Through the use
of innovative data transformation and prompt engineering techniques, these datasets can critically
enhance the efficiency of foundation models. This, in turn, can lead to sizeable augmentations in
classic benchmarks.
Our study presented Safurai-001 as a potent alternative to WizardCoder [Xu et al. (2023)] for multi-
turn conversations, adeptly managing user interactions and providing supports where needed. We
also present a distinct advance in the area of Coding LLMs evaluation tools, GPT4-based MultiPa-
rameter Evaluation Benchmark, moving beyond purely quantitative measures to offer more quali-
tative indicators. This new method provides a richer, more nuanced means of evaluation that can
expose insights beyond the reach of traditional standards such as HumanEval20 and MultiPL-E21 .
20
https://github.com/openai/human-eval
21
https://huggingface.co/datasets/nuprl/MultiPL-E
18
Drawing on these findings, we aim to shape future discourse and innovation in this field, accentu-
ating the paramount importance of robust evaluation tools and the fullest utilization of datasets in
optimizing language model performance.
Question
How can I write a Python function that takes a list of integers as an input and returns a tuple
representing the sum and product of all integers in that list?
WizardCoder Response
We can use the built-in ‘sum()‘ function in Python to get a sum of all integers in the list. We
can also use a variable initialized as 1 for product and then multiply it with each integer in
the list. Here’s the solution:
from typing import List, Tuple
def sum_product(numbers):
total_sum = sum(numbers) or 0 # if numbers is
empty, set total_sum to 0
total_product = 1
for num in numbers:
total_product *= num
return (total_sum, total_product)
19
Safurai-001 Response
First, we need to understand what a list is. A list is a collection of items, like a grocery list.
In this case, our list is made up of numbers.
Next, we need to understand what a tuple is. A tuple is like a list, but once it’s made, we
can’t change it. It’s like a locked box that we can’t open.
Now, let’s look at the problem. We need to write a function that takes a list of numbers and
returns a tuple with the sum and product of those numbers. Here’s the code:
from typing import List, Tuple
20
ACKNOWLEDGMENTS
R EFERENCES
Loubna Ben Allal, Raymond Li, Denis Kocetkov, Chenghao Mou, Christopher Akiki, Carlos Munoz
Ferrandis, Niklas Muennighoff, Mayank Mishra, Alex Gu, Manan Dey, Logesh Kumar Umapathi,
Carolyn Jane Anderson, Yangtian Zi, Joel Lamy Poirier, Hailey Schoelkopf, Sergey Troshin,
Dmitry Abulkhanov, Manuel Romero, Michael Lappert, Francesco De Toni, Bernardo Garcı́a
del Rı́o, Qian Liu, Shamik Bose, Urvashi Bhattacharyya, Terry Yue Zhuo, Ian Yu, Paulo Villegas,
Marco Zocca, Sourab Mangrulkar, David Lansky, Huu Nguyen, Danish Contractor, Luis Villa, Jia
Li, Dzmitry Bahdanau, Yacine Jernite, Sean Hughes, Daniel Fried, Arjun Guha, Harm de Vries,
and Leandro von Werra. Santacoder: don’t reach for the stars! arXiv:2301.03988, 2023.
Yekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, and Hua Wu. Ernie-code: Beyond
english-centric cross-lingual pretraining for programming languages. arXiv:2212.06742, 2022.
Shubham Chandel, Colin B. Clement, Guillermo Serrato, and Neel Sundaresan. Training and eval-
uating a jupyter notebook data science assistant. arXiv:2201.12901, 2022.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh,
Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam
Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James
Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Lev-
skaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin
Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret
Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick,
Andrew M. Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica
Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Bren-
nan Saeta, Mark Diaz, Orhan Firat, Michele Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas
Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling language modeling with pathways.
arXiv:2204.02311, 2022.
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong,
Wen tau Yih, Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling
and synthesis. arXiv:2204.05999, 2022.
Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth
Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital
Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai,
Yin Tat Lee, and Yuanzhi Li. Textbooks are all you need. arXiv:2306.11644, 2023.
Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao
Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii,
Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João
Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Lo-
gesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra
Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey,
Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luc-
cioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor,
Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex
Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva
Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes,
Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. Starcoder: may the source
be with you! arXiv:2305.06161, 2023.
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom
Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien
de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes Welbl, Sven
Gowal, Alexey Cherepanov, James Molloy, Daniel J. Mankowitz, Esme Sutherland Robson, Push-
meet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals. Competition-level code
generation with alphacode. arXiv:2203.07814, 2022.
21
OpenAI. Gpt-4 technical report. 2023.
Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, An Fu, Muhan Zeng, Ailun Yu,
Jichuan Ji, Jingyang Zhao, Yuenan Guo, and Qianxiang Wang. Pangu-coder2: Boosting large
language models for code with ranking feedback. arXiv:2307.14936, 2023.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-
lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher,
Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy
Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn,
Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel
Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee,
Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra,
Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi,
Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh
Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen
Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic,
Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models.
arXiv:2307.09288, 2023.
BigScience Workshop. Bloom: A 176b-parameter open-access multilingual language model.
arXiv:2211.05100, 2022.
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and
Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions.
arXiv:2304.12244, 2023.
Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, and Jian-Guang Lou. When language
model meets private library. arXiv:2210.17236, 2022.
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan Wang, Yufei Xue, Zihan Wang, Lei Shen,
Andi Wang, Yang Li, Teng Su, Zhilin Yang, and Jie Tang. Codegeex: A pre-trained model for
code generation with multilingual evaluations on humaneval-x. arXiv:2303.17568, 2023.
22