0% found this document useful (0 votes)
15 views19 pages

An Overview and Discussion On Using Large Language

This document discusses the potential of Large Language Models (LLMs) in automating the generation of implementations for open-ended problem solving, highlighting their ability to support problem framing, solution exploration, and advanced assessment. It contrasts traditional methods that focus on well-defined problems with LLMs' capacity to dynamically learn and adapt, suggesting that LLMs can enhance problem-solving activities that are often overlooked by existing approaches. The report also outlines the limitations of LLMs, such as their reliance on statistical patterns and lack of true logical reasoning, while proposing future research directions to address these challenges.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views19 pages

An Overview and Discussion On Using Large Language

This document discusses the potential of Large Language Models (LLMs) in automating the generation of implementations for open-ended problem solving, highlighting their ability to support problem framing, solution exploration, and advanced assessment. It contrasts traditional methods that focus on well-defined problems with LLMs' capacity to dynamically learn and adapt, suggesting that LLMs can enhance problem-solving activities that are often overlooked by existing approaches. The report also outlines the limitations of LLMs, such as their reliance on statistical patterns and lack of true logical reasoning, while proposing future research directions to address these challenges.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

An Overview and Discussion on Using

Large Language Models for Implementation


Generation of Solutions to Open-Ended Problems
Hashmath Shaik Alex Doboli
Department of ECE Department of ECE
Stony Brook University Stony Brook University
Stony Brook, NY 11794-2350 Stony Brook, NY 11794-2350
arXiv:2501.00562v1 [[Link]] 31 Dec 2024

[Link]@[Link] [Link]@[Link]

Abstract—Large Language Models offer new opportunities to tions cannot solve all requirements, i.e. they satisfy some
devise automated implementation generation methods that can but not others [12], [13]. Changing the parameters of
tackle problem solving activities beyond traditional methods, the implementation does not address the issue. Problem
which require algorithmic specifications and can use only static
domain knowledge, like performance metrics and libraries of solving includes options, like producing a description
basic building blocks. Large Language Models could support of the implementation trade-offs by parameter sampling
creating new methods to support problem solving activities for and selecting the best compromise, exploring implemen-
open-ended problems, like problem framing, exploring possible tation alternatives for specific fragments of the imple-
solving approaches, feature elaboration and combination, more mentation, so that better trade-offs result for the overall
advanced implementation assessment, and handling unexpected
situations. This report summarized the current work on Large solution, and selecting a different approach (principle)
Language Models, including model prompting, Reinforcement for an implementation, including situations when a new
Learning, and Retrieval-Augmented Generation. Future research implementation must be built, similar to open-ended
requirements were also discussed. solving for building a new implementation.
Index Terms—implementation generation, Large Language 3) Open-ended problem solving for implementation gener-
Models, open-ended problem solving, prompting, Reinforcement
Learning, Retrieval-augmented Generation ation requires devising new solutions with a significant
departure and characteristics from previous implemen-
I. I NTRODUCTION tations. The understanding of this process is still lim-
ited [14], [15]. Also, there are insufficient metrics to de-
Problem solving is the process of creating a solution for a scribe the degree to which the process is systematically
problem description [1]–[4]. The solution can be an explana- progressing towards success, e.g., building a new im-
tion for a set of properties exhibited by a static or dynamic plementation. Typical activities include problem framing
situation, e.g., a mathematical proof, or an implementation and problem understanding, identifying and selecting the
(realization), which is the construction of a new materialization solving approach, divide and conquer (e.g., problem par-
(e.g., design) that exhibits the required properties as a result titioning into sub-problems), implementation elaboration
of their operation (functioning, execution). This report focuses through trial-end-error, feature combination, adjustment,
on the implementation (realization) of problem solving. abstraction and insight gaining, implementation analysis
Creating an implementation can pertain to the three general- to find pros and cons and the impact of features on the
purpose problem-solving situations: well-defined problems, ill- implementation operation, implementation modification,
defined problems, and open-ended problems [5]–[7]: error correction, and handling unexpected situations.
1) Well-defined problem solving for implementation con-
struction describes situations in which an existing so- As summarized in the next section, traditional automated
lution can be reused with some incremental changes to implementation generation focuses mainly on elaboration and
solve a new problem. For example, textbook algorithms parameter trade-off exploration, for which the domain knowl-
are utilized to solve a new problem by selecting proper edge of the implementation is captured by customized met-
data structures and customizing the algorithm parame- rics [16] or in a library of basic building blocks [16], [17].
ters, like the conditions of conditional statements and The library is static and does not evolve to incorporate new
the iterations of loops. Using parameterized templates knowledge either from external sources or as a byproduct
for circuit design [8]–[11] belongs to this category too. of implementation generation. Moreover, traditional methods
2) Ill-defined problem solving for implementation construc- assume the existence of a problem specification expressing at
tion represents cases in which the existing implementa- least functional and performance requirements, but more often
the algorithm or architecture (structure) of the implementa- LLMs. Section IV discusses the similarities of LLMs and
tion [17], [18]. Hence, it can be argued that existing methods traditional automated implementation generation methods and
focus mainly on well-defined and ill-defined problems but summarizes the related research needs. Conclusions end the
less on implementation generation for open-ended problem report.
solving. Existing approaches cannot tackle problem framing
and exploring solution approaches, even though trial-and- II. OVERVIEW OF T RADITIONAL AUTOMATED
error and rapid prototyping are essential in understanding I MPLEMENTATION G ENERATION
new opportunities and limitations. Moreover, there is little Traditional approaches to automatically generate implemen-
automated support for divide and conquer and architecture tations can be grouped into four broad categories: (i) ap-
creation, combination of features from different solutions, and proaches based on high-level specifications, (ii) methods us-
handling unexpected situations. In general, traditional methods ing evolutionary algorithms, (iii) agent-based methods, and
struggle with any activity conducted at a level above an (iv) cognitive architectures. The four categories are summa-
algorithmic description of an implementation. rized next.
However, recent advances in Large Language Models (i) Approaches based on high-level specifications: These
(LLMs) created opportunities to devise novel automated im- approaches include traditional compiling methods to generate
plementation generation methods that can tackle problems executable code [16] and high-level synthesis methods [18]–
beyond algorithmic specifications and may use domain knowl- [21] and template-based synthesis [10], [11] to create elec-
edge that is dynamically learned over time. Arguably, LLMs tronic circuits and systems. They use high-level specifications
could contain knowledge that is continuously updated by described using a programming language. Conceptually, spec-
learning new features either from external documents or based ifications serve as parameterized descriptions of the target
on their own previously generated implementations. Imple- implementation architecture. Specifically, internal representa-
mentation assessment could be improved by comparing it to tions are built using a set of predefined rules (e.g., language
similar, externally available implementations and considering grammar) applied to the specifications and then used to
collective feedback and preferences expressed for other solu- create an optimized hardware design by exploring different
tions. The opportunities and limitations of an implementation optimization possibilities. Prediction models or simulation
can be better understood by embedding it into the trend of tools are integrated to evaluate the performance of possible
related designs. Moreover, support can be offered for problem implementation alternatives.
framing and exploring possible solution approaches, activities These methods address the problem-solving activities in the
that are often collective, in a team. LLMs can process multi- following ways: The specification gives an unambiguous, com-
modal descriptions, including natural language and images plete description of the parameterized architecture. Thus, there
with certain degrees of specification completeness, unknowns, is no problem framing step and problem understanding is fully
and ambiguity. Hence, understanding the capabilities of LLMs addressed during specification creation. Divide and conquer
for implementation generation, possibly in conjunction with is defined by the structuring of the specification. Also, there
traditional methods, is required. These capabilities mostly is no step of exploring possible implementation alternatives,
emerge from LLMs being able to learn a broad range of as the specification explicitly describes the data processing
associations in multi-modal data and diverse contexts. steps, including the connections between the sequences of
This report studied the degree to which LLMs, possibly processing steps, i.e. using the processing outputs as inputs for
using prompting, Reinforcement Learning (RL) and Retrieval- the next processing steps. Hence, feature combination during
Augmented Generation (RAG), can model the activities of elaboration only connects predefined operators which do not
implementation generation for an open-ended problem solving. change their function based on the connections. From the
The goal was to identify how LLMs and their extensions point of view of cognitive psychology, these combinations are
can contribute to implementing problem-solving activities that relation-based combinations but do not reflect feature-based
are not addressed in traditional methods. The report offers combinations, in which features of a concept are transferred
an extensive presentation of prompting methods, RAG tech- to another concept [22]. Hence, there are no unexpected sit-
niques, and RL approaches. Then, the using of LLMs to uations, including emerging features. Implementation analysis
implement problem-solving activities not available in tradi- uses performance models and simulation, even though the pros
tional automated implementation generation was discussed. and cons of an implementation are rarely causally linked to
New research requirements were also offered. The report the implementation fragments responsible for them. Hence, the
argued that these requirements refer to topics, like constructing insight gain is limited. Trial-and-error (possibly guided by pri-
the implementation approach, effectively controlling elabo- ority functions), implementation modification, and adjustment
ration, robust qualitative and quantitative assessment across are only at the level of optimizing the architecture parameters.
abstraction levels, knowledge memorizing during learning, and There is no abstraction or summarization during the process.
managing the problem solving process. Error correction requires to modify the specification and then
The report has the following structure. Section II offers repeat the problem-solving process.
an overview of the work on traditional, automated imple- (ii) Methods using evolutionary algorithms: These meth-
mentation generation. Section III presents an overview of ods create a dynamic process, in which large populations of
solutions originate new populations through traditional oper- setting. For example, SOAR CA models cognition-based prob-
ators, i.e. selection, crossover, and mutation [23]. Selection lem solving [28], using operation selection and application
means propagating high-fitness individuals from the current (e.g., state elaboration, operator proposal and evaluation, and
to the next population, crossover combines features of a set decision). Knowledge is procedural if-then rules selected
of solutions to produce new solutions, and mutation randomly through matching. Learning stores short-cuts to solutions, con-
changes solution features. ditions for applying the rules, and utility updates. ACT-R CA
These methods do not include problem framing and un- uses multiple symbolic knowledge representations, declarative
derstanding. Identifying and selecting the implementation ap- and procedural information learning, and utility-based decision
proach has been studied less, even though it is possible to making [27]. EPIC CA matches in parallel production rules to
maintain separate sub-populations, each for a different ap- the working memory, followed by the selection of firing rules
proach, and then giving higher priority to the sub-populations for multiple goals [30]. Sigma CA includes mixed symbolic-
that include more high-quality implementations. There is no probabilistic, discrete-continuous representations, knowledge
divide-and-conquer to separate a problem into sub-problems summarization and integration, and inference-based reason-
and no explicit error correction. Trial-and-error is mimicked ing [29]. Clarion CA maintains explicit and implicit cognition,
through the mutation operator, even though mutation does each having different representations and processing methods,
not implement a systematic exploration process guided by e.g., rule extraction, generalization, specialization, backprop-
the learned knowledge. There is no insight gaining during agation, and reinforcement learning [31]. InnovA is a CA for
the process, abstraction or summarization of the learned automated design of electronic circuits [32].
knowledge, and no explicit identification of unexpected sit-
III. OVERVIEW OF L ARGE L ANGUAGE M ODELS AND
uations. Crossover implements combination, including feature
D IFFUSION M ODELS
and relation combination. Similar to the previous category,
implementation analysis uses performance models and sim- A. Large Language Models
ulation to produce a fitness value that controls the selection Large Language Models (LLMs), primarily those built on
of the better implementations. However, there is no explicit transformer architectures, have made significant strides in
identification of the causal features that produce the pros and producing coherent, contextually relevant text [33]. They excel
cons of an implementation, thus there is no implementation at pattern recognition and can generate fluent natural lan-
adjustment, modification, or correction guided by causal in- guage by leveraging billions of parameters trained on massive
formation. There is no explicit memory mechanism, features corpora [34]. However, their computational principle—self-
being implicitly memorized through a population, and there attention over sequential data—imposes fundamental limita-
is no possibility to backtrack to previous states to attempt tions that hinder their ability to perform the rich, open-ended
exploring a different path. problem-solving tasks described in the previous sections.
(iii) Agent-based methods: These methods utilize multiple At the core of these limitations is the reliance on statistical
interacting agents, each agent having its own memory and correlations rather than genuine logical or conceptual under-
running its own decision-making algorithm [24], [25]. Even standing. While self-attention excels at identifying relevant
though traditional agents realize simple decision-making al- tokens in a sequence, it does not inherently encode hierarchi-
gorithms, e.g., through a set of simple rules in response to cal structures, domain-specific causal rules, or strict logical
specific inputs, it is possible to consider more complex meth- constraints. This stands in contrast to open-ended problem
ods, such as each agent running its own synthesis algorithm solving, where the concept space can be segmented into three
or population-based evolution. Agents interact with each other main categories—hierarchical concepts, alternative concepts,
by communicating high-quality implementations and features, and fundamental concepts—and the action space encompasses
or implementation steps, which then can be utilized by the complex operations, such as feature combination, dynamic
other agents, too. adjustment, abstraction, insight generation, and summarization
Depending on their decision-making procedure, agent-based [35]. LLMs struggle to engage these conceptual spaces in a
methods have similar characteristics, like the methods of the principled way because they are not grounded in mechanisms
previous two categories. Their main advantage is their ca- that ensure hierarchical reasoning, strategic problem decom-
pacity to simultaneously maintain multiple perspectives about position, or the flexible reuse of insights and intermediate
the implementation creation process, e.g., through their local representations [36].
memory, preferences, priorities, etc., and then aggregate these Another critical shortcoming is that LLMs tend to produce
perspectives to improve problem solving. It can be argued that generalized answers aligned with the statistical patterns seen
they mimic the implementation creation process by a team in their training data [37]. They are not inherently equipped
(team problem solving) [14], [26]. to execute a true divide-and-conquer approach to complex
(iv) Cognitive architectures: Cognitive architectures (CAs) tasks, nor can they systematically apply trial-and-error strate-
mimic the brain activities during problem solving [27]–[31]. gies. For example, while open-ended problem solving may
Architectures include modules for knowledge representation, demand iterative refinement—where a solver explores a space
knowledge memory, knowledge classification, summarization, of possible solutions, backtracks as necessary, and learns from
comparison, decision-making, prediction, learning, and goal failed attempts—an LLM’s output is typically a single forward
pass [38]. Without an internal model of logical inference, (ThoT) [48], Chain-of-Knowledge (CoK) [49], Chain-of-Code
memory structures that accumulate knowledge over multiple (CoC) [50], Logical Thoughts (LoT) [51], Chain-of-Event
steps, or explicit strategy formulations, LLMs cannot easily (CoE) [52], and Chain-of-Table [53] generate a single,
correct their reasoning or adapt their approach based on step-by-step sequence (chain) of responses toward the final
previous mistakes [39]. This leads to issues such as hallucina- answer. Methods differ in the type of task they target, i.e.,
tions, where models confidently assert falsehoods; distractions, code generation, summarization, and logical inference, and
where irrelevant details are emphasized; and a general inability in how they refine or represent intermediate steps. CoT
to build complex, causally grounded explanations. shows that using intermediate prompting steps can enhance
Some researchers have explored techniques like constraint- accuracy, e.g., up to 39% gains in mathematical problem
based decoding to enforce logical or linguistic rules at in- solving. An example in-context prompt for CoT might be:
ference time [40]. This can improve consistency and co- “If the problem is ‘Calculate 123 × 456,’ break it down as
herence to some extent, but it remains an add-on rather (100 + 20 + 3) × 456 and compute step-by-step.” Complex
than a fundamental solution. Constraint-based methods do CoT uses more involved in-context examples, improving
not grant the model a deeper conceptual understanding; they performance by as much as 18% on harder tasks. ThoT
merely prune outputs that violate predetermined constraints. tackles long or chaotic contexts by breaking them into
Similarly, improvements like sparse attention mechanisms manageable parts (e.g., dividing long passages into sections
reduce computational complexity, adapter layers can inject for sequential summarization), while CoK strategically adapts
domain-specific knowledge [41], and memory-augmented and consolidates knowledge from multiple sources to ensure
transformers attempt to store and reuse intermediate reasoning coherence and reduce hallucination. CoC specializes in
steps. While these approaches enhance performance on certain code-oriented reasoning by simulating key code outputs
tasks, they do not fully overcome the inherent limitations (e.g., predicting intermediate variable states for debugging),
of attention-based architectures or enable robust open-ended whereas LoT integrates logical equivalences and reductio ad
problem solving. The models are still limited by their training absurdum checks to refine reasoning chains (e.g., validating
data, biased toward patterns present therein, and lack the statements by identifying contradictions in their negations).
ability to intentionally search concept space, systematically CoE handles summarization by extracting, generalizing,
test hypotheses, or derive new conceptual abstractions beyond filtering, and integrating key events (e.g., pinpointing main
what is statistically suggested [42]. events from news articles), and Chain-of-Table extends CoT
In response to these challenges, a body of methods has principles to tabular data by dynamically planning and
emerged to push LLMs closer toward more sophisticated transforming tables—such as filtering or aggregation—before
reasoning and problem-solving behaviors. This work can be generating the final answer.
broadly divided into three interrelated categories: prompting Branching reasoning methods, like Self-Consistency [54],
engineering, Knowledge Retrieval strategies, and Model re- Contrastive CoT (or Contrastive Self-Consistency) [55], Fed-
finement techniques (RL). erated Same/Different Parameter Self-Consistency/CoT (Fed-
SP/DP-SC/COT) [56], Tree-of-Thoughts (ToT) [57], and
B. Prompting Engineering Maieutic Prompting [58], explore multiple possible reasoning
Prompting techniques utilize carefully constructed input paths in parallel. Branching techniques vary in how they
prompts to guide the model’s response generation process. sample or fuse paths, some relying on consensus votes and
Techniques can be grouped into five categories dicussed next. others on dynamic adaptation or tree-based elimination. Self-
a) Single-stage prompting (SSP): SSP methods directly Consistency, for instance, samples diverse solution paths and
instruct the model without iterative refinement. Meanwhile, selects the most consistent final answer, achieving gains of
Basic + Annotation Guideline-Based Prompting + Error over 11% on math tasks. Contrastive CoT incorporates both
Analysis-Based Prompting [43] uses formally defined entity correct and incorrect in-context examples to broaden the
annotation guidelines to specify how clinical terms should be model’s understanding, improving performance by over 10%
identified and categorized, ensuring clarity in entity recog- compared to standard CoT. Fed-SP-SC leverages paraphrased
nition. In addition, it incorporates instructions derived from queries to crowdsource additional hints, while ToT maintains
analyzing common model errors, such as addressing ambigu- a tree of partial solutions and systematically explores them
ous entity boundaries or redefining prompts for overlapping with breadth-first or depth-first strategies, offering up to 65%
terms. This strategy significantly improves clinical Named higher success rates than CoT on challenging math tasks.
Entity Recognition, with relaxed F1 scores reported as 0.794 Maieutic Prompting likewise generates a tree of propositions to
for GPT-3.5 and 0.861 for GPT-4 on the MTSamples dataset reconcile contradictory statements, surpassing linear methods
[44] and 0.676 for GPT-3.5 and0.736 for GPT-4 on the VAERS by 20% on common-sense benchmarks.
dataset [45], demonstrating its effectiveness. Iterative reasoning approaches, such as Plan-and-Solve (PS),
b) Reasoning strategies: These methods are of three types: Program-of-Thoughts (PoT), Chain-of-Symbol (CoS), Struc-
linear, branching, and iterative reasoning. tured Chain-of-Thought (SCoT), and Three-Hop Reasoning
Linear reasoning methods such as Chain-of-Thought (THOR), refine solutions step by step, often by passing in-
(CoT) [46], Complex CoT [47], Thread-of-Thought termediate outputs back into the model to enhance accuracy.
PS explicitly decomposes tasks into planning and execution with random variable substitutions before finalizing the an-
phases, where the planning phase structures the problem swer. The approach boosts accuracy from 78.7% to 92.5%.
into smaller sub-tasks, and the execution phase solves them Analogical Reasoning [69] prompts LLMs to generate and
sequentially. This reduces semantic and calculation errors, solve similar examples before addressing the main problem,
outperforming Chain-of-Thought (CoT) prompting by up to resulting in a 4% average accuracy gain across various tasks.
5% [59]. PoT enhances performance by separating reasoning Synthetic Prompting [70] involves a backward step, where
from computation: the model generates programmatic solu- a new query is generated from a self-constructed reasoning
tions executed by a Python interpreter, achieving up to 12% chain, and a forward step, where this query is re-solved;
accuracy gains in numerical and QA tasks [60]. CoS encodes this strategy selects the most complex examples for few-shot
spatial and symbolic relationships using concise symbolic prompts, leading to up to 15.6% absolute improvements in
representations, which improves reasoning in spatial tasks mathematical problem solving, common-sense reasoning, and
by up to 60.8% [61]. SCoT introduces structured reasoning logical reasoning.
through program-like branching and looping, significantly Meta-Level Guidance (MLG) methods enhance Large Lan-
improving code generation accuracy by up to 13.79% [62]. guage Models (LLMs) by promoting self-reflection and fo-
Finally, THOR addresses emotion and sentiment analysis cusing on pertinent information, thereby reducing errors. Self-
by splitting queries into three stages—aspect identification, Reflection involves the model evaluating its own outputs
opinion analysis, and polarity inference—resulting in superior to identify and correct mistakes, leading to improved per-
performance over prior supervised and zero-shot models [63]. formance. For example, in translation tasks, self-reflection
These approaches exemplify the power of iterative methods enables LLMs to retrieve bilingual knowledge, facilitating
in breaking complex problems into manageable components, the generation of higher-quality translations. Focusing is
thereby reducing errors and improving overall performance. achieved through techniques like System 2 Attention (S2A)
c) Multi-Stage Prompting (MSP): MSP techniques rely on [71], which filters out irrelevant content by prompting the
iterative feedback loops or ensemble strategies. MSP methods model to regenerate the context to include only essential
systematically refine outputs and incorporate multiple response information before producing a final response. This two-step
paths, e.g., through voting or iterative analysis, to yield more approach enhances reasoning by concentrating on relevant
robust and accurate solutions, particularly in domains requiring details, thereby improving accuracy. S2A has been shown
deeper reasoning or tailored task adaptation. Ensemble Re- to outperform basic prompting methods, including Chain-
finement (ER) [64] builds on Chain-of-Thought (CoT) and of-Thought (CoT) and instructed prompting, particularly on
Self-Consistency by generating multiple CoT-based responses truthfulness-oriented datasets. Metacognitive Prompting (MP)
at high temperature (introducing diversity) and then iteratively [72] introduces a five-stage process to further enhance LLM
conditioning on generated responses to produce a more coher- performance: (1) Comprehension: The model attempts to
ent and accurate output, leveraging insights from the strengths understands the input, ensuring clarity before proceeding;
and weaknesses of initial explanations and majority voting. (2) Preliminary Judgment: An initial assessment is made based
Auto-CoT [65] constructs demonstrations automatically by on the understood information; (3) Critical Evaluation: The
clustering queries from a dataset and generating reasoning initial judgment is scrutinized, considering alternative perspec-
chains for representative queries using Zero-Shot-CoT. Clus- tives and potential errors; (4) Final Decision with Explanation:
tering is achieved by partitioning questions into groups based A conclusive decision is reached, accompanied by a rationale
on semantic similarity, ensuring that representative queries to support it; and (5) Self-Assessment of Confidence: The
capture the diversity of the dataset. ReAct [66] interleaves model evaluates its confidence in the final decision, reflecting
reasoning traces—thought processes that explain intermediate on the reasoning process. This structured approach enables
steps—with action steps that execute operations, enabling LLMs to perform consistently better than methods like CoT
superior performance in complex tasks by seamlessly com- and Program Synthesis (PS) across various natural language
bining reasoning and action. Moreover, Active-Prompt [67] processing tasks, including paraphrasing, natural language
adaptively selects the most uncertain training queries, iden- inference, and named entity recognition.
tified via confidence metrics like entropy or variance, for e) Task Decomposition: These approaches break down com-
human annotation, boosting few-shot learning performance by plex tasks into smaller steps but vary in how they orchestrate
focusing on areas with the highest uncertainty. and execute the sub-problems. They include problem break-
d) Knowledge Enhancement: These approaches use high- down and sequential solving methods.
quality examples and strategic self-monitoring to improve Problem Breakdown approaches include the Least-to-Most
LLM performance. They pertain to two types, example-based method [73], which addresses the challenge of Chain-of-
and meta-level guidance methods Thought (CoT) failing on problems more difficult than its
Example-based methods leverage auxiliary examples or exemplars by first prompting the LLM to decompose a query
synthesized instances to guide the response creation process into sub-problems and then solving them sequentially, demon-
of LLMs. MathPrompter [68] focuses on creating a symbolic strating notable improvements over CoT and basic prompt-
template of the given mathematical query, solving it analyti- ing on tasks like commonsense reasoning and mathematical
cally or via Python, and then validating the derived solution problem solving. The decompositions are characterized by
their hierarchical structure, breaking down complex problems memory integration, and multi-hop and multi-modal reasoning.
into simpler, manageable sub-tasks that build upon each other The four areas are discussed next.
to facilitate step-by-step reasoning. Decomposed Prompting a) Task-Specific and Schema-Based Retrieval (TSR): TSR
(DecomP) [74] advances this idea by delegating sub-problems approaches leverage structured methods to solve problems in
to different LLMs with specialized prompts and decomposers, domains such as mathematics and knowledge-intensive tasks.
potentially incorporating hierarchical or recursive decompo- For instance, Schema-Based Instruction Retrieval-Augmented
sition, or external API calls. Specialized prompts are finely Generation (SBI-RAG) [80] employs schema-based instruction
tuned instructions crafted to guide each LLM toward solving to solve math word problems by predicting relevant schemas,
specific sub-problems efficiently, focusing on unique task offering a structured problem-solving paradigm. Schemas,
requirements or data contexts. Decomposers act as modular which act as templates for organizing and applying domain-
intermediaries that analyze the overall problem, partition it into specific knowledge, are inherently tied to knowledge graphs
distinct, logically dependent sub-tasks, and distribute them to that map relationships between concepts, enhancing reasoning
LLMs or external tools for execution. It achieves an average capabilities. The model selects the most suitable schema by
25% gain over CoT and Least-to-Most in Commonsense Rea- aligning the problem context with predefined patterns and
soning. Program-Aided Language Models (PAL) [75] further uses it to guide the solution process in a systematic man-
leverage interleaved natural language and programmatic steps ner. Similarly, Knowledge Graph-Enhanced RAG Framework
to enable Python-based execution of the reasoning process, (KRAGEN) [81] employs advanced prompting techniques,
surpassing CoT and basic methods for mathematical and notably the graph-of-thoughts (GoT) method, to dynamically
commonsense tasks. decompose complex problems into smaller subproblems. Each
Sequential Solving includes methods, like Binder and Dater subproblem is addressed using relevant knowledge retrieved
algorithms. Binder [76] integrates neural and symbolic parts through the RAG framework, minimizing hallucinations and
by using an LLM both as a parser and executor for natu- enhancing solution accuracy. The individual solutions are then
ral language queries, leveraging programming languages like consolidated to form a comprehensive answer, with KRA-
Python or SQL for structured execution. Binding is achieved GEN’s graph visualization enabling users to interact with
through a unified API that enables the LLM to generate, and assess the quality of the solution’s GoT structure and
interpret, and execute code using a few in-context examples, logic [81]. These techniques stand out for their ability to
leading to higher accuracy on table-based tasks compared address domain-specific challenges while ensuring adaptability
to fine-tuned approaches. Dater [77] focuses on few-shot through schema-guided reasoning. The use of schemas not
table reasoning by splitting a large table into relevant sub- only structures the solution process but also facilitates explain-
tables, translating complex queries into SQL sub-queries, and ability.
combining partial outcomes into a final solution. These three In data-driven tasks, Generative Retrieval-Augmented
steps aim to systematically extract meaningful data, execute Matching (GRAM) [82] addresses schema matching by em-
precise operations, and integrate results to address complex ploying a hierarchical classification model that dynamically
queries, outperforming fine-tuned methods by at least 2% on generates prompts for matching attributes across schemas.
Table-Based Truthfulness and 1% on Table-Based QA, and Specifically, GRAM utilizes a two-step process: first, it per-
surpassing Binder on these tasks. forms a coarse-grained classification to identify potential
attribute matches, and then refines these matches through
C. Knowledge Retrieval fine-grained classification, enhancing the precision of schema
Retrieval-Augmented Generation (RAG) addresses one of alignment. The prompt generation is guided by large lan-
the major issues of LLMs, which are their lack of a persistent, guage models (LLMs), which facilitate zero-shot and few-
reliable memory and factual grounding [78]. RAG methods in- shot learning, thereby improving efficiency and accuracy in
tegrate external knowledge sources into the generation process. database integration [82]. Similarly, TableRAG [83] focuses on
Instead of relying solely on learned representations within the reasoning over tabular data by retrieving and processing row-
model’s parameters, the system retrieves relevant documents, column relationships to interpret structured datasets accurately.
facts, or structured data at inference time and incorporates It conducts reasoning by leveraging query expansion combined
this information into its output. This grounding reduces hallu- with schema and cell retrieval to pinpoint crucial information
cinations, ensures that the model’s reasoning steps reference before providing it to the language models, enabling efficient
accurate and up-to-date information, and can improve the data encoding and precise retrieval. This approach allows
alignment of the solution with real-world constraints [79]. TableRAG to handle large-scale tables effectively, reducing
The versatility of RAG has led to significant advancements prompt lengths and mitigating information loss during the
in various domains, such as healthcare, finance, education, reasoning process [83].
and scientific research facilitated by novel frameworks tai- b) Self-Aware and Adaptive Retrieval. Recent RAG frame-
lored to address challenges in reasoning, problem-solving, works emphasize self-awareness and adaptive mechanisms to
and knowledge integration. This review categorized these address uncertainties in LLMs. Self-aware Knowledge Re-
advancements into four areas: task-specific and schema-based trieval (SeaKR) [84] activates retrieval during high uncer-
techniques, self-aware and adaptive mechanisms, long-term tainty and re-ranks snippets to ensure reliability. Specifically,
SeaKR addresses uncertainties arising from the LLM’s internal conditional traversals but may face challenges in balancing tree
state inconsistencies, triggering retrieval when the model’s depth with performance. MemoRAG [93] pairs a lightweight
self-assessed confidence is low. The re-ranking process in- global memory model for broad context with a powerful
volves selecting knowledge snippets that most effectively retrieval-generation model for focused answers, managing up
reduce the model’s uncertainty, thereby enhancing response to one million tokens efficiently but requiring careful system
accuracy [84]. Self-RAG [85] introduces iterative refinement, tuning. Pistis-RAG [94] emphasizes adaptability, leveraging
where retrieval queries generated during the response pro- online learning and user feedback to align responses with user
cess enable reassessment and improvement of outputs. This preferences, though its reliance on continuous feedback may
reassessment involves evaluating the relevance of retrieved introduce variability.
information during generation, allowing the model to iter- d) Multi-Hop and Multi-Modal Reasoning Retrieval: Multi-
atively refine its responses for enhanced accuracy. Critic- hop and multi-modal reasoning approaches broaden Retrieval-
Guided Planning (CR-Planner) [86] leverages critic models to Augmented Generation (RAG)’s capacity to handle tasks that
iteratively guide retrieval and reasoning toward task-specific require complex, step-by-step deliberation and data from di-
goals. The critic model operates by evaluating potential sub- verse sources. This involves performing reasoning that inte-
goals and their executions, assigning rewards to guide the grates information across multiple steps or modalities to derive
selection of the most promising reasoning paths. This guidance comprehensive answers. Deliberation refers to the systematic
ensures that the reasoning process aligns with task objec- process of considering various pieces of information and
tives, effectively navigating complex problem spaces [86]. reasoning paths to arrive at a well-founded conclusion.
For domain-specific adaptation, SimRAG [87] employs self- Multi-layered Thoughts Enhanced RAG (METRAG) [95]
training, generating and filtering synthetic data to fine-tune integrates similarity- and utility-based reasoning for deeper
models for specialized fields. In biomedical applications, Self- contextual understanding. It does so by combining similarity-
Rewarding Tree Search (SeRTS) [88] combines Monte Carlo oriented retrieval with utility-oriented assessments, where a
Tree Search and Reinforcement Learning to optimize retrieval. utility model, supervised by an LLM, evaluates the usefulness
Speculative RAG [89] improves efficiency with a two-stage of retrieved documents beyond mere similarity, enhancing the
process: a smaller model drafts responses, while a larger model relevance and quality of the information utilized in generation.
evaluates and finalizes them. This two-step process allows the RAG-Star [96] combines retrieval augmentation with Monte
system to balance efficiency and accuracy by leveraging the Carlo Tree Search (MCTS) to plan intermediate sub-queries
strengths of both models. that iteratively improve problem-solving accuracy. Accuracy,
These approaches offer distinct benefits and limitations. in this context, refers to the model’s ability to generate correct
SeaKR and Self-RAG provide dynamic adaptability and ac- and relevant responses to complex queries. Retrieval aug-
curacy but demand significant computational resources. CR- mentation involves incorporating external information into the
Planner and SeRTS enhance task-specific precision but in- model’s reasoning process. RAG-Star uses MCTS to explore
crease complexity. SimRAG excels in domain-specific tuning, various reasoning paths by generating and evaluating inter-
however it is constrained by the need for high-quality synthetic mediate sub-queries and their potential answers, effectively
data. Speculative RAG effectively reduces latency through par- guiding the model toward more accurate solutions.
allel drafting and verification, but requires accurate evaluation Knowledge Graph-Enhanced RAG Framework (KRA-
by generalist models. GEN) [81] uses Graph-of-Thoughts (GoT) methods to de-
c) Long-Term Memory for Knowledge Retrieval: Long-term compose multi-hop problems into explainable components. A
memory integration in RAG frameworks addresses the limi- GoT is a structured representation that maps out the reasoning
tations of purely query-specific retrieval by enabling the re- process, storing knowledge in the form of interconnected
tention and reuse of knowledge across tasks. HippoRAG [90], concepts and their relationships, often derived from knowledge
inspired by neurobiological memory structures, incorporates graphs. The GoT is constructed by the model during the
long-term memory directly into the retrieval process. This reasoning process, enabling it to break down complex queries
enables the system to consolidate and utilize past context into smaller, manageable parts. This decomposition allows the
effectively, enhancing performance in repetitive or longitudinal model to tackle each component systematically, enhancing
tasks. It transitions RAG systems from single-use retrieval interpretability and the overall reasoning process.
mechanisms to dynamic knowledge retainers. Recent research has introduced specialized frameworks that
Various architectures embed long-term memory into RAG. tackle the sequential nature of multi-hop queries and the
MemLong [91] employs a dual-network design where a frozen integration of text and vision data. These frameworks aim to
LLM backbone serves as a memory encoder, while a residual address limitations in handling complex reasoning tasks that
side-network manages retrieval, enabling efficient caching and require multiple inferential steps and the seamless combina-
updating of extensive contexts (up to 65k tokens). Its key tion of information from different modalities. MultiHop-RAG
advantage is scalability without data staleness, though man- provides a dedicated dataset and benchmarks to rigorously
aging large contexts may introduce overhead. HAT [92] uses assess RAG systems on multi-step queries [97], facilitating
a hierarchical tree-based memory structure for recursive ag- the evaluation of retrieval-augmented generation models in
gregation, enhancing coherence and summary quality through scenarios that necessitate reasoning across multiple docu-
ments. Retrieval-Augmented Multi-modal Chain-of-Thoughts input, the model generates a preliminary response or hy-
Reasoning [98] extends Chain-of-Thought (CoT) approaches pothesis, reflecting its immediate understanding. This involves
to handle images and text in tandem, enabling models to producing an initial answer without external validation.
process and reason over visual and textual data simultane- 3. Critically Assessing that Judgment: The model evalu-
ously. For purely textual multi-hop question answering, HOP, ates its preliminary response, identifying potential errors or
UNION, GENERATE (HUG) [99] offers a three-step method uncertainties. This is achieved by prompting the model to
that models rationales as sets of sentences, enhancing ex- question its initial answer, consider alternative interpretations,
plainability without requiring explicit rationale supervision. In and assess the confidence level of its response.
this framework, ”Hop” involves selecting relevant sentences, 4. Presenting a Final Decision with Reasoning: After critical
”Union” aggregates these sentences into a coherent rationale assessment, the model formulates a refined answer, providing a
set, and ”Generate” produces the final answer based on the rationale that outlines the reasoning process. This step ensures
aggregated rationale. The rationales modeled are the sets transparency and allows users to understand the basis of the
of sentences that collectively support the answer, providing model’s conclusion.
transparency in the reasoning process by explicitly outlining 5. Gauging Confidence in the Entire Process: The model
the evidence considered. Multimodal-CoT and Multi-Chain reflects on the overall process, assigning a confidence score
Reasoning (MCR) [100] further advance reasoning by respec- to its final answer, indicating the reliability of the response.
tively separating rationale generation from answer inference This is implemented by having the model express its certainty
for science question answering, and by prompting large lan- level, guiding users in decision-making.
guage models to examine multiple parallel chains of thought MP consistently outperforms Chain-of-Thought (CoT) and
before synthesizing final solutions. These approaches address Plan-and-Solve methods across paraphrasing, natural language
complex reasoning types that require integrating diverse infor- inference, and relation extraction tasks.
mation sources and evaluating multiple reasoning pathways. f) Self-Critique Methods/Evaluation- and Verification-
The rationale generated includes intermediate reasoning steps Focused Methods: To improve reliability and reduce factual
that elucidate the thought process leading to the answer. inaccuracies, Chain-of-Verification (CoVe) [103] uses a four-
Prompting is generated by designing specific instructions that step process: (1) generating an initial response, (2) formulating
guide the model to consider various perspectives and reasoning verification questions to identify potential errors or inconsis-
chains, thereby enhancing the robustness and accuracy of the tencies, (3) answering these questions to produce supporting
final output. evidence or rationale, and (4) revising the original response
Although RAG improves factual correctness and can help based on validated findings. CoVe has demonstrated over 10%
the model explore a broader concept space by tapping into performance improvements compared to basic prompting and
external repositories, it still does not imbue the model with a Chain-of-Thought (CoT) methods in both context-free and
genuine, internal problem-solving strategy. contextual question-answering tasks.
e) Self-Reflection Methods: Recent advancements under- Verify-and-Edit (VE) [104] enhances uncertain CoT out-
score the value of Large Language Models (LLMs) engag- puts by integrating external knowledge from reliable sources
ing in reflective reasoning before generating a final answer. such as encyclopedias, knowledge graphs, or domain-specific
Reflective reasoning involves the model’s introspection and repositories. Self-consistency identifies weak points in rea-
evaluation of its own thought processes to enhance decision- soning by generating multiple reasoning paths for the same
making and output quality. problem and comparing their outputs for discrepancies or
Implicit Retrieval-Augmented Generation (RAG) [78], logical contradictions, revealing areas of low confidence or
[101], [102] instructs LLMs to first retrieve key chunks of errors. The response is then revised by incorporating validated
context, specifying the number of sections and words in each evidence, ensuring factual accuracy and logical coherence.
section, then use these snippets to answer queries. The selec- Cross-referencing further verifies the revised response by re-
tion of the number of snippets and their lengths is typically checking it against retrieved knowledge to confirm it re-
determined through empirical tuning, balancing the need for solves inconsistencies while maintaining alignment across all
comprehensive context with the constraints of the model’s reasoning steps, avoiding the introduction of new errors or
input capacity. This method has achieved near state-of-the-art contradictions. VE evaluates the reliability of the final output
results in both general and biomedical contextual question- by analyzing agreement across revised reasoning paths and
answering tasks. ensuring alignment with external knowledge. This approach
Metacognitive Prompting (MP) [72] draws on the concept has achieved up to 10% gains in multi-hop reasoning tasks
of metacognition, comprising five phases: and 2% improvements in truthfulness evaluations over CoT
1. Interpreting the Input: The model analyzes the input text and self-consistency techniques.
to grasp its context and meaning, ensuring a clear understand- In summary, self-reflection techniques, e.g., Implicit RAG
ing of the task at hand. This is implemented by prompting and MP, emphasize reflective reasoning for deepening under-
the model to restate or summarize the input, confirming standing and clarity before producing an answer, while self-
comprehension. critique methods, i.e. CoVe and VE, concentrate on verifying
2. Forming an Initial Judgment: Based on the interpreted and refining initial outputs to reduce inaccuracies. Implicit
RAG and MP differ in execution: Implicit RAG systematically feedback, comparing response pairs to align LLM outputs with
retrieves the most relevant textual evidence for enhanced human preferences. Techniques such as Pairwise Proximal
context, whereas MP focuses on iterative introspection and Policy Optimization simplify the process by directly operating
confidence evaluation. CoVe and VE diverge in methodology, on comparative rewards, avoiding complexities like value
CoVe generates verification queries for self-checking, whereas function estimation and normalization [114].
VE specifically pinpoints uncertain outputs and edits them PPO algorithms iteratively adjust the weights of a model to
using external knowledge. maximize the expected reward [107]. Central to this process is
the collection of human feedback, which is critical in training
D. Reinforcement Learning reward models. Studies, such as Skywork-Reward [115] and
Reinforcement Learning (RL) provides a systematic frame- TÜLU-V2-mix [116], utilize human preferences by curating
work for refining LLM behavior by guiding models toward datasets of ranked examples, enabling models to align more
desired objectives through iterative feedback and carefully effectively with human judgments. [117] introduces tool-
designed reward signals [105]. There are six main components: augmented reward modeling, integrating external resources
agent, environment, state, action, reward, and policy [106]. To like calculators and search engines to refine alignment. Recent
apply RL for fine-tuning LLMs, the first step maps the six generative reward models use synthetic preferences, which
components to the LLM framework: the LLM represents the are artificially created by sampling and ranking model out-
policy, the current textual sequence is the state, and based on puts using a base preference model, to reduce reliance on
this state, the LLM generates an action, the next token. This extensive human feedback. [118] examined efficient methods
action updates the state, creating a new state that incorporates for collecting pairwise human preferences, optimizing reward
the newly added token. After generating a complete textual model design within RLHF frameworks. Additionally, research
sequence, a reward is determined by assessing the quality of on over-optimization risks underscores the importance of
the LLM output. This reward can be used to train a pre-trained balanced training to prevent performance degradation [119].
reward model or can be directly integrated into the alignment [114] propose novel pairwise feedback pipelines that improve
process to guide the behavior of the model. preference learning and policy optimization by comparing
The RL methods adopted by these models can be divided response pairs to better capture human preferences.
into two main categories, model-based RL approaches and RLHF’s multi-step process remains resource-intensive and
model-free approaches, which were discussed next. reliant on extensive human feedback [120]. Over-optimization
a) Model-based RL Approaches:The methods in this cate- risks may cause models to exploit weaknesses in the reward
gory can be grouped into three categories, RLHF, RLAIF and function rather than achieving genuine alignment with human
exploration, which are discussed next. preferences [119].
Reinforcement Learning from Human Feedback (RLHF): RL from AI Feedback (RLAIF) is a training method de-
RL from Human Feedback (RLHF) re-train LLMs by incorpo- signed to replace human evaluators with AI systems, offering
rating a reward signal derived from human evaluations. RLHFs better scalability and consistency by mitigating the variability
perform three fundamental stages: They initially perform su- of human judgment [121]. In RLAIF, a Reward Model (RM) is
pervised fine-tuning (SFT) using labeled datasets, followed by trained using preference labels generated by a Large Language
training a reward model (RM) based on human-evaluated out- Model (LLM). These labels are transformed into a probability
puts, and finally use this reward signal to inform the model’s distribution through a softmax function and optimized via
policy fine-tuning using the Proximal Policy Optimization cross-entropy loss, enabling the RM to guide the training
(PPO) algorithm [107]. of the target AI model [122]. Various approaches have been
[108] pioneered fine-tuning models like InstructGPT us- proposed to address the specific challenges of RLAIF. One
ing human feedback to better adhere to user instructions. strategy distills AI feedback to train reward models, leveraging
Building on this approach, [109] and [110] explored reward AI-generated insights to fine-tune reward systems and create
modeling and methods to address challenges such as length scalable feedback mechanisms. For example, UltraFeedback
bias, ensuring outputs are concise and aligned with human compiles a large-scale dataset of over one million GPT-
expectations. Frameworks like trlX [111] and high-quality 4 feedback annotations on 250,000 user-assistant conversa-
datasets introduced by [112] have scaled RLHF applications, tions to train reward models [112]. Magpie employs a self-
improving the performance of large language models (LLMs) synthesis method, where an aligned LLM generates large-scale
in tasks such as summarization, translation, and dialogue gen- alignment data that fine-tunes reward models [123]. Help-
eration. Summarization tasks, for example, leverage reinforce- Steer2 introduces a permissively licensed preference dataset
ment learning (RL) through both extractive and abstractive to train reward models, demonstrating improved alignment
methods; extractive summarization selects key sentences from with human preferences [124]. Another approach focuses on
the source, while abstractive summarization generates novel prompting LLMs to function as reward functions, directly
sentences to convey the essence of the content [113]. RL guiding model training through reward scores, as seen in Ex-
optimizes summarization by using rewards based on metrics ploring with LLMs (ELLM) Rewards [125]. Additional work,
like ROUGE to iteratively enhance the quality of outputs. such as Reward Design with Language Models, emphasizes
Policy optimization, on the other hand, employs pairwise constructing reward mechanisms that align model outputs with
desired outcomes by leveraging LLM capabilities [126]. Self- and improve sample efficiency. The method combines LLM-
supervised feedback mechanisms have also been explored; generated contextually relevant trajectories with reinforcement
for instance, the Eureka framework introduces a novel ap- learning, allowing the agent to pretrain on meaningful, struc-
proach to reward optimization through self-generated feedback tured action sequences before fine-tuning its policies in the
loops [127]. Self-rewarding systems, including Self-Refined target environment. However, it falls short in addressing the
LLMs [128] and Self-Rewarding Language Models (SRLM) complexity and variability of problems that demand broader
[129], enable iterative refinement of model outputs based on generalization and creative reasoning.
their own evaluations. b) Model Free Approaches: These methods can be grouped
RLAIF remains less widely adopted compared to RLHF. into three categories, DPO, IPO, and actor critical. Their
This discrepancy stems from challenges, such as difficulties in discussion follows next.
achieving alignment and the risk of propagating biases inherent Direct Preference Optimization (DPO) addresses the limita-
in AI-generated feedback [112], [127]. These challenges can tions of RLHL/PPO, which necessitates meticulous oversight
create feedback loops that amplify existing biases, constraining and significant computational resources due to the initial phase
model diversity and limiting its ability to generalize effec- to train a reward model using a preference dataset, followed
tively [129]. Moreover, the absence of human evaluators in by training an RL policy with the pre-trained reward model
RLAIF can result in a lack of nuance, leading to a narrower serving as the environment. DPO offers a simpler alternative
latent space influenced by the biases of the training AI [128]. by directly optimizing LLM parameters using preference data,
Exploration techniques in RL involves seeking new in- bypassing the need for a reward model [139]. DPO relies on
formation to improve future decisions, whereas exploitation a preference loss function trained on datasets of paired human
capitalizes on current knowledge to maximize immediate preferences (e.g., “Response A is better than Response B”).
rewards [130]. In these algorithms, each action decision can Several extensions to DPO improve upon this baseline. For
be made stochastic via epsilon-greedy [131] or entropy instance, DPOP [140] (also termed DPO-positive) introduces
regularization [132] to ensure diverse coverage of the environ- a margin-based term to prevent rewarding both preferred and
ment, but excessive exploration can be inefficient. Traditional disfavored outputs concurrently, thereby improving perfor-
approaches, such as epsilon-greedy [133] and Boltzmann mance on tasks with small edit distances. Specifically, the
exploration [134], introduce randomness without leveraging margin-based term in DPOP introduces a penalty for assigning
prior knowledge, slowing convergence. Recent methods, like, high probabilities to both preferred and disfavored outputs,
ExploRLLM [135] uses LLMs hierarchically to generate high- ensuring that the model distinctly favors the preferred response
level plans and low-level affordance-based policies to effi- to improve task performance. Iterative DPO [129] (also known
ciently explore high-value states while minimizing reliance as online DPO) mitigates distribution shifts by continually up-
on step-by-step LLM invocation. This structured approach dating the policy on newly generated responses, an advantage
enhances efficiency but struggles with adaptability in open- over vanilla DPO, which can overfit to a narrower distribution.
ended domains [136]. Meanwhile, β-DPO [141] adaptively tunes the regularization
Soft RLLF integrates natural language as logical feedback term based on the data quality, making it more robust to
to balance exploration and exploitation, enabling improved noisy preferences. Stepwise DPO (sDPO) [142] partitions the
performance in reasoning tasks such as negation understanding preference dataset to perform incremental updates, leveraging
and logical consistency in high-stakes applications [137]. a stronger intermediate reference model at each phase.
This is achieved by encoding logical consistency checks and DPO methods are advantageous for structured problem-
negation handling into the learning process, utilizing feed- solving, like in creative writing or complex reasoning because
back loops to iteratively refine the agent’s decision-making. they can directly incorporate human preferences and avoid un-
However, its effectiveness diminishes when tackling problems desired behavior without heavily relying on large-scale reward
requiring broader adaptability and creativity, as it is opti- modeling or complex RL training loops [139]. However, a re-
mized for structured reasoning [137]. Another recent approach, curring drawback is their sensitivity to distribution shifts, e.g.,
LLM+Exp [138], employs a dual-LLM framework: one LLM when the model starts generating out-of-domain responses,
analyzes action-reward trajectories to derive exploration strate- alignment performance can drop unless the reference model
gies, while the other dynamically adjusts action probabilities or preference data is iteratively updated [143]. Moreover,
to refine future decisions. Action-reward trajectories represent purely relying on pairwise or setwise human judgments can
sequences of actions taken by an agent and the corresponding still introduce label noise or ambiguity, especially for creative
rewards, offering insights into the learning process. Action or unstructured tasks [144]. Despite these limitations, DPO-
probabilities define the likelihood of selecting specific actions based techniques are promising for balancing helpfulness and
based on learned patterns and anticipated outcomes. While this correctness in open-ended LLM outputs [145].
adaptive approach excels in structured environments, it faces Identity Preference Optimization (IPO) [146] was intro-
scalability issues and struggles to generalize effectively to un- duced to address the overfitting inherent in RLHF and DPO.
predictable or unstructured tasks. Guided Pretraining RL [125] Unlike traditional methods that transform pairwise prefer-
incorporates LLMs to influence exploration, providing con- ences into pointwise rewards using the Bradley–Terry (BT)
textual background knowledge to prioritize relevant actions model [147], IPO directly optimizes preferences without rely-
ing on nonlinear transformations, which are known to exacer- and dependence on high-quality data reduce robustness in
bate overfitting. The objective function of Identity Preference complex, evolving environments [150].
Optimization (IPO), as defined in eq. (1), aims to directly Actor-critic methods, such as Advantage Actor-Critic (A2C)
optimize preference probabilities while mitigating overfitting and Deep Deterministic Policy Gradient (DDPG), have been
issues inherent in methods like RLHF and DPO. The function effectively adapted to optimize prompts for large language
maximizes the expected preference utility, represented by models (LLMs). Frameworks like Prompt Actor-Critic Editing
(PACE) [151] employ an iterative process where the actor (the
Ex [Ey,y′ Ψ(Pθ (y > y ′ ))] , LLM) generates a response a based on a prompt p and input
X. This process is formalized as
where Ψ(Pθ (y > y ′ )) captures the model’s ability to predict
and optimize preference probabilities for pairs of outputs a = factor ([p; X], M ),
(y and y ′ ). To prevent excessive deviation from a refer-
ence policy, the KL divergence term DKL (π||πref ) imposes where factor represents the decision-making mechanism of the
actor, [p; X] is the concatenated context consisting of the
a regularization constraint, controlled by the coefficient β.
By balancing preference optimization and regularization, this prompt p and the specific input X, and M is the LLM being
optimized. The actor function processes the concatenated
approach avoids transforming pairwise preferences into point-
context to produce the response a, guided by the prompt p
wise rewards, which can exacerbate overfitting, and directly
aligns the model’s behavior with human preferences while and the input X.
The critic, another LLM or evaluation mechanism, evaluates
maintaining stability.
the relevance, coherence, and task-specific accuracy of the
πθ∗ = max Ex [Ey,y′ Ψ(Pθ (y > y ′ )) − βDKL (π||πref )] . (1) response against the objective Y . The critique is calculated
π
as follows [151]:
To address the overfitting caused by the nonlinear transfor-
c = fcritic ([p; X; a; Y ], M ),
mation Ψ(x), IPO simplifies Ψ(x) to a linear function, Ψ(x) =
x, and formulates a robust loss function, as defined in eq. (2). where fcritic represents the evaluation function of the critic.
This loss function, LIPO , directly optimizes the policy πθ by The input [p; X; a; Y ] consists of the prompt p, the input X,
aligning it with human preferences while mitigating overfit- the actor-generated response a, and the objective Y , which
ting. The expectation is taken over pairs of outputs (yw , yl ), defines the desired or target output. The critic processes this
where yw represents the preferred (winning) output and yl concatenated input using the language model M to generate
the less preferred (losing) output. The terms − log ππrefθ (y w)
(yw ) and
a critique c. This critique assesses how well the response a
− log ππrefθ (y l) aligns with the objective Y , considering both the input X
(yl ) measure how well the current policy πθ aligns
with the reference policy πref , accounting for both preferred and prompt p. [152] leverages KL-regularization to balance
and less preferred outputs. A regularization term, 2β 1
, balances fidelity to the original prompt while allowing modifications
the trade-off between optimizing preferences and maintaining that improve task-specific performance. By iterating on this
adherence to the reference policy, ensuring model stability and actor-critic loop, PACE enhances prompt effectiveness and
reducing the risk of overfitting. By incorporating a squared guides LLMs toward better alignment with task objectives.
penalty term, LIPO captures and penalizes deviations from Additionally, actor-critic methods assume well-structured
ideal preference alignment, whether positive or negative. The feedback loops, which might be unreasonable for problems
simplified approach avoids the complexity and instability of with sparse or noisy signals. Recent work addresses these
nonlinear transformations, providing a stable and effective challenges. [153] explores open-ended learning in the context
framework for aligning policies with human preferences. This of unsupervised skill discovery, highlighting the need for more
makes IPO a robust and efficient alternative to traditional flexible reward functions in high-dimensional environments.
preference-based learning methods that rely on pointwise HDFlow [154] combines fast and slow thinking modes to en-
rewards or complex transformations. hance complex reasoning. [155] introduces Direct Q-function
Optimization (DQO), which formulates response generation
2
as a Markov Decision Process (MDP), allowing each token

πθ (yw ) πθ (yl ) 1
LIPO = −E(yw ,yl ) log − log − . (2) generation to be treated as a state transition. Leveraging the
πref (yw ) πref (yl ) 2β
soft actor-critic (SAC) framework, DQO directly parameterizes
This approach proves particularly robust in scenarios with the Q-function within the language model, enabling it to learn
deterministic or near-deterministic feedback, where existing effectively from offline data, including unbalanced or negative
methods often struggle due to unstable gradients [148]. By samples that helps improve multi-step reasoning.
leveraging a simpler optimization framework and incorpo-
rating strong regularization, IPO effectively mitigates over- IV. S IMILARITIES OF LLM S AND T RADITIONAL
fitting and outperforms DPO in experimental settings [149]. AUTOMATED I MPLEMENTATION G ENERATION M ETHODS
However, IPO faces challenges due to its reliance on static AND R ELATED R ESEARCH N EEDS
preference distributions, which limits adaptability to dynamic A broad analogy can be identified between using Genetic
or diverse scenarios. Additionally, its sensitivity to noise Algorithms (GAs) and LLMs for implementation creation.
1) Selection: GA selection chooses the fittest individuals to Elaboration
using own
Elaboration
using
pass their genes to the next generation. In fine-tuning or knowledge
(strategy 1)
alternative
features
training data selection, LLMs prioritize coherence and (strategy 4)

Incremental elaboration
Implementation cluster1
Implementation cluster4
relevance when generating textS, similar to selecting kernel 1
Detailing
kernel 4
alternatives
relevant context for responses. Like choosing the best
Alternative
seeds from a harvest, LLMs select the most relevant Elaboration
using combined
X X
niche
X
words or sentences to continue a conversation [156]. Envelope fragment1
features
(strategy 3)
Excluded
niche
2) Crossover (Recombination): GA crossover combines
Elaboration Implementation cluster Elaboration
3
the genomes of two parents to create a new individual. using migrated kernel 3 through
features kernel
This is similar to blending knowledge from different (strategy 2) aggregation
(strategy 5)
domains during text generation. For example, merging Implementation cluster2 New niche

insights from literature and science in a single response. kernel


2
Implementation cluster
5
Crossover is like an LLM writing poetry about quantum kernel
5

physics, e.g., combining Shakespearean elegance with Individual


solutions
scientific rigor [157].
Envelope fragment 2
3) Mutation: GA mutation introduces random changes in
a genome to explore new possibilities. Similar to the Fig. 1. Five strategies for automated implementation creation
slight randomness added during sampling techniques
like top-k or temperature settings, which allow LLMs
to produce diverse responses. Mutation in GAs is like [163].
LLMs occasionally breaking patterns to say something The next part discusses using Cognitive Architectures for
unexpected or creative [158]. implementation creation, possibly using the features of LLMs.
4) Inversion: GA inversion reverses a segment of the Similar to [32], this report considers that devising an imple-
genome to explore new configurations. This parallels mentation for a problem specification utilizes the five strategies
rephrasing or reordering sentences during text generation shown in Figure 1. The problem solving process is a mixture
while preserving the original meaning. Like flipping a of the five strategies.
playlist order for a new vibe, LLMs rephrase “The car Each strategy starts from a kernel, which is the invariant
is fast” into “A fast car it is” [159]. set of features used in the process. The problem solving
5) Elitism: GA ensures the best solutions carry over un- process creates a solution cluster corresponding to the kernel
changed to the next generation. Similar to checkpointing features, e.g., each implementation in the cluster includes the
the best-performing weights during training or favoring features. Implementations are created through implementation
high-confidence outputs in decoding strategies. Like elaboration by exploring a sequence of detailing alternatives.
archiving the best answers during an essay edit, LLMs For example, the principle of the bubble sorting algorithm can
retain their most confident responses for the final out- be described as repeatedly comparing the adjacent values of an
put [160]. array and swapping them if they are in the wrong order until no
6) Replacement: GA decides how much of the old pop- more value swapping are needed. The kernel features include
ulation to keep versus the new one. Similar to param- three features: (i) the values of an array, (ii) the swapping of
eter updates during fine-tuning, where new knowledge adjacent values if they are in the wrong order, and (iii) the rep-
replaces older information incrementally. Replacement etition of the process until no more swapping are needed. The
is like LLMs balancing old facts while integrating new corresponding cluster includes all implementations obtained
updates, ensuring a model doesn’t “forget” but adapts to by elaborating the three kernel features.
current knowledge [161]. The five strategies are as follows [32]:
7) Fitness Evaluation: GA scores individuals based on • Strategy 1 describes the elaboration process in which
quality to determine their survival. Similar to evaluating each kernel is elaborated without changing the kernel.
model outputs using metrics like BLEU, ROUGE, or A set of detailing alternatives can be used for each
user feedback in RLHF. Fitness evaluation is like an elaboration step to produce an implementation envelope.
LLM receiving human feedback to improve its responses The envelopes are incrementally elaborated until the final
based on relevance, coherence, or creativity [162]. implementation is created.
8) Exploration vs. Exploitation: GA balances trying new • Strategy 2 represents the process, which in addition to the
possibilities (exploration) and refining known solutions elaboration steps of Strategy 1 also uses elaboration re-
(exploitation). Balancing randomness and coherence sults corresponding to a different implementation cluster.
during response generation. Parameters like temperature Figure 1 shows the using of features from Implementation
encourage exploration, while context relevance drives cluster 1 (red arrow in the figure) to build the implementa-
exploitation. Just as genetic algorithms search for novel tions of Implementation cluster 2. Hence, the subsequent
solutions, LLMs strike a balance between playful cre- solution include elaboration of all kernel features and the
ativity and logical reasoning in ambiguous prompts features adopted from another cluster.
• Strategy 3 uses a kernel that combines kernel features degree by LLM, similar to the use of LLM to solve
from two different implementation clusters. The blue ambiguities [82], [83]. However, it is likely that the
arrows in Figure 1 illustrates the combination. current methods are insufficient for this purpose.
• Strategy 4 presents an elaboration process in which 3) Elaboration: Executing the five strategies requires de-
the selected detailing alternatives are excluded from the vising additional methods for detailing alternatives, pre-
elaboration steps used for building other implementation dicting the effectiveness of each alternative in the con-
clusters. It represents the excluded niche in Figure 1 text of a partial implementation, assigning a priority
(green arrow). (preference) to each alternative, and incorporating the
• Strategy 5 creates a kernel bottom-up by identifying and alternative into the partial implementation. A possible
generalizing the features of individual implementations. approach is to use schema for elaboration, similar to
The individual implementations were produced through RAG methods for LLMs [80], [81]. Schema matching
less-structured methods, like, for example, through ex- can benefit from LLMs to clarify certain ambiguities,
perimental trial-and-error. such as in [82], [83]. However, schema are static struc-
tures, useful in analogical reasoning, even though prob-
While the five strategies provide templates for the im-
lem solving often requires performing new sequences of
plementation elaboration process, automated implementation
decisions beyond a static schema.
generation requires the following additional activities:
4) Implementation assessment: LLMs can be used for two
1) Divide and conquer: The activity partitions a problem kinds of performance assessments. Qualitative assess-
into sub-problems and then provides ways to integrate ment, including comparing implementations, such as
the implementation for the sub-problems. Task decom- pairs of circuit designs, can be obtained by prompting
position methods in LLM prompting [73], [74] can traditional LLMs. CoT prompting can be used to obtain
produce certain decompositions, especially in situations performance assessment at a finer granularity. RLHF can
for which the sb-problems are less coupled. However, fine tune assessment by adding human feedback about
design problems are often strongly coupled, so that even the quality of the implementations [112]. Moreover,
though there are specialized modules to implement a self-critique methods could be used to improve the
certain function, their operation and performance are correctness and completeness of the LLM responses,
tightly related. Decomposition requires not only a static like self-consistency and cross-referencing methods in
problem partitioning based on the items in the prompts VE [104]. A second approach uses datasets of char-
(i.e. words) but also the interpretation of a sub-problem acterized implementations to train an LLM, similar to
within the context set-up by the interpretations of other exploration techniques in RL [133], [134]. Then, the
sub-problems. LLM fine tuning through RL is likely generalization capacity of LLMs are used to quanti-
infeasible due to the huge space of possible decomposi- tatively predict the performance parameters of a new
tions possible in real-life. A mechanism is also needed implementation. Nevertheless, the two approaches do
to track the analyzed decompositions, so that the infor- not scale beyond the samples used in training an LLM,
mation can be used to improve future decompositions. including situations in which a new implementation uses
This capability is absent in current methods. a nonlinear combination of the features of different im-
2) Kernel creation: The method creates kernels either by plementation. There is no mechanism similar to setting-
assembling the features likely to address the problem up precise physical models of an implementation, so
requirements and then elaborating them top-down or in that the models can be solved to produce quantitative
a bottom-up process as detailed in Strategy 5. Sepa- performance assessment, like in traditional automated
rate kernels can be created for different sub-problems implementation creation methods.
followed by integrating them into a single kernel and 5) Memory and learning: Similar to using long-term mem-
its elaboration or separately elaborating each kernel and ory for knowledge retrieval in RAG, memory systems
integrating their implementations. Ideas on LLM self- are needed to for learning to store associations, like
reflection and focus on the main information [71] can kernel features, their most relevant implementations
help identify the features to be included in a kernel. fragments, and their performance values or between
However, finding kernels, e.g., the invariant features high-level features and their detailed elaborations, the
present in all implementations pertaining to a cluster, causal relationships of main features and performance
remains mostly a manual process. Methods similar to attributes, and elaboration sequences that produced high-
RLHF [107] can help retrieving similar features, but quality implementations. Similar to schema-based re-
their scalability is likely low. Moreover, combining trieval, memory cueing must solve semantic ambiguities.
features from different kernels to generate a new kernel 6) Adaptive process: It includes the sequence of auto-
(Strategy 3) has not been studied by current LLM meth- mated activities performed to create an implementa-
ods. The combination of features needs a way to predict tion. It requires devising new means to predict the
the expected performance at a high level (possibly a expected outcomes of the available activities, selecting
qualitative evaluation), which can be offered to some and adapting an activity to the current context, under-
standing the degree to which the sequence advances solving activities that are not available in traditional automated
towards creating an implementation, and learning new implementation generation methods. New research require-
knowledge available during the process. Also, when ments are also presented, e.g., support for problem framing,
addressing collaboration between humans and LLMs to creating an implementation approach, effective elaboration
tackle unexpected challenges, such as handling zero- control, robust qualitative and quantitative assessment across
day attacks, the process necessitates reasoning, under- abstraction levels, knowledge memorizing during learning, and
standing of prior instructions, and intuitive decision- managing the problem solving process.
making within the context of new parameters and con- R EFERENCES
straints. To automate this process, exploring reason-
[1] S. Fiore, M. Rosen, K. Smith-Jentsch, E. Salas, L. M., and N. Warner,
ing techniques, including deductive reasoning, inductive “Toward an understanding of macrocognition in teams: Pre- dicting
reasoning, analogical reasoning, common sense reason- processes in complex collaborative contexts,” Human Factors, vol. 52,
ing, tree-of-thoughts, multiple chains of thought, causal no. 2, pp. 203–224, 2010.
[2] A. Fischer, S. Greiff, and J. Funke, “The process of solving complex
reasoning, heuristic reasoning, and symbolic reasoning, problems,” Journal of Problem Solving, vol. 4, 2012.
is required. Among these, the primary human thought [3] C. Sun, V. Shute, A. Stewart, J. Yonehiro, N. Duran, and S. D’Mello,
process often involves mapping the current problem to “Towards a generalized competency model of collaborative problem
solving,” Computers & Education, vol. 143, p. 103672, 2020.
a previously encountered one or identifying similarities [4] T. Wiltshire, J. Butner, and S. Fiore, “Problem-solving phase transitions
with analogous problems, like in analogical reasoning. during team collaboration,” Cognitive Science, vol. 42, no. 1, pp. 129–
Consequently, an effective approach to problem mod- 167, 2018.
[5] G. Schraw, M. E. Dunkle, and L. D. Bendixen, “Cognitive processes
eling could involve neuro-symbolic representations that in well-defined and ill-defined problem solving,” Applied Cognitive
allow LLMs to dynamically learn and adapt in real- Psychology, vol. 9, no. 6, pp. 523–538, 1995.
time. Techniques such as grokking, which enable models [6] A. Doboli and A. Umbarkar, “The role of precedents in increasing
creativity during iterative design of electronic embedded systems,”
to discover relationships and patterns through iterative Design Studies, vol. 35, no. 3, pp. 298–326, 2014.
refinement, and masked LLMs are promising methods to [7] A. Doboli, A. Umbarkar, S. Doboli, and J. Betz, “Modeling semantic
achieve this goal. These approaches empower the model knowledge structures for creative problem solving: Studies on express-
ing concepts, categories, associations, goals and context,” Knowledge-
to derive connections on the fly, effectively merging based Systems, vol. 78, pp. 34–50, 2015.
learned representations with reasoning capabilities. [8] J. R. Koza, F. H. Bennett, D. Andre, and M. A. Keane, “Reuse,
parameterized reuse, and hierarchical reuse of substructures in evolving
V. C ONCLUSIONS electrical circuits using genetic programming,” in Evolvable Systems:
From Biology to Hardware: First International Conference, ICES96
Recent advances in Large Language Models (LLMs) offer Tsukuba, Japan, October 7–8, 1996 Proceedings 1. Springer, 1997,
the opportunity to extend automated implementation genera- pp. 312–326.
[9] R. Wirfs-Brock, P. Taylor, and J. Noble, “Problem frame patterns: an
tion techniques beyond the current methods that require algo- exploration of patterns in the problem space,” Proc. Conference on
rithmic specifications as input and can use only statically do- Pattern Languages of Programs, 2006.
main knowledge. LLMs can process multi-modal descriptions, [10] H. Tang and A. Doboli, “High-level synthesis of delta-sigma modu-
lators optimized for complexity, sensitivity and power consumption,”
including ideas communicated in natural language and using IEEE Transactions on CADICS, vol. 25, no. 3, pp. 597–607, 2006.
images and with certain degrees of specification completeness, [11] Y. Wei, H. Tang, and A. Doboli, “Systematic methodology for design-
unknowns, and ambiguity. LLMs learn a broad range of ing reconfigurable delta sigma modulator topologies for multimode
communication systems,” IEEE Transactions on CADICS, vol. 26,
associations and for diverse contexts. These new capabilities no. 3, pp. 480–496, 2007.
might offer intriguing paths beyond traditional implementation [12] G. A. Klein and J. Weitzenfeld, “Improvement of skills for solving-ill-
generation, such as support problem framing and exploration defined problems,” Educational Psychologist, vol. 13, no. 1, pp. 31–41,
1978.
of possible solution approaches, improved implementation [13] J. P. Leighton, W. T. Rogers, and T. O. Maguire, “Assessment of student
assessment across abstraction levels by comprehensive com- problem-solving on ill-defined tasks,” Alberta Journal of Educational
parison to similar, externally available implementations, col- Research, vol. 45, no. 4, 1999.
[14] A. Doboli and S. Doboli, “A novel agent-based, evolutionary model
lective feedback and preferences, and enhanced elaboration for expressing the dynamics of creative open-problem solving in small
by incorporating continuously updated domain knowledge. groups,” Applied Intelligence, vol. 51, pp. 2094–2127, 2021.
These features are critical in solving open-ended problem, [15] R. Wang, J. Lehman, A. Rawal, J. Zhi, Y. Li, J. Clune, and K. Stan-
ley, “Enhanced poet: Open-ended reinforcement learning through un-
currently hard to address with existing methods. Summarizing bounded invention of learning challenges and their solutions,” in
the state-of-the-art on LLMs and their related improvements International conference on machine learning. PMLR, 2020, pp.
is a first step towards devising nocel LLM-based methods for 9940–9951.
[16] A. Aho, J. Ullman, R. Sethi, and M. Lam, The SOAR Cognitive
implementation generation. Architecture. Addison Wesley, 2006.
This report offers a comprehensive overview of existing [17] A. Doboli, N. Dhanwada, A. Nunez-Aldana, and R. Vemuri, “A library-
LLM techniques and studied the degree to which they can based approach to analog synthesis from vhdl-ams specifications,” ACM
Transactions on Design Automation, vol. 9, no. 2, pp. 238–271, 2004.
model the activities needed for implementation generation [18] M. Fingeroff, High-Level Synthesis Blue Book. Xlibris Us, 2010.
for open-ended problem solving. The overview presents LLM [19] T. McConaghy, P. Palmers, P. Gao, M. Steyaert, and G. Gielen,
enhancements, like prompting, Reinforcement Learning (RL) Variation-aware Analog Structural Synthesis. Springer, 2009.
[20] A. Doboli and R. Vemuri, “Behavioral modeling for high-level syn-
and Retrieval-Augmented Generation (RAG). Then the report thesis of analog and mixed-signal systems from vhdl-ams,” IEEE
discusses the possibility of using LLMs to realize problem Transactions on CADICS, vol. 22, no. 11, 2003.
[21] ——, “Exploration-based high-level synthesis of linear analog systems clinical named entity recognition via prompt engineering,” Journal of
operating at low/medium frequencies,” IEEE Transactions on CADICS, the American Medical Informatics Association, p. ocad259, 2024.
vol. 22, no. 22, 2003. [44] T. Boyle, “Medical transcriptions,” 2018,
[22] E. Wisniewski, “When concepts combine,” Psychonomic Bulletin & accessed: 2024-12-26. [Online]. Available:
Review, vol. 4, no. 2, pp. 167–183, 1997. [Link]
[23] W. Kruiskamp and D. Leenaerts, “Darwin: Cmos opamp synthesis by [45] “Vaccine adverse event reporting system (vaers),”
means of genetic algorithm,” Proc. Design Automation Conference, pp. [Link] Centers for Disease Control
433–438, 1995. and Prevention (CDC) and U.S. Food and Drug Administration (FDA),
[24] A. Chopra, A. Artikis, J. Bentahar, M. Colombetti, F. Dignum, 2024, accessed: 2024-12-26.
N. Fornara, A. Jones, M. Singh, and P. Yolum, “Research directions in [46] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le,
agent communication,” ACM Trans. Intell. Syst. Technol., vol. 4, no. 2, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large
2013. language models,” Advances in neural information processing systems,
[25] E. Bonabeau, “Agent-based modeling: methods and techniques for vol. 35, pp. 24 824–24 837, 2022.
simulating human systems,” Proceedings of the National Academy of [47] Y. Fu, H. Peng, A. Sabharwal, P. Clark, and T. Khot, “Complexity-
Sciences, vol. 9, no. 3, 7280. based prompting for multi-step reasoning,” in The Eleventh Interna-
[26] S. Lapp, K. Jablokow, and C. McComb, “Collaborating with style: tional Conference on Learning Representations, 2022.
Using an agent-based model to simulate cognitive style diversity in [48] Y. Zhou, X. Geng, T. Shen, C. Tao, G. Long, J.-G. Lou, and
problem solving teams,” in Proc. ASME International Design En- J. Shen, “Thread of thought unraveling chaotic contexts,” arXiv preprint
gineering Technical Conferences and Computers and Information in arXiv:2311.08734, 2023.
Engineering Conference, Vol. 7, 2017, pp. 1–7. [49] X. Li, R. Zhao, Y. K. Chia, B. Ding, S. Joty, S. Poria, and
[27] J. Anderson, “Act: A simple theory of complex cognition,” American L. Bing, “Chain-of-knowledge: Grounding large language models
Psychologist, vol. 51, pp. 355–365, 1996. via dynamic knowledge adapting over heterogeneous sources,” arXiv
[28] J. Laird, Compilers: Principles, Techniques, and Tools. The MIT preprint arXiv:2305.13269, 2023.
Press, 2012. [50] C. Li, J. Liang, A. Zeng, X. Chen, K. Hausman, D. Sadigh,
[29] P. Rosenbloom, A. Demski, and U. Volkan, “The sigma cognitive ar- S. Levine, L. Fei-Fei, F. Xia, and B. Ichter, “Chain of code: Reasoning
chitecture and system: towards functionally elegant grand unification,” with a language model-augmented code emulator,” arXiv preprint
Journal of Artificial General Intelligence, 2016. arXiv:2312.04474, 2023.
[30] D. Kieras and D. Meyer, “An overview of the epic architecture [51] X. Zhao, M. Li, W. Lu, C. Weber, J. H. Lee, K. Chu, and S. Wermter,
for cognition and performance with application to human-computer “Enhancing zero-shot chain-of-thought reasoning in large language
interaction,” Journal Human-Computer Interaction, vol. 12, no. 4, pp. models through logic,” arXiv preprint arXiv:2309.13339, 2023.
391–438, 1997. [52] S. Bao, T. Li, and B. Cao, “Chain-of-event prompting for multi-
[31] R. Sun, A tutorial on clarion 5.0. Cognitive Science document summarization by large language models,” International
Department, Rensselaer Polytechnic ([Link] Journal of Web Information Systems, no. ahead-of-print, 2024.
rsun/[Link]), 2003.
[53] Z. Wang, H. Zhang, C.-L. Li, J. M. Eisenschlos, V. Perot, Z. Wang,
[32] H. Li, X. Liu, F. Jiao, A. Doboli, and S. Doboli, “Innova: A cognitive
L. Miculicich, Y. Fujii, J. Shang, C.-Y. Lee et al., “Chain-of-table:
architecture for computational innovation through robust divergence
Evolving tables in the reasoning chain for table understanding,” arXiv
and its application for analog circuit design,” IEEE Transactions on
preprint arXiv:2401.04398, 2024.
CADICS, vol. 37, no. 10, pp. 1943–1956, 2018.
[54] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd-
[33] A. Vaswani, “Attention is all you need,” Advances in Neural Informa-
hery, and D. Zhou, “Self-consistency improves chain of thought rea-
tion Processing Systems, 2017.
soning in language models,” arXiv preprint arXiv:2203.11171, 2022.
[34] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- [55] Y. K. Chia, G. Chen, L. A. Tuan, S. Poria, and L. Bing, “Contrastive
els are few-shot learners,” Advances in neural information processing chain-of-thought prompting,” arXiv preprint arXiv:2311.09277, 2023.
systems, vol. 33, pp. 1877–1901, 2020. [56] X. Liu, T. Pang, and C. Fan, “Federated prompting and chain-of-
[35] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A thought reasoning for improving llms answering,” in International
review and new perspectives,” IEEE transactions on pattern analysis Conference on Knowledge Science, Engineering and Management.
and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. Springer, 2023, pp. 3–11.
[36] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, [57] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and
“Building machines that learn and think like people,” Behavioral and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with
brain sciences, vol. 40, p. e253, 2017. large language models,” Advances in Neural Information Processing
[37] L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, Systems, vol. 36, 2024.
M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh et al., “Ethical and social [58] J. Jung, L. Qin, S. Welleck, F. Brahman, C. Bhagavatula, R. L. Bras,
risks of harm from language models,” arXiv preprint arXiv:2112.04359, and Y. Choi, “Maieutic prompting: Logically consistent reasoning with
2021. recursive explanations,” arXiv preprint arXiv:2205.11822, 2022.
[38] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, [59] L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P.
E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought
artificial general intelligence: Early experiments with gpt-4,” arXiv reasoning by large language models,” arXiv preprint arXiv:2305.04091,
preprint arXiv:2303.12712, 2023. 2023.
[39] R. Zellers, A. Holtzman, H. Rashkin, Y. Bisk, A. Farhadi, F. Roesner, [60] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts
and Y. Choi, “Defending against neural fake news,” Advances in neural prompting: Disentangling computation from reasoning for numerical
information processing systems, vol. 32, 2019. reasoning tasks,” arXiv preprint arXiv:2211.12588, 2022.
[40] M. Post and D. Vilar, “Fast lexically constrained decoding with dy- [61] H. Hu, H. Lu, H. Zhang, Y.-Z. Song, W. Lam, and Y. Zhang, “Chain-
namic beam allocation for neural machine translation,” arXiv preprint of-symbol prompting elicits planning in large langauge models,” arXiv
arXiv:1804.06609, 2018. preprint arXiv:2305.10276, 2023.
[41] N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, [62] J. Li, G. Li, Y. Li, and Z. Jin, “Structured chain-of-thought prompting
A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer for code generation,” ACM Transactions on Software Engineering and
learning for nlp,” in International conference on machine learning. Methodology, 2023.
PMLR, 2019, pp. 2790–2799. [63] H. Fei, B. Li, Q. Liu, L. Bing, F. Li, and T.-S. Chua, “Reasoning
[42] R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. Zemel, W. Brendel, implicit sentiment with chain-of-thought prompting,” arXiv preprint
M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural arXiv:2305.11255, 2023.
networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, [64] K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou,
2020. K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert-
[43] Y. Hu, Q. Chen, J. Du, X. Peng, V. K. Keloth, X. Zuo, Y. Zhou, level medical question answering with large language models,” arXiv
Z. Li, X. Jiang, Z. Lu et al., “Improving large language models for preprint arXiv:2305.09617, 2023.
[65] Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought for adapting large language models to specialized domains,” arXiv
prompting in large language models,” arXiv preprint arXiv:2210.03493, preprint arXiv:2410.17952, 2024.
2022. [88] M. Hu, L. Zong, H. Wang, J. Zhou, J. Li, Y. Gao, K.-F. Wong,
[66] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, Y. Li, and I. King, “SeRTS: Self-rewarding tree search for biomedical
“React: Synergizing reasoning and acting in language models,” arXiv retrieval-augmented generation,” in Findings of the Association for
preprint arXiv:2210.03629, 2022. Computational Linguistics: EMNLP 2024, Y. Al-Onaizan, M. Bansal,
[67] S. Diao, P. Wang, Y. Lin, R. Pan, X. Liu, and T. Zhang, “Active and Y.-N. Chen, Eds. Miami, Florida, USA: Association for
prompting with chain-of-thought for large language models,” arXiv Computational Linguistics, Nov. 2024, pp. 1321–1335. [Online].
preprint arXiv:2302.12246, 2023. Available: [Link]
[68] S. Imani, L. Du, and H. Shrivastava, “Mathprompter: Mathematical rea- [89] Z. Wang, Z. Wang, L. Le, H. S. Zheng, S. Mishra, V. Perot, Y. Zhang,
soning using large language models,” arXiv preprint arXiv:2303.05398, A. Mattapalli, A. Taly, J. Shang et al., “Speculative rag: Enhanc-
2023. ing retrieval augmented generation through drafting,” arXiv preprint
[69] M. Yasunaga, X. Chen, Y. Li, P. Pasupat, J. Leskovec, P. Liang, E. H. arXiv:2407.08223, 2024.
Chi, and D. Zhou, “Large language models as analogical reasoners,” [90] B. J. Gutiérrez, Y. Shu, Y. Gu, M. Yasunaga, and Y. Su, “Hipporag:
arXiv preprint arXiv:2310.01714, 2023. Neurobiologically inspired long-term memory for large language mod-
[70] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, and W. Chen, “Syn- els,” arXiv preprint arXiv:2405.14831, 2024.
thetic prompting: Generating chain-of-thought demonstrations for large [91] W. Wang, L. Dong, H. Cheng, X. Liu, X. Yan, J. Gao, and F. Wei,
language models,” in International Conference on Machine Learning. “Augmenting language models with long-term memory,” Advances in
PMLR, 2023, pp. 30 706–30 775. Neural Information Processing Systems, vol. 36, 2024.
[71] J. Weston and S. Sukhbaatar, “System 2 attention (is something you [92] A. Aadhithya A et al., “Enhancing long-term memory using hierarchi-
might need too),” arXiv preprint arXiv:2311.11829, 2023. cal aggregate tree for retrieval augmented generation,” arXiv e-prints,
[72] Y. Wang and Y. Zhao, “Metacognitive prompting improves understand- pp. arXiv–2406, 2024.
ing in large language models,” arXiv preprint arXiv:2308.05342, 2023. [93] H. Qian, P. Zhang, Z. Liu, K. Mao, and Z. Dou, “Memorag: Moving
[73] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schu- towards next-gen rag via memory-inspired knowledge discovery,” arXiv
urmans, C. Cui, O. Bousquet, Q. Le et al., “Least-to-most prompting preprint arXiv:2409.05591, 2024.
enables complex reasoning in large language models,” arXiv preprint [94] Y. Bai, Y. Miao, L. Chen, D. Wang, D. Li, Y. Ren, H. Xie, C. Yang,
arXiv:2205.10625, 2022. and X. Cai, “Pistis-rag: Enhancing retrieval-augmented generation with
[74] T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson, P. Clark, human feedback,” arXiv preprint arXiv:2407.00072, 2024.
and A. Sabharwal, “Decomposed prompting: A modular approach for [95] C. Gan, D. Yang, B. Hu, H. Zhang, S. Li, Z. Liu, Y. Shen, L. Ju,
solving complex tasks,” arXiv preprint arXiv:2210.02406, 2022. Z. Zhang, J. Gu et al., “Similarity is not all you need: Endowing
[75] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and retrieval augmented generation with multi layered thoughts,” arXiv
G. Neubig, “Pal: Program-aided language models,” in International preprint arXiv:2405.19893, 2024.
Conference on Machine Learning. PMLR, 2023, pp. 10 764–10 799. [96] J. Jiang, J. Chen, J. Li, R. Ren, S. Wang, W. X. Zhao, Y. Song,
[76] Z. Cheng, T. Xie, P. Shi, C. Li, R. Nadkarni, Y. Hu, C. Xiong, D. Radev, and T. Zhang, “Rag-star: Enhancing deliberative reasoning with
M. Ostendorf, L. Zettlemoyer et al., “Binding language models in retrieval augmented verification and refinement,” arXiv preprint
symbolic languages,” arXiv preprint arXiv:2210.02875, 2022. arXiv:2412.12881, 2024.
[77] Y. Ye, B. Hui, M. Yang, B. Li, F. Huang, and Y. Li, “Large language [97] Y. Tang and Y. Yang, “Multihop-rag: Benchmarking retrieval-
models are versatile decomposers: Decompose evidence and questions augmented generation for multi-hop queries,” arXiv preprint
for table-based reasoning,” arXiv preprint arXiv:2301.13808, 2023. arXiv:2401.15391, 2024.
[78] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, [98] B. Liu, C. Lyu, Z. Min, Z. Wang, J. Su, and L. Wang, “Retrieval-
H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval- augmented multi-modal chain-of-thoughts reasoning for large language
augmented generation for knowledge-intensive nlp tasks,” Advances in models,” arXiv preprint arXiv:2312.01714, 2023.
Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020. [99] W. Zhao, J. T. Chiu, C. Cardie, and A. M. Rush, “Hop, union, generate:
[79] P. Béchard and O. M. Ayala, “Reducing hallucination in struc- Explainable multi-hop reasoning without rationale supervision,” arXiv
tured outputs via retrieval-augmented generation,” arXiv preprint preprint arXiv:2305.14237, 2023.
arXiv:2404.08189, 2024. [100] Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola,
[80] P. Dixit and T. Oates, “Sbi-rag: Enhancing math word problem solving “Multimodal chain-of-thought reasoning in language models,” arXiv
for students through schema-based instruction and retrieval-augmented preprint arXiv:2302.00923, 2023.
generation,” arXiv preprint arXiv:2410.13293, 2024. [101] S. Vatsal, A. Singh, and S. Tafreshi, “Can gpt improve the state of prior
[81] N. Matsumoto, J. Moran, H. Choi, M. E. Hernandez, M. Venkatesan, authorization via guideline based automated question answering?” in
P. Wang, and J. H. Moore, “Kragen: a knowledge graph-enhanced AI for Health Equity and Fairness: Leveraging AI to Address Social
rag framework for biomedical problem solving using large language Determinants of Health. Springer, 2024, pp. 147–158.
models,” Bioinformatics, vol. 40, no. 6, 2024. [102] S. Vatsal and A. Singh, “Can gpt redefine medical understanding?
[82] X. Liu, R. Wang, Y. Song, and L. Kong, “Gram: Generative retrieval evaluating gpt on biomedical machine reading comprehension,” arXiv
augmented matching of data schemas in the context of data security,” preprint arXiv:2405.18682, 2024.
in Proceedings of the 30th ACM SIGKDD Conference on Knowledge [103] S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz,
Discovery and Data Mining, 2024, pp. 5476–5486. and J. Weston, “Chain-of-verification reduces hallucination in large
[83] S.-A. Chen, L. Miculicich, J. M. Eisenschlos, Z. Wang, Z. Wang, language models,” arXiv preprint arXiv:2309.11495, 2023.
Y. Chen, Y. Fujii, H.-T. Lin, C.-Y. Lee, and T. Pfister, “Tablerag: [104] R. Zhao, X. Li, S. Joty, C. Qin, and L. Bing, “Verify-and-edit:
Million-token table understanding with language models,” arXiv A knowledge-enhanced chain-of-thought framework,” arXiv preprint
preprint arXiv:2410.04739, 2024. arXiv:2305.03268, 2023.
[84] Z. Yao, W. Qi, L. Pan, S. Cao, L. Hu, W. Liu, L. Hou, and J. Li, “Seakr: [105] Y. Zhai, H. Bai, Z. Lin, J. Pan, S. Tong, Y. Zhou, A. Suhr, S. Xie,
Self-aware knowledge retrieval for adaptive retrieval augmented gen- Y. LeCun, Y. Ma et al., “Fine-tuning large vision-language models
eration,” arXiv preprint arXiv:2406.19215, 2024. as decision-making agents via reinforcement learning,” arXiv preprint
[85] A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Self- arXiv:2405.10292, 2024.
reflective retrieval augmented generation,” in NeurIPS 2023 Workshop [106] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
on Instruction Tuning and Instruction Following, 2023. MIT press, 2018.
[86] X. Li, W. Xu, R. Zhao, F. Jiao, S. Joty, and L. Bing, “Can [107] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox-
we further elicit reasoning in llms? critic-guided planning with imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347,
retrieval-augmentation for solving challenging tasks,” arXiv preprint 2017.
arXiv:2410.01428, 2024. [108] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
[87] R. Xu, H. Liu, S. Nag, Z. Dai, Y. Xie, X. Tang, C. Luo, Y. Li, J. C. Ho, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language
C. Yang et al., “Simrag: Self-improving retrieval-augmented generation models to follow instructions with human feedback,” Advances in
neural information processing systems, vol. 35, pp. 27 730–27 744, [130] H. Wang, T. Zariphopoulou, and X. Zhou, “Exploration versus exploita-
2022. tion in reinforcement learning: A stochastic control approach,” arXiv
[109] W. Shen, R. Zheng, W. Zhan, J. Zhao, S. Dou, T. Gui, Q. Zhang, preprint arXiv:1812.01552, 2018.
and X. Huang, “Loose lips sink ships: Mitigating length bias [131] C. Dann, Y. Mansour, M. Mohri, A. Sekhari, and K. Sridharan,
in reinforcement learning from human feedback,” arXiv preprint “Guarantees for epsilon-greedy reinforcement learning with function
arXiv:2310.05199, 2023. approximation,” in International conference on machine learning.
[110] B. Wang, R. Zheng, L. Chen, Y. Liu, S. Dou, C. Huang, W. Shen, PMLR, 2022, pp. 4666–4689.
S. Jin, E. Zhou, C. Shi et al., “Secrets of rlhf in large language models [132] V. Mnih, “Asynchronous methods for deep reinforcement learning,”
part ii: Reward modeling,” arXiv preprint arXiv:2401.06080, 2024. arXiv preprint arXiv:1602.01783, 2016.
[111] A. Havrilla, M. Zhuravinskyi, D. Phung, A. Tiwari, J. Tow, S. Bi- [133] M. Tokic, “Adaptive ε-greedy exploration in reinforcement learning
derman, Q. Anthony, and L. Castricato, “trlx: A framework for large based on value differences,” in Annual conference on artificial intelli-
scale reinforcement learning from human feedback,” in Proceedings gence. Springer, 2010, pp. 203–210.
of the 2023 Conference on Empirical Methods in Natural Language [134] N. Cesa-Bianchi, C. Gentile, G. Lugosi, and G. Neu, “Boltzmann
Processing, 2023, pp. 8578–8595. exploration done right,” Advances in neural information processing
[112] G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and systems, vol. 30, 2017.
M. Sun, “Ultrafeedback: Boosting language models with high-quality [135] R. Ma, J. Luijkx, Z. Ajanovic, and J. Kober, “Explorllm: Guiding
feedback,” 2023. exploration in reinforcement learning with large language models,”
[113] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss, arXiv preprint arXiv:2403.09583, 2024.
A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize [136] Q. Zhao, H. Fu, C. Sun, and G. Konidaris, “Epo: Hierarchical
with human feedback,” Advances in Neural Information Processing llm agents with environment preference optimization,” arXiv preprint
Systems, vol. 33, pp. 3008–3021, 2020. arXiv:2408.16090, 2024.
[114] R. Munos, M. Valko, D. Calandriello, M. G. Azar, M. Rowland, Z. D. [137] H.-T. Nguyen and K. Satoh, “Balancing exploration and exploitation in
Guo, Y. Tang, M. Geist, T. Mesnard, A. Michi et al., “Nash learning llm using soft rllf for enhanced negation understanding,” arXiv preprint
from human feedback,” arXiv preprint arXiv:2312.00886, 2023. arXiv:2403.01185, 2024.
[115] C. Y. Liu, L. Zeng, J. Liu, R. Yan, J. He, C. Wang, S. Yan, Y. Liu, [138] F. Yang, P. Zhao, Z. Wang, L. Wang, J. Zhang, M. Garg, Q. Lin,
and Y. Zhou, “Skywork-reward: Bag of tricks for reward modeling in S. Rajmohan, and D. Zhang, “Empower large language model to
llms,” arXiv preprint arXiv:2410.18451, 2024. perform better on industrial domain-specific question answering,” arXiv
[116] H. Ivison, Y. Wang, V. Pyatkin, N. Lambert, M. Peters, P. Dasigi, preprint arXiv:2305.11541, 2023.
J. Jang, D. Wadden, N. A. Smith, I. Beltagy et al., “Camels in a [139] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and
changing climate: Enhancing lm adaptation with tulu 2,” arXiv preprint C. Finn, “Direct preference optimization: Your language model is
arXiv:2311.10702, 2023. secretly a reward model,” Advances in Neural Information Processing
Systems, vol. 36, 2024.
[117] L. Li, Y. Chai, S. Wang, Y. Sun, H. Tian, N. Zhang, and H. Wu, “Tool-
augmented reward modeling,” arXiv preprint arXiv:2310.01045, 2023. [140] A. Pal, D. Karkhanis, S. Dooley, M. Roberts, S. Naidu, and C. White,
“Smaug: Fixing failure modes of preference optimisation with dpo-
[118] A. Scheid, E. Boursier, A. Durmus, M. I. Jordan, P. Ménard,
positive,” arXiv preprint arXiv:2402.13228, 2024.
E. Moulines, and M. Valko, “Optimal design for reward modeling in
rlhf,” arXiv preprint arXiv:2410.17055, 2024. [141] J. Wu, Y. Xie, Z. Yang, J. Wu, J. Gao, B. Ding, X. Wang, and X. He,
arXiv preprint arXiv:2407.08639, 2024.
[119] L. Gao, J. Schulman, and J. Hilton, “Scaling laws for reward model
[142] D. Kim, Y. Kim, W. Song, H. Kim, Y. Kim, S. Kim, and C. Park, “sdpo:
overoptimization,” in International Conference on Machine Learning.
Don’t use your data all at once,” arXiv preprint arXiv:2403.19270,
PMLR, 2023, pp. 10 835–10 866.
2024.
[120] T. Kaufmann, P. Weng, V. Bengs, and E. Hüllermeier, “A survey
[143] Z. Yang, F. Wan, L. Zhong, T. Shi, and X. Quan, “Weighted-reward
of reinforcement learning from human feedback,” arXiv preprint
preference optimization for implicit model fusion,” arXiv preprint
arXiv:2312.14925, 2023.
arXiv:2412.03187, 2024.
[121] H. Lee, S. Phatale, H. Mansoor, K. R. Lu, T. Mesnard, J. Ferret, [144] S. R. Chowdhury, A. Kini, and N. Natarajan, “Provably robust
C. Bishop, E. Hall, V. Carbune, and A. Rastogi, “Rlaif: Scaling dpo: Aligning language models with noisy feedback,” arXiv preprint
reinforcement learning from human feedback with ai feedback,” 2023. arXiv:2403.00409, 2024.
[122] A. Li, Q. Xiao, P. Cao, J. Tang, Y. Yuan, Z. Zhao, X. Chen, L. Zhang, [145] W. Xiao, Z. Wang, L. Gan, S. Zhao, W. He, L. A. Tuan, L. Chen,
X. Li, K. Yang et al., “Hrlaif: Improvements in helpfulness and harm- H. Jiang, Z. Zhao, and F. Wu, “A comprehensive survey of datasets,
lessness in open-domain reinforcement learning from ai feedback,” theories, variants, and applications in direct preference optimization,”
arXiv preprint arXiv:2403.08309, 2024. arXiv preprint arXiv:2410.15595, 2024.
[123] Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, Y. Choi, and B. Y. [146] M. G. Azar, Z. D. Guo, B. Piot, R. Munos, M. Rowland, M. Valko, and
Lin, “Magpie: Alignment data synthesis from scratch by prompting D. Calandriello, “A general theoretical paradigm to understand learning
aligned llms with nothing,” arXiv preprint arXiv:2406.08464, 2024. from human preferences,” in International Conference on Artificial
[124] Z. Wang, A. Bukharin, O. Delalleau, D. Egert, G. Shen, J. Zeng, Intelligence and Statistics. PMLR, 2024, pp. 4447–4455.
O. Kuchaiev, and Y. Dong, “Helpsteer2-preference: Complementing [147] H. Sun, Y. Shen, and J.-F. Ton, “Rethinking bradley-terry models in
ratings with preferences,” arXiv preprint arXiv:2410.01257, 2024. preference-based reward modeling: Foundations, theory, and alterna-
[125] Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, tives,” arXiv preprint arXiv:2411.04991, 2024.
and J. Andreas, “Guiding pretraining in reinforcement learning with [148] Y. Wu, Z. Sun, H. Yuan, K. Ji, Y. Yang, and Q. Gu, “Self-play
large language models,” in International Conference on Machine preference optimization for language model alignment,” arXiv preprint
Learning. PMLR, 2023, pp. 8657–8677. arXiv:2405.00675, 2024.
[126] M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward design with [149] C. Wang, Z. Zhao, C. Zhu, K. A. Sankararaman, M. Valko, X. Cao,
language models,” arXiv preprint arXiv:2303.00001, 2023. Z. Chen, M. Khabsa, Y. Chen, H. Ma et al., “Preference optimization
[127] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Ja- with multi-sample comparisons,” arXiv preprint arXiv:2410.12138,
yaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: Human- 2024.
level reward design via coding large language models,” arXiv preprint [150] A. Fisch, J. Eisenstein, V. Zayats, A. Agarwal, A. Beirami, C. Nagpal,
arXiv:2310.12931, 2023. P. Shaw, and J. Berant, “Robust preference optimization through reward
[128] J. Song, Z. Zhou, J. Liu, C. Fang, Z. Shu, and L. Ma, “Self-refined model distillation,” arXiv preprint arXiv:2405.19316, 2024.
large language model as automated reward function designer for deep [151] Y. Dong, K. Luo, X. Jiang, Z. Jin, and G. Li, “Pace: Improving prompt
reinforcement learning in robotics,” arXiv preprint arXiv:2309.06687, with actor-critic editing for large language model,” arXiv preprint
2023. arXiv:2308.10088, 2023.
[129] W. Yuan, R. Y. Pang, K. Cho, S. Sukhbaatar, J. Xu, and J. Weston, [152] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford,
“Self-rewarding language models,” arXiv preprint arXiv:2401.10020, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models
2024. from human preferences,” arXiv preprint arXiv:1909.08593, 2019.
[153] R. Meier and A. Mujika, “Open-ended reinforcement learning with
neural reward functions,” Advances in Neural Information Processing
Systems, vol. 35, pp. 2465–2479, 2022.
[154] W. Yao, H. Mi, and D. Yu, “Hdflow: Enhancing llm complex problem-
solving with hybrid thinking and dynamic workflows,” arXiv preprint
arXiv:2409.17433, 2024.
[155] G. Liu, K. Ji, R. Zheng, Z. Wu, C. Dun, Q. Gu, and L. Yan, “Enhancing
multi-step reasoning abilities of language models through direct q-
function optimization,” arXiv preprint arXiv:2410.09302, 2024.
[156] P. Aryan, “Llms as debate partners: Utilizing genetic algorithms
and adversarial search for adaptive arguments,” arXiv preprint
arXiv:2412.06229, 2024.
[157] X. Wu, S.-h. Wu, J. Wu, L. Feng, and K. C. Tan, “Evolutionary
computation in the era of large language model: Survey and roadmap,”
arXiv preprint arXiv:2401.10034, 2024.
[158] H. Yin, A. V. Kononova, T. Bäck, and N. van Stein, “Controlling
the mutation in large language models for the efficient evolution of
algorithms,” arXiv preprint arXiv:2412.03250, 2024.
[159] Q. Guo, R. Wang, J. Guo, B. Li, K. Song, X. Tan, G. Liu, J. Bian,
and Y. Yang, “Connecting large language models with evolution-
ary algorithms yields powerful prompt optimizers,” arXiv preprint
arXiv:2309.08532, 2023.
[160] S. Brahmachary, S. M. Joshi, A. Panda, K. Koneripalli, A. K. Sagotra,
H. Patel, A. Sharma, A. D. Jagtap, and K. Kalyanaraman, “Large
language model-based evolutionary optimizer: Reasoning with elitism,”
arXiv preprint arXiv:2403.02054, 2024.
[161] W. Chao, J. Zhao, L. Jiao, L. Li, F. Liu, and S. Yang, “A match made
in consistency heaven: when large language models meet evolutionary
algorithms,” arXiv preprint arXiv:2401.10510, 2024.
[162] H. Hao, X. Zhang, and A. Zhou, “Large language models as surrogate
models in evolutionary algorithms: A preliminary study,” arXiv preprint
arXiv:2406.10675, 2024.
[163] X. Huang, W. Liu, X. Chen, X. Wang, D. Lian, Y. Wang, R. Tang,
and E. Chen, “Wese: Weak exploration to strong exploitation for llm
agents,” arXiv preprint arXiv:2404.07456, 2024.
This figure "[Link]" is available in "png" format from:

[Link]

You might also like