0% found this document useful (0 votes)
50 views10 pages

Exploring Agent Procedural Memory

The document presents M emp, a framework designed to enhance the procedural memory of large language model-based agents, allowing them to learn, update, and utilize past experiences effectively. By distilling agent trajectories into detailed instructions and abstract scripts, M emp improves task execution efficiency and accuracy, as demonstrated in empirical evaluations on TravelPlanner and ALFWorld. The framework emphasizes the importance of dynamic memory management strategies for optimizing agent performance across various tasks.

Uploaded by

jackyuanjie1990
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views10 pages

Exploring Agent Procedural Memory

The document presents M emp, a framework designed to enhance the procedural memory of large language model-based agents, allowing them to learn, update, and utilize past experiences effectively. By distilling agent trajectories into detailed instructions and abstract scripts, M emp improves task execution efficiency and accuracy, as demonstrated in empirical evaluations on TravelPlanner and ALFWorld. The framework emphasizes the importance of dynamic memory management strategies for optimizing agent performance across various tasks.

Uploaded by

jackyuanjie1990
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

M emp : Exploring Agent Procedural Memory

Runnan Fang♠♡∗ , Yuan Liang♠ * , Xiaobin Wang♡ , Jialong Wu♡ ,


Shuofei Qiao♠♡ , Pengjun Xie♡ , Fei Huang♡ , Huajun Chen♠ , Ningyu Zhang♠†

Zhejiang University ♡ Alibaba Group
{rolnan,zhangningyu}@zju.edu.cn

Abstract
arXiv:2508.06433v2 [cs.CL] 13 Aug 2025

Task Execution

Large Language Models (LLMs) based agents excel at di-


verse tasks, yet they suffer from brittle procedural memory Slow

that is manually engineered or entangled in static parame-


ters. In this work, we investigate strategies to endow agents Tasks How to do?

with a learnable, updatable, and lifelong procedural memory. Inaccurate

We propose M emp that distills past agent trajectories into


both fine-grained, step-by-step instructions and higher-level,
script-like abstractions, and explore the impact of differ- Task Execution

ent strategies for Build, Retrieval, and Update of procedural


memory. Coupled with a dynamic regimen that continuously Procedural Faster
I know…
updates, corrects, and deprecates its contents, this repository Memory steps ↓50%

evolves in lockstep with new experience. Empirical evalua-


tion on TravelPlanner and ALFWorld shows that as the mem- Accurate
ory repository is refined, agents achieve steadily higher suc- accuracy ↑50%

cess rates and greater efficiency on analogous tasks. More- Tasks

over, procedural memory built from a stronger model re-


tains its value: migrating the procedural memory to a weaker
model yields substantial performance gains. Figure 1: With procedural memory, agents can improve
both the success rate (accuracy ↑) and execution efficiency
(steps ↓) when solving similar tasks.
Introduction
As large language models (LLMs) grow ever more pow-
erful, LLM-based agents augmented by their own reason- and a similar environment. Instead of starting fresh each
ing and external tools are taking on increasingly sophis- time, an agent should extract its experience from past suc-
ticated works (Zhao et al. 2023; Wang et al. 2024a; Xi cesses. By turning earlier trajectories into reusable templates
et al. 2025; Qiao et al. 2023). No longer mere assistants, like patterns of reasoning, tool sequences, and recovery tac-
these agents now trawl the web for elusive insights and tics, it can progress step by step, learning from every fail-
weave them into comprehensive, publication ready reports, ure and success, until even the most convoluted missions
like Deep Research (OpenAI 2025; x.ai 2025) and Web- become routine.
Dancer (Wu et al. 2025a). Moreover, they can handle com-
plex data analyses (Lan et al. 2025; Ifargan et al. 2025; The capacity to distill, chronicle, and re-apply lessons
Ou et al. 2025), navigate multi-step GUI workflows (Luo from one’s own experiential trajectory is the bedrock of
et al. 2025; Qin et al. 2025), and sustain long-horizon, tool- human learning and the pivotal gateway through which an
rich interactions (Yao et al. 2025; Barres et al. 2025; Chen agent ascends toward self-directed refinement (Liu et al.
et al. 2025; Fang et al. 2025; Gur et al. 2023) with pre- 2025a; Sumers et al. 2023a; Li et al. 2023). Procedu-
cision. Yet executing such intricate, long-horizon tasks de- ral memory (Gupta and Cohen 2002; Cohen and Squire
mands dozens of steps and protracted runtimes. Along the 1980) silently compiles habitual skills into executable sub-
way, unpredictable external events—network glitches, UI routines, enabling unconscious, fluent action. While con-
changes, shifting data schemas—can derail the entire pro- temporary agents built on LLMs can compose short ac-
cess. Restarting from scratch every time is a punishing or- tion plans or call external tools, their procedural knowledge
deal for present-day agents. Beneath their surface diversity, is either hand-crafted, stored as brittle prompt templates,
many complex tasks share deep structural commonalities or implicitly entangled in model parameters that are ex-
pensive to update. Existing memory-augmented frameworks
* Equal Core Contributors. such as LangGraph (Mavroudis 2024), AutoGPT (Yang,

Corresponding Author. Yue, and He 2023), or agent cognitive architectures like
Memory Bank (Zhong et al. 2024a; Sumers et al. 2023b) 2024; Zhang et al. 2024; Liu et al. 2025a; Li et al. 2025).
and Soar (Laird 2022) provide coarse abstractions (buffers, These systems aim to mimic aspects of human memory to
rule chunks, production systems) but leave the optimiza- improve coherence, personalization, and learning capabil-
tion of procedural memory life-cycle operations about how ities (Chhikara et al. 2025; Wu et al. 2025b; Xia et al.
skills are built, indexed, patched, and eventually pruned, 2025). Current approaches include end-to-end memory sys-
largely unexamined. Consequently, there is no principled tems (Yu et al. 2025; Zhou et al. 2025), external memory
way to quantify how efficiently an agent evolves its procedu- systems (Chhikara et al. 2025; Zhong et al. 2024b), and hier-
ral repertoire or to guarantee that new experiences improve archical memory structures (Hu et al. 2024a; Xu et al. 2025).
rather than erode performance. These methods involve encoding and storing information
To close this gap, we present M emp , a task-agnostic in various formats, using retrieval mechanisms like vector
framework that treats procedural memory as a first-class op- embeddings and semantic search, and implementing mem-
timization object. The core exploration of M emp lies in ory updating and forgetting strategies to maintain relevance
how different strategies for memory construction, retrieval, and efficiency. Despite its importance, memory in multi-
and updating affect overall performance. During the con- turn agent interactions remains underexplored, and enabling
struction phase, we follow the majority of traditional mem- agents to effectively learn and utilize memory across trajec-
ory architectures and agent-based memory designs by lever- tories poses a significant challenge. Procedural memory is
aging either the full historical trajectory or explicit guide- a type of long-term memory that involves the retention of
lines to guide the process. In the retrieval phase, we exper- procedures and skills, such as typing or riding a bike, which
iment with various key-building strategies—such as query- are performed automatically without conscious thought. The
vector matching and keyword-vector matching—to investi- agent utilizes procedural memory to internalize and auto-
gate how procedural memory can be constructed more pre- mate repetitive tasks, decision-making processes, and in-
cisely. Unlike prior memory mechanisms or learning from teraction patterns, leading to more efficient and context-
experience, M emp introduces diverse procedural-memory aware responses over time. Although there have been several
update strategies: In the realm of agents, memory updating works, such as Voyager (Wang et al. 2023), AWM (Wang
is crucial for agents to adapt to dynamic environments. By et al. 2024b), and AutoManual (Chen et al. 2024), that uti-
incorporating diverse strategies like ordinary addition, val- lize procedural memory to enhance agents’ capabilities on
idation filtering, reflection, and dynamic discarding, agents similar tasks, there still lacks a systematic analysis on how
can efficiently manage their knowledge base. This ensures to construct, retrieve, and update such procedural memory
they stay updated with new information, discard outdated like (Wu et al. 2024). Therefore, our work mainly focuses
data, and optimize memory resources. Such strategies en- on exploring how to build an effective procedural memory
hance learning efficiency, improve decision-making quality, system for agents performing cross-trajectory tasks.
and boost adaptability, allowing agents to perform optimally
in various tasks and scenarios. Learning from Experience. LLM-based Agent learning
We instantiate M emp on top of strong LLMs (GPT-4o from experience involves intelligence continuously improv-
and Claude, Qwen) and evaluate on two diverse domains: ing their decision-making capabilities through interaction
long-horizon housework ALFWorld (Shridhar et al. 2021) with environments and utilization of past experiences (Tan
and long-term information seeking task TravelPlanner (Xie et al. 2025; Tang et al. 2025; Zhou et al. 2025; Qiao et al.
et al. 2024). On two benchmark datasets that rigorously eval- 2025; Su et al. 2025; Wang et al. 2024b). This approach is
uate agent capabilities, we demonstrate that constructing and crucial for developing adaptive and intelligent agents capa-
retrieving procedural memory during training empowers an ble of handling dynamic real-world scenarios, as it allows
agent to distill and reuse its prior experience. When this them to optimize behaviors, reduce manual programming
memory is exploited at test time, the agent’s task accuracy needs, and enhance performance across various tasks (Zheng
rises, and compared with tackling each instance in isolation, et al.; Liu et al. 2025c; Wang et al. 2025). Agents typically
it eliminates most fruitless exploration on unfamiliar tasks, employ mechanisms such as reinforcement learning (Lu
yielding substantial reductions in both step count and token et al. 2025; Dong et al. 2025), experience replay (Feng et al.
consumption. Further, by equipping the agent with a set of 2025; Liu et al. 2025b), imitation learning (Sun et al. 2024;
memory-update mechanisms, we allow it to build and refine Yang et al. 2024b), memory management (Hou, Tamoto, and
its procedural memory while acting in the test environment. Miyashita 2024; Hu et al. 2024b), and multi-agent learning
This endows the agent with a continual, almost linear mas- to achieve this. However, current methods face limitations
tery of the task. Extensive ablations reveal that procedural including low sample efficiency, poor generalization across
memory also scales gracefully and transfers effectively to tasks, catastrophic forgetting when learning new informa-
new, related tasks. tion, and there are very few features for memory update. The
key distinction of our work lies in systematically investigat-
Related Works ing optimal strategies for construction, retrieval, and update
Memory in Language Agents. Memory is a foundational modules of an agent’s procedural knowledge. During the up-
component in language agents, enabling them to retain and date phase, we enhance the agent’s capabilities by maintain-
utilize past information across multiple timescales, includ- ing an editable repository of procedural knowledge. Addi-
ing short-term, episodic, and long-term memory, to en- tionally, collecting high-quality training data can be chal-
hance their performance and adaptability (Zhou et al. 2023, lenging and may introduce biases. Addressing these limi-
tations is essential for advancing the capabilities of LLM- where M em is the procedural memory library acquired by
based agents and ensuring their effective application in real- the agent over the T tasks. After constructing the procedural
world contexts. memory library, when facing a new task tnew , we need a
good procedural memory retriever to recall a memory that
Preliminary fits tnew . Generally speaking, we would choose the task t ∈
T that is most similar to tnew , because similar experiences
When an agent influences its external environment by in- are more helpful for the agent to complete the new task.
voking external tools or executing prescribed actions, and
iteratively refines its behavior over multiple rounds to ac- mretrieved = arg pmax S(tnew , ti ) (4)
m i ∈M em
complish a complex multi-step objective, this paradigm can As we use cosine similarity for the vector embedding model
be modeled as a Markov Decision Process (MDP). (Puter- ϕ of the task in the experiment, the retrieval process be-
man 1990) Under this view, at each discrete time step t, the comes:
agent, situated in state st ∈ S, chooses an action at ∈ A, ϕ(tnew ) · ϕ(ti )
according to its policy π(at |st ), where A is the action space mretrieved = arg pmax . (5)
m i ∈M em ∥ϕ(tnew )∥∥ϕ(ti )∥
of the task. The environment then transitions to a new state
st+1 ∈ S and emits an observation Ot . Consequently, the Moreover, as the number of completed tasks continuously
entire interaction trajectory may be compactly expressed as: increases, simply augmenting the agent’s procedural mem-
ory is inconsistent with common sense. A well-designed
τ = (s0 , a0 , o1 , s1 , a1 , o2 , . . . , sT ), (1) procedural memory system should have a reasonable up-
date mechanism—that is, it should dynamically perform ad-
where τ is the complete exploration trajectory of this task. dition, deletion, modification, and retrieval based on the task
Moreover, a reward function R will evaluate the task’s com- execution context.
pletion r within this environment env by assigning a score Let M (t) denote the agent’s procedural memory at time
based on the final state sT or the entire trajectory τ . t, and τt represent the set of tasks completed up to time t.
r = R(env, sT , τ ) ∈ [0, 1] (2) Then, the update mechanism can be modeled as a function U
that takes the current procedural memory and task execution
Although approaches resembling Markov Decision Pro- feedback to produce the updated memory:
cesses inevitably contain erroneous actions and exploratory M (t + 1) = U (M (t), E(t), τt ) , (6)
attempts, the contextual information they generate becomes where E(t) encapsulates the execution feedback (e.g., suc-
valuable for decision-making as the model’s reasoning and cess, failure, performance metrics). A more sophisticated
reflective capabilities improve. Nevertheless, this benefit implementation of U could be represented as:
comes at a high test-time cost—both in time and in token
consumption. When facing an entirely new and complex en-
vironment, many actions (or tokens) are spent simply under- U = Add (Mnew ) ⊖ Remove (Mobso ) ⊕ U pdate(Mexist ),
standing the environment and the task itself. This leads to (7)
redundancy when similar tasks are executed within the same where Mnew represents new procedural memory to be
environment: the agent has already acquired partial procedu- added; Mobso indicates procedural memory to be removed,
ral knowledge about the environment or task during earlier Mexisting are tasks to be updated based on execution feed-
episodes, yet fails to transfer that knowledge effectively to back E(t). This comprehensive formula captures the essen-
subsequent tasks. tial add, delete, and modify operations within the update
mechanism.
By shifting from parallel task completion to sequential
task completion, the agent can learn and distill experience
from earlier tasks, thereby reducing repetitive exploration.
Experiment
Inspired by human procedural memory, we propose to equip In this section, we will introduce the Procedural Memory
the agent with a procedural memory module. This module framework in detail (Figure 2), covering the storage, re-
transforms the conventional policy π(at |st ) into πmp (at |st ), trieval, and update modules of memory, as well as analyzing
where mp is the agent’s learned procedural memory. which strategies perform better within each module.

Agent Procedural Memory Experimental Settings


Datasets. For our experiments, we adapt TravelPlan-
Procedural memory is the type of long-term memory respon- ner (Xie et al. 2024) and ALFWorld (Shridhar et al. 2021)
sible for knowing how to perform tasks and skills, such as benchmarks. TravelPlanner is a benchmark designed to eval-
typing or riding a bike. By mastering this type of procedural uate agents’ ability to use tools and perform complex plan-
memory, humans avoid the need to relearn the process each ning under intricate constraints. In contrast, ALFWorld com-
time. For an agent, that is, for a task trajectory τ and its re- prises household tasks. In each interaction round, the agent
ward r, a memory mp is constructed by a builder B, thereby outputs an action, and the environment responds with textual
achieving the acquisition of memory, namely feedback describing the resulting state. This process repeats
T for multiple turns until the task is completed or the maxi-
mum number of rounds is reached. ALFWorld includes test
X
M em = mpt , where mpt = B(τt , rt ) (3)
t=1
split to evaluate the agent’s generalization ability.
Build Retrieve
New Tasks Task Execution
Past Tasks Procedural Memory Task4
Task1
Key: Put a clean cup in Task5
Task2 microwave Task6
Task3 … Value: I’ve successfully … Ef ciency Accuracy
solved this task. Next time
Past Trajectories when I meet new similar
tasks, I know that I should
Trajectory 1
rst do ..., and then do …
Trajectory 2
Trajectory 3 … Relevant
Retriever New Trajectories
Memory

Update
Memory Storage
Update
P Mem1, P Mem2, …
Corrected
P Mem high
Tasks P Mem

Add
New Trajectories
P Mem Error Success Rate

Golden low
Agent Trajectory Negative
Trajectory
Delete
fi
fi
Figure 2: The procedural memory framework consists of Build, Retrieve, and Update, which respectively involve encoding
stored procedural memory, forming new procedural memories, and modifying existing ones in light of new experiences.

Backbones. In our experiments, we benchmarked our pro- ral memory is retrieved, it is appended to the task as part of
cedural memory on three base models. Specifically, we the context, serving as prior knowledge to assist the model
adopt the two proprietary frontier models that have con- in completing the task.
sistently dominated public leaderboards: OpenAI’s GPT- Inspired by this, we designed the following experimental
4o (OpenAI 2022) and Anthropic’s Claude (Anthropic conditions:
2022), and complement them with the open-sourced • No Memory: The model tackles the assigned task in a
Qwen2.5-72B-Instruct (Yang et al. 2024a). The first two pro- ReAct fashion without any external memory.
vide state-of-the-art closed-source performance, while the
third allows us to verify that our findings generalize beyond • Trajectory: We first filter the gold trajectories from the
proprietary systems and remain valid in the open-source training set and store them. At inference time, the system
regime. retrieves the top-k trajectories whose query vectors are
most similar to the current task’s vector, supplying them
Evaluation. For ALFWorld dataset, task completion is as procedural memories before execution.
evaluated by the execution environment. After a task is • Script: The model analyzes and summarizes the gold
completed or the maximum number of execution steps is trajectories from the training set, distilling them into ab-
reached, the environment provides a reward of 0 or 1 to indi- stract procedural knowledge that is provided as a prompt
cate whether the task has been successfully completed. For before each task.
TravelPlanner, we conduct experiments on the test set in a
two-stage mode. After multiple rounds of interaction to ob- • Proceduralization: This condition combines the full re-
tain the travel trajectory and the final planner, GPT-4o con- trieved trajectories with the high-level script generated
verts the travel plan into a specified JSON format. The con- by the model, integrating both concrete examples and ab-
verted plan is then compared with the gold standard to obtain stract guidance as the procedural memory.
scores for both Common Sense and Hard Constraint. As shown in Table 1, all memory construction meth-
ods outperform the no-memory baseline, achieving higher
Memory Storage & Retrieval scores on both datasets while also reducing the number of
Procedural knowledge is typically stored in two main for- steps required. This indicates that procedural memory built
mats: (1) trajectories are kept verbatim, round by round, in during training is beneficial for directly applying tasks dur-
memory, or (2) high-level abstractions are extracted from ing testing. Furthermore, we observe that the approach of
these trajectories and then stored. Once a similar procedu- abstracting trajectories into scripts during training yields rel-
TravelPlanner ALFWorld
Model Granularity
#CS ↑ #HC ↑ Steps ↓ Dev ↑ Test ↑ Steps ↓
No Memory 71.93 12.88 17.84 39.28 42.14 23.76
Script 72.08 5.50 15.79 66.67 56.43 18.52
GPT-4o
Trajectory 76.02 8.25 14.64 67.17 74.29 16.49
Proceduralization 79.94 9.76 14.62 87.14 77.86 15.01
No Memory 63.49 33.06 18.84 39.20 34.97 24.12
Script 62.08 29.61 19.21 56.13 53.59 19.38
Claude-3.5-sonnet
Trajectory 65.76 29.61 17.72 69.28 71.78 15.97
Proceduralization 65.46 30.14 15.29 82.50 74.72 15.79
No Memory 56.57 7.34 18.32 44.91 41.25 21.38
Script 58.59 7.34 18.53 66.24 61.88 17.13
Qwen2.5-72b
Trajectory 63.41 12.66 18.12 64.49 69.57 16.40
Proceduralization 63.82 14.19 17.94 85.71 77.19 15.32

Table 1: Results on Build Policy. #CS, #HC denote Commensense and Hard Constraint, respectively. ↑ indicates the higher
values are better, and ↓ denotes the lower values are better. The best results among all methods with similar settings are bolded,
and the second-best results are underlined.

atively better performance on the ALFWorld test set com- Model Policy #CS ↑ #HC ↑ Steps ↓
pared to the dev set. Conversely, trajectories that utilize com- No Memory 71.93 12.88 17.84
plete execution traces as procedural memory achieve higher Random Sample 74.59 6.72 15.12
GPT-4o
scores on the dev set, suggesting that scripts are more ca- Key=Query 73.38 8.95 15.44
pable of generalizing to different test tasks, while trajecto- Key=AveFact 76.02 8.25 14.64
ries are better suited for scenarios involving tasks similar to No Memory 63.49 33.06 18.84
those already completed. By combining procedure knowl- Claude-3.5-sonnet
Random Sample 63.99 29.91 17.93
edge from both methods of employing abstracted guidelines Key=Query 64.93 28.56 17.60
along with concrete execution trajectories, we attain the op- Key=AveFact 65.76 29.61 17.72
timal performance. No Memory 56.57 7.34 18.32
After converting a set of completed trajectories into pro- Random Sample 59.76 8.43 18.31
Qwen2.5-72b
cedural memory, the next critical challenge is to retrieve the Key=Query 61.71 11.97 18.54
Key=AveFact 63.41 12.66 18.12
most accurate and relevant procedural knowledge when a
new task arrives. We have designed several different key
Table 2: Results on Retrieve Policy on TravelPlanner.
construction methods for memory storage to facilitate sub-
sequent vector-based matching and retrieval:
• Random Sample: Does not utilize keys for vector re- Memory Update
trieval; instead, randomly extracts a few memories from
procedural memory. While many prior efforts have focused on developing
• Query: Employ query description as the key for storage, reusable procedural knowledge, enabling models to learn
leveraging the semantic similarity of queries for retrieval. from prior experiences rather than solving each test task
in isolation, most existing memory update methods remain
• AveFact: We apply a large model to extract keywords
quite rudimentary. Typically, they simply append newly ac-
from the task’s query, then computes the average similar-
quired memories to the existing store—a so-called ”merge”
ity across matched keywords for retrieval.
strategy. In this work, we explore several online memory-
During the retrieval process, we evaluate the similarity by update mechanisms to identify which dynamic strategy de-
calculating the cosine similarity between their correspond- livers the best performance on our tasks. Beyond end-to-end
ing vectors. Our experiments show that these different re- evaluation metrics, we also analyze how both accuracy and
trieval strategies produce varying results. Specifically, com- efficiency evolve as the number of executed tasks increases,
pared to random sampling, employing the query based and explicitly measuring the benefits conferred by our procedu-
AveFact methods for precise retrieval significantly improves ral memory.
performance. The query-based approach benefits from cap- To facilitate systematic comparison, we designed sev-
turing semantic contexts, enabling more accurate matches. eral memory-update scenarios. In each, the agent’s episodic
The AveFact method, by extracting key features and aver- memory is refreshed after every t test-set tasks. The specific
aging their similarities, effectively focuses on core task ele- update strategies are as follows:
ments, leading to better retrieval efficacy. Overall, our find-
ings suggest that incorporating semantic understanding and • Vanilla Memory Update: After every t tasks, all trajec-
key feature extraction in retrieval strategies substantially en- tories from these tasks are consolidated into procedural
hances memory access accuracy and the effectiveness of memories and directly appended to the memory bank.
downstream task performance. • Validation: After every t tasks, only the trajectories of
Vallina Validation Adjustment
Reward Gain over 'wo Memory' Steps Reduction over 'wo Memory'
0.7 0
Reward Improvement ( Reward)

0.6 2

Steps Reduction ( Steps)


0.5 4
0.4 6
0.3 8
0.2 10
0.1 12
0.0 14
1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Trajectory Group Index Trajectory Group Index
Figure 3: Reward gain and steps reduction vs. trajectory group index with procedural memory.

successfully completed tasks are retained and converted (a) 96.6 wo Memory
100 91.4 w GPT-4o-Memory
into procedural memories for storage.
• Adjustment: When a retrieved procedural memory re- 80
65.5
sults in a failed execution, the erroneous trajectory is
60 59.3

Count
combined with the original memory and then revised in
place, yielding an updated procedural memory. 40
As depicted in Figure 3, we systematically divided the 20 16.9 15.3
tasks within our testbed into several distinct groups, with
each group comprising a diverse set of individual tasks. 0
Delivery Steps Commonsense
Upon the completion of tasks within each group, we em- (b)
ployed the previously described strategies to construct, 82.5 81.5
80 78.2 79.3
store, and update the procedural memory. The experimen- 75.8
tal results reveal a clear trend: as we sequentially progress 70.7
70
Final Scores

through more groups and iteratively refresh the memory,


all strategies contribute to improved performance on subse- 60
quent tasks. Specifically, this is reflected not only in higher
overall scores but also in a reduction in the number of steps 50
required to complete the tasks. 40.7
A closer comparison of different strategies exposes signif- 40
icant disparities in their effectiveness. Notably, the reflexion- 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0
based update mechanism stands out as the most effective Number of Memory Retrieved
approach. By the time the final group of tasks is reached,
this method delivers a substantial advantage: it surpasses the Figure 4: (a) Transfer result of GPT-4o’s procedural mem-
second-best strategy by an impressive margin of +0.7 points ory to Qwen2.5-14B-Instruct and its performance on Trav-
and achieves a reduction of 14 steps. These improvements elPlanner dataset.(b) The relationship between the quantity
underscore the value of continually updating the memory, of procedural memory retrieved for GPT-4o’s performance
particularly when the update is guided by an error-correction on the ALFWorld dataset.
mechanism embedded in the reflexion process.

Analysis (middle), after multiple attempts, the model completes the


Procedural Memory Boosts Accuracy and Cuts Trials. task but at a cost significantly higher than the optimal path.
Figure 5 presents a case study demonstrating how Proce- In contrast, once Procedural Memory is available for sim-
dural Memory enhances both accuracy and efficiency. In ilar tasks, the model spends less time on trial and error. For
the absence of Procedural Memory, facing a complex task example, in the egg heating problem, Procedural Memory
that has not been performed before, there are usually two can indicate the approximate location of the egg, saving the
situations. In the first scenario (left), the model repeatedly aimless search. During the heating process, it provides clear
attempts illegal or incorrect actions, causing the context guidance, ensuring that the heating actions are performed
to become increasingly complex and eventually exceeding consecutively and correctly, thereby allowing the task to be
the model’s understanding capacity. In the second scenario completed in fewer steps.
Task: heat some egg and put it in garbagecan
Without Memory Without Memory With Memory
Agent: go to countertop 1 Agent: go to countertop 1 Memory: I’ve solved similar task before.
Obs: In the countertop 1, you see a Obs: In the countertop 1, you see a To solve this query, rst go to the fridge
dishsponge 1, a plate 1, a knife 1… dishsponge 1, a plate 1, a knife 1… or another likely location and take an
Agent: go to cabinet 2 Agent: go to cabinet 2 egg, then put the egg into the microwave
Obs: In the cabinet 2, you see a bowl 1, a Obs: In the cabinet 2, you see a bowl 1, a and heat it. Last, move to the designated
fork 1… fork 1… location and place the heated egg.
Agent: go to fridge 1 Agent: go to fridge 1
Obs: In the fridge 1, you see an egg 1, a Obs: In the fridge 1, you see an egg 1, a Agent: go to fridge 1
cup 1… cup 1… Obs: In the fridge 1, you see an egg 1, a
Agent: take egg 1 from fridge 1 Agent: take egg 1 from fridge 1 cup 1…
Agent: go to toaster 1 Agent: go to toaster 1 Agent: take egg 1 from fridge 1
Obs: I can’t use toaster to heat egg Obs: I can’t use toaster to heat egg Agent: go to microwave 1
Agent: go to stoveburner 1 Agent: go to microwave 1 ……
Obs: I can’t use stoveburner to heat egg …… Agent: put egg 1 in garbagecan 1
Agent: back to cabinet 1 Agent: put heated egg 1 in garbagecan 1 Obs: Task completed
…… Obs: Task completed
Obs: Task failed
Steps: 14
Steps: 27 Steps: 23
Tokens: 2589
Tokens: 3635 Tokens: 3274
Accomplished: No Accomplished: Yes Accomplished: Yes
fi
Figure 5: Compare trajectories with and without procedural memory, shortens the process by 9 steps and saves 685 tokens.

Procedural memory exhibits transferability from strong as the procedural-memory store and the number of retrieved
models to weaker ones. For a procedural memory con- memories increase, we designed a set of follow-up exper-
structed from a strong model in an offline memory library, iments. As showned in Figure 4 (b), as the number of re-
we aim to verify whether this form of procedural mem- trieved procedural memories increases, the agent’s perfor-
ory can be effectively transferred to other models, or even mance also improves steadily, exhibiting an upward trend
weaker models. This exploration underscores the signifi- followed by a plateau. However, retrieving too many memo-
cance of memory transfer, as it could potentially enhance ries can lead to a decline in the agent’s performance. This is
the adaptability and efficiency of various models by leverag- because excessive retrieval can affect the context length and
ing the knowledge and experience encapsulated within the also introduce less accurate procedural memories, which can
strong model’s memory structure. As shown in Figure 4 (b), interfere with the overall effectiveness.
procedural memory generated by GPT-4o was employed by
Qwen2.5-14B. On the Travel Plan benchmark, the 14 bil- Conclusion and Future Work
lion parameter model raised its task completion rate by 5%
and cut the average number of steps by 1.6. Similar gains, We introduce M emp , a task-agnostic framework that ele-
both in success rate and trajectory length, appeared on ALF- vates procedural memory to a core optimization target in
World. These outcomes confirm that procedural knowledge LLM-based agents. By systematically studying strategies for
from a stronger model can be distilled into a reusable mem- memory construction, retrieval, and updating, M emp en-
ory bank and transferred to a smaller model with minimal ables agents to distill, reuse, and refine their own past expe-
overhead, giving that smaller model a clear boost in task riences across diverse, long-horizon tasks. Empirical results
solving ability. Moreover, by leveraging procedural mem- on housework automation and information-seeking bench-
ory transfer, we can rapidly migrate the experiential knowl- marks show that leveraging procedural memory significantly
edge that one model has acquired to another, which is highly boosts task success rates and efficiency. Beyond improving
beneficial for agents as they adapt to new tasks with greater individual episodes, M emp supports continual learning and
efficiency and robustness. robust generalization, marking a step toward self-improving,
resilient agents.
Scaling Memory Retrieval Improves Agent Perfor- In our experiments, M emp has achieved promising re-
mance. While our main experiment has already demon- sults in both construction and retrieval. Moving forward, we
strated that procedural memory improves an agent’s task ac- plan to enhance this work in several ways. Firstly, we will
curacy and reduces the number of steps required, vector- develop more diverse retrieval strategies. The current ap-
based storage and retrieval confer an advantage over hu- proach involves constructing different keys for vector-based
man procedural memory: they can be scaled both in total retrieval. However, traditional methods like BM25 could
capacity and in the number of memories retrieved. To in- also be explored to retrieve precise memories more effec-
vestigate whether an agent’s performance continues to rise tively. Secondly, in M emp , we currently rely on the standard
reward signals provided by the benchmark. However, in real- Hu, M.; Chen, T.; Chen, Q.; Mu, Y.; Shao, W.; and Luo,
world scenarios, many tasks do not have clear reward sig- P. 2024b. Hiagent: Hierarchical working memory manage-
nals, making it difficult for the agent to determine whether a ment for solving long-horizon agent tasks with large lan-
task has been completed successfully. In such cases, using a guage model.
large language model (LLM) as a judge to assess task com- Ifargan, T.; Hafner, L.; Kern, M.; Alcalay, O.; and Kishony,
pletion could be a viable solution. This would transform the R. 2025. Autonomous llm-driven research—from data to
agent’s lifecycle into a continuous loop of executing tasks, human-verifiable research papers.
self-assessing completion, building memories, and then pro- Laird, J. E. 2022. Introduction to the soar cognitive archi-
ceeding to new tasks. tecture.
Lan, W.; Tang, Z.; Liu, M.; Chen, Q.; Peng, W.; Chen, Y. P.;
References and Pan, Y. 2025. The large language models on biomedical
Anthropic. 2022. Claude 3.5 Sonnet System Card. data analysis: a survey.
Barres, V.; Dong, H.; Ray, S.; Si, X.; and Narasimhan, K. Li, G.; Hammoud, H.; Itani, H.; Khizbullin, D.; and
2025. τ 2 -Bench: Evaluating Conversational Agents in a Ghanem, B. 2023. Camel: Communicative agents for”
Dual-Control Environment. mind” exploration of large language model society.
Li, Z.; Song, S.; Xi, C.; Wang, H.; Tang, C.; Niu, S.; Chen,
Chen, C.; Hao, X.; Liu, W.; Huang, X.; Zeng, X.; Yu, S.; Li, D.; Yang, J.; Li, C.; Yu, Q.; et al. 2025. Memos: A memory
D.; Wang, S.; Gan, W.; Huang, Y.; et al. 2025. ACEBench: os for ai system.
Who Wins the Match Point in Tool Usage?
Liu, B.; Li, X.; Zhang, J.; Wang, J.; He, T.; Hong, S.; Liu,
Chen, M.; Li, Y.; Yang, Y.; Yu, S.; Lin, B.; and He, H.; Zhang, S.; Song, K.; Zhu, K.; et al. 2025a. Advances and
X. 2024. AutoManual: Constructing Instruction Manuals challenges in foundation agents: From brain-inspired intel-
by LLM Agents via Interactive Environmental Learning. ligence to evolutionary, collaborative, and safe systems.
arXiv:2405.16247. Liu, Y.; Si, C.; Narasimhan, K.; and Yao, S. 2025b. Contex-
Chhikara, P.; Khant, D.; Aryan, S.; Singh, T.; and Yadav, D. tual Experience Replay for Self-Improvement of Language
2025. Mem0: Building Production-Ready AI Agents with Agents.
Scalable Long-Term Memory. Liu, Z.; Chai, J.; Zhu, X.; Tang, S.; Ye, R.; Zhang, B.; Bai,
Cohen, N. J.; and Squire, L. R. 1980. Preserved learning and L.; and Chen, S. 2025c. Ml-agent: Reinforcing llm agents
retention of pattern-analyzing skill in amnesia: Dissociation for autonomous machine learning engineering.
of knowing how and knowing that. Lu, F.; Zhong, Z.; Liu, S.; Fu, C.-W.; and Jia, J. 2025.
Dong, G.; Chen, Y.; Li, X.; Jin, J.; Qian, H.; Zhu, Y.; Mao, ARPO: End-to-End Policy Optimization for GUI Agents
H.; Zhou, G.; Dou, Z.; and Wen, J.-R. 2025. Tool-Star: Em- with Experience Replay.
powering LLM-Brained Multi-Tool Reasoner via Reinforce- Luo, R.; Wang, L.; He, W.; and Xia, X. 2025. Gui-r1: A gen-
ment Learning. eralist r1-style vision-language action model for gui agents.
Fang, R.; Wang, X.; Liang, Y.; Qiao, S.; Wu, J.; Xi, Z.; Mavroudis, V. 2024. LangChain v0. 3.
Zhang, N.; Jiang, Y.; Xie, P.; Huang, F.; et al. 2025. OpenAI. 2022. GPT-4 System Card.
SynWorld: Virtual Scenario Synthesis for Agentic Action OpenAI. 2025. Deep Research System Card.
Knowledge Refinement. Ou, Y.; Luo, Y.; Zheng, J.; Wei, L.; Qiao, S.; Zhang, J.;
Feng, E.; Zhou, W.; Liu, Z.; Chen, L.; Dong, Y.; Zhang, C.; Zheng, D.; Chen, H.; and Zhang, N. 2025. AutoMind: Adap-
Zhao, Y.; Du, D.; Hua, Z.; Xia, Y.; et al. 2025. Get Experi- tive Knowledgeable Agent for Automated Data Science.
ence from Practice: LLM Agents with Record & Replay. Puterman, M. L. 1990. Markov decision processes.
Gupta, P.; and Cohen, N. J. 2002. Theoretical and computa- Qiao, S.; Fang, R.; Qiu, Z.; Wang, X.; Zhang, N.; Jiang, Y.;
tional analysis of skill learning, repetition priming, and pro- Xie, P.; Huang, F.; and Chen, H. 2025. Benchmarking Agen-
cedural memory. tic Workflow Generation. In The Thirteenth International
Conference on Learning Representations, ICLR 2025, Sin-
Gur, I.; Furuta, H.; Huang, A.; Safdari, M.; Matsuo, Y.; Eck, gapore, April 24-28, 2025. OpenReview.net.
D.; and Faust, A. 2023. A real-world webagent with plan-
ning, long context understanding, and program synthesis. Qiao, S.; Ou, Y.; Zhang, N.; Chen, X.; Yao, Y.; Deng, S.;
Tan, C.; Huang, F.; and Chen, H. 2023. Reasoning with Lan-
Hou, Y.; Tamoto, H.; and Miyashita, H. 2024. ” my agent un- guage Model Prompting: A Survey. In Rogers, A.; Boyd-
derstands me better”: Integrating dynamic human-like mem- Graber, J. L.; and Okazaki, N., eds., Proceedings of the
ory recall and consolidation in llm-based agents. In Ex- 61st Annual Meeting of the Association for Computational
tended Abstracts of the CHI Conference on Human Factors Linguistics (Volume 1: Long Papers), ACL 2023, Toronto,
in Computing Systems, 1–7. Canada, July 9-14, 2023, 5368–5393. Association for Com-
Hu, M.; Chen, T.; Chen, Q.; Mu, Y.; Shao, W.; and Luo, putational Linguistics.
P. 2024a. Hiagent: Hierarchical working memory manage- Qin, Y.; Ye, Y.; Fang, J.; Wang, H.; Liang, S.; Tian, S.;
ment for solving long-horizon agent tasks with large lan- Zhang, J.; Li, J.; Li, Y.; Huang, S.; et al. 2025. Ui-tars: Pio-
guage model. neering automated gui interaction with native agents.
Shridhar, M.; Yuan, X.; Côté, M.; Bisk, Y.; Trischler, A.; Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.;
and Hausknecht, M. J. 2021. ALFWorld: Aligning Text Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. 2025. The
and Embodied Environments for Interactive Learning. In rise and potential of large language model based agents: A
9th International Conference on Learning Representations, survey.
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- Xia, M.; Ruehle, V.; Rajmohan, S.; and Shokri, R. 2025.
view.net. Minerva: A Programmable Memory Test Benchmark for
Su, H.; Sun, R.; Yoon, J.; Yin, P.; Yu, T.; and Arık, S. Ö. Language Models.
2025. Learn-by-interact: A data-centric framework for self- Xie, J.; Zhang, K.; Chen, J.; Zhu, T.; Lou, R.; Tian, Y.; Xiao,
adaptive agents in realistic environments. Y.; and Su, Y. 2024. TravelPlanner: A Benchmark for Real-
Sumers, T.; Yao, S.; Narasimhan, K.; and Griffiths, T. 2023a. World Planning with Language Agents. In Forty-first In-
Cognitive architectures for language agents. ternational Conference on Machine Learning, ICML 2024,
Sumers, T.; Yao, S.; Narasimhan, K.; and Griffiths, T. 2023b. Vienna, Austria, July 21-27, 2024. OpenReview.net.
Cognitive architectures for language agents. Xu, W.; Liang, Z.; Mei, K.; Gao, H.; Tan, J.; and Zhang, Y.
Sun, J.; Zhang, Q.; Duan, Y.; Jiang, X.; Cheng, C.; and Xu, 2025. A-mem: Agentic memory for llm agents.
R. 2024. Prompt, plan, perform: Llm-based humanoid con- Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.;
trol via quantized imitation learning. In 2024 IEEE Inter- Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. 2024a. Qwen2. 5
national Conference on Robotics and Automation (ICRA), technical report.
16236–16242. IEEE. Yang, H.; Yue, S.; and He, Y. 2023. Auto-gpt for online
Tan, X.; Li, B.; Qiu, X.; Qu, C.; Chu, W.; Xu, Y.; and Qi, decision making: Benchmarks and additional opinions.
Y. 2025. Meta-Agent-Workflow: Streamlining Tool Usage Yang, Y.; Zhou, T.; Li, K.; Tao, D.; Li, L.; Shen, L.; He, X.;
in LLMs through Workflow Construction, Retrieval, and Jiang, J.; and Shi, Y. 2024b. Embodied multi-modal agent
Refinement. In Companion Proceedings of the ACM on trained by an llm from a parallel textworld. In Proceedings
Web Conference 2025, WWW ’25, 458–467. New York, of the IEEE/CVF conference on computer vision and pattern
NY, USA: Association for Computing Machinery. ISBN recognition, 26275–26285.
9798400713316.
Yao, S.; Shinn, N.; Razavi, P.; and Narasimhan, K. R. 2025.
Tang, X.; Qin, T.; Peng, T.; Zhou, Z.; Shao, D.; Du, T.; Wei, τ -bench: A Benchmark for Tool-Agent-User Interaction in
X.; Xia, P.; Wu, F.; Zhu, H.; Zhang, G.; Liu, J.; Wang, X.; Real-World Domains. In The Thirteenth International Con-
Hong, S.; Wu, C.; Cheng, H.; Wang, C.; and Zhou, W. 2025. ference on Learning Representations.
Agent KB: Leveraging Cross-Domain Experience for Agen-
tic Problem Solving. arXiv:2507.06229. Yu, H.; Chen, T.; Feng, J.; Chen, J.; Dai, W.; Yu, Q.;
Zhang, Y.-Q.; Ma, W.-Y.; Liu, J.; Wang, M.; and Zhou,
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, H. 2025. MemAgent: Reshaping Long-Context LLM with
Y.; Fan, L.; and Anandkumar, A. 2023. Voyager: An open- Multi-Conv RL-based Memory Agent. arXiv:2507.02259.
ended embodied agent with large language models.
Zhang, Z.; Bo, X.; Ma, C.; Li, R.; Chen, X.; Dai, Q.; Zhu,
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; J.; Dong, Z.; and Wen, J.-R. 2024. A survey on the memory
Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. 2024a. A survey mechanism of large language model based agents.
on large language model based autonomous agents.
Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.;
Wang, Z.; Xu, H.; Wang, J.; Zhang, X.; Yan, M.; Zhang, J.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023. A sur-
Huang, F.; and Ji, H. 2025. Mobile-agent-e: Self-evolving vey of large language models.
mobile assistant for complex tasks.
Zheng, L.; Wang, R.; Wang, X.; and An, B. ???? Synapse:
Wang, Z. Z.; Mao, J.; Fried, D.; and Neubig, G. 2024b. Trajectory-as-Exemplar Prompting with Memory for Com-
Agent workflow memory. puter Control. In The Twelfth International Conference on
Wu, J.; Li, B.; Fang, R.; Yin, W.; Zhang, L.; Tao, Z.; Zhang, Learning Representations.
D.; Xi, Z.; Fu, G.; Jiang, Y.; et al. 2025a. WebDancer: To- Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; and Wang, Y. 2024a.
wards Autonomous Information Seeking Agency. Memorybank: Enhancing large language models with long-
Wu, J.; Yin, W.; Jiang, Y.; Wang, Z.; Xi, Z.; Fang, R.; Zhang, term memory. In Proceedings of the AAAI Conference on
L.; He, Y.; Zhou, D.; Xie, P.; and Huang, F. 2025b. Web- Artificial Intelligence, volume 38, 19724–19731.
Walker: Benchmarking LLMs in Web Traversal. In Che, Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; and Wang, Y. 2024b.
W.; Nabende, J.; Shutova, E.; and Pilehvar, M. T., eds., Memorybank: Enhancing large language models with long-
Proceedings of the 63rd Annual Meeting of the Associa- term memory. In Proceedings of the AAAI Conference on
tion for Computational Linguistics (Volume 1: Long Papers), Artificial Intelligence, volume 38, 19724–19731.
10290–10305. Vienna, Austria: Association for Computa-
tional Linguistics. ISBN 979-8-89176-251-0. Zhou, W.; Jiang, Y. E.; Li, L.; Wu, J.; Wang, T.; Qiu, S.;
Zhang, J.; Chen, J.; Wu, R.; Wang, S.; Zhu, S.; Chen, J.;
Wu, X.; Bu, Y.; Cai, Y.; and Wang, T. 2024. Updating Large Zhang, W.; Tang, X.; Zhang, N.; Chen, H.; Cui, P.; and
Language Models’ Memories with Time Constraints. Sachan, M. 2023. Agents: An Open-source Framework for
x.ai. 2025. Grok 3 Beta — The Age of Reasoning Agents. Autonomous Language Agents. arXiv:2309.07870.
Zhou, W.; Ou, Y.; Ding, S.; Li, L.; Wu, J.; Wang, T.; Chen,
J.; Wang, S.; Xu, X.; Zhang, N.; Chen, H.; and Jiang, Y. E.
2024. Symbolic Learning Enables Self-Evolving Agents.
arXiv:2406.18532.
Zhou, Z.; Qu, A.; Wu, Z.; Kim, S.; Prakash, A.; Rus, D.;
Zhao, J.; Low, B. K. H.; and Liang, P. P. 2025. MEM1:
Learning to Synergize Memory and Reasoning for Efficient
Long-Horizon Agents. arXiv:2506.15841.

You might also like