Exploring Agent Procedural Memory
Exploring Agent Procedural Memory
Abstract
arXiv:2508.06433v2 [cs.CL] 13 Aug 2025
Task Execution
Update
Memory Storage
Update
P Mem1, P Mem2, …
Corrected
P Mem high
Tasks P Mem
Add
New Trajectories
P Mem Error Success Rate
Golden low
Agent Trajectory Negative
Trajectory
Delete
fi
fi
Figure 2: The procedural memory framework consists of Build, Retrieve, and Update, which respectively involve encoding
stored procedural memory, forming new procedural memories, and modifying existing ones in light of new experiences.
Backbones. In our experiments, we benchmarked our pro- ral memory is retrieved, it is appended to the task as part of
cedural memory on three base models. Specifically, we the context, serving as prior knowledge to assist the model
adopt the two proprietary frontier models that have con- in completing the task.
sistently dominated public leaderboards: OpenAI’s GPT- Inspired by this, we designed the following experimental
4o (OpenAI 2022) and Anthropic’s Claude (Anthropic conditions:
2022), and complement them with the open-sourced • No Memory: The model tackles the assigned task in a
Qwen2.5-72B-Instruct (Yang et al. 2024a). The first two pro- ReAct fashion without any external memory.
vide state-of-the-art closed-source performance, while the
third allows us to verify that our findings generalize beyond • Trajectory: We first filter the gold trajectories from the
proprietary systems and remain valid in the open-source training set and store them. At inference time, the system
regime. retrieves the top-k trajectories whose query vectors are
most similar to the current task’s vector, supplying them
Evaluation. For ALFWorld dataset, task completion is as procedural memories before execution.
evaluated by the execution environment. After a task is • Script: The model analyzes and summarizes the gold
completed or the maximum number of execution steps is trajectories from the training set, distilling them into ab-
reached, the environment provides a reward of 0 or 1 to indi- stract procedural knowledge that is provided as a prompt
cate whether the task has been successfully completed. For before each task.
TravelPlanner, we conduct experiments on the test set in a
two-stage mode. After multiple rounds of interaction to ob- • Proceduralization: This condition combines the full re-
tain the travel trajectory and the final planner, GPT-4o con- trieved trajectories with the high-level script generated
verts the travel plan into a specified JSON format. The con- by the model, integrating both concrete examples and ab-
verted plan is then compared with the gold standard to obtain stract guidance as the procedural memory.
scores for both Common Sense and Hard Constraint. As shown in Table 1, all memory construction meth-
ods outperform the no-memory baseline, achieving higher
Memory Storage & Retrieval scores on both datasets while also reducing the number of
Procedural knowledge is typically stored in two main for- steps required. This indicates that procedural memory built
mats: (1) trajectories are kept verbatim, round by round, in during training is beneficial for directly applying tasks dur-
memory, or (2) high-level abstractions are extracted from ing testing. Furthermore, we observe that the approach of
these trajectories and then stored. Once a similar procedu- abstracting trajectories into scripts during training yields rel-
TravelPlanner ALFWorld
Model Granularity
#CS ↑ #HC ↑ Steps ↓ Dev ↑ Test ↑ Steps ↓
No Memory 71.93 12.88 17.84 39.28 42.14 23.76
Script 72.08 5.50 15.79 66.67 56.43 18.52
GPT-4o
Trajectory 76.02 8.25 14.64 67.17 74.29 16.49
Proceduralization 79.94 9.76 14.62 87.14 77.86 15.01
No Memory 63.49 33.06 18.84 39.20 34.97 24.12
Script 62.08 29.61 19.21 56.13 53.59 19.38
Claude-3.5-sonnet
Trajectory 65.76 29.61 17.72 69.28 71.78 15.97
Proceduralization 65.46 30.14 15.29 82.50 74.72 15.79
No Memory 56.57 7.34 18.32 44.91 41.25 21.38
Script 58.59 7.34 18.53 66.24 61.88 17.13
Qwen2.5-72b
Trajectory 63.41 12.66 18.12 64.49 69.57 16.40
Proceduralization 63.82 14.19 17.94 85.71 77.19 15.32
Table 1: Results on Build Policy. #CS, #HC denote Commensense and Hard Constraint, respectively. ↑ indicates the higher
values are better, and ↓ denotes the lower values are better. The best results among all methods with similar settings are bolded,
and the second-best results are underlined.
atively better performance on the ALFWorld test set com- Model Policy #CS ↑ #HC ↑ Steps ↓
pared to the dev set. Conversely, trajectories that utilize com- No Memory 71.93 12.88 17.84
plete execution traces as procedural memory achieve higher Random Sample 74.59 6.72 15.12
GPT-4o
scores on the dev set, suggesting that scripts are more ca- Key=Query 73.38 8.95 15.44
pable of generalizing to different test tasks, while trajecto- Key=AveFact 76.02 8.25 14.64
ries are better suited for scenarios involving tasks similar to No Memory 63.49 33.06 18.84
those already completed. By combining procedure knowl- Claude-3.5-sonnet
Random Sample 63.99 29.91 17.93
edge from both methods of employing abstracted guidelines Key=Query 64.93 28.56 17.60
along with concrete execution trajectories, we attain the op- Key=AveFact 65.76 29.61 17.72
timal performance. No Memory 56.57 7.34 18.32
After converting a set of completed trajectories into pro- Random Sample 59.76 8.43 18.31
Qwen2.5-72b
cedural memory, the next critical challenge is to retrieve the Key=Query 61.71 11.97 18.54
Key=AveFact 63.41 12.66 18.12
most accurate and relevant procedural knowledge when a
new task arrives. We have designed several different key
Table 2: Results on Retrieve Policy on TravelPlanner.
construction methods for memory storage to facilitate sub-
sequent vector-based matching and retrieval:
• Random Sample: Does not utilize keys for vector re- Memory Update
trieval; instead, randomly extracts a few memories from
procedural memory. While many prior efforts have focused on developing
• Query: Employ query description as the key for storage, reusable procedural knowledge, enabling models to learn
leveraging the semantic similarity of queries for retrieval. from prior experiences rather than solving each test task
in isolation, most existing memory update methods remain
• AveFact: We apply a large model to extract keywords
quite rudimentary. Typically, they simply append newly ac-
from the task’s query, then computes the average similar-
quired memories to the existing store—a so-called ”merge”
ity across matched keywords for retrieval.
strategy. In this work, we explore several online memory-
During the retrieval process, we evaluate the similarity by update mechanisms to identify which dynamic strategy de-
calculating the cosine similarity between their correspond- livers the best performance on our tasks. Beyond end-to-end
ing vectors. Our experiments show that these different re- evaluation metrics, we also analyze how both accuracy and
trieval strategies produce varying results. Specifically, com- efficiency evolve as the number of executed tasks increases,
pared to random sampling, employing the query based and explicitly measuring the benefits conferred by our procedu-
AveFact methods for precise retrieval significantly improves ral memory.
performance. The query-based approach benefits from cap- To facilitate systematic comparison, we designed sev-
turing semantic contexts, enabling more accurate matches. eral memory-update scenarios. In each, the agent’s episodic
The AveFact method, by extracting key features and aver- memory is refreshed after every t test-set tasks. The specific
aging their similarities, effectively focuses on core task ele- update strategies are as follows:
ments, leading to better retrieval efficacy. Overall, our find-
ings suggest that incorporating semantic understanding and • Vanilla Memory Update: After every t tasks, all trajec-
key feature extraction in retrieval strategies substantially en- tories from these tasks are consolidated into procedural
hances memory access accuracy and the effectiveness of memories and directly appended to the memory bank.
downstream task performance. • Validation: After every t tasks, only the trajectories of
Vallina Validation Adjustment
Reward Gain over 'wo Memory' Steps Reduction over 'wo Memory'
0.7 0
Reward Improvement ( Reward)
0.6 2
successfully completed tasks are retained and converted (a) 96.6 wo Memory
100 91.4 w GPT-4o-Memory
into procedural memories for storage.
• Adjustment: When a retrieved procedural memory re- 80
65.5
sults in a failed execution, the erroneous trajectory is
60 59.3
Count
combined with the original memory and then revised in
place, yielding an updated procedural memory. 40
As depicted in Figure 3, we systematically divided the 20 16.9 15.3
tasks within our testbed into several distinct groups, with
each group comprising a diverse set of individual tasks. 0
Delivery Steps Commonsense
Upon the completion of tasks within each group, we em- (b)
ployed the previously described strategies to construct, 82.5 81.5
80 78.2 79.3
store, and update the procedural memory. The experimen- 75.8
tal results reveal a clear trend: as we sequentially progress 70.7
70
Final Scores
Procedural memory exhibits transferability from strong as the procedural-memory store and the number of retrieved
models to weaker ones. For a procedural memory con- memories increase, we designed a set of follow-up exper-
structed from a strong model in an offline memory library, iments. As showned in Figure 4 (b), as the number of re-
we aim to verify whether this form of procedural mem- trieved procedural memories increases, the agent’s perfor-
ory can be effectively transferred to other models, or even mance also improves steadily, exhibiting an upward trend
weaker models. This exploration underscores the signifi- followed by a plateau. However, retrieving too many memo-
cance of memory transfer, as it could potentially enhance ries can lead to a decline in the agent’s performance. This is
the adaptability and efficiency of various models by leverag- because excessive retrieval can affect the context length and
ing the knowledge and experience encapsulated within the also introduce less accurate procedural memories, which can
strong model’s memory structure. As shown in Figure 4 (b), interfere with the overall effectiveness.
procedural memory generated by GPT-4o was employed by
Qwen2.5-14B. On the Travel Plan benchmark, the 14 bil- Conclusion and Future Work
lion parameter model raised its task completion rate by 5%
and cut the average number of steps by 1.6. Similar gains, We introduce M emp , a task-agnostic framework that ele-
both in success rate and trajectory length, appeared on ALF- vates procedural memory to a core optimization target in
World. These outcomes confirm that procedural knowledge LLM-based agents. By systematically studying strategies for
from a stronger model can be distilled into a reusable mem- memory construction, retrieval, and updating, M emp en-
ory bank and transferred to a smaller model with minimal ables agents to distill, reuse, and refine their own past expe-
overhead, giving that smaller model a clear boost in task riences across diverse, long-horizon tasks. Empirical results
solving ability. Moreover, by leveraging procedural mem- on housework automation and information-seeking bench-
ory transfer, we can rapidly migrate the experiential knowl- marks show that leveraging procedural memory significantly
edge that one model has acquired to another, which is highly boosts task success rates and efficiency. Beyond improving
beneficial for agents as they adapt to new tasks with greater individual episodes, M emp supports continual learning and
efficiency and robustness. robust generalization, marking a step toward self-improving,
resilient agents.
Scaling Memory Retrieval Improves Agent Perfor- In our experiments, M emp has achieved promising re-
mance. While our main experiment has already demon- sults in both construction and retrieval. Moving forward, we
strated that procedural memory improves an agent’s task ac- plan to enhance this work in several ways. Firstly, we will
curacy and reduces the number of steps required, vector- develop more diverse retrieval strategies. The current ap-
based storage and retrieval confer an advantage over hu- proach involves constructing different keys for vector-based
man procedural memory: they can be scaled both in total retrieval. However, traditional methods like BM25 could
capacity and in the number of memories retrieved. To in- also be explored to retrieve precise memories more effec-
vestigate whether an agent’s performance continues to rise tively. Secondly, in M emp , we currently rely on the standard
reward signals provided by the benchmark. However, in real- Hu, M.; Chen, T.; Chen, Q.; Mu, Y.; Shao, W.; and Luo,
world scenarios, many tasks do not have clear reward sig- P. 2024b. Hiagent: Hierarchical working memory manage-
nals, making it difficult for the agent to determine whether a ment for solving long-horizon agent tasks with large lan-
task has been completed successfully. In such cases, using a guage model.
large language model (LLM) as a judge to assess task com- Ifargan, T.; Hafner, L.; Kern, M.; Alcalay, O.; and Kishony,
pletion could be a viable solution. This would transform the R. 2025. Autonomous llm-driven research—from data to
agent’s lifecycle into a continuous loop of executing tasks, human-verifiable research papers.
self-assessing completion, building memories, and then pro- Laird, J. E. 2022. Introduction to the soar cognitive archi-
ceeding to new tasks. tecture.
Lan, W.; Tang, Z.; Liu, M.; Chen, Q.; Peng, W.; Chen, Y. P.;
References and Pan, Y. 2025. The large language models on biomedical
Anthropic. 2022. Claude 3.5 Sonnet System Card. data analysis: a survey.
Barres, V.; Dong, H.; Ray, S.; Si, X.; and Narasimhan, K. Li, G.; Hammoud, H.; Itani, H.; Khizbullin, D.; and
2025. τ 2 -Bench: Evaluating Conversational Agents in a Ghanem, B. 2023. Camel: Communicative agents for”
Dual-Control Environment. mind” exploration of large language model society.
Li, Z.; Song, S.; Xi, C.; Wang, H.; Tang, C.; Niu, S.; Chen,
Chen, C.; Hao, X.; Liu, W.; Huang, X.; Zeng, X.; Yu, S.; Li, D.; Yang, J.; Li, C.; Yu, Q.; et al. 2025. Memos: A memory
D.; Wang, S.; Gan, W.; Huang, Y.; et al. 2025. ACEBench: os for ai system.
Who Wins the Match Point in Tool Usage?
Liu, B.; Li, X.; Zhang, J.; Wang, J.; He, T.; Hong, S.; Liu,
Chen, M.; Li, Y.; Yang, Y.; Yu, S.; Lin, B.; and He, H.; Zhang, S.; Song, K.; Zhu, K.; et al. 2025a. Advances and
X. 2024. AutoManual: Constructing Instruction Manuals challenges in foundation agents: From brain-inspired intel-
by LLM Agents via Interactive Environmental Learning. ligence to evolutionary, collaborative, and safe systems.
arXiv:2405.16247. Liu, Y.; Si, C.; Narasimhan, K.; and Yao, S. 2025b. Contex-
Chhikara, P.; Khant, D.; Aryan, S.; Singh, T.; and Yadav, D. tual Experience Replay for Self-Improvement of Language
2025. Mem0: Building Production-Ready AI Agents with Agents.
Scalable Long-Term Memory. Liu, Z.; Chai, J.; Zhu, X.; Tang, S.; Ye, R.; Zhang, B.; Bai,
Cohen, N. J.; and Squire, L. R. 1980. Preserved learning and L.; and Chen, S. 2025c. Ml-agent: Reinforcing llm agents
retention of pattern-analyzing skill in amnesia: Dissociation for autonomous machine learning engineering.
of knowing how and knowing that. Lu, F.; Zhong, Z.; Liu, S.; Fu, C.-W.; and Jia, J. 2025.
Dong, G.; Chen, Y.; Li, X.; Jin, J.; Qian, H.; Zhu, Y.; Mao, ARPO: End-to-End Policy Optimization for GUI Agents
H.; Zhou, G.; Dou, Z.; and Wen, J.-R. 2025. Tool-Star: Em- with Experience Replay.
powering LLM-Brained Multi-Tool Reasoner via Reinforce- Luo, R.; Wang, L.; He, W.; and Xia, X. 2025. Gui-r1: A gen-
ment Learning. eralist r1-style vision-language action model for gui agents.
Fang, R.; Wang, X.; Liang, Y.; Qiao, S.; Wu, J.; Xi, Z.; Mavroudis, V. 2024. LangChain v0. 3.
Zhang, N.; Jiang, Y.; Xie, P.; Huang, F.; et al. 2025. OpenAI. 2022. GPT-4 System Card.
SynWorld: Virtual Scenario Synthesis for Agentic Action OpenAI. 2025. Deep Research System Card.
Knowledge Refinement. Ou, Y.; Luo, Y.; Zheng, J.; Wei, L.; Qiao, S.; Zhang, J.;
Feng, E.; Zhou, W.; Liu, Z.; Chen, L.; Dong, Y.; Zhang, C.; Zheng, D.; Chen, H.; and Zhang, N. 2025. AutoMind: Adap-
Zhao, Y.; Du, D.; Hua, Z.; Xia, Y.; et al. 2025. Get Experi- tive Knowledgeable Agent for Automated Data Science.
ence from Practice: LLM Agents with Record & Replay. Puterman, M. L. 1990. Markov decision processes.
Gupta, P.; and Cohen, N. J. 2002. Theoretical and computa- Qiao, S.; Fang, R.; Qiu, Z.; Wang, X.; Zhang, N.; Jiang, Y.;
tional analysis of skill learning, repetition priming, and pro- Xie, P.; Huang, F.; and Chen, H. 2025. Benchmarking Agen-
cedural memory. tic Workflow Generation. In The Thirteenth International
Conference on Learning Representations, ICLR 2025, Sin-
Gur, I.; Furuta, H.; Huang, A.; Safdari, M.; Matsuo, Y.; Eck, gapore, April 24-28, 2025. OpenReview.net.
D.; and Faust, A. 2023. A real-world webagent with plan-
ning, long context understanding, and program synthesis. Qiao, S.; Ou, Y.; Zhang, N.; Chen, X.; Yao, Y.; Deng, S.;
Tan, C.; Huang, F.; and Chen, H. 2023. Reasoning with Lan-
Hou, Y.; Tamoto, H.; and Miyashita, H. 2024. ” my agent un- guage Model Prompting: A Survey. In Rogers, A.; Boyd-
derstands me better”: Integrating dynamic human-like mem- Graber, J. L.; and Okazaki, N., eds., Proceedings of the
ory recall and consolidation in llm-based agents. In Ex- 61st Annual Meeting of the Association for Computational
tended Abstracts of the CHI Conference on Human Factors Linguistics (Volume 1: Long Papers), ACL 2023, Toronto,
in Computing Systems, 1–7. Canada, July 9-14, 2023, 5368–5393. Association for Com-
Hu, M.; Chen, T.; Chen, Q.; Mu, Y.; Shao, W.; and Luo, putational Linguistics.
P. 2024a. Hiagent: Hierarchical working memory manage- Qin, Y.; Ye, Y.; Fang, J.; Wang, H.; Liang, S.; Tian, S.;
ment for solving long-horizon agent tasks with large lan- Zhang, J.; Li, J.; Li, Y.; Huang, S.; et al. 2025. Ui-tars: Pio-
guage model. neering automated gui interaction with native agents.
Shridhar, M.; Yuan, X.; Côté, M.; Bisk, Y.; Trischler, A.; Xi, Z.; Chen, W.; Guo, X.; He, W.; Ding, Y.; Hong, B.;
and Hausknecht, M. J. 2021. ALFWorld: Aligning Text Zhang, M.; Wang, J.; Jin, S.; Zhou, E.; et al. 2025. The
and Embodied Environments for Interactive Learning. In rise and potential of large language model based agents: A
9th International Conference on Learning Representations, survey.
ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenRe- Xia, M.; Ruehle, V.; Rajmohan, S.; and Shokri, R. 2025.
view.net. Minerva: A Programmable Memory Test Benchmark for
Su, H.; Sun, R.; Yoon, J.; Yin, P.; Yu, T.; and Arık, S. Ö. Language Models.
2025. Learn-by-interact: A data-centric framework for self- Xie, J.; Zhang, K.; Chen, J.; Zhu, T.; Lou, R.; Tian, Y.; Xiao,
adaptive agents in realistic environments. Y.; and Su, Y. 2024. TravelPlanner: A Benchmark for Real-
Sumers, T.; Yao, S.; Narasimhan, K.; and Griffiths, T. 2023a. World Planning with Language Agents. In Forty-first In-
Cognitive architectures for language agents. ternational Conference on Machine Learning, ICML 2024,
Sumers, T.; Yao, S.; Narasimhan, K.; and Griffiths, T. 2023b. Vienna, Austria, July 21-27, 2024. OpenReview.net.
Cognitive architectures for language agents. Xu, W.; Liang, Z.; Mei, K.; Gao, H.; Tan, J.; and Zhang, Y.
Sun, J.; Zhang, Q.; Duan, Y.; Jiang, X.; Cheng, C.; and Xu, 2025. A-mem: Agentic memory for llm agents.
R. 2024. Prompt, plan, perform: Llm-based humanoid con- Yang, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.;
trol via quantized imitation learning. In 2024 IEEE Inter- Li, C.; Liu, D.; Huang, F.; Wei, H.; et al. 2024a. Qwen2. 5
national Conference on Robotics and Automation (ICRA), technical report.
16236–16242. IEEE. Yang, H.; Yue, S.; and He, Y. 2023. Auto-gpt for online
Tan, X.; Li, B.; Qiu, X.; Qu, C.; Chu, W.; Xu, Y.; and Qi, decision making: Benchmarks and additional opinions.
Y. 2025. Meta-Agent-Workflow: Streamlining Tool Usage Yang, Y.; Zhou, T.; Li, K.; Tao, D.; Li, L.; Shen, L.; He, X.;
in LLMs through Workflow Construction, Retrieval, and Jiang, J.; and Shi, Y. 2024b. Embodied multi-modal agent
Refinement. In Companion Proceedings of the ACM on trained by an llm from a parallel textworld. In Proceedings
Web Conference 2025, WWW ’25, 458–467. New York, of the IEEE/CVF conference on computer vision and pattern
NY, USA: Association for Computing Machinery. ISBN recognition, 26275–26285.
9798400713316.
Yao, S.; Shinn, N.; Razavi, P.; and Narasimhan, K. R. 2025.
Tang, X.; Qin, T.; Peng, T.; Zhou, Z.; Shao, D.; Du, T.; Wei, τ -bench: A Benchmark for Tool-Agent-User Interaction in
X.; Xia, P.; Wu, F.; Zhu, H.; Zhang, G.; Liu, J.; Wang, X.; Real-World Domains. In The Thirteenth International Con-
Hong, S.; Wu, C.; Cheng, H.; Wang, C.; and Zhou, W. 2025. ference on Learning Representations.
Agent KB: Leveraging Cross-Domain Experience for Agen-
tic Problem Solving. arXiv:2507.06229. Yu, H.; Chen, T.; Feng, J.; Chen, J.; Dai, W.; Yu, Q.;
Zhang, Y.-Q.; Ma, W.-Y.; Liu, J.; Wang, M.; and Zhou,
Wang, G.; Xie, Y.; Jiang, Y.; Mandlekar, A.; Xiao, C.; Zhu, H. 2025. MemAgent: Reshaping Long-Context LLM with
Y.; Fan, L.; and Anandkumar, A. 2023. Voyager: An open- Multi-Conv RL-based Memory Agent. arXiv:2507.02259.
ended embodied agent with large language models.
Zhang, Z.; Bo, X.; Ma, C.; Li, R.; Chen, X.; Dai, Q.; Zhu,
Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; J.; Dong, Z.; and Wen, J.-R. 2024. A survey on the memory
Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. 2024a. A survey mechanism of large language model based agents.
on large language model based autonomous agents.
Zhao, W. X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.;
Wang, Z.; Xu, H.; Wang, J.; Zhang, X.; Yan, M.; Zhang, J.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. 2023. A sur-
Huang, F.; and Ji, H. 2025. Mobile-agent-e: Self-evolving vey of large language models.
mobile assistant for complex tasks.
Zheng, L.; Wang, R.; Wang, X.; and An, B. ???? Synapse:
Wang, Z. Z.; Mao, J.; Fried, D.; and Neubig, G. 2024b. Trajectory-as-Exemplar Prompting with Memory for Com-
Agent workflow memory. puter Control. In The Twelfth International Conference on
Wu, J.; Li, B.; Fang, R.; Yin, W.; Zhang, L.; Tao, Z.; Zhang, Learning Representations.
D.; Xi, Z.; Fu, G.; Jiang, Y.; et al. 2025a. WebDancer: To- Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; and Wang, Y. 2024a.
wards Autonomous Information Seeking Agency. Memorybank: Enhancing large language models with long-
Wu, J.; Yin, W.; Jiang, Y.; Wang, Z.; Xi, Z.; Fang, R.; Zhang, term memory. In Proceedings of the AAAI Conference on
L.; He, Y.; Zhou, D.; Xie, P.; and Huang, F. 2025b. Web- Artificial Intelligence, volume 38, 19724–19731.
Walker: Benchmarking LLMs in Web Traversal. In Che, Zhong, W.; Guo, L.; Gao, Q.; Ye, H.; and Wang, Y. 2024b.
W.; Nabende, J.; Shutova, E.; and Pilehvar, M. T., eds., Memorybank: Enhancing large language models with long-
Proceedings of the 63rd Annual Meeting of the Associa- term memory. In Proceedings of the AAAI Conference on
tion for Computational Linguistics (Volume 1: Long Papers), Artificial Intelligence, volume 38, 19724–19731.
10290–10305. Vienna, Austria: Association for Computa-
tional Linguistics. ISBN 979-8-89176-251-0. Zhou, W.; Jiang, Y. E.; Li, L.; Wu, J.; Wang, T.; Qiu, S.;
Zhang, J.; Chen, J.; Wu, R.; Wang, S.; Zhu, S.; Chen, J.;
Wu, X.; Bu, Y.; Cai, Y.; and Wang, T. 2024. Updating Large Zhang, W.; Tang, X.; Zhang, N.; Chen, H.; Cui, P.; and
Language Models’ Memories with Time Constraints. Sachan, M. 2023. Agents: An Open-source Framework for
x.ai. 2025. Grok 3 Beta — The Age of Reasoning Agents. Autonomous Language Agents. arXiv:2309.07870.
Zhou, W.; Ou, Y.; Ding, S.; Li, L.; Wu, J.; Wang, T.; Chen,
J.; Wang, S.; Xu, X.; Zhang, N.; Chen, H.; and Jiang, Y. E.
2024. Symbolic Learning Enables Self-Evolving Agents.
arXiv:2406.18532.
Zhou, Z.; Qu, A.; Wu, Z.; Kim, S.; Prakash, A.; Rus, D.;
Zhao, J.; Low, B. K. H.; and Liang, P. P. 2025. MEM1:
Learning to Synergize Memory and Reasoning for Efficient
Long-Horizon Agents. arXiv:2506.15841.