Building Self-Evolving Agents via Experience-Driven Lifelong Learning

A Framework and Benchmark

Yuxuan Cai¹, Yipeng Hao¹, Jie Zhou¹'², Hang Yan³, Zhikai Lei¹, Rui Zheng⁴, Zhenhua Han,

Yutao Yang¹, Junsong Li¹, Qianjun Pan¹, Tianyu Huai¹, Qin Chen¹, Xin Li², Kai Chen², Bo Zhang², Xipeng Qiu⁴, Liang He¹

¹ School of Computer Science and Technology, East China Normal University, Shanghai

² Shanghai AI Laboratory, ³ The Chinese University of HongKong, ⁴ Fudan University

About

As AI advances toward general intelligence, the focus is shifting from systems optimized for static tasks to creating open-ended agents that learn continuously and adapt autonomously. This vision prioritizes long-term memory, skill transfer, and strategic planning, driven by an intrinsic curiosity to learn in dynamic, unpredictable environments.

In this paper, we introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. The framework is built on four core principles: (1) Experience Exploration: Agents learn through self-motivated interaction with dynamic environments, navigating interdependent tasks and generating rich experiential trajectories. (2) Long-term Memory: Agents preserve and structure historical knowledge, including personal experiences, domain expertise, and commonsense reasoning, into a persistent memory system. (3) Skill Learning: Agents autonomously improve by abstracting recurring patterns from experience into reusable skills, which are actively refined and validated for application in new tasks. (4) Knowledge Internalization: Agents internalize explicit and discrete experiences into implicit and intuitive capabilities as "second nature".

We also introduce StuLife, a benchmark dataset for ELL that simulates a student's holistic college journey, from enrollment to academic and personal development, across three core phases and ten detailed sub-scenarios. StuLife is designed around three key paradigm shifts: From Passive to Proactive, From Context to Memory, and From Imitation to Learning. In this dynamic environment, agents must acquire and distill practical skills and maintain persistent memory to make decisions based on evolving state variables (e.g., resource availability and time). Critically, these agents are also expected to demonstrate intrinsic motivation by setting their own goals and initiating actions without external prompting. To this end, StuLife provides a comprehensive evaluation platform featuring our novel metrics (e.g., StuGPA) to specifically assess these critical capabilities.

Our evaluation reveals that even the best model, GPT-5, scores only 17.9/100, revealing a vast gap toward AGI, demonstrating fundamental deficiencies in retaining long-term memory and acting with proactive, self-motivated initiative. Beyond evaluating state-of-the-art LLMs on the StuLife, we also explore the role of context engineering in advancing AGI. Our results suggest that optimizing how we guide models may be as crucial as improving the models themselves, positioning context engineering as a key enabler of progress toward AGI.

What is ELL?

ELL Framework

We introduce Experience-driven Lifelong Learning (ELL), a framework for building self-evolving agents capable of continuous growth through real-world interaction. Unlike traditional continual learning approaches, ELL emphasizes learning from experience: agents acquire knowledge not from static, labeled datasets, but through dynamic interaction with their environment.

The framework is built on four core principles:

Experience Exploration

The agent must be capable of sequentially decomposing and executing complex, long-horizon tasks that involve continuous interaction over minutes to hours with unquantifiable rewards. Through sustained and self-motivated engagement, it generates rich experiential data, enabling iterative learning and self-correction. This persistent interaction allows the agent to progressively refine strategies and adapt behavior based on dynamic feedback, mimicking the trial-and-error process of real-world learning.

Long-term Memory

Experiential data is systematically processed and consolidated into persistent and structured memory, including raw observations, key events, learned facts, temporal contexts, and self-reflective insights. Memory is not passive storage but an active resource: it supports retrieval over long time spans, enables context-aware reasoning, and forms the foundation for future decision-making.

Skill Learning

The agent abstracts recurring patterns from experience into reusable skills, such as decision rules, functional modules, or problem-solving heuristics. These skills are explicitly constructed through reflection and validated through application in new and evolving tasks. The agent actively manages its skill repertoire, adding, refining, combining, or deprecating skills based on performance, creating a dynamic, self-improving system.

Knowledge Internalization

Beyond storing memories and reusing skills, the agent undergoes a process of knowledge internalization, transforming explicit and discrete knowledge into implicit and intuitive understanding. Over time, frequently used rules, patterns, and strategies are distilled into the agent's core reasoning process, reducing reliance on external retrieval or step-by-step reflection. This shift from deliberate application to automatic execution mirrors the cognitive transition from novice to expert, where learned behavior becomes "second nature".

What is StuLife?

StuLife Overview

We also introduce StuLife, a benchmark dataset for ELL that simulates a student's holistic college journey—from enrollment to academic and personal development—across three core phases and ten detailed sub-scenarios.

StuLife is designed around three key paradigm shifts:

From Passive to Proactive
From Context to Memory
From Imitation to Learning

StuLife is a new benchmark designed to evaluate the long-term memory, planning, adaptation, and autonomous decision-making capabilities of AI agents. It immerses agents in a persistent, stateful, and dynamic virtual university campus environment where their actions have lasting consequences.

Unlike traditional benchmarks that focus on stateless, single-turn tasks, StuLife creates a "virtual world" that evolves over a simulated academic year. An agent's success is not just about solving the immediate problem, but about managing their time, remembering commitments, and navigating a complex web of academic and social responsibilities that persist across hundreds of tasks.

It features a dynamic, interactive environment in which tasks are highly interconnected, and critical state variables—such as GPA, course availability, advisor relationships, and time—evolve based on the agent's decisions. Agents must: 1) Autonomously acquire practical skills (e.g., course registration, scheduling, navigation, and communication), 2) Distill experiences into reusable knowledge, and 3) Maintain persistent memory to support future decision-making. Crucially, they are expected to exhibit intrinsic motivation by setting goals, anticipating future needs, and initiating actions without external prompting.

StuLife provides a comprehensive platform for evaluating lifelong learning capabilities, including memory retention, skill transfer, and autonomous, goal-directed behavior.

Beyond evaluating state-of-the-art LLMs on the StuLife benchmark, we also explore the role of context engineering in advancing AGI. Our results suggest that optimizing how we guide models may be as crucial as improving the models themselves, positioning context engineering as a key enabler of progress toward AGI.

StuLife Architecture

Dataset Overview

The benchmark includes a comprehensive dataset of 1284 tasks spanning a full academic year, covering scenarios including:

  • Academic integrity and rule learning
  • Campus exploration and facility location
  • Course selection and schedule management
  • Attending different courses
  • Interacting with academic advisors
  • Library resource usage and seat booking
  • Midterm and final exams
  • Joining and participating in student clubs

Core Concepts

StuLife is founded on three key principles to challenge the frontiers of agent intelligence:

Persistent World

The campus environment is a single, continuous Python object (`CampusEnvironment`). Every action an agent takes—from sending an email to reserving a study room—permanently alters the state of this world. A booked room remains booked for all subsequent tasks. This creates a single source of truth and forces the agent to deal with the long-term consequences of its decisions.

Stateful & Dynamic Subsystems

The world is composed of multiple interconnected subsystems (e.g., calendar, course selection, geography) that are dynamic and stateful. Course popularity fluctuates, room availability changes, and the agent's location persists between tasks. This requires the agent to constantly query the latest state of the world before acting, rather than relying on outdated information.

Time-Driven & Self-Directed Tasks

Agents are not always given explicit instructions. Instead, they operate on a simulated clock and must autonomously consult their internal calendar to understand "what to do next." Whether it's attending a class at 8:00 AM or a club meeting in the evening, the agent must demonstrate a sense of time and initiative, driven by the schedule it builds for itself.

Citation

If this work is helpful to you, please cite our paper:

@article{cai2025building,
  title={Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark},
  author={Cai, Yuxuan and Hao, Yipeng and Zhou, Jie and Yan, Hang and Lei, Zhikai and Zhen, Rui and Han, Zhenhua and Yang, Yutao and Li, Junsong and Pan, Qianjun and others},
  journal={arXiv preprint arXiv:2508.19005},
  year={2025}
}

Contact

For questions, please contact us through the following channels:

Or visit our GitHub Repository

Leaderboard

The following are the latest results on the StuLife benchmark. Metrics include StuGPA, Long-Term Task Success Rate (LTRR), Proactive Interaction Score (PIS), and success rates and average turns for each task category.

SOTA LLMs Performance (Default Setting)

Open-source Model Open-source Model Thinking Model Thinking Model
Model StuGPA LTRR PIS In-Class Daily Campus Exam Total
Success AvgTurn Success AvgTurn Success AvgTurn Success AvgTurn
GPT-5 Thinking Model 17.90 6.50 4.68 7.78 12.70 14.16 14.31 16.88 6.24 12.35 12.69
Grok4 Thinking Model 17.38 10.65 4.50 4.79 6.31 21.80 11.25 18.75 5.69 15.23 8.68
DeepSeek-V3.1-thinking Open-source Model Thinking Model 17.04 6.14 3.78 6.29 9.83 12.58 13.03 17.50 5.54 11.18 10.88
Gemini-2.5-Pro Thinking Model 16.43 7.04 3.24 5.39 14.94 18.88 12.78 15.63 9.51 13.53 13.19
Qwen3-30B-A3B Open-source Model Thinking Model 16.30 5.05 0.72 0.60 9.45 10.79 11.75 17.50 5.46 8.31 10.09
Qwen3-235B-A22B Open-source Model 16.03 5.42 1.80 2.10 18.71 10.34 17.17 16.88 10.75 8.52 16.95
DeepSeek-V3.1 Open-source Model 14.26 4.51 0.54 0.90 14.03 12.81 12.62 15.00 6.78 8.95 12.43
DeepSeek-R1 Open-source Model Thinking Model 14.25 8.30 3.96 5.09 8.04 13.26 13.02 18.13 4.56 11.18 10.08
Qwen3-8B Open-source Model 13.31 4.33 0.54 0.90 10.12 8.31 10.25 14.38 6.31 6.71 9.71
QwQ-32B Open-source Model 13.21 5.78 3.42 4.79 7.72 6.97 13.25 16.88 4.52 7.88 10.06
Qwen3-32B Open-source Model Thinking Model 12.67 5.42 1.26 1.80 8.31 7.64 10.74 17.50 4.94 7.24 9.10
DeepSeek-V3 Open-source Model 11.22 6.14 2.88 3.59 5.84 6.74 11.87 16.25 4.26 7.24 8.64
Qwen3-32B Open-source Model 7.36 3.97 0.54 0.60 7.80 2.25 13.79 13.13 4.88 3.51 10.41
Llama-3.1-8B Open-source Model 5.81 3.30 0.90 0.90 61.34 0.00 35.91 10.63 28.46 2.13 44.62

Context Engineering Methods Performance

Method StuGPA LTRR PIS In-Class Daily Campus Exam Total
Success AvgTurn Success AvgTurn Success AvgTurn Success AvgTurn
Vanilla
Qwen3-235B-A22B Open-source Model 16.03 5.42 1.80 2.10 18.71 10.34 17.17 16.88 10.75 8.52 16.95
Proactive
Qwen3-235B-A22B Open-source Model 16.90 5.96 3.06 5.09 16.70 10.34 16.38 16.88 7.73 9.58 15.42
Skill
Qwen3-235B-A22B Open-source Model 17.28 6.86 0.90 1.50 16.89 15.28 16.51 17.50 9.28 10.76 15.75
Memory
Qwen3-235B-A22B + Vanilla RAG Open-source Model 10.98 4.69 0.18 0.00 17.87 5.84 14.20 16.25 10.04 5.54 15.07
Qwen3-235B-A22B + Graph RAG Open-source Model 15.34 4.87 0.72 0.90 20.68 10.11 14.03 16.25 10.61 7.88 16.13
Qwen3-235B-A22B + MemGPT Open-source Model 19.99 6.86 1.44 2.40 17.28 13.03 13.59 23.75 9.02 11.08 14.42
Qwen3-235B-A22B + MemoryBank Open-source Model 17.64 5.96 1.62 0.90 16.68 12.36 14.15 20.00 8.04 9.58 14.35
All-in-One
Qwen3-235B-A22B Open-source Model 21.07 9.39 3.76 2.69 16.82 17.75 15.65 25.63 6.30 13.74 14.93

Note: All results are presented as percentages. StuGPA represents Student GPA, LTRR represents Long-Term Task Completion Rate, PIS represents Proactive Interaction Score. Success represents success rate, AvgTurn represents average completion turns.

Submit Your Results

We welcome submissions from researchers and practitioners working on the StuLife benchmark. Your contributions help advance the field of Experience-driven Lifelong Learning!

How to Submit

There are two ways to submit your results to the StuLife leaderboard:

Method 1: Email Submission

Send your results directly to our team via email:

1

Run your model on the StuLife benchmark and generate the runs.json file

2

Prepare any additional information about your method (model description, hyperparameters, etc.)

3

Send both files to:

Method 2: GitHub Pull Request

Contribute directly to our repository:

2

Run your model and output results to the ./result folder

3

Create a pull request with your submission

Review Process

Weekly Updates

We conduct manual reviews of all submissions every Monday to update the leaderboard with the latest results. Please allow up to one week for your submission to appear on the leaderboard.

Submission Guidelines

Required Files

Include the runs.json output file from your benchmark run

Method Description

Provide a brief description of your method, including model architecture and key innovations

Reproducibility

Include hyperparameters and implementation details for reproducibility

Validation

Ensure your results are generated using the official StuLife evaluation framework