01 โ OVERVIEW
EvoAgentBench
A Unified Evaluation Framework for AI Agent Self-Evolution
EvoAgentBench enables standardized comparison of agent self-evolution methods โ techniques that allow agents to improve their performance by learning from past experience. It provides pluggable abstractions for domains, agents, and skill evaluation methods, making it easy to evaluate how different self-evolution approaches generalize across information retrieval, reasoning, software engineering, code implementation, and knowledge work.
Multi-Domain Evaluation
5 diverse evaluation domains โ information retrieval, reasoning, software engineering, code implementation, and knowledge work โ with clustered train/test splits and unified evaluation pipeline.
Multi-Agent Support
Plug in any CLI-based agent โ Nanabot, OpenClaw, or your own. Each task runs in isolated config with independent workspace, supporting concurrent execution and automatic retry.
Self-Evolution Comparison
Standardized train โ extract โ evaluate protocol for comparing skill-based self-evolution methods. Supports both offline (batch extraction) and online (learn-as-you-go) evaluation modes.
02 โ RESULTS
Agent Performance
Partial results shown below. More agents, domains, and methods coming soon.
72.4%
Best With Skills
+10.9%
Avg. Improvement
20
Configurations
| # | Agent | Base Model | Domain | Self-Evolving Methods | Without | With Skills | ฮ | Cost | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | OpenClaw | Qwen3.5-397B | Knowledge Work | Human Design | 50.0% | 72.4% | +22.4 | โ 91.8% turns | |
| 2 | OpenClaw | Qwen3.5-27B | Knowledge Work | Human Design | 51.7% | 65.5% | +13.8 | โ 43.8% turns | |
| 3 | OpenClaw | Qwen3.5-397B | Code Implementation | Human Design | 56.4% | 64.1% | +7.7 | โ 6.7% turns | |
| 4 | OpenClaw | Qwen3.5-27B | Code Implementation | Human Design | 53.8% | 64.1% | +10.3 | โ 0% turns | |
| 5 | OpenClaw | Qwen3.5-397B | Software Engineering | Human Design | 26.9% | 61.5% | +34.6 | โ 0.5% turns | |
| 6 | OpenClaw | Qwen3.5-397B | Reasoning & Problem Decomposition | Human Design | 45.0% | 60.0% | +15.0 | โ 2.7% chars | |
| 7 | OpenClaw | Qwen3.5-397B | Knowledge Work | EverOS | 50.0% | 56.9% | +6.9 | โ 24.7% turns | |
| 8 | OpenClaw | Qwen3.5-397B | Code Implementation | EverOS | 56.4% | 56.4% | +0.0 | โ 3.3% turns | |
| 9 | OpenClaw | Qwen3.5-397B | Information Retrieval | Human Design | 32.3% | 55.4% | +23.1 | โ 21.8% turns | |
| 10 | OpenClaw | Qwen3.5-27B | Knowledge Work | EverOS | 51.7% | 55.2% | +3.5 | โ 5.4% turns | |
| 11 | OpenClaw | Qwen3.5-27B | Code Implementation | EverOS | 53.8% | 51.3% | -2.5 | โ 4.8% turns | |
| 12 | OpenClaw | Qwen3.5-397B | Reasoning & Problem Decomposition | EverOS | 45.0% | 49.0% | +4.0 | โ 32.1% chars | |
| 13 | OpenClaw | Qwen3.5-397B | Information Retrieval | EverOS | 32.3% | 43.1% | +10.8 | โ 33.1% turns | |
| 14 | OpenClaw | Qwen3.5-27B | Reasoning & Problem Decomposition | EverOS | 37.0% | 42.0% | +5.0 | โ 6.2% chars | |
| 15 | OpenClaw | Qwen3.5-397B | Software Engineering | EverOS | 26.9% | 38.5% | +11.6 | โ 11.4% turns | |
| 16 | OpenClaw | Qwen3.5-27B | Software Engineering | EverOS | 11.5% | 38.5% | +27.0 | โ 41.2% turns | |
| 17 | OpenClaw | Qwen3.5-27B | Software Engineering | Human Design | 11.5% | 38.5% | +27.0 | โ 62.5% turns | |
| 18 | OpenClaw | Qwen3.5-27B | Information Retrieval | Human Design | 30.8% | 35.4% | +4.6 | โ 14.2% turns | |
| 19 | OpenClaw | Qwen3.5-27B | Information Retrieval | EverOS | 30.8% | 32.3% | +1.5 | โ 4.7% turns | |
| 20 | OpenClaw | Qwen3.5-27B | Reasoning & Problem Decomposition | Human Design | 37.0% | 29.0% | -8.0 | โ 13.0% chars |
03 โ DOMAINS
Evaluation Domains
EvoAgentBench builds on existing benchmarks by clustering tasks into domains with train/test splits for self-evolution training and evaluation.
| Domain | Base Benchmark | Description | Clusters | Train | Test |
|---|---|---|---|---|---|
| Information Retrieval | BrowseCompPlus | Search a local corpus to answer complex multi-constraint questions. | 10 (by topic) | 154 | 65 |
| Reasoning & Problem Decomposition | OmniMath | Solve competition-level math problems across multiple subdisciplines. | By subdiscipline | 478 | 100 |
| Software Engineering | SWE-Bench | Fix real-world bugs in open-source Python repositories. | 19 (by repo) | 101 | 26 |
| Code Implementation | LiveCodeBench | Solve competitive programming problems with code execution. | 39 (by type) | 97 | 39 |
| Knowledge Work | GDPVal | Perform real-world occupational tasks (Excel, PDF, Word). | 29 (by occupation) | 87 | 58 |
04 โ SELF-EVOLUTION
Skill Extraction Methods
EvoAgentBench provides a standardized protocol for evaluating agent self-evolution methods โ techniques that let agents learn from past experience and improve future performance. 5 methods are currently integrated.
EverOS
Memory-based extractionA memory OS that makes agents more personal while saving tokens. Extracts and stores long-term memory from session trajectories, then injects reusable skills as domain-specific strategies.
GitHubEvoSkill
Evolutionary optimizationAn agent-agnostic toolkit for automatically creating and improving AI skills. Runs an evolution loop where an Executor collects failures and a Proposer analyzes patterns to synthesize reusable skills.
GitHubMemento
Retrieval-based (CBR)A memory-based continual-learning framework using Case-Based Reasoning. Logs successful and failed trajectories into a Case Bank, retrieves by Q-value with SimCSE embeddings to steer planning.
GitHubOpenSpace
Continuous accumulationA self-evolving engine where every task makes every agent smarter. Skills automatically select, apply, monitor, analyze, and evolve themselves via three evolution modes (FIX, DERIVED, CAPTURED).
GitHubReasoningBank
Memory + reasoningA memory mechanism that learns from both successful and failed trajectories, storing reasoning as memory content. Introduces memory-aware test-time scaling โ experience-driven memory as an additional scaling dimension for agent systems.
GitHub