EvoAgentBench

01 โ€” OVERVIEW

EvoAgentBench

A Unified Evaluation Framework for AI Agent Self-Evolution

EvoAgentBench enables standardized comparison of agent self-evolution methods โ€” techniques that allow agents to improve their performance by learning from past experience. It provides pluggable abstractions for domains, agents, and skill evaluation methods, making it easy to evaluate how different self-evolution approaches generalize across information retrieval, reasoning, software engineering, code implementation, and knowledge work.

๐ŸŒ

Multi-Domain Evaluation

5 diverse evaluation domains โ€” information retrieval, reasoning, software engineering, code implementation, and knowledge work โ€” with clustered train/test splits and unified evaluation pipeline.

๐Ÿค–

Multi-Agent Support

Plug in any CLI-based agent โ€” Nanabot, OpenClaw, or your own. Each task runs in isolated config with independent workspace, supporting concurrent execution and automatic retry.

๐Ÿงฌ

Self-Evolution Comparison

Standardized train โ†’ extract โ†’ evaluate protocol for comparing skill-based self-evolution methods. Supports both offline (batch extraction) and online (learn-as-you-go) evaluation modes.

02 โ€” RESULTS

Agent Performance

Partial results shown below. More agents, domains, and methods coming soon.

View all โ†’

72.4%

Best With Skills

+10.9%

Avg. Improvement

20

Configurations

Filter ยท Agent
Filter ยท Domain
Filter ยท Self-Evolving Methods
Sort by
#AgentBase ModelDomainSelf-Evolving MethodsWithoutWith Skillsฮ”Cost
1OpenClawQwen3.5-397BKnowledge WorkHuman Design50.0%72.4%+22.4
โ†‘ 91.8% turns
2OpenClawQwen3.5-27BKnowledge WorkHuman Design51.7%65.5%+13.8
โ†‘ 43.8% turns
3OpenClawQwen3.5-397BCode ImplementationHuman Design56.4%64.1%+7.7
โ†“ 6.7% turns
4OpenClawQwen3.5-27BCode ImplementationHuman Design53.8%64.1%+10.3
โ€” 0% turns
5OpenClawQwen3.5-397BSoftware EngineeringHuman Design26.9%61.5%+34.6
โ†“ 0.5% turns
6OpenClawQwen3.5-397BReasoning & Problem DecompositionHuman Design45.0%60.0%+15.0
โ†‘ 2.7% chars
7OpenClawQwen3.5-397BKnowledge WorkEverOS50.0%56.9%+6.9
โ†‘ 24.7% turns
8OpenClawQwen3.5-397BCode ImplementationEverOS56.4%56.4%+0.0
โ†“ 3.3% turns
9OpenClawQwen3.5-397BInformation RetrievalHuman Design32.3%55.4%+23.1
โ†“ 21.8% turns
10OpenClawQwen3.5-27BKnowledge WorkEverOS51.7%55.2%+3.5
โ†‘ 5.4% turns
11OpenClawQwen3.5-27BCode ImplementationEverOS53.8%51.3%-2.5
โ†“ 4.8% turns
12OpenClawQwen3.5-397BReasoning & Problem DecompositionEverOS45.0%49.0%+4.0
โ†“ 32.1% chars
13OpenClawQwen3.5-397BInformation RetrievalEverOS32.3%43.1%+10.8
โ†“ 33.1% turns
14OpenClawQwen3.5-27BReasoning & Problem DecompositionEverOS37.0%42.0%+5.0
โ†“ 6.2% chars
15OpenClawQwen3.5-397BSoftware EngineeringEverOS26.9%38.5%+11.6
โ†“ 11.4% turns
16OpenClawQwen3.5-27BSoftware EngineeringEverOS11.5%38.5%+27.0
โ†‘ 41.2% turns
17OpenClawQwen3.5-27BSoftware EngineeringHuman Design11.5%38.5%+27.0
โ†‘ 62.5% turns
18OpenClawQwen3.5-27BInformation RetrievalHuman Design30.8%35.4%+4.6
โ†‘ 14.2% turns
19OpenClawQwen3.5-27BInformation RetrievalEverOS30.8%32.3%+1.5
โ†‘ 4.7% turns
20OpenClawQwen3.5-27BReasoning & Problem DecompositionHuman Design37.0%29.0%-8.0
โ†“ 13.0% chars

03 โ€” DOMAINS

Evaluation Domains

EvoAgentBench builds on existing benchmarks by clustering tasks into domains with train/test splits for self-evolution training and evaluation.

DomainBase BenchmarkDescriptionClustersTrainTest
Information RetrievalBrowseCompPlusSearch a local corpus to answer complex multi-constraint questions.10 (by topic)15465
Reasoning & Problem DecompositionOmniMathSolve competition-level math problems across multiple subdisciplines.By subdiscipline478100
Software EngineeringSWE-BenchFix real-world bugs in open-source Python repositories.19 (by repo)10126
Code ImplementationLiveCodeBenchSolve competitive programming problems with code execution.39 (by type)9739
Knowledge WorkGDPValPerform real-world occupational tasks (Excel, PDF, Word).29 (by occupation)8758

04 โ€” SELF-EVOLUTION

Skill Extraction Methods

EvoAgentBench provides a standardized protocol for evaluating agent self-evolution methods โ€” techniques that let agents learn from past experience and improve future performance. 5 methods are currently integrated.

EverOS

Memory-based extraction

A memory OS that makes agents more personal while saving tokens. Extracts and stores long-term memory from session trajectories, then injects reusable skills as domain-specific strategies.

GitHub

EvoSkill

Evolutionary optimization

An agent-agnostic toolkit for automatically creating and improving AI skills. Runs an evolution loop where an Executor collects failures and a Proposer analyzes patterns to synthesize reusable skills.

GitHub

Memento

Retrieval-based (CBR)

A memory-based continual-learning framework using Case-Based Reasoning. Logs successful and failed trajectories into a Case Bank, retrieves by Q-value with SimCSE embeddings to steer planning.

GitHub

OpenSpace

Continuous accumulation

A self-evolving engine where every task makes every agent smarter. Skills automatically select, apply, monitor, analyze, and evolve themselves via three evolution modes (FIX, DERIVED, CAPTURED).

GitHub

ReasoningBank

Memory + reasoning

A memory mechanism that learns from both successful and failed trajectories, storing reasoning as memory content. Introduces memory-aware test-time scaling โ€” experience-driven memory as an additional scaling dimension for agent systems.

GitHub