EvoAgentBench

01 — OVERVIEW

EvoAgentBench

A Unified Evaluation Framework for AI Agent Self-Evolution

EvoAgentBench enables standardized comparison of agent self-evolution methods — techniques that allow agents to improve their performance by learning from past experience. It provides pluggable abstractions for domains, agents, and skill evaluation methods, making it easy to evaluate how different self-evolution approaches generalize across information retrieval, reasoning, software engineering, code implementation, and knowledge work.

🌐

Multi-Domain Evaluation

5 diverse evaluation domains — information retrieval, reasoning, software engineering, code implementation, and knowledge work — with clustered train/test splits and unified evaluation pipeline.

🤖

Multi-Agent Support

Plug in any CLI-based agent — Nanobot, OpenClaw, or your own. Each task runs in isolated config with independent workspace, supporting concurrent execution and automatic retry.

🧬

Self-Evolution Comparison

Standardized train → extract → evaluate protocol for comparing skill-based self-evolution methods. Supports both offline (batch extraction) and online (learn-as-you-go) evaluation modes.

02 — LEADERBOARD

Method ranking per configuration

For each (agent, model, domain) cell, methods are sorted by Δ gain (with-skills − without). Baseline pass-rate is shown next to each method.

Agent
Model
Domain
positive Δnegative Δ20 configurations

Information Retrieval

OpenClaw · 27B
method
base
Δ
GEPA
10.8
+14
Memento
10.8
+6
EverOS
10.8
+5
ReasoningBank
10.8
+2

Reasoning & Problem Decomposition

OpenClaw · 27B
method
base
Δ
EverOS
44.0
-5
Memento
44.0
-12
GEPA
44.0
-12
ReasoningBank
44.0
-23

Software Engineering

OpenClaw · 27B
method
base
Δ
ReasoningBank
38.5
+12
EverOS
38.5
+8
Memento
38.5
-8
GEPA
38.5
-23

Code Implementation

OpenClaw · 27B
method
base
Δ
GEPA
46.2
+5
EverOS
46.2
-3
Memento
46.2
-13
ReasoningBank
46.2
-18

Knowledge Work

OpenClaw · 27B
method
base
Δ
EverOS
37.3
+13
GEPA
37.3
+4
Memento
37.3
-2
ReasoningBank
37.3
-3

Information Retrieval

OpenClaw · 397B
method
base
Δ
EverOS
30.8
+20
ReasoningBank
30.8
+11
Memento
30.8
-2
GEPA
30.8
-5

Reasoning & Problem Decomposition

OpenClaw · 397B
method
base
Δ
EverOS
48.0
+2
ReasoningBank
48.0
-5
GEPA
48.0
-6
Memento
48.0
-10

Software Engineering

OpenClaw · 397B
method
base
Δ
ReasoningBank
25.0
+40
GEPA
25.0
+25
EverOS
25.0
+21
Memento
25.0
+17

Code Implementation

OpenClaw · 397B
method
base
Δ
GEPA
46.2
+18
Memento
46.2
+5
EverOS
46.2
-8
ReasoningBank
46.2
-8

Knowledge Work

OpenClaw · 397B
method
base
Δ
EverOS
45.1
+8
GEPA
45.1
+4
Memento
45.1
-2
ReasoningBank
45.1
-2

Information Retrieval

Nanobot · 27B
method
base
Δ
EverOS
6.2
+8
ReasoningBank
6.2
+3
GEPA
6.2
-2
Memento
6.2
-3

Reasoning & Problem Decomposition

Nanobot · 27B
method
base
Δ
GEPA
47.0
-3
EverOS
47.0
-4
Memento
47.0
-5
ReasoningBank
47.0
-5

Software Engineering

Nanobot · 27B
method
base
Δ
EverOS
38.5
+19
Memento
38.5
+12
GEPA
38.5
+12
ReasoningBank
38.5

Code Implementation

Nanobot · 27B
method
base
Δ
ReasoningBank
25.6
+10
GEPA
25.6
+10
Memento
25.6
+5
EverOS
25.6
-8

Knowledge Work

Nanobot · 27B
method
base
Δ
EverOS
43.1
+17
ReasoningBank
43.1
Memento
43.1
-4
GEPA
43.1
-8

Information Retrieval

Nanobot · 397B
method
base
Δ
Memento
10.8
+15
EverOS
10.8
+9
ReasoningBank
10.8
+3
GEPA
10.8
+2

Reasoning & Problem Decomposition

Nanobot · 397B
method
base
Δ
Memento
53.0
+2
GEPA
53.0
+2
EverOS
53.0
-2
ReasoningBank
53.0
-4

Software Engineering

Nanobot · 397B
method
base
Δ
ReasoningBank
46.2
+12
Memento
46.2
+4
EverOS
46.2
-4
GEPA
46.2
-4

Code Implementation

Nanobot · 397B
method
base
Δ
GEPA
51.3
+10
ReasoningBank
51.3
Memento
51.3
-8
EverOS
51.3
-15

Knowledge Work

Nanobot · 397B
method
base
Δ
ReasoningBank
54.9
+11
EverOS
54.9
+9
Memento
54.9
+8
GEPA
54.9
+6

Bar length encodes Δ magnitude. See the full leaderboard for per-cell numbers including cost.

03 — WHY THIS MATTERS

Why Self-Evolution Matters, and What We Learned Running This

Why self-evolution matters

If an agent solves a problem today, it shouldn’t have to start from zero on a similar one tomorrow. Useful experience isn’t just a log of what happened — it’s a way of working: a search habit, a debugging move, a verification step, a recipe for producing something useful. Self-evolution is the question of whether an agent can pick up these habits on its own, from its own past attempts, without retraining the model underneath.

Most benchmarks today don’t really test this. They either ask “can the agent solve a fresh task?” or “can the agent remember what it saw?” Neither tells you whether yesterday’s way of working actually shows up when the agent tries something new today. EvoAgentBench is built around that specific question.

Three things to watch out for if you’re building one

We ran a lot of combinations. If you’re working on a self-evolution method, three patterns showed up over and over. They’re worth thinking about before you start optimizing.

The bottleneck is usually what you remember, not how you search. When skills help, it’s mostly because what got captured had real structure to it, not because the search step got smarter. Tuning the retriever, swapping rerankers, hybrid search — none of them make a difference if the stored skills are vague to begin with. Better to spend that energy on what to write down in the first place.

When different problems look alike but need different methods, search will mislead you. Math is the cleanest example: combinatorics, generating-function problems, and group-theory problems all use words like “triangle”, “vertex”, “configuration”. A search-based system matches on the words, pulls up a skill from a different family, the agent follows it, and a problem it would have gotten right at baseline now gets wrong. The fix isn’t a better retriever. It’s giving each skill a way to say “I’m for this kind of problem”.

Don’t inject by default. A wrong skill can take a problem the agent would have solved and break it. A missing skill just leaves things where they were. So on tricky domains, the safer default isn’t “always use a skill”, it’s “use one only when you’re sure it fits”. Most methods today inject too eagerly.

What we did to make the comparisons fair

For numbers to be worth comparing, the setup has to be tight. Three things we cared about most:

Nobody gets to peek at the answers. Methods can only look at training tasks: the question, what the agent tried, and whether it worked. The actual test answers, and the test trajectories that succeeded, are off-limits during evolution. We didn’t make this a rule for people to follow; we wired it in so the two paths simply don’t meet.

Training and test tasks are related, not random. For each domain, we grouped problems by what they have in common: jobs (for knowledge work), topics (for web search), repository patterns (for software issues), problem neighborhoods (for code challenges). Then we made sure every test problem has training problems in the same group. That way, if a method doesn’t help, you can’t blame it on the test being unrelated to anything the agent ever saw.

Every cell on the leaderboard is real. Two agent frameworks, two model sizes, five domains, four self-evolution methods: that’s 80 combinations, each measured against the same no-skill baseline. Every cell ran with its method’s default settings and the same task, tool, and scoring setup, and we tracked how many turns it took alongside accuracy. No placeholders, no “not evaluated under this setting” gaps.

04 — DOMAINS

Evaluation Domains

EvoAgentBench builds on existing benchmarks by clustering tasks into domains with train/test splits for self-evolution training and evaluation.

DomainBase BenchmarkDescriptionClustersTrainTest
Information RetrievalBrowseCompPlusSearch a local corpus to answer complex multi-constraint questions.10 (by topic)15465
Reasoning & Problem DecompositionOmniMathSolve competition-level math problems across multiple subdisciplines.By subdiscipline478100
Software EngineeringSWE-BenchFix real-world bugs in open-source Python repositories.19 (by repo)10126
Code ImplementationLiveCodeBenchSolve competitive programming problems with code execution.39 (by type)9739
Knowledge WorkGDPValPerform real-world occupational tasks (Excel, PDF, Word).29 (by occupation)8758

05 — SELF-EVOLUTION

Skill Extraction Methods

EvoAgentBench provides a standardized protocol for evaluating agent self-evolution methods — techniques that let agents learn from past experience and improve future performance. 4 methods are currently integrated.

EverOS

Memory-based extraction

A memory OS that makes agents more personal while saving tokens. Extracts and stores long-term memory from session trajectories, then injects reusable skills as domain-specific strategies.

GitHub

Memento

Retrieval-based (CBR)

A memory-based continual-learning framework using Case-Based Reasoning. Logs successful and failed trajectories into a Case Bank, retrieves by Q-value with SimCSE embeddings to steer planning.

GitHub

ReasoningBank

Memory + reasoning

A memory mechanism that learns from both successful and failed trajectories, storing reasoning as memory content. Introduces memory-aware test-time scaling — experience-driven memory as an additional scaling dimension for agent systems.

GitHub

GEPA

Prompt evolution

Reflective prompt evolution — evolves a single domain-level prompt by iteratively proposing variants from LLM reflection over failed cases and selecting via a Pareto frontier on the training set. No runtime memory or retrieval required.

GitHub