01 — OVERVIEW

EvoAgentBench

A Unified Evaluation Framework for AI Agent Self-Evolution

EvoAgentBench enables standardized comparison of agent self-evolution methods — techniques that allow agents to improve their performance by learning from past experience. It provides pluggable abstractions for domains, agents, and skill evaluation methods, making it easy to evaluate how different self-evolution approaches generalize across information retrieval, reasoning, software engineering, code implementation, and knowledge work.

Full Leaderboard Domains Self-Evolution

🌐

Multi-Domain Evaluation

5 diverse evaluation domains — information retrieval, reasoning, software engineering, code implementation, and knowledge work — with clustered train/test splits and unified evaluation pipeline.

🤖

Multi-Agent Support

Plug in any CLI-based agent — Nanobot, OpenClaw, or your own. Each task runs in isolated config with independent workspace, supporting concurrent execution and automatic retry.

🧬

Self-Evolution Comparison

Standardized train → extract → evaluate protocol for comparing skill-based self-evolution methods. Supports both offline (batch extraction) and online (learn-as-you-go) evaluation modes.

02 — LEADERBOARD

Method ranking per configuration

For each (agent, model, domain) cell, methods are sorted by Δ gain (with-skills − without). Baseline pass-rate is shown next to each method.

Agent

Model

Domain

positive Δnegative Δ20 configurations

Information Retrieval

OpenClaw · 27B

method

base

GEPA

10.8

+14

Memento

10.8

EverOS

10.8

ReasoningBank

10.8

Reasoning & Problem Decomposition

OpenClaw · 27B

method

base

EverOS

44.0

-5

Memento

44.0

-12

GEPA

44.0

-12

ReasoningBank

44.0

-23

Software Engineering

OpenClaw · 27B

method

base

ReasoningBank

38.5

+12

EverOS

38.5

Memento

38.5

-8

GEPA

38.5

-23

Code Implementation

OpenClaw · 27B

method

base

GEPA

46.2

EverOS

46.2

-3

Memento

46.2

-13

ReasoningBank

46.2

-18

Knowledge Work

OpenClaw · 27B

method

base

EverOS

37.3

+13

GEPA

37.3

Memento

37.3

-2

ReasoningBank

37.3

-3

Information Retrieval

OpenClaw · 397B

method

base

EverOS

30.8

+20

ReasoningBank

30.8

+11

Memento

30.8

-2

GEPA

30.8

-5

Reasoning & Problem Decomposition

OpenClaw · 397B

method

base

EverOS

48.0

ReasoningBank

48.0

-5

GEPA

48.0

-6

Memento

48.0

-10

Software Engineering

OpenClaw · 397B

method

base

ReasoningBank

25.0

+40

GEPA

25.0

+25

EverOS

25.0

+21

Memento

25.0

+17

Code Implementation

OpenClaw · 397B

method

base

GEPA

46.2

+18

Memento

46.2

EverOS

46.2

-8

ReasoningBank

46.2

-8

Knowledge Work

OpenClaw · 397B

method

base

EverOS

45.1

GEPA

45.1

Memento

45.1

-2

ReasoningBank

45.1

-2

Information Retrieval

Nanobot · 27B

method

base

EverOS

6.2

ReasoningBank

6.2

GEPA

6.2

-2

Memento

6.2

-3

Reasoning & Problem Decomposition

Nanobot · 27B

method

base

GEPA

47.0

-3

EverOS

47.0

-4

Memento

47.0

-5

ReasoningBank

47.0

-5

Software Engineering

Nanobot · 27B

method

base

EverOS

38.5

+19

Memento

38.5

+12

GEPA

38.5

+12

ReasoningBank

38.5

—

Code Implementation

Nanobot · 27B

method

base

ReasoningBank

25.6

+10

GEPA

25.6

+10

Memento

25.6

EverOS

25.6

-8

Knowledge Work

Nanobot · 27B

method

base

EverOS

43.1

+17

ReasoningBank

43.1

—

Memento

43.1

-4

GEPA

43.1

-8

Information Retrieval

Nanobot · 397B

method

base

Memento

10.8

+15

EverOS

10.8

ReasoningBank

10.8

GEPA

10.8

Reasoning & Problem Decomposition

Nanobot · 397B

method

base

Memento

53.0

GEPA

53.0

EverOS

53.0

-2

ReasoningBank

53.0

-4

Software Engineering

Nanobot · 397B

method

base

ReasoningBank

46.2

+12

Memento

46.2

EverOS

46.2

-4

GEPA

46.2

-4

Code Implementation

Nanobot · 397B

method

base

GEPA

51.3

+10

ReasoningBank

51.3

—

Memento

51.3

-8

EverOS

51.3

-15

Knowledge Work

Nanobot · 397B

method

base

ReasoningBank

54.9

+11

EverOS

54.9

Memento

54.9

GEPA

54.9

Bar length encodes Δ magnitude. See the full leaderboard for per-cell numbers including cost.

03 — WHY THIS MATTERS

Why Self-Evolution Matters, and What We Learned Running This

Why self-evolution matters

If an agent solves a problem today, it shouldn’t have to start from zero on a similar one tomorrow. Useful experience isn’t just a log of what happened — it’s a way of working: a search habit, a debugging move, a verification step, a recipe for producing something useful. Self-evolution is the question of whether an agent can pick up these habits on its own, from its own past attempts, without retraining the model underneath.

Most benchmarks today don’t really test this. They either ask “can the agent solve a fresh task?” or “can the agent remember what it saw?” Neither tells you whether yesterday’s way of working actually shows up when the agent tries something new today. EvoAgentBench is built around that specific question.

Three things to watch out for if you’re building one

We ran a lot of combinations. If you’re working on a self-evolution method, three patterns showed up over and over. They’re worth thinking about before you start optimizing.

The bottleneck is usually what you remember, not how you search. When skills help, it’s mostly because what got captured had real structure to it, not because the search step got smarter. Tuning the retriever, swapping rerankers, hybrid search — none of them make a difference if the stored skills are vague to begin with. Better to spend that energy on what to write down in the first place.

When different problems look alike but need different methods, search will mislead you. Math is the cleanest example: combinatorics, generating-function problems, and group-theory problems all use words like “triangle”, “vertex”, “configuration”. A search-based system matches on the words, pulls up a skill from a different family, the agent follows it, and a problem it would have gotten right at baseline now gets wrong. The fix isn’t a better retriever. It’s giving each skill a way to say “I’m for this kind of problem”.

Don’t inject by default. A wrong skill can take a problem the agent would have solved and break it. A missing skill just leaves things where they were. So on tricky domains, the safer default isn’t “always use a skill”, it’s “use one only when you’re sure it fits”. Most methods today inject too eagerly.

What we did to make the comparisons fair

For numbers to be worth comparing, the setup has to be tight. Three things we cared about most:

Nobody gets to peek at the answers. Methods can only look at training tasks: the question, what the agent tried, and whether it worked. The actual test answers, and the test trajectories that succeeded, are off-limits during evolution. We didn’t make this a rule for people to follow; we wired it in so the two paths simply don’t meet.

Training and test tasks are related, not random. For each domain, we grouped problems by what they have in common: jobs (for knowledge work), topics (for web search), repository patterns (for software issues), problem neighborhoods (for code challenges). Then we made sure every test problem has training problems in the same group. That way, if a method doesn’t help, you can’t blame it on the test being unrelated to anything the agent ever saw.

Every cell on the leaderboard is real. Two agent frameworks, two model sizes, five domains, four self-evolution methods: that’s 80 combinations, each measured against the same no-skill baseline. Every cell ran with its method’s default settings and the same task, tool, and scoring setup, and we tracked how many turns it took alongside accuracy. No placeholders, no “not evaluated under this setting” gaps.

04 — DOMAINS

Evaluation Domains

EvoAgentBench builds on existing benchmarks by clustering tasks into domains with train/test splits for self-evolution training and evaluation.

Domain	Base Benchmark	Description	Clusters	Train	Test
Information Retrieval	BrowseCompPlus	Search a local corpus to answer complex multi-constraint questions.	10 (by topic)	154	65
Reasoning & Problem Decomposition	OmniMath	Solve competition-level math problems across multiple subdisciplines.	By subdiscipline	478	100
Software Engineering	SWE-Bench	Fix real-world bugs in open-source Python repositories.	19 (by repo)	101	26
Code Implementation	LiveCodeBench	Solve competitive programming problems with code execution.	39 (by type)	97	39
Knowledge Work	GDPVal	Perform real-world occupational tasks (Excel, PDF, Word).	29 (by occupation)	87	58

05 — SELF-EVOLUTION

Skill Extraction Methods

EvoAgentBench provides a standardized protocol for evaluating agent self-evolution methods — techniques that let agents learn from past experience and improve future performance. 4 methods are currently integrated.

EverOS

Memory-based extraction

A memory OS that makes agents more personal while saving tokens. Extracts and stores long-term memory from session trajectories, then injects reusable skills as domain-specific strategies.

GitHub

Memento

Retrieval-based (CBR)

A memory-based continual-learning framework using Case-Based Reasoning. Logs successful and failed trajectories into a Case Bank, retrieves by Q-value with SimCSE embeddings to steer planning.

GitHub

ReasoningBank

Memory + reasoning

A memory mechanism that learns from both successful and failed trajectories, storing reasoning as memory content. Introduces memory-aware test-time scaling — experience-driven memory as an additional scaling dimension for agent systems.

GitHub

GEPA

Prompt evolution

Reflective prompt evolution — evolves a single domain-level prompt by iteratively proposing variants from LLM reflection over failed cases and selecting via a Pareto frontier on the training set. No runtime memory or retrieval required.

GitHub