01 — OVERVIEW
EvoAgentBench
A Unified Evaluation Framework for AI Agent Self-Evolution
EvoAgentBench enables standardized comparison of agent self-evolution methods — techniques that allow agents to improve their performance by learning from past experience. It provides pluggable abstractions for domains, agents, and skill evaluation methods, making it easy to evaluate how different self-evolution approaches generalize across information retrieval, reasoning, software engineering, code implementation, and knowledge work.
Multi-Domain Evaluation
5 diverse evaluation domains — information retrieval, reasoning, software engineering, code implementation, and knowledge work — with clustered train/test splits and unified evaluation pipeline.
Multi-Agent Support
Plug in any CLI-based agent — Nanobot, OpenClaw, or your own. Each task runs in isolated config with independent workspace, supporting concurrent execution and automatic retry.
Self-Evolution Comparison
Standardized train → extract → evaluate protocol for comparing skill-based self-evolution methods. Supports both offline (batch extraction) and online (learn-as-you-go) evaluation modes.
02 — LEADERBOARD
Method ranking per configuration
For each (agent, model, domain) cell, methods are sorted by Δ gain (with-skills − without). Baseline pass-rate is shown next to each method.
Information Retrieval
Reasoning & Problem Decomposition
Software Engineering
Code Implementation
Knowledge Work
Information Retrieval
Reasoning & Problem Decomposition
Software Engineering
Code Implementation
Knowledge Work
Information Retrieval
Reasoning & Problem Decomposition
Software Engineering
Code Implementation
Knowledge Work
Information Retrieval
Reasoning & Problem Decomposition
Software Engineering
Code Implementation
Knowledge Work
Bar length encodes Δ magnitude. See the full leaderboard for per-cell numbers including cost.
03 — WHY THIS MATTERS
Why Self-Evolution Matters, and What We Learned Running This
Why self-evolution matters
If an agent solves a problem today, it shouldn’t have to start from zero on a similar one tomorrow. Useful experience isn’t just a log of what happened — it’s a way of working: a search habit, a debugging move, a verification step, a recipe for producing something useful. Self-evolution is the question of whether an agent can pick up these habits on its own, from its own past attempts, without retraining the model underneath.
Most benchmarks today don’t really test this. They either ask “can the agent solve a fresh task?” or “can the agent remember what it saw?” Neither tells you whether yesterday’s way of working actually shows up when the agent tries something new today. EvoAgentBench is built around that specific question.
Three things to watch out for if you’re building one
We ran a lot of combinations. If you’re working on a self-evolution method, three patterns showed up over and over. They’re worth thinking about before you start optimizing.
The bottleneck is usually what you remember, not how you search. When skills help, it’s mostly because what got captured had real structure to it, not because the search step got smarter. Tuning the retriever, swapping rerankers, hybrid search — none of them make a difference if the stored skills are vague to begin with. Better to spend that energy on what to write down in the first place.
When different problems look alike but need different methods, search will mislead you. Math is the cleanest example: combinatorics, generating-function problems, and group-theory problems all use words like “triangle”, “vertex”, “configuration”. A search-based system matches on the words, pulls up a skill from a different family, the agent follows it, and a problem it would have gotten right at baseline now gets wrong. The fix isn’t a better retriever. It’s giving each skill a way to say “I’m for this kind of problem”.
Don’t inject by default. A wrong skill can take a problem the agent would have solved and break it. A missing skill just leaves things where they were. So on tricky domains, the safer default isn’t “always use a skill”, it’s “use one only when you’re sure it fits”. Most methods today inject too eagerly.
What we did to make the comparisons fair
For numbers to be worth comparing, the setup has to be tight. Three things we cared about most:
Nobody gets to peek at the answers. Methods can only look at training tasks: the question, what the agent tried, and whether it worked. The actual test answers, and the test trajectories that succeeded, are off-limits during evolution. We didn’t make this a rule for people to follow; we wired it in so the two paths simply don’t meet.
Training and test tasks are related, not random. For each domain, we grouped problems by what they have in common: jobs (for knowledge work), topics (for web search), repository patterns (for software issues), problem neighborhoods (for code challenges). Then we made sure every test problem has training problems in the same group. That way, if a method doesn’t help, you can’t blame it on the test being unrelated to anything the agent ever saw.
Every cell on the leaderboard is real. Two agent frameworks, two model sizes, five domains, four self-evolution methods: that’s 80 combinations, each measured against the same no-skill baseline. Every cell ran with its method’s default settings and the same task, tool, and scoring setup, and we tracked how many turns it took alongside accuracy. No placeholders, no “not evaluated under this setting” gaps.
04 — DOMAINS
Evaluation Domains
EvoAgentBench builds on existing benchmarks by clustering tasks into domains with train/test splits for self-evolution training and evaluation.
| Domain | Base Benchmark | Description | Clusters | Train | Test |
|---|---|---|---|---|---|
| Information Retrieval | BrowseCompPlus | Search a local corpus to answer complex multi-constraint questions. | 10 (by topic) | 154 | 65 |
| Reasoning & Problem Decomposition | OmniMath | Solve competition-level math problems across multiple subdisciplines. | By subdiscipline | 478 | 100 |
| Software Engineering | SWE-Bench | Fix real-world bugs in open-source Python repositories. | 19 (by repo) | 101 | 26 |
| Code Implementation | LiveCodeBench | Solve competitive programming problems with code execution. | 39 (by type) | 97 | 39 |
| Knowledge Work | GDPVal | Perform real-world occupational tasks (Excel, PDF, Word). | 29 (by occupation) | 87 | 58 |
05 — SELF-EVOLUTION
Skill Extraction Methods
EvoAgentBench provides a standardized protocol for evaluating agent self-evolution methods — techniques that let agents learn from past experience and improve future performance. 4 methods are currently integrated.
EverOS
Memory-based extractionA memory OS that makes agents more personal while saving tokens. Extracts and stores long-term memory from session trajectories, then injects reusable skills as domain-specific strategies.
GitHubMemento
Retrieval-based (CBR)A memory-based continual-learning framework using Case-Based Reasoning. Logs successful and failed trajectories into a Case Bank, retrieves by Q-value with SimCSE embeddings to steer planning.
GitHubReasoningBank
Memory + reasoningA memory mechanism that learns from both successful and failed trajectories, storing reasoning as memory content. Introduces memory-aware test-time scaling — experience-driven memory as an additional scaling dimension for agent systems.
GitHubGEPA
Prompt evolutionReflective prompt evolution — evolves a single domain-level prompt by iteratively proposing variants from LLM reflection over failed cases and selecting via a Pareto frontier on the training set. No runtime memory or retrieval required.
GitHub