EvoAgentBench

Agent Performance

Pass rates on EvoAgentBench ยท 5 domains ยท multiple self-evolution methods

Partial results. More agents, domains, and methods coming soon.

65.5%

Best With Skills

+2.3%

Avg. Improvement

80

Configurations

Filter ยท Agent
Filter ยท Domain
Filter ยท Self-Evolving Methods
Sort by
#AgentBase ModelDomainSelf-Evolving MethodsWithoutWith Skillsฮ”Cost
1NanobotQwen3.5-397BKnowledge WorkReasoningBank54.9%65.5%+10.6
โ†‘ 17.6% turns
2OpenClawQwen3.5-397BSoftware EngineeringReasoningBank25.0%65.4%+40.4
โ†‘ 89.0% turns
3OpenClawQwen3.5-397BCode ImplementationGEPA46.2%64.1%+17.9
โ†‘ 90.0% turns
4NanobotQwen3.5-397BKnowledge WorkEverOS54.9%63.8%+8.9
โ†‘ 1.2% turns
5NanobotQwen3.5-397BKnowledge WorkMemento54.9%62.7%+7.8
โ†‘ 14.5% turns
6NanobotQwen3.5-397BCode ImplementationGEPA51.3%61.5%+10.2
โ†‘ 65.2% turns
7NanobotQwen3.5-397BKnowledge WorkGEPA54.9%60.8%+5.9
โ†‘ 39.6% turns
8NanobotQwen3.5-27BKnowledge WorkEverOS43.1%60.3%+17.2
โ†‘ 10.9% turns
9NanobotQwen3.5-27BSoftware EngineeringEverOS38.5%57.7%+19.2
โ†“ 4.2% turns
10NanobotQwen3.5-397BSoftware EngineeringReasoningBank46.2%57.7%+11.5
โ†‘ 12.0% turns
11NanobotQwen3.5-397BReasoning & Problem DecompositionMemento53.0%55.0%+2.0
โ†‘ 1484.9% chars
12NanobotQwen3.5-397BReasoning & Problem DecompositionGEPA53.0%55.0%+2.0
โ†‘ 1.9% chars
13OpenClawQwen3.5-397BKnowledge WorkEverOS45.1%53.4%+8.3
โ†‘ 4.4% turns
14OpenClawQwen3.5-27BCode ImplementationGEPA46.2%51.3%+5.1
โ†‘ 20.4% turns
15OpenClawQwen3.5-397BCode ImplementationMemento46.2%51.3%+5.1
โ†‘ 8.0% turns
16NanobotQwen3.5-397BCode ImplementationReasoningBank51.3%51.3%+0.0
โ†‘ 4.3% turns
17NanobotQwen3.5-397BReasoning & Problem DecompositionEverOS53.0%51.0%-2.0
โ†“ 2.8% chars
18OpenClawQwen3.5-397BInformation RetrievalEverOS30.8%50.8%+20.0
โ†“ 1.5% turns
19OpenClawQwen3.5-27BKnowledge WorkEverOS37.3%50.0%+12.7
โ†‘ 5.5% turns
20OpenClawQwen3.5-27BSoftware EngineeringReasoningBank38.5%50.0%+11.5
โ†‘ 5.1% turns
21OpenClawQwen3.5-397BReasoning & Problem DecompositionEverOS48.0%50.0%+2.0
โ†“ 18.0% chars
22OpenClawQwen3.5-397BSoftware EngineeringGEPA25.0%50.0%+25.0
โ†‘ 59.8% turns
23NanobotQwen3.5-27BSoftware EngineeringMemento38.5%50.0%+11.5
โ†‘ 24.9% turns
24NanobotQwen3.5-27BSoftware EngineeringGEPA38.5%50.0%+11.5
โ†‘ 29.2% turns
25NanobotQwen3.5-397BSoftware EngineeringMemento46.2%50.0%+3.8
โ†‘ 16.2% turns
26OpenClawQwen3.5-397BKnowledge WorkGEPA45.1%49.0%+3.9
โ†‘ 5.9% turns
27NanobotQwen3.5-397BReasoning & Problem DecompositionReasoningBank53.0%49.0%-4.0
โ†‘ 30.0% chars
28OpenClawQwen3.5-27BSoftware EngineeringEverOS38.5%46.2%+7.7
โ†‘ 0.5% turns
29OpenClawQwen3.5-397BSoftware EngineeringEverOS25.0%46.2%+21.2
โ†‘ 32.0% turns
30NanobotQwen3.5-27BReasoning & Problem DecompositionGEPA47.0%44.0%-3.0
โ†“ 0.2% chars
31OpenClawQwen3.5-27BCode ImplementationEverOS46.2%43.6%-2.6
โ†‘ 32.7% turns
32NanobotQwen3.5-397BCode ImplementationMemento51.3%43.6%-7.7
โ†“ 17.4% turns
33OpenClawQwen3.5-397BKnowledge WorkMemento45.1%43.1%-2.0
โ†“ 8.1% turns
34OpenClawQwen3.5-397BKnowledge WorkReasoningBank45.1%43.1%-2.0
โ†“ 5.9% turns
35NanobotQwen3.5-27BKnowledge WorkReasoningBank43.1%43.1%+0.0
โ†‘ 64.4% turns
36OpenClawQwen3.5-397BReasoning & Problem DecompositionReasoningBank48.0%43.0%-5.0
โ†“ 13.8% chars
37NanobotQwen3.5-27BReasoning & Problem DecompositionEverOS47.0%43.0%-4.0
โ†‘ 2.8% chars
38OpenClawQwen3.5-397BSoftware EngineeringMemento25.0%42.3%+17.3
โ†‘ 28.9% turns
39NanobotQwen3.5-397BSoftware EngineeringEverOS46.2%42.3%-3.9
โ†“ 15.9% turns
40NanobotQwen3.5-397BSoftware EngineeringGEPA46.2%42.3%-3.9
โ†‘ 8.3% turns
41OpenClawQwen3.5-397BReasoning & Problem DecompositionGEPA48.0%42.0%-6.0
โ†“ 18.3% chars
42NanobotQwen3.5-27BReasoning & Problem DecompositionMemento47.0%42.0%-5.0
โ†‘ 1843.1% chars
43NanobotQwen3.5-27BReasoning & Problem DecompositionReasoningBank47.0%42.0%-5.0
โ†‘ 42.4% chars
44OpenClawQwen3.5-397BInformation RetrievalReasoningBank30.8%41.5%+10.7
โ†“ 10.3% turns
45OpenClawQwen3.5-27BKnowledge WorkGEPA37.3%41.2%+3.9
โ†‘ 41.9% turns
46NanobotQwen3.5-27BKnowledge WorkMemento43.1%39.2%-3.9
โ†‘ 66.5% turns
47OpenClawQwen3.5-27BReasoning & Problem DecompositionEverOS44.0%39.0%-5.0
โ†“ 31.8% chars
48OpenClawQwen3.5-397BCode ImplementationEverOS46.2%38.5%-7.7
โ†‘ 42.0% turns
49OpenClawQwen3.5-397BCode ImplementationReasoningBank46.2%38.5%-7.7
โ†‘ 2.0% turns
50NanobotQwen3.5-27BSoftware EngineeringReasoningBank38.5%38.5%+0.0
โ†“ 6.5% turns
51OpenClawQwen3.5-397BReasoning & Problem DecompositionMemento48.0%38.0%-10.0
โ†‘ 68.5% chars
52NanobotQwen3.5-27BCode ImplementationReasoningBank25.6%35.9%+10.3
โ†‘ 95.5% turns
53NanobotQwen3.5-27BCode ImplementationGEPA25.6%35.9%+10.3
โ†‘ 68.2% turns
54NanobotQwen3.5-397BCode ImplementationEverOS51.3%35.9%-15.4
โ†‘ 17.4% turns
55OpenClawQwen3.5-27BKnowledge WorkMemento37.3%35.3%-2.0
โ†‘ 100.0% turns
56NanobotQwen3.5-27BKnowledge WorkGEPA43.1%35.3%-7.8
โ†‘ 72.3% turns
57OpenClawQwen3.5-27BKnowledge WorkReasoningBank37.3%34.5%-2.8
โ†‘ 29.1% turns
58OpenClawQwen3.5-27BCode ImplementationMemento46.2%33.3%-12.9
โ†“ 6.1% turns
59OpenClawQwen3.5-27BReasoning & Problem DecompositionMemento44.0%32.0%-12.0
โ†‘ 21.1% chars
60OpenClawQwen3.5-27BReasoning & Problem DecompositionGEPA44.0%32.0%-12.0
โ†“ 37.9% chars
61OpenClawQwen3.5-27BSoftware EngineeringMemento38.5%30.8%-7.7
โ†“ 41.7% turns
62NanobotQwen3.5-27BCode ImplementationMemento25.6%30.8%+5.2
โ†‘ 27.3% turns
63OpenClawQwen3.5-397BInformation RetrievalMemento30.8%29.2%-1.6
โ†“ 8.6% turns
64OpenClawQwen3.5-27BCode ImplementationReasoningBank46.2%28.2%-18.0
โ†“ 8.2% turns
65OpenClawQwen3.5-397BInformation RetrievalGEPA30.8%26.2%-4.6
โ†‘ 14.1% turns
66NanobotQwen3.5-397BInformation RetrievalMemento10.8%26.2%+15.4
โ†“ 33.6% turns
67OpenClawQwen3.5-27BInformation RetrievalGEPA10.8%24.6%+13.8
โ†“ 0.6% turns
68OpenClawQwen3.5-27BReasoning & Problem DecompositionReasoningBank44.0%21.0%-23.0
โ†“ 28.7% chars
69NanobotQwen3.5-397BInformation RetrievalEverOS10.8%20.0%+9.2
โ†“ 44.0% turns
70NanobotQwen3.5-27BCode ImplementationEverOS25.6%17.9%-7.7
โ†‘ 131.8% turns
71OpenClawQwen3.5-27BInformation RetrievalMemento10.8%16.9%+6.1
โ†“ 19.2% turns
72OpenClawQwen3.5-27BInformation RetrievalEverOS10.8%15.4%+4.6
โ†‘ 76.5% turns
73OpenClawQwen3.5-27BSoftware EngineeringGEPA38.5%15.4%-23.1
โ†“ 39.9% turns
74NanobotQwen3.5-27BInformation RetrievalEverOS6.2%13.8%+7.6
โ†“ 34.2% turns
75NanobotQwen3.5-397BInformation RetrievalReasoningBank10.8%13.8%+3.0
โ†“ 3.6% turns
76OpenClawQwen3.5-27BInformation RetrievalReasoningBank10.8%12.3%+1.5
โ†‘ 17.8% turns
77NanobotQwen3.5-397BInformation RetrievalGEPA10.8%12.3%+1.5
โ†“ 47.6% turns
78NanobotQwen3.5-27BInformation RetrievalReasoningBank6.2%9.2%+3.0
โ†‘ 40.6% turns
79NanobotQwen3.5-27BInformation RetrievalGEPA6.2%4.6%-1.6
โ†“ 47.4% turns
80NanobotQwen3.5-27BInformation RetrievalMemento6.2%3.1%-3.1
โ†‘ 18.9% turns