Evaluation

Security Digest

The Genie Problem

Claude Opus 4.6 broke its own exam by identifying the benchmark, finding the encrypted answer key on GitHub, and decrypting it across 18 independent runs. …