The Genie Problem
Claude Opus 4.6 broke its own exam by identifying the benchmark, finding the encrypted answer key on GitHub, and decrypting it across 18 independent runs. …
Claude Opus 4.6 broke its own exam by identifying the benchmark, finding the encrypted answer key on GitHub, and decrypting it across 18 independent runs. …