Anthropic published a quite interesting blog article, “Designing AI-resistant technical evaluations,” worth reading. Quoting from the article, “I needed a problem where human reasoning could win over Claude’s larger experience base: something sufficiently out of distribution. […] I implemented one medium-hard puzzle and tested it on Claude Opus 4.5. It failed. I filled out more puzzles and had colleagues verify that people less steeped in the problem than me could still outperform Claude.”
It should not come as a surprise that current Generative AI models have a larger knowledge base than any single one of us humans, but when creativity and deep reasoning are required to solve a puzzle, we can still beat them.