The Problem with Traditional AI BenchmarksIf you’ve been keeping up with the latest AI model releases, you’ve probably noticed a trend: every new model is breaking records on some benchmark leaderboard. Gemini is #1 on Chatbot Arena, OpenAI’s o3 scored 25% on the Frontier Test, and DeepSeek is dominating the MMLU benchmark. But let’s be real—what do these scores actually tell us about an AI model’s real-world value? Not much. LLMs today are fine-tuned to game these benchmarks. Labs are mining...