Why Large Language Models Fail at Video Games
- •Large language models consistently struggle to play video games compared to human performance levels
- •Coding success in LLMs relies on structured, granular tasks that contrast with diverse video game mechanics
- •Current AI training in simulations faces obstacles due to the high diversity of video game environments
Large language models demonstrate significant limitations when attempting to play video games, despite their advanced capabilities in other domains like coding and language processing. While models like Gemini 2.5 Pro managed to complete Pokemon Blue in May 2025, these instances remain rare exceptions that rely on custom software and operate at speeds far slower than human players. Julian Togelius, director of NYU’s Game Innovation Lab, notes that current models fail to outperform simple search algorithms in specialized game-based benchmarks, struggling notably with spatial reasoning and game-specific mechanics.
The disparity between an LLM's success in coding and its failure in gaming stems from fundamental differences in environment structure. Coding functions as a well-behaved, task-oriented game where rewards are immediate and granular, allowing models to leverage existing training data to produce functional software. In contrast, video games present vast, diverse challenges with unique input-output spaces and mechanics. Unlike standardized academic tasks, video games lack the massive, uniform datasets required for LLMs to achieve proficiency across multiple titles, as most games require distinct, iterative learning processes that models cannot currently execute.
Experts caution against overestimating AI's current game-playing potential, even as companies like Google and Nvidia integrate gamelike simulations into AI training loops. Games remain significantly more diverse than the physical world, where consistent physics and environments allow for more reliable model training. For example, while world models in autonomous driving, such as Waymo’s, benefit from the relative stability of real-world physics, video games lack such universal constraints. Consequently, while LLMs can generate boilerplate game code, they remain incapable of the iterative play-testing and adjustment necessary to create novel or high-quality gaming experiences.