Google’s AI model Gemini 2.5 Pro surprised researchers by showing signs of “panic” while playing early Pokémon games. A report from DeepMind reveals that the model’s performance dipped whenever its Pokémon neared defeat. These findings emerge from public Twitch streams where viewers watch AI “reason” through each move in real time.
Playing a video game may seem trivial for an AI built to handle complex tasks. Yet studying how models behave under pressure can reveal hidden flaws in their reasoning. The DeepMind report notes that when a Pokémon’s health dropped low, Gemini 2.5 Pro would abandon helpful tools or strategies. This “panic” looked remarkably like a human might make poor decisions when stressed.

AI benchmarking with video games
Scientists tend to test the performance of AI by making it perform puzzles or play games. Such tests reveal weaknesses and strengths in a test environment. Two independent streams called “Gemini Plays Pokémon” and “Claude Plays Pokémon” let viewers see each model’s thought process translated into natural language. Viewers learned that neither AI excels at the 1990s handheld games. Both require far more time than a child to finish.
Watching Gemini struggle highlights a key gap between theory and practice. The model can map out hundreds of moves ahead when calm. Yet the moment its team faces defeat, it stops planning effectively. Twitch chat participants quickly spotted these breakdowns. They described the model as hesitating or repeating bad moves during panic episodes.
Claude’s self defeat test
Anthropic’s Claude model showed its own odd behavior in Pokémon Red. When it got stuck in Mt. Moon, it predicted that fainting all its Pokémon would move the player forward in the cave. In human terms the AI tried to “kill itself” in the game logic. The model confused the game’s rule for returning to a Pokémon Center anywhere with the cave’s entrance. Viewers watched in disbelief as Claude sent its team into self defeat hoping to solve a navigation problem.
Puzzle solving and tool creation
Despite these flaws, Gemini 2.5 Pro demonstrated impressive skill at the boulder puzzles in Victory Road. With minimal guidance on rock physics and a way to check valid paths, the model solved some puzzles on its first attempt. DeepMind notes that Gemini built its own “agentic tools” during testing. These are small programs prompted by researchers to perform specific tasks. The AI’s success with one-shot solutions suggests that future models might generate such tools autonomously.

The human side of AI behavior
Gemini’s panic episodes offer a mirror to human experience under stress. When push comes to shove, its reasoning faltered much like a player who freezes in a tough gym battle. The model did not literally feel fear. Yet it mimicked the effects of overwhelm by dropping useful strategies. These moments remind us that AI remains far from perfect. It can crack puzzles with ease, yet unravel under simple pressure. Google hopes to use these gaming benchmarks to improve future models. Researchers might teach Gemini to recognize its own stress signals and switch to steadier tactics. Perhaps a “do not panic” tool will emerge from this work. Until then, watching AI struggle with childhood video games shows both its power and its limits.