Training Little Brains to Play Games

More from Nye's Digital Lab

Cover image for Tech's Perfect Storm of Layoffs

Nye's Digital Lab

Jul 13

Tech's Perfect Storm of Layoffs

Covid's Whiplash, India & Billions of AI Cloud Infrastructure

Nye's Digital Lab

May 25

Childhood's End

The science fiction story that keeps me up at night

Cover image for LEGO to Protocol: Part I

Nye's Digital Lab

Jul 27

LEGO to Protocol: Part I

A LEGO two-parter on the future of modular global optimization

Subscribe to Nye's Digital Lab

>200 subscribers

Subscribe to Nye's Digital Lab

>200 subscribers

Nye's Digital Lab is a weekly scribble on creativity in an age of rapid change.

This week I'm training agents, and documenting the phases of intelligence they seem to go through. It's gettin' wild.

Setting up Games

Imagine taking two little brains and putting them in a small room where they play endless games against each other. Poker, chess, backgammon, and Connect Four. But also 80's classics like Mario Bros. and Tank!

Obviously, it’s slightly more technical... but in essence, that’s been my early morning experimentation for the past week using OpenAI’s Gymnasium and Petting Zoo framework.

I’ve been training agents to play games.

Training these little brains is about setting up an environment, and then determining the number of games you would like them to play again and again ... and again. (These are called “epochs”)

I’ve noticed a pattern for the evolution of their intelligence that I suspect is present in the training of agentic systems in general. This pattern is evident in the graphs in Petting Zoo (the Python framework) which show the emergence of “understanding.”

It’s not simply a matter of more epochs meaning more intelligence. There are distinct stages their learning goes through.

This is a new art, and I’m still experimenting. However, there are definitely “phases” I’ve noticed game-playing agents evolve through. These are my observations more than hard science, let’s call them “field notes” from the training floor.

I’ve done my best to articulate them below.

**PettingZoo is a simple, pythonic interface capable of representing general multi-agent reinforcement learning (MARL) problems.**

The 5 Phases of Intelligence

1. Chaos

When agents enter the environment, they are nothing but completely random interactions. I imagine virtual environment agents flailing around without any sort of rhyme or reason. It’s haphazard streams of random choices with zero decision making.

Agents can stay in this phase for a very long time. In my first explorations without really committing to a training regimen, I never saw progression. They just kept making senseless moves indefinitely, and I started wondering if I’d configured something wrong. But I hadn’t. Chaos is simply where everything begins.

There’s something humbling about watching pure randomness play out. It reminds you that intelligence isn’t a given. Learning takes time and energy.

The simplest example in the Petting Zoo MARL framework is the classic "Cartpole" - where, in a physics environment, agents learn to balance a pole on the cart that slides side to side.

2. Rules (aka "Policy")

Each game environment (provided by the framework) has the rules of the system baked in.

You can’t move twice in Connect Four, and you can’t play poker without putting chips in the kitty. The agents just need to learn the game itself. They do this through random happenstance, but when they do something right, the game’s reward function activates.

They have “discovered” they’re doing something positive that the rules of the environment demand. They begin to make the simplest of actions that subscribe to the rule set of the specific environment. They actually begin to play Connect Four.

They aren’t so good at it, but they seem to become “aware” of what they can and cannot do. It’s a forcing function. I imagine this is probably not any more efficient than coding a simple if-then application. However, once they “understand” the rule set, they can begin to really play. And they really learn how to play that game.

Once they understand the rules, they can begin to truly play. The rules phase is like learning grammar before writing poetry. It's the necessary scaffolding for everything that comes after.

*Understanding* *Connect Four,* *Stable Diffusion*

3. Patterns

Identifying patterns is ultimately the advantage of training agents to play. Humans who are good at games are exceptionally good at this, and this is the art you’re trying to cultivate with the agents.

The Connect Four agents began to first learn the rules (drop tokens, create lines) but to become strategic they had to become aware of their opponent to block them, and to set themselves up to make a line of four.

In poker, pattern recognition begins to find a flush or two of a kind. And as they get better, they seem to learn to keep the high cards and discard the ones without hand-worthy values.

This is where training begins to justify itself.

The agents aren’t just reacting anymore; they’re beginning to predict. They’re developing something that looks disturbingly like anticipation.

Setting up an interface that allows customization of agentic strategies. (Python, MARL, Petting Zoo)

4. Optimization

Once they know what the complicated reward function actually is, they begin to align the patterns in ways that they play strategically.

I’m convinced optimization is the key to intelligence.

Ideally the little brains are forming optimal associations in their little networks, branching one pattern to the next.

Creativity and intelligence is all about pattern associations. What works emerges from tens of thousands upon tens of thousands of epochs. It’s about efficiency. This is done by cutting unnecessary steps, finding shortcuts, and essentially, doing more with less. But there’s danger in those cuts. How deep can you optimize before you’ve carved away something essential?

The best trained agents develop neural pathways that fire in optimal sequences, each connection reinforcing the next.

5. Collapse (The Danger Zone)

Heads up. This is a guess.

There was a time when I experimented with GANs—Generative Adversarial Networks—and I learned about "mode collapse" firsthand. It’s a catastrophic failure where the model essentially breaks down and produces useless outputs.

There is a danger to too much training.

It’s possible that training too much turns the agent into a rule-following entity as opposed to a decision-making entity. You’ve trained too much and pushed past the limits.

Overfitting is when your agent becomes too specialized to its training environment. Instead of a flexible decision-maker, you’ve created something rigid that can’t adapt to slight variations. You’ve trained away the very intelligence you were trying to build.

I haven’t seen collapse with my simple game agents, but the threat looms larger with more complex systems. Sometimes things just go bad, and a collapsed model is worthless.

The challenge is recognizing the sweet spot: enough training to develop capability, not so much that you’ve trained away adaptability. Stop before the collapse.

*1 on 1,* *Bots' Life*, Google Gemini Nano Banana

The Bigger Picture

Working with my little brains in my Python library has me thinking about larger agentic workflows in society and these phases of learning on a larger scale. If teaching an AI to play Connect Four follows these five predictable phases, what does that tell us about training more sophisticated systems?

Right now, this coordinating seems to be an art more than a science—knowing when to push training forward, when to adjust rewards, when to declare victory and move on. I wonder how systems can standardize, and how this process can become more available to everyone who wants to participate in shaping how artificial intelligence develops?

I'm considering that these phases mirror how humans learn.

We start in chaos, grasping for understanding.
We learn the rules of our environment. We recognize patterns.
We optimize.
And yes, sometimes we overtrain, becoming so rigid in our expertise that we forget how to adapt.

If these parallels hold, then understanding how to train AI agents might teach us something about cultivating intelligence more broadly. Teaching our students, our organizations, and ourselves.

Or maybe it's just fun to watch these little brains play games.

That's it for this time. I do this every week. If you wish to support my work, consider purchasing my collection of essays from 2025, the link is below. If you vibe to the ideas I express, consider subscribing or sharing with friends.

Support my work by purchasing the 2025 Collection.

Nye Warburton is an educator and trainer of agents from Savannah, Georgia. These essays are improvised with Otter.ai and refined with Claude Sonnet 4.

More Essays from the Digital Lab

WTF are Agents?, June 1, 2025
Here Comes Reinforcement Learning, March 30, 2025
The Search for the Reward Function, December 7, 2025

Notes

OpenAI Gymnasium and Petting Zoo: Legitimate, widely-used frameworks for reinforcement learning research. Gymnasium provides standardized environments for training agents, while Petting Zoo specializes in multi-agent scenarios. ✓

PettingZoo Documentation

An API standard for multi-agent reinforcement learning. PettingZoo is a simple, pythonic interface capable of representing general multi-agent reinforcement learning (MARL) problems. PettingZoo includes a wide variety of reference environments, helpful utilities, and tools for creating your own custom environments. The AEC API supports sequential turn based environments, while the Parallel API supports environments with simultaneous actions.

https://pettingzoo.farama.org

Epochs and Training Cycles: The description of epochs as complete passes through training with weight/bias adjustments is accurate for machine learning training processes. ✓

Reward Functions: The concept of reward functions triggering learning in reinforcement learning is fundamental to the field. Agents learn by discovering which actions produce positive rewards. ✓

Pattern Recognition Phase: Research confirms that reinforcement learning agents progress from random exploration to pattern recognition as they accumulate experience. Well-documented in the literature. ✓

Overfitting: A genuine and significant concern in machine learning. Models can become too specialized to training data and lose generalization ability. ✓

Mode Collapse: A real phenomenon most commonly discussed in the context of GANs, where the generator produces limited varieties of outputs. Correctly identified as a training failure state, though less common in game-playing reinforcement learning agents. ✓

Five-Phase Framework: While not standard academic nomenclature, these phases align with documented stages in reinforcement learning: random exploration, basic policy formation, strategy development, policy optimization, and potential overfitting. The framework is experientially sound. ✓

Nye's Digital Lab is a weekly scribble on creativity in an age of rapid change.

This week I'm training agents, and documenting the phases of intelligence they seem to go through. It's gettin' wild.

Setting up Games

Obviously, it’s slightly more technical... but in essence, that’s been my early morning experimentation for the past week using OpenAI’s Gymnasium and Petting Zoo framework.

I’ve been training agents to play games.

It’s not simply a matter of more epochs meaning more intelligence. There are distinct stages their learning goes through.

I’ve done my best to articulate them below.