The Search for the Reward Function

Nye’s Digital Lab is a weekly scribble on creativity in the age of AI and Distributed Systems.

This week I’m focusing on the reward function, and figuring out the best way AI can optimize for it.

Playgrounds

As soon as I parked, my kids opened the car doors and ran into the park.

There must have been twenty other kids scattered in twenty different directions. Most on swings, slides, the weird spinny thing that makes everyone nauseous. But some just went to the fence and started trading Pokémon. Others joined in a random game of “chase.”

No one told them where to go. They just knew.

Each kid was solving their own optimization problem:

maximize fun, minimize boredom, avoid the big kid who hogs the monkey bars.

I started to think of the park as a system. The reward function was built on “fun.”

For some, fun equals velocity and air time on the swings. For others, it’s social connection, huddling by the fence trading cards. For one kid I watched, it was the pure satisfaction of going down the slide several times in a row, because apparently that’s what peak performance looks like when you’re seven.

The playground works because it offers multiple reward pathways. Kids self-organize based on their individual motivations. Nobody needs to tell them where to go; the design of the space and their internal drives handle that automatically.

Once you start looking for these patterns, understanding what motivates every agent in a system, you can’t unsee them. In a world inhabited by AI, understanding systems might be the most important skill we can develop.

I. What Motivates an Agent?

The first principle of systems thinking is simple:

if you want to understand a system, figure out what its agents are chasing.

*PacMan and Ghosts, Source: founditemclothing.com*

Here’s a classic reinforcement learning problem: teaching a robot to walk.

Researchers create a virtual environment where a digital stick-figure flails around with joints in all the wrong places. At first, it’s useless. Its legs are spasming, torso flopping, going absolutely nowhere. But every time it moves forward, even a centimeter, you reward it with points. Digital dopamine. The robot doesn’t “understand” walking.

It just knows:

do thing, get points.
Do better thing, get more points.
Fall on face, get no points.

Above is a video from youtube of someone training a quadruped robot to walk.

Through thousands of iterations, it “figures out” walking. The reward function is its motivation, its purpose, its entire reason for being.

Cut to: psychology experiments with rats in mazes.

The researcher doesn’t teach the rat a step-by-step choreography through corridors. They put cheese at the end.

The rat figures it out because the reward is clear:

cheese exists, I want cheese, therefore...
I will learn this maze.

If you moved the cheese halfway through instead of at the end, the rat would optimize for a different path entirely. What if you electrified certain paths? Now you’ve introduced a loss function. That’s the opposite of the reward function. It’s the thing to avoid, the penalty for wrong moves.

The entire system includes the maze, the rat’s behavior, and the paths it takes. All of it is shaped by where you put the cheese and how you structure the penalties. Change the incentives, change the system.

This is the fundamental insight that unlocks everything else.

II. Reward Functions

Once you start looking for reward functions, you see them governing every system around you.

Why don’t most people go 95 mph everywhere?

Not because cars can’t go that fast. Certainly not because we’re all responsible citizens who deeply contemplate traffic safety. It’s because there’s a speed limit sign, and behind that sign is a system of enforcement and penalties.

The reward function for speeding (getting there faster, feeling the rush) is outweighed by the loss function (tickets, insurance hikes, accidents, death).

This is governance through systems design.

We don’t rely on every driver to independently calculate optimal social outcomes, we build rules into the environment. The system guides behavior by making certain choices more or less attractive.

Beyond humans, this is important for artificial agents too. An AI optimizing for “engagement” gave us social media’s infinite scroll and algorithmic radicalization. An AI optimizing for “efficiency” might route delivery trucks through residential neighborhoods at 3 AM. An AI optimizing for “profit” without constraints will find every loophole, every edge case, every technically-legal-but-morally-questionable shortcut available.

The problem isn’t the AI being evil.

The AI is just doing what the robot learning to walk does: maximizing its reward function. The problem is we pointed it at the wrong thing and forgot to add appropriate loss functions.

III. Designing Systems

As we architect systems for an AI-inhabited world, what are we actually incentivizing?

Everyone’s obsessed with making agents smarter, right? We focus on better algorithms, more training data, faster processing. Maybe we stop with the obsessive scaling?

I think that we’re not considering nearly enough about the systems these agents will operate within.

What does “success” look like? Who decides? What are the boundaries?

What are we maximizing for?!?

The agent optimizes for whatever objective function we give it, with no inherent sense of whether that objective is actually useful or good.

A company that rewards quarterly earnings above all else will cut costs even as quality suffers. A social media platform that rewards time-on-site will optimize for addictiveness over wellbeing. A school that rewards test scores will teach to the test instead of fostering curiosity.

We are all rats in mazes designed by other people, by history, by markets, by culture. But we’re also the maze designers. We choose where to put the cheese. We decide what to reward and what to penalize. We architect the systems that shape behavior.

So how do you find a reward function? You observe.

You ask: what does this entity keep doing? What is it moving toward? What is it avoiding? If you released this agent into an environment with no instructions, where would it go?

The maze is being built right now.

We get to choose where the cheese goes.

Let’s try not to optimize for the wrong thing?

Thanks for reading. I do this every week. If you vibe to the ideas I express, consider subscribing or sharing with friends. We'll see you next time.

Nye Warburton is an educator and systems designer from Savannah, Georgia. This essay was improvised with Otter.ai and refined and edited using Claude Sonnet 4.5.

For more information visit: https://nyewarburton.com