How I Stopped Babysitting Claude Code (And Started Walking Away)

I used to watch Claude Code like a helicopter parent.

Every few minutes: Is it still on track? Did it actually write tests or just say it did? When it says “done,” is it actually done?

The answer was often no. Claude Code would stop mid-task and claim completion. It would write tests that looked right but tested nothing. It would “refactor” code by replacing implementation with TODO comments. I couldn’t leave my desk. I couldn’t trust the output. I was reviewing everything anyway, which defeated the whole point.

Then I found a setup that let me walk away.

Why I Needed Model-Agnostic Guardrails

Here’s the backstory. I’m in Hong Kong. No official Claude access. At work, we’re on AWS Bedrock for security reasons - can’t hit Anthropic servers directly. At home, I’d need a VPN to use Claude, which is annoying, and Kimi K2 through Moonshot has the same performance at better cost anyway.

So I ended up using:

Work: Amazon Bedrock with Sonnet 4.5 and Opus 4.5
Home: Moonshot API with Kimi K2

Two different models. Same problem: I couldn’t trust either of them unsupervised.

This forced me to think differently. Instead of asking “which model is better,” I started asking “what catches mistakes regardless of which model I’m using?”

The answer was ClaudeKit.

The Setup That Lets Me Walk Away

First, let me get this out of the way: I’m not affiliated with ClaudeKit. Found it on Discord, connected with Carl to get setup advice, and it solved my problem. That’s it.

ClaudeKit gives you hooks, subagents, and slash commands for Claude Code. Hooks are scripts that run automatically when Claude Code does things - like editing a file or finishing a task. You run claudekit setup, pick which hooks and subagents you want, and it configures everything for you.

Two hook types matter most for my workflow, and they work differently:

PostToolUse hooks run every time Claude Code touches a file. Deterministic TypeScript scripts. Actual code checking code. No AI involved.
Catches things like “you replaced implementation with a TODO comment” or “you prefixed a parameter with underscore but didn’t remove it.” Binary pass/fail.
Stop hooks run when Claude Code finishes (or claims to finish). Mine triggers a self-review, which does use the model. Non-deterministic. An AI looking at AI output with fresh context.
Catches the “did you actually finish what you said you’d finish?” stuff that needs judgment.

Deterministic hooks catch mechanical mistakes. Stop hooks catch the stuff that needs judgment.

Here’s part of my current setup:

What This Looks Like in Practice

I’ve been using this for about five months. Here’s what it looks like when it actually fires.

January 18th, 2026. I was building a Supabase ping scheduler - a thing that pokes your database periodically so Supabase doesn’t pause your free-tier project.

Bugbot had flagged 9 issues. Claude Code was deep in it - 10 commits, 8 files, fixing RealDictCursor configs, debugging async shutdown, adjusting timezone validation. Real in-the-weeds work.

I had to step away. Instead of stopping the session and reviewing everything, I just... left.

Claude Code finished. Stop hook auto-triggered within 2 seconds.

When I came back, the self-review had already run. Fresh context, clean perspective on the whole PR:

95% test coverage (verified, not claimed)
All 9 bugs actually fixed across all 10 commits
Architecture was coherent
Production-ready with minor refinements

The main agent was too close to see this. It was heads-down in the last bugfix. The stop hook gave that “step back and look at the whole thing” perspective that I would have done manually. Except I didn’t have to be there.

I didn’t review Claude Code’s work. I reviewed the review. Much faster.

This wasn’t a one-off. Earlier that same day, I’d manually triggered a code review on the scheduler architecture. Two reviews, same project, same day. One manual, one automatic. Both useful. That’s what systematic looks like.

What the Hooks Actually Catch

Most of what the hooks catch isn’t dramatic. It’s death-by-a-thousand-cuts prevention.

The self-review hook catches Claude Code claiming done when it’s not. It looks at the actual state of the code, not the agent’s summary. Incomplete work, missing edge cases, TODOs that should have been implemented.

Then there’s code becoming comments. Claude Code sometimes does this:

Real code deleted, TODO comment left. The hook catches when implementation becomes placeholder. This happens more than you’d think.

Fake refactoring is another one. Parameters get prefixed with underscore but never removed:

These accumulate. They make code harder to understand. The hook catches them immediately.

And linting runs on every file save, so style drift never becomes a thing. No manual code review needed for formatting.

None of this is glamorous. It’s just systematic. The hooks run invisibly. You don’t think about them. They just work.

Why the Model Doesn’t Matter

Here’s what surprised me: there’s no significant quality difference between Sonnet 4.5 and Kimi K2 for this workflow.

Speed varies. Depth varies. But both benefit equally from the guardrails.

The PostToolUse hooks are pure TypeScript - deterministic scripts that don’t care what model wrote the code. They check the output, not the source. Whether you’re using Kimi through Moonshot, Sonnet through Bedrock, or Opus for the heavy stuff, the same checks run.

The self-review stop hook does use a model, but it’s checking with fresh context - not the same agent that just wrote the code. That separation is what makes it useful. The reviewer isn’t invested in the implementation.

The Workflow Now

Plan mode first. I use /plan to make sure Claude Code has a clear vision of what I want built. No ambiguity.
Start execution and walk away. The hooks keep it on rails. Stop hook will self-review when it finishes.
Come back to reviewed code. I’m reviewing the review, not the raw output. Sometimes I iterate. Usually I ship.

For bigger projects with multiple tasks, I add Taskmaster MCP to the mix. It feeds Claude Code structured tasks with dependencies already mapped. Claude Code pulls the task, executes, guardrails catch issues. Same trust, bigger scope. I wrote about how that works.

This is how I’ve shipped 10 production tools in two months. Risk dashboards, monitoring systems, alert scripts, compliance generators. Not by watching every keystroke. By trusting a system that catches what Claude Code misses when it’s in the weeds.

What Still Worries Me

I don’t want to oversell this. The hooks catch a lot, but they’re not magic.

They catch structural issues. Incomplete work. Lazy refactoring. Style violations. They don’t catch logic bugs that pass tests. They don’t catch architectural decisions that are technically correct but wrong for the use case.

I still review. Just less, and at a higher level.

And sometimes Claude Code still surprises me with something the hooks didn’t catch. That’s the nature of it. But the baseline is so much better than babysitting.

Try It

The setup wizard walks you through picking which hooks and subagents you want. Select self-review, check-todos, lint-changed, and check-comment-replacement if you want my setup. Or run claudekit setup --yes to accept the defaults.

Use plan mode. Walk away.

The model debate is a distraction. Whether you’re on Claude, Kimi, or whatever comes next, the guardrails are what let you trust the output.

I spent months chasing the “right” model before I figured that out. The architecture around the model matters more.

Thanks for reading Orchestrated Code! Subscribe for free to receive new posts and support my work.

More from Orchestrated Code

Orchestrated Code

Dec 27

How Claude Gave Me Permission to Ship: A Python Dev's Journey from Ideas to 5,500 Users

The story of how AI agents helped me build tip.md — without me learning TypeScript

Cover image for What Ralph Wiggum Loops Are Missing (And When It Starts to Matter)

Orchestrated Code

Jan 24

What Ralph Wiggum Loops Are Missing (And When It Starts to Matter)

I shipped two products with this “new” pattern before anyone named it.

Cover image for 1.2 Humans, 20 AI Agents, Same Revenue

Orchestrated Code

Jan 19

1.2 Humans, 20 AI Agents, Same Revenue

Jason Lemkin replaced his entire sales team with AI. The playbook works in any domain.

I used to watch Claude Code like a helicopter parent.

Every few minutes: Is it still on track? Did it actually write tests or just say it did? When it says “done,” is it actually done?

Then I found a setup that let me walk away.

Why I Needed Model-Agnostic Guardrails

So I ended up using:

Work: Amazon Bedrock with Sonnet 4.5 and Opus 4.5
Home: Moonshot API with Kimi K2

Two different models. Same problem: I couldn’t trust either of them unsupervised.

This forced me to think differently. Instead of asking “which model is better,” I started asking “what catches mistakes regardless of which model I’m using?”

The answer was ClaudeKit.

The Setup That Lets Me Walk Away

First, let me get this out of the way: I’m not affiliated with ClaudeKit. Found it on Discord, connected with Carl to get setup advice, and it solved my problem. That’s it.

Two hook types matter most for my workflow, and they work differently:

PostToolUse hooks run every time Claude Code touches a file. Deterministic TypeScript scripts. Actual code checking code. No AI involved.
Catches things like “you replaced implementation with a TODO comment” or “you prefixed a parameter with underscore but didn’t remove it.” Binary pass/fail.
Stop hooks run when Claude Code finishes (or claims to finish). Mine triggers a self-review, which does use the model. Non-deterministic. An AI looking at AI output with fresh context.
Catches the “did you actually finish what you said you’d finish?” stuff that needs judgment.

Deterministic hooks catch mechanical mistakes. Stop hooks catch the stuff that needs judgment.

Here’s part of my current setup:

What This Looks Like in Practice

I’ve been using this for about five months. Here’s what it looks like when it actually fires.

January 18th, 2026. I was building a Supabase ping scheduler - a thing that pokes your database periodically so Supabase doesn’t pause your free-tier project.

Bugbot had flagged 9 issues. Claude Code was deep in it - 10 commits, 8 files, fixing RealDictCursor configs, debugging async shutdown, adjusting timezone validation. Real in-the-weeds work.

I had to step away. Instead of stopping the session and reviewing everything, I just... left.

Claude Code finished. Stop hook auto-triggered within 2 seconds.

When I came back, the self-review had already run. Fresh context, clean perspective on the whole PR:

95% test coverage (verified, not claimed)
All 9 bugs actually fixed across all 10 commits
Architecture was coherent
Production-ready with minor refinements

I didn’t review Claude Code’s work. I reviewed the review. Much faster.

What the Hooks Actually Catch

Most of what the hooks catch isn’t dramatic. It’s death-by-a-thousand-cuts prevention.

Then there’s code becoming comments. Claude Code sometimes does this:

Real code deleted, TODO comment left. The hook catches when implementation becomes placeholder. This happens more than you’d think.

Fake refactoring is another one. Parameters get prefixed with underscore but never removed:

These accumulate. They make code harder to understand. The hook catches them immediately.

And linting runs on every file save, so style drift never becomes a thing. No manual code review needed for formatting.

None of this is glamorous. It’s just systematic. The hooks run invisibly. You don’t think about them. They just work.

Why the Model Doesn’t Matter

Here’s what surprised me: there’s no significant quality difference between Sonnet 4.5 and Kimi K2 for this workflow.

Speed varies. Depth varies. But both benefit equally from the guardrails.

The Workflow Now

Plan mode first. I use /plan to make sure Claude Code has a clear vision of what I want built. No ambiguity.
Start execution and walk away. The hooks keep it on rails. Stop hook will self-review when it finishes.
Come back to reviewed code. I’m reviewing the review, not the raw output. Sometimes I iterate. Usually I ship.

What Still Worries Me

I don’t want to oversell this. The hooks catch a lot, but they’re not magic.

I still review. Just less, and at a higher level.

And sometimes Claude Code still surprises me with something the hooks didn’t catch. That’s the nature of it. But the baseline is so much better than babysitting.

Try It

Use plan mode. Walk away.

The model debate is a distraction. Whether you’re on Claude, Kimi, or whatever comes next, the guardrails are what let you trust the output.

I spent months chasing the “right” model before I figured that out. The architecture around the model matters more.

Thanks for reading Orchestrated Code! Subscribe for free to receive new posts and support my work.

More from Orchestrated Code

Orchestrated Code

Dec 27

How Claude Gave Me Permission to Ship: A Python Dev's Journey from Ideas to 5,500 Users

The story of how AI agents helped me build tip.md — without me learning TypeScript

Orchestrated Code

Jan 24

What Ralph Wiggum Loops Are Missing (And When It Starts to Matter)

I shipped two products with this “new” pattern before anyone named it.

Orchestrated Code

Jan 19

1.2 Humans, 20 AI Agents, Same Revenue

Jason Lemkin replaced his entire sales team with AI. The playbook works in any domain.

More from Orchestrated Code

Why I Needed Model-Agnostic Guardrails

The Setup That Lets Me Walk Away

What This Looks Like in Practice

What the Hooks Actually Catch

Why the Model Doesn’t Matter

The Workflow Now

What Still Worries Me

Try It

No comments yet

More from Orchestrated Code

Orchestrated Code

More from Orchestrated Code

Why I Needed Model-Agnostic Guardrails

The Setup That Lets Me Walk Away

What This Looks Like in Practice

What the Hooks Actually Catch

Why the Model Doesn’t Matter

The Workflow Now

What Still Worries Me

Try It

No comments yet

More from Orchestrated Code

How I Stopped Babysitting Claude Code (And Started Walking Away)

The model debate is a distraction. Here's what actually matters.

How I Stopped Babysitting Claude Code (And Started Walking Away)

The model debate is a distraction. Here's what actually matters.

No comments yet

No comments yet

Why I Needed Model-Agnostic Guardrails

The Setup That Lets Me Walk Away

What This Looks Like in Practice

What the Hooks Actually Catch

Why the Model Doesn’t Matter

The Workflow Now

What Still Worries Me

Try It

Why I Needed Model-Agnostic Guardrails

The Setup That Lets Me Walk Away

What This Looks Like in Practice

What the Hooks Actually Catch

Why the Model Doesn’t Matter

The Workflow Now

What Still Worries Me

Try It