I Wrote a Plan. The Work Steered It.

What steering through one project taught me about how engineering work actually happens once AI is in the loop.

May 31, 2026

Disclaimer: The views expressed here are my own and are not related to or reflective of my work or any organization I am affiliated with.

In the first plan: PostgreSQL, Redis, ChromaDB, Celery, CrewAI, a hosted model API.

In the working version: SQLite. No Redis. No ChromaDB. No Celery. No CrewAI. The model SDK, called directly.

The plan wasn’t wrong. The plan got steered, in real time, by the work, while the work was being done.

In my last post I argued that engineering organizations aren’t adopting AI. The ones that win are rebuilding around it. The constraint shifted from generation to judgment, and the operating model has to follow.

This is the lifecycle-level companion. What rebuilt looks like from inside one engineer’s loop. Three shifts I keep watching reshape the work.

CommsCrew, referenced throughout, is a personal learning lab. A multi-agent communications platform I keep building so I have real code, real mistakes, and real decisions to look at while the model and the ecosystem evolve underneath me. The practice space. The repo is real. The design docs, the dead-branch architecture, the configuration files are all still in there.

This isn’t a 10x productivity pitch. It isn’t an “AI replaces engineers” argument. It isn’t a tool catalog. It’s three places where the shape of the work changed. Not the speed.

Shift 1: From writing code to designing systems

The bottleneck used to be typing. You knew what to build. The question was how fast you could write it.

Generation is free now. The new bottleneck is design judgment. Knowing what to build, in enough fidelity that an AI agent can produce a correct first pass.

The most valuable work I do now happens in a markdown file before any code gets written.

A concrete number. The Phase 1 activation design in CommsCrew, docs/plans/2026-02-24-phase1-activation-design.md, is 176 lines. It produced a 1,761-line implementation plan. That plan generated thousands of lines of shipped code, across backend endpoints, database migrations, new pages, and a notification system.

A 176-line markdown file was the lever for the whole thing. The leverage isn’t in the code. It’s in the design step.

Pre-AI, I would have written zero of that doc. I would have opened my editor, started coding, and discovered the design while writing the code. That was the right strategy when typing was the constraint.

It is the wrong strategy now. Typing isn’t the constraint. The clarity of intent I can hand to the generator is. A vague brief produces vague output. A precise brief produces shippable output.

The opening paragraph wasn’t a hypothetical. The PLAN.md at the root of the CommsCrew repo still lists all six of the original stack choices. The backend/app_new/ directory next to it uses none of them.

Midway through I realized CrewAI was an unnecessary abstraction between me and the model. ChromaDB was overkill for the memory volume I had. Celery was adding deployment surface area for jobs SQLite could handle inline.

The legacy backend/app/ directory is still in the repo. A museum exhibit of the version I built before realizing the stack was wrong. I could rewrite it because rewriting was no longer expensive.

The proof is one line each, in two files I kept side by side:

# backend/app/core/ai_engine/agents/base_agent.py  (legacy)
from crewai import Agent

# backend/app_new/crew/base_agent.py  (current)
# the model provider's SDK, called directly
import model_sdk

A framework dependency disappeared because changing my mind about it stopped being expensive.

Architecture lock-in used to be a feature of senior engineering. The discipline of not flip-flopping. It is now a liability. The new discipline is staying steerable. When the cost to change your mind drops by an order of magnitude, your decision threshold should drop with it.

The artifact your senior engineers should produce more of is design. Living design documents that get edited as the work uncovers reality. Not architecture diagrams filed once a quarter.

The failure mode is using AI to generate faster while designing worse. Teams get an early win, the first feature ships in a third of the expected time, then the second feature has compounding rework because the design didn’t catch what the first one foreclosed. AI rewards precision in the brief. If you brief vaguely, you waste cycles refactoring the output.

The work isn’t ten times faster. It’s distributed differently. Move the slow part upstream.

Shift 2: From debugging to verifying

I used to spend a third of my engineering time tracking down bugs I had written. Typos. Off-by-ones. Missing null checks. State I forgot to thread through three functions.

Pre-AI, that was just the job. Now AI writes most of the lines, and most of those bugs don’t happen.

But a new kind of bug does. Code that looks right and isn’t. Code that solves a slightly different problem than the one I described. Code that calls an API that doesn’t exist. Code that compiles but isn’t actually wired up.

The job changed from finding my bugs to catching the model’s confident-but-wrong output.

This is the most common failure mode in my codebase now. The agent generates a patch that compiles. The tests pass. The bug lives somewhere the tests don’t reach. A fixture that’s a copy-paste of an older version with stale assumptions. An assertion that’s tautological. A mock that doesn’t match the real service.

The output looks like the right output. Only end-to-end behavior, or someone reading the diff with suspicion, will catch that it isn’t.

Generation got cheap. Verification didn’t.

In CommsCrew the agent system streams responses to the frontend via Server-Sent Events. The hardest bugs I hit weren’t in any single layer. They lived in the integration between the backend stream, the frontend’s getReader() loop, and the agent’s token cadence.

The unit tests passed. The stream parser passed. The agent passed. The only way to see the actual bug was to run the feature end-to-end, in a browser, while watching the network panel.

The interesting failures had become invisible to any single layer of tests.

The interesting bugs now live at boundaries. AI is very good at producing code that satisfies its own immediate context. AI is much worse at producing code that integrates correctly with surrounding code it wasn’t shown.

Page-level tests, in-isolation unit tests, and “it compiled and the linter is happy” are not sufficient evidence of correctness anymore.

The 10x engineer is no longer the one who writes the most code. It’s the one who catches what looks right but isn’t. That’s closer to an editor’s eye than to coding ability. Harder to interview for. Undersupplied in the market.

The failure mode is trust. Teams that trust the model ship faster for two weeks and break things for the next six months. The velocity gain everyone wants is real, but it materializes only when the verification surface keeps up with the generation surface.

Shift 3: From tools to environments

The third shift is the one I think is least appreciated.

Your IDE used to be a text editor with some plugins. It is now an environment. An agent with persistent memory, scoped permissions, parallel working copies, and tools that reach into your browser, your Slack, your task tracker, your design files.

The unit of work isn’t “I’m editing this file.” It’s “I’m in this session with this context.”

CommsCrew lives in a worktree-driven development setup. The .agent/settings.local.json file at the repo root lists which tools my AI session is allowed to run unattended. It is, in effect, a CI/CD policy for the agent:

{
  "permissions": {
    "allow": [
      "Bash(*)",
      "Write(*)",
      "Edit(*)",
      "Read(*)",
      "mcp__playwright__browser_navigate",
      "mcp__playwright__browser_take_screenshot",
      "mcp__playwright__browser_click",
      "WebSearch",
      ...
    ]
  }
}

Pre-AI, “what is this developer allowed to run” was a flat boolean. They have a laptop, they can run whatever they want.

Post-AI, it’s a real policy question. The agent is running things on your behalf at machine speed.

I’m writing this blog post in .agent/worktrees/tender-nash-24e072/. A Git worktree that’s a complete working copy of the CommsCrew repo, isolated from my main branch. Worktrees can live anywhere on disk. I nest agent-spawned ones under .agent/ so they’re scoped to the agent’s workspace rather than scattered alongside human branches.

Branches were sufficient for human-paced parallel work. Agents can run multiple tasks in parallel and they need their own filesystems. Shared state across parallel agents is a quiet category of bug.

The same session that writes Python and TypeScript also drives a Playwright browser to verify the UI. It generates Gamma decks. With the right MCP servers registered it reaches into your task tracker, your design files, your messaging tools. The integration boundary used to be the IDE. Past your editor there were a dozen separate tools you had to context-switch between. The Model Context Protocol collapsed that. The new boundary is the session itself.

The agent in this session has access to skills, reusable methodologies for specific tasks like brainstorming, debugging, code review, frontend design. It has access to a memory store with the project’s history, my voice preferences, anti-patterns I’ve already learned.

Onboarding a new engineer used to be six weeks. Onboarding a new session is thirty seconds.

That isn’t because the agent is smart. It’s because the environment carries the context that used to live in someone’s head.

The failure mode is treating the environment as a personal-productivity thing. Each engineer rolls their own setup. Nothing is shared. Nothing compounds. After a year you have ten engineers running ten different agent configurations, with ten different trust models, ten different memory stores. None of it scales to the team.

The three shifts compound

The shifts aren’t independent.

Shift 1 needs Shift 2, because designing more sharply means catching more design-vs-implementation gaps. That catching is verification.

Shift 2 needs Shift 3, because verification requires the environment to support it. The hooks, the integration tests, the browser automation, the memory of what passed before.

Shift 3 needs Shift 1, because no environment teaches taste. Only design discipline does.

Velocity is the wrong frame. It measures the easy thing, code shipped per week, and ignores the compounding thing: judgment per decision, verification per generation, leverage per environment.

Three questions to ask your team this quarter. Are your senior engineers writing design documents that drive ten times their volume in shipped code? Are you measuring verification quality, what was actually run and observed, rather than output quantity? Is your engineering environment a deliberate choice with shared permissions, shared skills, and shared memory, or is it whatever each engineer set up alone?

If those three answers are no, you’re operating on the velocity frame. If they’re yes, you’re operating on the model frame.

Both will ship product. Only one will compound.

This is a sequel to Engineering Orgs Aren’t Adopting AI. They’re Rebuilding Around It, which made the case at the organizational level. The next, if I can find a clean way to say it, is about what hiring and team structure look like once these shifts have run for a year.

Thinking Through AI

Discussion about this post

Ready for more?