Context Is the New Craft in AI-Assisted Development

Summary
Close

The workflows, review gates, and skills behind a year of agent-assisted development

A companion post, AI Didn’t Replace Developers. It Moved the Work., argues that AI agents have not removed developers; they have moved developer judgment upstream into specification and downstream into review.

This post is the operational half of that argument: what the workflow actually looks like, where it tends to fail, and the practices that keep it honest.

The short version: agents amplify whatever you bring them.

Bring vague intent and they will produce a plausible version of the wrong thing very efficiently. Bring a clean specification, curated context, and a verification loop and they become a serious force multiplier.

The craft is in everything that surrounds the keystroke.

Part 2 of 2: This post focuses on workflows and operational practices. The companion post explains the broader shift in software engineering work created by AI coding agents.

TL;DR

  • Context curation is a core engineering skill. Beyond a practical threshold, more context hurts more than it helps.
  • A repeatable spec → grill → plan → implement → verify loop catches most avoidable failures before code is written.
  • Agents fail in recognizable shapes; building countermeasures into the workflow is cheaper than catching them in review.
  • Reusable skills and current agent-facing documentation compound across every future session.

Context is the new craft

Modern agents have large context windows: hundreds of thousands of tokens, sometimes more.

It is tempting to assume that more context is always better. It is not.

Agents appear to have a practical “smart zone”: enough context to reason accurately, but not so much that the important signals get buried under irrelevant detail.

In our own work, we have found that beyond a certain threshold, additional context often reduces reasoning quality instead of improving it. The exact threshold varies by model, task, and codebase, but the pattern itself has been consistent: once context becomes too noisy, quality starts to drift.

At that point, outcomes depend less on how much more you can fit into the session and more on how aggressively you can remove noise.

A few things we have learned the hard way:

  • Signal-to-noise matters more than raw size. An agent given a focused 20K-token brief often outperforms one given a 200K-token dump of vaguely related files. The agent has to allocate attention; padding the context with irrelevant material dilutes the parts that matter.
  • Start fresh when the topic shifts. When a conversation pivots from designing a feature to debugging an unrelated test, it is usually faster to begin a new session than to carry a thick history into the new task. Stale context is a real cost.
  • Push exploration into side investigations. When you need to answer a “where is X” or “how does Y work” question, isolate that discovery work and bring only the summarized findings back into the main session.
  • Plan before you implement. Draft an explicit plan before code gets written: what files change, what the order of operations is, and what success looks like. The plan is short, reviewable, and keeps implementation focused.

But a plan is only useful if it is honest, and the first draft of a plan is almost never honest. It is full of assumptions you made without noticing, branches of the decision tree you skipped past, and terminology that quietly conflicts with the rest of the codebase.

So before we commit to a plan, we put it through a grilling step: an interactive session where the agent interviews us about the design. It pulls implicit decisions into the open, walks the decision tree branch by branch, compares proposed terms against the project’s existing language, and asks the awkward “what about X?” questions we would otherwise discover three days into implementation.

A grilling session that ends with five revised paragraphs may have saved a day of rework. It forces implicit decisions to become explicit before code is written, when they are cheapest to change.

Credit where it is due: the idea of grilling a plan before implementation is not ours. We picked it up from Matt Pocock’s post on AI Hero, where he makes the case for an interview-style step that surfaces hidden assumptions before code is written. Our grill-me skill is one implementation of that idea, and grill-with-docs extends it by checking the plan against the project’s domain glossary and prior decisions.

The common thread is context curation. Done well, it leaves the developer free to focus on judgment calls: design choices, tradeoffs, edge cases, and dangerous abstractions.

Our concrete workflow

Our workflow is simple enough to fit on one page, but strict enough to catch most avoidable failures:

  1. Write the spec. Turn product intent into concrete behavior, source data, edge cases, non-goals, and acceptance criteria.
  1. Grill the spec. Use the agent to challenge assumptions, missing branches, unclear terminology, and domain conflicts before implementation starts.
  1. Draft the plan. Ask for the files to change, the order of operations, the tests to add or update, and the definition of done.
  1. Review the plan. The human accepts, edits, or rejects the plan. This is the cheapest moment to catch the wrong approach.
  1. Implement in bounded steps. Let the agent work phase by phase rather than running open-ended across the whole codebase.
  1. Run verification gates. Require objective checks before treating the work as complete.
  1. Review the diff critically. Run a structured agent review, then compare the code to the original intent and approved plan — not just to passing tests.
  1. Feed lessons back. When a mistake recurs, update docs, tests, or a reusable skill so the team does not pay the same tax again.

Verification gates

We do not treat agent output as done when it looks right.

We treat it as done when it passes the same gates as human-written code, plus one extra gate: a check against the original intent.

  • Typecheck and compile checks.
  • Linting and formatting.
  • Unit tests for new and changed behavior.
  • Integration or end-to-end tests when behavior crosses boundaries.
  • Visual or snapshot tests where UI behavior matters.
  • Manual acceptance against the original spec.
  • Structured AI review and human diff review against the approved plan and project conventions.

What goes wrong

Agents fail in specific, recognizable ways. Knowing the shapes is most of what keeps you out of trouble.

Failure mode Symptom Countermeasure
Inventions that look plausible The agent calls a function, flag, package, or API that does not exist because something similar exists in other projects it has seen. Ground the agent in real files. Require typecheck, tests, and diff review before accepting the output.
Confident misreads of conventions The agent applies one local pattern everywhere, including places where the team made a deliberate exception. Keep agent documentation current. Explain exceptions and the reasons behind them, not just the default pattern.
Plausible-but-adjacent solutions The code solves a nearby problem or fixes the symptom in a layer that creates a regression elsewhere. Restate success criteria before implementation. Require the plan to name where the fix belongs and why.
Context drift over long sessions The agent reintroduces assumptions that were corrected earlier or forgets what the task is now about. Start a fresh session when the topic changes. Summarize only durable decisions into the next context.
Tool-call rabbit holes The agent spends many steps exploring a question that did not need deep investigation. Time-box research. Use focused side investigations for discovery. Bring only summaries back into the main session.

None of these are reasons to put the tools down.

They are reasons to build the surrounding scaffolding: clear specs, current docs, plan-before-implement, fresh sessions when the topic shifts, and verification gates that catch mistakes before they become review debt.

Skills: small specialists beat one big prompt

Some tools call these skills, others call them commands, workflows, playbooks, agents, or custom instructions.

The name matters less than the pattern: package a recurring process as a reusable specialist with its own instructions and scope.

Over the past year we have built up a working set that does most of the heavy lifting:

  • A planning skill that walks through research, options, and tradeoffs before settling on an approach.
  • An implementation skill that turns a finished plan into working code with verification gates between phases.
  • A plan-grilling skill, adapted from the workflow above, that stress-tests a proposed design against the project’s domain glossary and prior decisions.
  • A PR review skill that runs as an independent pass, often with a different model, and produces a structured review against the target branch.
  • A review-response skill that works through reviewer comments systematically: apply clear fixes, surface human decisions, and iterate until the review is resolved.
  • A project-specific design-system skill that encodes the team’s UI conventions into the agent’s defaults.

Our create-plan and implement-plan skills are built on the research-plan-implement loop described in HumanLayer’s Advanced Context Engineering for Coding Agents — particularly the argument for treating compaction as a deliberate step rather than something the tool does to you.

Our adaptation is simple: keep the current intent explicit, compact discovered context into the plan itself, and verify each implementation phase against that plan before moving on.

The why is the same in every case: each skill captures a workflow worth locking in.

Our rule of three is simple: the third time you retype the same setup, turn it into a skill. Below that, you are abstracting too early. Above it, you are paying the same tax repeatedly.

The catch is that skills are not free. They rot.

When a project’s conventions shift, an out-of-date skill is worse than no skill at all because it confidently applies a pattern nobody on the team uses anymore.

We treat the skills directory like any other piece of code: reviewed, updated, and occasionally retired.

The documentation imperative

An agent only knows what you tell it, plus what it can inspect.

A good CLAUDE.md, AGENTS.md, agents.md, or whatever your tool calls it, works like a senior engineer’s onboarding guide:

  • Conventions: naming, file layout, commit format, and what “good” looks like in this codebase.
  • Constraints: things we tried and rejected, patterns we are moving away from, and the reasons behind decisions that are not obvious from the code.
  • Glossary: a domain dictionary so the agent uses the same language as the team.
  • Pointers: links to deeper docs for testing, styling, authentication, deployment, and architecture decisions.

We layer it: a root document for broad strokes, per-directory documents for local nuance, and architecture decision records for calls worth preserving.

To make the contrast concrete: “we use TypeScript and React” is wallpaper.

A useful agent document names specifics: which date library to use, which folders hold what, what the commit format looks like, and which patterns the team has moved away from.

The unlock is that this work compounds.

A week spent improving documentation pays back across every future session, on every future task, for every developer — human or otherwise — who touches the project.

Where the leverage lives

The pattern across all of this — context curation, workflow loops, verification gates, reusable skills, documentation — is the same:

Invest in the scaffolding around the agent, not in clever prompts.

The agent is fast at the keyboard. The leverage is everywhere else. A subscription is rarely the hard part. Review culture, planning rituals, written conventions, onboarding quality, and verification infrastructure are.

Build those and the agent earns its keep. Skip them and you ship more code than you can safely absorb.

And yes — an agent helped draft this blog post and the image. A human grilled it.

← Read the companion post, AI Didn’t Replace Developers. It Moved the Work., for the broader argument about how developer judgment has shifted upstream into specification and downstream into review.

About the author
Jani Rapo

With over a decade of experience in software development and architecture, Jani is a Senior Software Architect at Brightly who excels in crafting pragmatic digital solutions. His current work centers on the changing craft of software development in the AI era: writing precise specifications, orchestrating AI coding agents, and the documentation and review practices that make it all hold together.

How can
we help you?
Are you looking for data driven digital solutions that add business value? Our senior technical experts help you build just the right solutions for your unique challenges and operational environment.