Harness Engineering

Practical Patterns for Agent-First Software Development

A deep dive into Ryan Lopopolo’s talk on Harness Engineering, exploring practical patterns for agent-first software development.
odsc
Author

Oren Bochman

Published

Tuesday, April 28, 2026

Modified

Monday, May 18, 2026

Keywords

Harness Engineering, Agent-First Software Development, Coding Agents, Software Engineering

Harness Engineering: Practical Patterns for Agent-First Software Development

NoteNotes
  • Core thesis: Modern coding agents can produce substantial software end-to-end, but their effectiveness depends less on raw model capability and more on the surrounding engineering harness: context, feedback loops, tests, documentation, review workflows, and operational constraints.

  • Speaker’s experiment:

    • Over roughly six months in 2025, the speaker’s team used coding agents to build an internal productivity agent.
    • The product reached beta with about 200 internal users.
    • The team aimed to have agents produce essentially all code, with minimal direct human editing.
    • Human engineers shifted from writing code to designing, staffing, guiding, and validating the “code factory.”
  • Why traditional software workflows change:

    • If code generation becomes cheap, teams no longer need to optimize primarily around human coding throughput.
    • Human attention becomes the scarce resource.
    • Agent context and attention are also scarce: broader tasks dilute agent performance.
    • Work should therefore be decomposed into tightly scoped agent sessions.
  • Main bottlenecks:

    • Human attention: humans should not repeatedly provide the same synchronous feedback.
    • Model context: agents need the right information at the right time, not every possible instruction upfront.
    • Model attention: the more unrelated context an agent sees, the less reliable it becomes.
  • Harness engineering:

    • Harness engineering means designing systems that deliver the right context to agents at the right stage of work.
    • Code review comments are treated as evidence of missing context: if a human repeatedly comments on an issue, that requirement should be encoded into the harness.
    • The goal is to move feedback earlier in the process so the agent avoids predictable mistakes before review.
  • Context as the central mechanism:

    • Agents receive context from prompts, agents.md, repository structure, documentation, tests, linters, tool outputs, and reviews.
    • Non-functional requirements should be written down explicitly: reliability, performance, typography, architecture, testing expectations, security, and code style.
    • Since agents start “fresh” on every task, they do not accumulate tacit team knowledge the way humans do through onboarding and repeated review.
  • The codebase is part of the prompt:

    • Agents inspect nearby files and imitate local patterns.
    • Homogeneous repository structure helps agents generalize from one file or module to another.
    • Good existing code improves future agent output because it becomes useful in-context evidence.
  • Use agents.md as a map, not a manual:

    • A short agents.md with pointers to deeper documentation worked better than a huge file containing everything.
    • The speaker contrasts a roughly 300-line map-like file with a 3,000-line overloaded file.
    • The agent should be told where to look depending on task type, such as frontend architecture, numerical analysis, reliability, or performance.
  • Front-of-process techniques:

    • Provide compact, discoverable documentation for agent personas and engineering expectations.
    • Maintain documents such as:
      • how to write reliable production code,
      • how to write performant TypeScript,
      • how to structure frontend architecture,
      • how to use internal libraries,
      • how to validate user journeys.
    • Encode common operational lessons, such as requiring network calls to have timeouts and retries.
  • Middle-of-process techniques:

    • Use fast tests, builds, and linters so agents can hill-climb toward correct solutions.
    • Linters should check not only syntax but also repository structure, package boundaries, configuration consistency, and architectural rules.
    • High-quality linter failure messages should explain the remediation in human-readable prose.
    • These tool outputs become just-in-time prompts for the agent.
  • Repository architecture for agents:

    • The team used many small package boundaries in a monorepo to help agents reason locally.
    • The speaker mentions around 500 local NPM packages.
    • The architecture was intentionally over-partitioned, even without deploying true microservices.
    • This allowed agents and humans to restrict the relevant context for a change.
  • Static constraints and schema discipline:

    • The team used tools such as Zod and Pydantic to validate data at boundaries.
    • They tried to eliminate unknown, untyped, or overly loose internal data structures.
    • This reduced the agent tendency to generate redundant validation or dead code deep inside the system.
  • End-of-process techniques:

    • Treat agents like teammates rather than tools that must be watched continuously.
    • Require a proof of work from agents:
      • tests run,
      • quality assurance plan,
      • logs inspected,
      • screenshots or videos,
      • evidence that the ticket requirements were met.
    • Agents were taught to attach media and validation artifacts to pull requests.
  • Review agents:

    • The team distilled patterns from hundreds of human-reviewed pull requests.
    • Reviewer agents were created to catch common mistakes before the human reviewer.
    • Review personas included reliability, security, performance, frontend architecture, modularity, and quality assurance.
    • These reviewers used the same guardrail documents given to implementation agents.
  • Important reviewer-agent design choice:

    • Review agents must be biased toward merging, not endlessly blocking.
    • Without that bias, reviewer agents can continuously “heckle” implementation agents and prevent convergence.
    • The review job was framed as identifying sufficiently important issues, such as P2-and-above concerns.
  • Shift-left feedback loop:

    • The speaker’s recurring theme is to move review feedback, production feedback, and operational feedback earlier into the agent workflow.
    • Repeated human interventions should become documentation, skills, tests, or linters.
    • Every human correction is a candidate for automation or prompt injection.
  • Team productivity effect:

    • The team moved from roughly 3.5 pull requests per day to 5–10 pull requests per day.
    • The limiting factor became how much parallel agent work the team could schedule, not how much code humans could personally write.
    • Each new engineer improved the shared harness by adding their own view of “what good looks like.”
  • Handling production failures:

    • The team initially lacked normal production observability because humans were not directly doing the work.
    • They used agents to build observability tooling as code.
    • Dashboards and alerts were defined in JSON and YAML.
    • Agents could inspect metrics, logs, and dashboards, then propose missing instrumentation or alerting.
    • The speaker notes that they still used humans for release branches and smoke testing; continuous deployment was not fully automated.
  • Agent personas in practice:

    • Personas were coarse-grained and horizontal across the software development lifecycle.

    • Examples:

      • reliability reviewer,
      • security reviewer,
      • performance reviewer,
      • frontend architecture reviewer,
      • package layering reviewer,
      • quality assurance reviewer.
    • Each persona was given a small set of relevant documents and asked to evaluate a diff from that role.

  • Key takeaway:

    • Code is becoming cheap, but attention, context, and validation remain expensive.
    • The highest-leverage work is to build harnesses that encode team judgment into documents, tests, linters, review agents, and fast feedback loops.
    • The role of the human engineer shifts from producing code to designing systems that let agents produce acceptable code reliably.

Citation

BibTeX citation:
@online{bochman2026,
  author = {Bochman, Oren},
  title = {Harness {Engineering}},
  date = {2026-04-28},
  url = {https://orenbochman.github.io/posts/2026/04-28-ODSC-AI-2026-Day-1/talk7.html},
  langid = {en}
}
For attribution, please cite this work as:
Bochman, Oren. 2026. “Harness Engineering.” April 28. https://orenbochman.github.io/posts/2026/04-28-ODSC-AI-2026-Day-1/talk7.html.