Why does a prototype built with AI work in a demonstration but not in production?

A demonstration exercises only the happy path, on little data and a single user. Production adds concurrency, real volumes, error handling, security and integrations, that is, everything that was not the feature and that the agent did not produce.

What is the technical debt of AI-augmented development?

It is debt taken on without anyone deciding to, because the generated code reads well and passes the demonstration. Implicit coupling, error cases never written, a data model cut for the demo: it stays invisible while the prototype runs, and costs the most to repair once the system is live.

Can AI handle the architecture of a business application?

An agent optimises the local task it is given, not the global coherence of the system. Architecture, invariants and cross-cutting policies such as security or data consistency are decided at the scale of the whole system, which does not fit in a context window.

Is AI-generated code secure by default?

No. An agent reproduces patterns from a training distribution that contains a great deal of insecure code, and security is a cross-cutting property poorly served by autonomous generation: queries open to injection, partial validation, secrets in the code, dependencies left unreviewed. The flaws do not appear in a demonstration, but when the running system is probed.

Should we stop using AI to develop software?

No. An agent is excellent at local, well-scoped tasks, which are a large share of the work. The question is not AI or engineers, but the division of labour: the agent executes quickly, a human holds the overall view, the architecture and the cross-cutting decisions.

Why does an internal AI project get stuck at scaling?

The prototype works for one user and little data. Scaling demands architectural decisions taken early, such as indexes, caching, back-pressure or concurrency handling, against a load profile the agent never saw. These decisions are hard to recover after the fact.

Your AI prototype works. Keeping it running in production is another craft.

Q: When does an outside engineering perspective become indispensable on an AI-augmented project?

When internal velocity exists, a team, an agent, speed, but the systemic counterweight is missing: no architecture review, no audit of the debt being taken on, no hardening of security and load, no production discipline. It is at that boundary that an outside engineering perspective earns its place.

A director often describes the same scene to us. A prototype built in a few days with a coding agent, a demonstration that impresses, the lingering feeling that the hard part is behind them. Then three months pass, and the application still will not go into production.

The instinct is to read this as a failure, of the tool, the team or the project. It rarely is. This gap between it works in a demonstration and it holds in production is the visible edge of a distinction that AI did not create, but has made far sharper: writing code and engineering a system are not the same work. The first has become fast and cheap. The second remains a craft.

This article looks at where that line falls, and why: what the autonomy of AI does not cover, the technical debt that builds up unnoticed, the points where an AI-augmented project actually breaks once in production, and the division of labour that does hold.

What the agent genuinely does well

Let us start with what works, because it is considerable. On a local, well-scoped task, a coding agent is excellent: producing a function from a clear specification, scaffolding a record-management interface, writing a script, translating a snippet from one language to another, delivering a plausible first implementation of a known pattern. On these tasks it is faster than a human and often just as good. This is not a concession made grudgingly: it is a large part of a developer’s daily work, and it is precisely why these tools have spread so quickly.

What these tasks have in common is that they are local. They are bounded, immediate, and the context that matters fits in front of the agent: the specification is in the prompt, the file is in the window, success can be checked on the spot by running the result. A system is the opposite. Its correctness is spread across many files, many decisions and a long span of time; a large part of what makes it correct is written down nowhere; and “does it work” cannot be checked by running the happy path once. The agent is built to optimise the task it is given. Yet a system is exactly what no single task expresses: it is the sum of the constraints, invariants and decisions that no individual task contains.

That is where the line falls. Not “the agent is good or bad”, but “the agent operates on the local, and a production system is systemic”. Everything that follows comes from that gap.

Diagram: above the line, the feature, the only layer visible in a demonstration and produced quickly by the agent; below it, the systemic layers (error handling, security, scaling, integrations, architecture, long-term consistency) that hold the system up in production.

What autonomy does not cover

The first effect of that gap concerns reasoning at the scale of the system. The agent optimises locally; it does not hold the whole set of the application’s invariants and couplings in view. Individually sound decisions end up composing a global incoherence: logic duplicated in two places, an abstraction that leaks, contracts that do not agree from one module to the next. Each piece would pass a review; it is the assembly that does not hold together.

Next come the cross-cutting concerns, everything that is not lodged in a single file. The error-handling strategy, transaction boundaries, the authorisation policy, observability, data consistency: these are decisions that run across the entire codebase and must be held coherently, in a single act of design. An agent serving one task after another produces them piecemeal: slightly different error handling here, an authorisation check forgotten there, because no task taken in isolation is “the security policy” or “the consistency model”.

Then there is implicit context. A large part of what makes a business system correct is unwritten: the invariants of the domain, regulatory constraints, the reasons behind an earlier choice, the contracts with the systems the application talks to. The agent has only what is in its context. It cannot infer what was never stated, and in a real organisation a considerable amount is never stated.

Finally there is consistency over time. A prototype is a snapshot; a system lives and changes. The shortcuts that are harmless for a demonstration (no schema-migration strategy, configuration hard-coded, no versioning) become liabilities at the very moment the thing has to evolve under real use. The agent optimises the present task, not the system’s ability to transform over months.

The decisive point is that none of this is a maturity gap that the next version will close. It is a mismatch between the shape of the work an agent does well, local, bounded and immediate, and the shape of a production system, distributed, implicit and durable. A more capable agent does the local better; it does not turn the systemic into the local. The limit is on autonomy: on producing a system without a human holding the overall view. Put a competent architect in the loop, and the current generation of agents becomes a remarkable accelerator. Take the architect away, and the gap reappears, whatever the version.

The debt that goes unnoticed

Technical debt is an old idea: you borrow against future maintainability in order to ship now. Normally it is a conscious trade-off; you know you are cutting a corner, and you note it.

The debt of AI-augmented development has a particular character: it is produced quickly and plausibly. The code reads well, runs, passes the demonstration. The corner is therefore cut without anyone having decided to cut it. When a human writes every line, part of the friction is felt as it is written: the developer senses the coupling, the missing test, the shortcut, and at least records it as a debt. The agent removes that friction. That is its whole point, and it is also what removes the signal. The debt is taken on silently. The speed that is so valuable is exactly what hides its cost.

What accumulates is concrete: implicit coupling between parts that ought to be independent, error paths and edge cases never written because the demonstration exercised only the happy path, patterns that differ from one corner of the codebase to another, a data model cut for the shape of the demo and not for what production will hold, assumptions and configuration frozen into the code. None of it visible while the prototype runs.

And here is the asymmetry that matters. Repairing a system whose debt has compounded costs far more than having built it from the start with the systemic view, and it costs the most at the worst moment. Untangling coupling no one documented, reconstructing invariants no one wrote down, retrofitting tests onto code never meant to be testable, and doing all of this on a system now live, which cannot be broken while it is repaired. The speed that felt free at the prototype stage was a loan; the bill falls due in production, under load, with real data and real users, at the worst possible moment to discover it.

Where it breaks: production, security, scaling

Going into production is the first wall. The prototype runs on the developer’s machine, on the happy path. Production demands everything that is not the feature: error handling and graceful degradation, logs and observability so a problem can be diagnosed, configuration and secrets managed outside the code, repeatable deployment, the ability to roll back. The agent built the feature; this hardening layer is largely absent, because none of it was the task.

Security deserves a pause. Generated code frequently ships plausible and vulnerable patterns: query construction open to injection, partial or absent input validation, authorisation enforced endpoint by endpoint instead of as a single policy, secrets left in the code, dependencies pulled in without anyone examining what they carry. Two causes converge: an agent reproduces patterns from a training distribution that contains a great deal of insecure code, and security is a cross-cutting property, exactly the kind the previous section showed to be under-served by autonomy. Above all, the flaws do not appear in the demonstration; they appear when someone probes the running system. The reference catalogues of common application vulnerabilities read, point by point, like the list of what a quick prototype leaves out.

Scaling is the next wall. The prototype works for one user and a handful of records. Production brings concurrency, real data volumes, queries that were faultless on ten rows and collapse on ten million, missing indexes, unbounded memory, no caching or back-pressure strategy. These are architectural decisions taken, or left unmade, early, against a load profile the agent never saw.

Then there are the integrations. A real business application lives in an ecosystem: it talks to payment, to identity, to legacy data. The agent cleanly handles a call to a well-behaved interface. The less obliging reality (a flaky third party, partial failures, idempotency, contracts that drift) is systemic, and that is where integrations quietly give way.

Each of these points is invisible in the demonstration and surfaces in production. That is the precise mechanism by which a project works, then stalls: the demonstration exercises the local, production exercises the systemic.

The division of labour that holds

The question has never been “AI or engineers”. The agent is genuinely good at the local, and the local is a large share of the work. Removing it would be absurd; keeping it is right.

What the systemic layer adds is a different kind of work, not more of the same. Holding the architecture: the overall view, the invariants, the path by which the system will evolve. Setting the cross-cutting policies (security, error handling, data consistency) once and coherently. Ensuring production hardening and consistency over time. This is the work that does not fit in a context window, because it lives at the scale of the whole system and of the long term. It is not the defence of a headcount; it is a category of work that the autonomous agent, structurally, does not do.

The role of the human shifts accordingly: from writing every line, now largely assisted, to holding the overall view and the cross-cutting decisions, and using the agent to execute quickly within that frame. The same tool, opposite outcomes, depending on whether someone holds the frame or not.

The projects that get stuck are almost always those where internal velocity exists (a team, an agent, speed), but where the systemic counterweight is missing: no architecture review, no audit of the debt being taken on, no hardening of security and load, no production discipline. It is at that boundary, when velocity is present but the systemic layer absent, that an outside engineering perspective earns its place.

The gap from prototype to production, then, is not the failure it appears to be. It is the visible edge of a distinction that has always existed, between writing code and engineering a system. The agent made the first cheap. In doing so it made the second more conspicuous, and, if anything, more valuable. The value did not disappear when code became easy to produce. It moved: from producing the code to keeping the system standing.