The Dark Software Factory

🔗 Originally published on LinkedIn

Imagine a software company where no developer writes code. No one reviews pull requests. No one merges commits. Instead, features move from specification to deployment inside a fully automated development system that generates the implementation, validates it against behavioral tests, and ships it to production. That idea has a name: the dark software factory.

For most of software's short history, we have borrowed heavily from the language of big industry. We talk about pipelines, build servers, production, shipping, and engineering at scale as though software is rolling off an assembly line. But the real work is deeply human. Behind all the industrial metaphors sit people. Writing and reviewing code, debating design choices, pushing changes, and trusting that the tests they write will catch any mistakes.

Now a more literal version of the metaphor is emerging. People are starting to describe “dark software factories”, borrowing the term from lights-out manufacturing, where robots run the plant and nobody needs to turn the lights on because nobody is inside. In the software version, the unsettling part is not “AI writes some code”. The unsettling part is the rule: no human writes the code, and no human reviews the code either. The system takes intent as input and produces deployed software as output. Humans move upstream into specification, constraints, and governance. The implementation becomes machine-only territory.

If this all sounds like hyperbole to you, well then you could be in for a shock. OpenAI have published an account of building and shipping an internal beta product with “0 lines of manually-written code”, with everything generated by Codex, including tests, CI config, docs, observability, and tooling. They describe the job of the engineers as designing the environment and feedback loops that make agent work reliable. “Humans steer. Agents execute.”

In parallel, StrongDM have been publishing material on what they explicitly call a “Software Factory”, centred on validation at scale and an idea they call a Digital Twin Universe, behavioural clones of third-party services such as Okta, Jira, Slack, and Google Workspace APIs. The point is to run huge volumes of scenario validation without touching production systems or hitting rate limits.

And Dan Shapiro has given this shift a simple ladder that keeps showing up in the conversation: levels of AI-assisted development that end at Level 5, the “dark factory”, where humans are neither needed nor welcome inside the implementation loop.

So what is the dark software factory, really?

It is not “an AI that writes code”. It is a control system wrapped around code generation. The software is not produced by genius prompts. It is produced by relentless feedback loops.

The core trick is understanding what humans really do and why they really do it: if you remove human review, you do not remove quality control. You replace it. Code review was never the point. Evidence was the point. We read code because it was the only practical way to build confidence that the system matched intent, that edge cases were handled, that security rules were respected, that performance would not fall off a cliff. If we want to stop reading code, we need other forms of evidence that are stronger than “this looks right”.

That requirement forces a different architecture for the engineering process itself. OpenAI call their discipline “harness engineering”, and it is a useful phrase because it frames the agent as powerful, fast, and not reliably aligned with your goals unless you literally harness it. In their write-up, the big breakthroughs are not model cleverness. They are structural. They talk about making the repository a “system of record”, pushing knowledge into the repo so agents can find it, keeping a small AGENTS.md as a map rather than a rotting encyclopedia, enforcing architectural constraints mechanically with linters and structural tests, and investing in agent-visible observability so the agent can reproduce bugs and validate fixes by looking at UI state, logs, metrics, and traces.

I have touched on some of the early developments around this here and here, but OpenAI, Anthropic, and others, are now trying to turn the dial up to eleven.

This is where the factory metaphor becomes literal. The output of a factory is not the result of one persons judgement. It is the result of a system that constrains variability. In manufacturing, you do not rely on a worker’s “taste” for whether a bolt is tight enough. You use torque tools, gauges, calibrated processes, and inspection regimes. The dark software factory is the same instinct applied to code.

If you want a concrete definition, you could perhaps start here: a dark software factory is an automated development system where the primary quality gate is externally observable behaviour, and not human inspection of source. Code becomes closer, conceptually, to a neural network snapshot: you treat it as opaque internal structure, and you infer correctness from its behaviour under evaluation. StrongDM explicitly describe this framing as a “validation constraint”, where they required a system that could be validated automatically without semantic inspection of source.

That ripples through everything.

First, specifications become first-class artefacts. You can see this in both OpenAI’s harness approach and in the emerging “factory” approaches. The spec is no longer a vaguely-worded task or ticket. It is the thing you can run. It has to be precise enough that the machine can use it to generate work, and precise enough that you can hold the machine accountable when it fails. In practice, that means constraints that are measurable, interfaces that are explicit, and acceptance criteria that are behavioural.

Second, evaluation becomes the centre of gravity. The classic failure mode of agentic coding is "reward hacking". e.g If an agent is asked to “make tests pass”, it will find ways to pass tests, sometimes by cheating entirely. Not because it is trying to deceive, but because it is optimising for the metric you gave it. StrongDM’s published material leans heavily on scenario-based validation and on the idea of keeping validation scenarios as a kind of holdout set, similar to machine learning practice, so the system cannot simply overfit to the visible tests. This is one of the most important conceptual shifts in the whole space. In normal engineering, tests live next to code. That is sensible when humans write both. In a factory, that arrangement becomes self-defeating, because the same generator can write the thing and the proof of the thing. The moment the model can see its own grading rubric, you should assume it will learn to please the rubric (i.e it might cheat). Holdout evaluation is a way of forcing generalisation, not compliance.

Third, simulation becomes a necessary strategic asset. Scenario validation at the scale a factory wants is hard if your system depends on third-party APIs. You cannot run thousands of realistic integration scenarios per hour against live services without cost, rate limits, and the risk of doing something spectacularly stupid to real data. StrongDM’s answer is the Digital Twin Universe. They talk about building behavioural clones of key dependencies, validating those clones against the real services until the differences stop showing up, and then using the twins to run extremely high-volume validation with determinism and replayability. This is like a bit like mocking, but to the n'th degree. You can read that and think “that’s overkill”, and it probably is for most teams today. But it is also the shape of what a factory needs. It is the software equivalent of a test rig. If you are going to run a lights-out plant, you invest in instrumentation, not manual labour.

Fourth, architecture stops being a style preference and becomes a safety rail. OpenAI’s write-up puts this bluntly: strict boundaries and predictable structure are multipliers for agents. They describe rigid layering rules enforced mechanically, custom linters whose error messages are written to inject remediation instructions into agent context, and recurring “garbage collection” processes to fight entropy and drift in an agent-generated codebase. That last piece matters more than it first appears. A factory without maintenance collapses. In agentic codebases, drift is not a rare event. It is the default. Models replicate patterns that already exist. If the patterns are uneven, the unevenness spreads. The only scalable response is to encode taste as enforcement. You capture a judgement once, then you make it mechanical, then it applies everywhere, forever.

If all of this is starting to feel like “software engineering, but more so”, that is exactly the point. The dark factory is not magic. It is a reallocation of effort from writing code to designing the system that produces code. That reallocation is why this feels nescient. Most developers have built careers in the loop where you read code, you reason about it, you change it, you review it. The factory says that loop is no longer the main value creation zone. The value creation zone becomes specification, constraints, evaluation, and operational feedback. Which brings us to the part that makes people nervous...

The terrifying version of the dark software factory is a world where software ships faster than anyone can understand it. Not because the code is too complex, but because nobody has looked. The system produces change, deploys change, monitors change, then produces the next change. Humans stay at thirty thousand feet, looking at dashboards and customer outcomes, while the implementation becomes a kind of machine internal. It is thrilling when it works, and horrifying when it fails.

We already have foreshocks of this. Even teams with human oversight have experienced agent-driven incidents that are fundamentally “process failures”, not “code bugs”. When an agent can run tools, and the harness is weak, the failure mode is not a broken feature. It is a destructive action executed quickly and confidently. That is why the people taking this seriously keep returning to the same theme: the harness matters more than the model.

The exciting version is a world where small teams can build and maintain serious systems because the factory produces leverage. Imagine a world where three engineers can ship what used to require thirty, not by becoming superheroes, but by turning their judgement into encoded constraints and reusable evaluation rigs. OpenAI claim an order-of-magnitude compression in time-to-build when the environment is designed properly and the system is set up to compound improvements.

Both versions can be true, depending on whether the “factory” is real or pretend.

A pretend factory is just “agents writing more code”. It is "vibe coding" with a hardhat on. It produces output, but it does not produce evidence. A real factory is obsessed with validation. It treats every failure as a missing capability in the harness, then feeds that back into the system so the next run is better. OpenAI describe this loop explicitly: when the agent struggles, the fix is not “try harder”. The fix is to identify what is missing, then make it legible and enforceable for the agent.

If you really want to understand where the frontier is, you need to stop asking “can the model write the code?” and start asking “can the system prove the code is correct enough without a human reading it?” That question forces you to think about what “correct enough” even means. In many products, the correctness you want is not logical purity. It is user outcomes, reliability thresholds, security invariants, cost boundaries, and regulatory constraints. These are behavioural. They are measurable. They can be turned into gates. This is why the factory conversation keeps circling back to scenario validation, observability, and simulation. Those are the instruments that let you measure behaviour with enough richness that you can trust it.

It also forces you to confront organisational change. In Shapiro’s model, moving up the levels is not just about using better AI. It is about changing who you are at work. At Level 3 you become a full-time reviewer. At Level 4 you become a spec-writer and orchestrator. At Level 5 you become the governor of a black box that turns specs into software. Most developers are not trained for that. Even many senior developers are not trained for that. We have whole cultures built around the idea that competence is expressed through code. The factory asks you to express competence through constraints and evaluation.

That is why the idea is both exciting and a way off.

It is exciting because it offers a path out of the oldest bottleneck in software: human implementation time. It is a sniff of a future where the marginal cost of new software approaches the marginal cost of running the factory, and where the speed limit becomes your ability to articulate intent clearly.

It feels a way off because the hard part is not generating code. The hard part is building the harness, and most teams do not yet have the evaluation muscle for their current human-built systems, let alone for autonomous ones. Many organisations cannot reliably answer whether their own software behaves as intended today. They have flaky tests, shallow integration coverage, weak observability, and tacit architectural rules enforced by tribal knowledge. A dark factory would amplify every one of those weaknesses, quickly.

So is there a practical takeaway? The practical takeaway is not “everyone should build a dark factory right now”. The practical takeaway is that the discipline is already useful long before you go fully lights-out. If you start treating validation as your primary asset, if you invest in harness-like constraints, if you make your systems legible through logs, metrics, traces, and reproducible environments, you make both humans and agents more effective. Martin Fowler’s commentary on harness engineering frames this as a mix of context engineering, architectural constraints, and ongoing “garbage collection” to fight decay. In other words, you do not have to be brave enough to turn the lights off to benefit from learning how factories work.

But it would be naive to pretend this is not an unravelling shift. The dark software factory is one of those ideas that starts as a curiosity, then becomes a competitive advantage, then becomes table stakes, and then quietly rewrites what the job is. In 2026, it is mostly still a story told by small teams and early adopters, with a handful of public accounts that are drawing attention precisely because they look slightly impossible.

The curtain is moving. Behind it is a new kind of engineering discipline, one that treats software creation as a controllable process rather than an artisanal craft. The people who figure out how to encode judgement as constraints, how to build evaluation that cannot be gamed, how to simulate reality cheaply, and how to close the loop from production feedback back into specification, will change what software organisations look like.

And that is the real meaning of “dark”. It is the absence of humans in the place where we have always assumed humans must be.

This article was originally written and published on LinkedIn by Kevin Smith, CTO and founder of Dootrix.

The Dark Software Factory

Latest Insights