Premature Compilation: Why agent developers write too much code

Most programmers building AI agents are solving the wrong problem first.

They pick a framework. LangChain, CrewAI, Mastra, AutoGen. They write Python classes, define chain abstractions, configure orchestration layers. They debug stack traces five levels deep into framework internals. They fight the framework to do something slightly outside its design. Then they wonder why agent development feels so slow.

I almost did the same thing. I had Mastra’s docs open in one tab and LangChain’s getting-started guide in another. I was about to invest weeks learning an agent framework. Then I stopped and asked a question that turned out to be more important than which framework to pick: do I actually need one?

The answer, after building a complete multi-step agent pipeline without any framework, was no. And the reason is more interesting than “frameworks are bloated” — agent development is a fundamentally different kind of programming, and the instinct to reach for traditional tools is itself the bottleneck.

I’m calling the mistake Premature Compilation.

The gradient

There’s a spectrum in agentic development that maps to something programmers already understand.

Python is slow but you iterate fast. C is fast but you iterate slow. You prototype in Python, find the hot paths, compile those to C if you need the speed. You don’t write the whole system in C from day one. That would be premature optimization.

Agentic development has the same structure:

Natural language instructions   (flexible, expensive to run, fast to iterate)
              |
      [observe what's deterministic]
              |
              v
Deterministic code              (rigid, cheap to run, slow to iterate)

Natural language rules and guidelines — what I’ll call heuristics — are the Python. Deterministic code is the C. You iterate in natural language, find what’s deterministic, compile those parts to code. The rest stays as prompts because the LLM handles unknown cases that code can’t.

The analogy isn’t perfect — Python is still deterministic, prompts aren’t. The same file can produce different results across runs. But the iteration speed tradeoff maps cleanly: you prototype in the flexible medium, then harden what you understand.

Premature Compilation is reaching for frameworks and code before you’ve even figured out, through prompting, what your agent should do. It’s writing C before you’ve prototyped in Python. The result: weeks of framework code for a process you don’t yet understand, which you’ll have to rewrite when the process turns out different than expected. It usually does.

The artifact changed

The reason traditional tools feel wrong is that the primary artifact of agentic development isn’t code. It’s natural language.

When I built my agent pipeline, the output was: a set of SKILL.md files — markdown files containing heuristics for the agent to follow — plus audit checklists and tool definitions. The orchestration — retry logic, passing phase outputs forward, checking whether the audit passed — is real code, but it’s the most straightforward part of the system. The intellectual work, the thing I iterated on, is the heuristics.

The development loop looked like this:

  1. Write a SKILL.md with heuristics for the task
  2. Run the pipeline
  3. Read the output. See what the agent missed.
  4. Edit one sentence in the SKILL.md
  5. Run again

Each iteration: seconds to minutes. No compilation. No type errors. No dependency conflicts. No framework abstractions to navigate. The agent did something wrong because the heuristic wasn’t precise enough. Fix the heuristic. Run again. The tradeoff: each run costs an API call. Prompt iteration is faster but not free.

Compare the framework approach:

  1. Write Python/TypeScript with LangChain
  2. Define chains, tools, agents in code
  3. Run. Something fails.
  4. Debug: was it the prompt? The chain logic? The tool definition? The framework config? The model?
  5. Navigate five layers of abstraction to find the issue
  6. Change code, re-run

Five debugging surfaces spread across layers of abstraction. Minutes to hours instead of seconds to minutes. Prompt iteration has its own ambiguity — was it the wording? The ordering? The model’s interpretation? — but there are no framework internals between you and the output. It’s a tighter loop.

Octomind, after 12+ months of LangChain in production, reported that their team had begun spending as much time understanding and debugging LangChain as building features. They replaced it with modular, low-level code and minimal abstractions. Di Chiappari wrote that coding agents had replaced every framework he used. The same friction keeps showing up.

These teams replaced frameworks with direct code, not with prompts. Anthropic’s own guidance says the same thing — start by using LLM APIs directly, add complexity only when simpler solutions fail. That’s an argument for simple code over framework code, not for natural language over code.

My argument goes one step further. Before you write even the simple code, start with natural language. Iterate on the heuristics first. Compile to code only after you understand what’s deterministic. That step — the part before code — is what nobody talks about.

Worth noting the incentives at play, including my own: Anthropic benefits when developers use their API directly rather than through framework abstractions. And I built my system the prompt-first way — that experience shapes my perspective. But the advice is consistent with what independent developers like Octomind and Di Chiappari found on their own.

This is the experience of those who left frameworks, not a representative sample. Plenty of teams ship successfully with them — the same 47billion comparison that showed AutoGen’s mismatch also showed CrewAI delivering results in a week. Teams that chose well and succeeded don’t write blog posts about it, so the evidence you find online skews negative. But the bottleneck these teams describe — the framework consuming effort that should go to understanding the problem — is consistent, and it’s what Premature Compilation predicts.

You’re not throwing it away

Everyone talks about “prototyping in the CLI, then moving to production.” As if the SKILL.md files are throwaway scaffolding you’ll replace with real code later.

They’re closer to production than you’d expect. Instruction files that guide agent behavior are already supported across the ecosystem — OpenAI’s Codex, GitHub Copilot, Cursor, and other platforms all execute them. You still need infrastructure around them — error handling, monitoring, deployment. But the heuristics themselves don’t get rewritten in a framework. They stay as natural language because that’s where the judgment lives.

Why the programming instinct misfires here

Programmers see a problem and reach for code. It’s not a conscious decision. It’s muscle memory. Problem -> code -> framework -> IDE. Years of training.

But agent development is closer to coaching than programming. You’re not telling a dumb machine what to do step by step. You’re guiding an intelligent system toward the right behavior through principles, examples, and feedback. The SKILL.md is not a program. It’s a set of coaching notes for a capable but imperfect team member.

Here’s what that looks like. This is from a code audit agent’s SKILL.md — the heuristics that tell it how to evaluate code changes:

Does any optional chaining (`?.`) hide a case that should be impossible?
`user?.name` is fine if user is genuinely optional.
It's a bug mask if user should always exist at that point.

A linter can detect ?.. It cannot determine whether the value should always exist at that point — that requires understanding the code’s intent, the data flow, the contract the function is supposed to uphold. You could write a static analysis rule that flags all ?. usage — but it’d be noisy and miss the point. The heuristic isn’t “find optional chaining.” It’s “judge whether optional chaining is hiding a bug.” That judgment is domain knowledge expressed as a principle, not a pattern match. That’s why the SKILL.md is the artifact, not a function signature.

The skill transfer from programming to what’s been called “agentic engineering” (Karpathy’s preferred term) is real but specific. Decomposition transfers. Systematic debugging transfers. The instinct to make things composable and observable transfers. These are thinking skills, not coding skills.

What doesn’t transfer: the instinct to write everything as code, to reach for frameworks, to build abstractions, to want type safety and unit tests. These tools were built for deterministic programs. An agent system is non-deterministic. You don’t unit-test a jazz performance — you evaluate whether it met the criteria. Evals exist for this (LLM-as-judge, automated rubrics, prompt A/B testing), but they’re criteria-based, not assertion-based. A different quality model entirely.

The thing that worries me

There’s a recursive problem hiding in all of this that I haven’t seen anyone discuss.

Code is one of the largest, cleanest corpora of structured reasoning that exists. LLMs didn’t just learn syntax from code training data. Code likely taught them something about logical precision, step-by-step decomposition, error handling, specification-following — how much is genuinely debated, and even the direction has skeptics, but the correlation between code training and structured reasoning is hard to dismiss. Much of the code in training data was validated by a compiler. Natural language has no equivalent quality signal. A SKILL.md file can contain a logical contradiction and nothing flags it.

The concern: if the paradigm shifts from code to natural language, less new code gets written. Future training data has less high-quality, compiler-validated reasoning. Future LLMs might get worse at the precise, structured thinking that makes them good at interpreting SKILL.md files in the first place.

The paradigm could erode its own foundation.

The counter-arguments are real: the existing code corpus is enormous and doesn’t shrink. Synthetic training data exists. Economic pressure to maintain code skills creates a floor. The deterministic parts of agentic systems still produce new code — potentially higher quality, because it’s only written for well-understood operations. And AI agents might actually increase the total volume of code written, even as humans write less of it directly — the net effect on training data is unclear.

But so is the structural risk. It has the shape of a tragedy of the commons — each individual benefits from using prompts over code, and collectively this could thin the flow of new code that future models train on. The existing corpus doesn’t shrink, so the “commons” isn’t depleted in the traditional sense. But the rate of new, high-quality contributions might slow.

I don’t know how this plays out. Whether it matters within 5 years, 10 years, or never — genuinely uncertain.

Where this actually breaks down

Four real limitations I’ve hit, no hedging.

Natural language has no refactoring tools. After 50 updates to a SKILL.md, you get accumulated heuristics with contradictions nobody notices, vestigial rules for problems that no longer exist, and ordering effects (LLMs weight instructions differently by position). I’ve had skill files degrade slowly — output quality dropping over weeks as new rules silently contradicted old ones, with no tool flagging the conflict. Code has linting, type checking, dead code analysis. Natural language has nothing equivalent. Whoever builds “linting for natural language heuristics” solves the next bottleneck.

Model coupling. A SKILL.md is implicitly coupled to the model that interprets it. The same file can produce different behavior on different models, or after a model update. This is “works on my machine” but worse. You can pin library versions. You can’t always pin model versions. Your “code” can silently change behavior without the file changing.

No observability. When a SKILL.md produces bad output, you can’t trace which heuristic fired or how the model weighted each instruction. You change the input and re-run — the most primitive debugging method there is. Code has stack traces, breakpoints, logging. Natural language has: read the output, guess what went wrong, edit a sentence, try again. It works, but it’s flying blind compared to what programmers are used to.

The oracle problem. Who evaluates the evaluator? If the LLM checks its own output, you have circular validation. The same biases that produce bad output might produce bad evaluation. My system uses layered verification (multi-pass, truth-checking, claim verification, audit checklists) and human review at the top. But human review is a bottleneck that limits scalability. For tasks where human evaluation doesn’t scale, this approach hits a structural limit.

What I’d say to a programmer about to learn an agent framework

Don’t. Not yet.

Open any AI coding tool that supports instruction files — Cursor, Claude Code, Codex, Windsurf, OpenCode, or whatever you already use. Write a one-paragraph description of what you want your agent to do. Run it. Read the output. Fix what’s wrong in the description. Run it again.

Do this 10 times. You’ll learn more about your agent’s actual behavior in an hour than you would in a week of framework setup. You’ll discover which parts are deterministic and which need reasoning. You’ll find failure modes you never anticipated. You’ll iterate faster than any framework allows.

Then, if a specific step is deterministic enough to compile to code, compile that step. Keep the rest as natural language. The ratio of prompts to code will be higher than your programming instinct expects. That’s fine. That’s correct.

When your system eventually needs multi-model routing, state persistence, or human-in-the-loop checkpoints, a framework can earn its weight. But choosing wrong has real costs — one consulting team reported building the same simple agent in 3 weeks with AutoGen vs. 1 week with CrewAI, a mismatch between tool and task that ate two extra weeks. You’ll choose better after you understand your problem through prompting first.

The intellectual work is the heuristics and the task decomposition, not the orchestration. The orchestration is minimal. Everything else is domain knowledge encoded as natural language instructions, refined through conversation.

One qualifier: this advice applies most to novel agent systems where you’re inventing the process. If you’re building a well-understood pattern — RAG, chat-with-docs, support routing — the process is already known, and a framework that encodes it will save time. Premature Compilation is about the gap between “I don’t understand my problem yet” and “I’m writing code for it.” If you already understand your problem, compile away.

Start there. You’ll know when you need more — and you’ll choose better when you do.


Sources: Octomind: Why we no longer use LangChain, Anthropic: Building Effective Agents, Anthropic: Building agents with the Claude Agent SDK, SKILL.md open standard, Karpathy on agentic engineering, Di Chiappari: Coding agents have replaced every framework, 47billion: AI agents in production

Comments

No comments yet. Be the first!

Add a comment