What Claude Code skills actually are (and aren't)

The first thing that happens when you try to write a skill is that you fail. Not at the mechanics — the mechanics are trivial. You fail earlier. You sit down to encode “how I do research” and realize you don’t actually know how you do research. Not precisely. Not in a way that decomposes into steps another entity could follow.

So you start categorizing. Is this a fact or a procedure? Does it have a trigger condition? When would someone need this, and how would they know they need it? You notice duplication between your second and fourth skill files. You discover that three of your “separate” workflows share a common verification step you never named. You are building, without intending to, a personal ontology. A bottom-up knowledge graph of your own expertise, assembled by noticing patterns across attempts to externalize it.

The forcing function toward explicit knowledge is genuinely valuable. Writing a skill is reflective practice dressed as configuration. You organize, the AI performs, you notice patterns in the failures, you refine. A co-evolution loop. And the modest but honest claim is this: skills help you think about your own work.

The implementation is simpler than the introspection. Create a folder in ~/.claude/skills/, add a SKILL.md file, describe when the skill should activate, write the procedure. Anthropic’s documentation covers the whole thing in a page. Skills can bundle runnable Python or shell scripts alongside the markdown. They port across projects. At startup, only skill descriptions load into context, constrained to a character budget that scales at 2% of the context window. The full content loads only when a trigger matches. This progressive disclosure is the core engineering innovation and the genuine reason skills differ from just pasting instructions into CLAUDE.md.

But the claims don’t end at self-knowledge. Barry Zhang, the Anthropic engineer who introduced skills publicly, maps them onto a computing stack: models are processors, agent runtimes are operating systems, skills are the applications. He calls them “a concrete step towards continuous learning.” Anthropic positions skills as a new computing paradigm.

Markdown files in folders.

Every public framing of skills, ordered from mechanism to aspiration:

  1. “Just prompts” (Reddit)
  2. “Presaved prompts with auto-activation” (Reddit)
  3. “Dotfiles for LLMs” (Hacker News)
  4. “Context management optimization” (Hacker News)
  5. “O(n) to O(1) entropy factorization” (Dust)
  6. “The application layer” (Barry Zhang, Anthropic, paraphrased from his computing stack analogy)
  7. “Continuous learning” (Barry Zhang)

I call this the Compression Ladder. Each rung does more rhetorical work than the data supports.

At Level 1, “just prompts” is mechanically correct and practically useless. Like saying software is just electrons through silicon.

At Level 2, “prompt templates with lazy loading” is the most accurate single-sentence description. It captures the genuine engineering innovation (progressive disclosure) without inflating it.

At Level 3, “dotfiles for LLMs” maps onto a real lineage — but with one genuine novelty: the agent can create and modify its own configuration files. Your .bashrc can’t write itself.

Then the ladder leaves solid ground. “Applications” in the computing stack sense are executable code with predictable behavior. Skills are probabilistic text the model follows roughly half the time. One user’s 250-invocation eval found skills activate about 50% of the time at baseline.

“Continuous learning” implies weight updates that persist between sessions. What actually happens is filesystem persistence: a markdown file sits in a folder, and the model reads it again next session with no memory of having read it before. The word “learning” does rhetorical work that the mechanism does not perform.

The knowledge curation benefit is real. The performance claim is a different story.

Then the first controlled study arrived, in February 2026. An ETH Zurich team tested four agents across two benchmarks (SWE-bench Lite and their own AGENTbench). Human-written context files improved task success by about 4% while increasing cost up to 19%. Auto-generated files hurt performance in 5 of 8 settings and added 20% cost. For Claude Code specifically, human-written files performed worse than no file at all.

The curation overhead helps the human think. It does not measurably help the AI perform. The marketing says one thing. The evidence says another. The gap between them is the subject of this article.

You’re telling, not teaching

The skills discourse rests on an analogy that flatters the product. Anthropic engineers describe skills as onboarding material, the way you’d bring a brilliant new hire up to speed. “It’s like talking to a very brilliant consultant who has gone through onboarding,” as Marius from Anthropic put it. The framing imports a human learning model: give the consultant your style guide and process documents, hand over the institutional knowledge, and they internalize it into expertise.

LLMs don’t internalize from context the way humans do.

A human who reads a style guide compresses patterns into a durable mental model that generalizes beyond the examples. They notice an inconsistency the guide doesn’t cover and resolve it by inference from the principles they absorbed. They apply the guide to novel situations its authors never anticipated. The knowledge becomes part of them.

An LLM that reads a style guide has more tokens competing for attention in a fixed-size context window. No weight updates occur. No compression happens between sessions. No generalization extends beyond what the context window contains. When the session ends, the model forgets everything. Next session, you tell it again. The “teaching” frame is wrong at the mechanism level. You are not teaching. You are telling. Every session. From scratch.

The onboarding analogy is flattering in exactly the wrong direction. A human consultant who “went through onboarding” retains the knowledge afterward. The cost of onboarding is paid once. The benefit compounds over time. An LLM “going through onboarding” pays the cost every session (context window tokens, processing time, instruction confusion risk) and retains nothing. The costs recur. The benefits don’t compound. You cannot “teach” an LLM by giving it more context. More context is more instruction, not more learning.

The distinction matters because the two frames produce opposite behaviors. The teaching frame encourages writing more, because more teaching should mean more learning. The telling frame encourages writing less, because every unnecessary instruction is noise the model must process alongside your actual request. One frame tells you to elaborate. The other tells you to delete. The data supports deleting.

The ETH Zurich study found something that makes this concrete. Agents that received context files mentioning a specific tool used it roughly 160 times more often (1.6 uses per instance versus fewer than 0.01 without). The ratio sounds like proof that context works. But the actual task success improvement was roughly 4%. The impressive intermediate metric (did the agent use the tool?) masks a modest outcome (did the agent solve the problem?).

Translate the usage ratio: “the agent followed the instruction.” Not “the agent succeeded 160x more.” Behavior change is not performance improvement. The paper reached the top of Hacker News with 232 points and 161 comments, and it was the usage ratio that got shared. The 4% sat in a less-quoted section.

Why extra context backfires

The mechanism behind the degradation is not what you’d expect. The standard explanation is attention dilution — more tokens, thinner attention. But modern transformers handle 500 extra tokens in a 100K window without breaking a sweat. If distraction were the cause, you’d expect smooth degradation. The ETH Zurich data shows something spikier.

Agents with context files “explored more files and took more steps.” That is not attention loss. It looks more like the model second-guessing itself.

A more likely mechanism is instruction confusion. When your CLAUDE.md restates coding patterns the model already infers from existing code, the model faces a conflict: follow the written instruction or follow what it inferred from the codebase itself? The model was trained on millions of repositories. It has strong priors about how Python projects structure their tests, how TypeScript projects handle types, how React components manage state. When your context file restates those patterns, or states them slightly differently, the model hesitates. The result is more exploratory steps, more file reads, more hedging. The agent does more work, not better work.

The ETH Zurich paper ran a revealing ablation. When existing documentation files were removed from the repos before testing, LLM-generated context files improved performance by 2.7%. The problem was redundancy. The context file was restating what the documentation already said.

This changes what “shorter is better” means. If attention dilution were the problem, shorter content would always be better regardless of what it says. Under instruction confusion, the operative variable is whether the content conflicts with what the model already knows. An 800-line research methodology doesn’t conflict with anything in the model’s training data, because the model has no prior on your specific research workflow. A 10-line coding standards guide conflicts directly, because the model already has strong priors on coding patterns from training. Novel procedures can be long without penalty. Redundant standards hurt even when short.

This is a hypothesis that fits the ETH Zurich data, not an established mechanism. The paper observed the behavioral pattern (more exploratory steps, not fewer correct ones) but did not test instruction confusion as a causal explanation.

If most context is noise, what is the signal?

The one test that matters

One framework helps. Before writing any line in a skill or CLAUDE.md, ask one question.

Can the agent figure this out by reading the codebase?

The Discoverability Test. If the answer is yes, delete it. The agent can ls, grep, and read the entire repo in seconds. It can parse your package.json, check your tsconfig, read your test files, inspect your directory structure. It was trained on millions of repositories that look like yours. The ETH Zurich study is consistent with this: 100% of Sonnet-generated context files included codebase overviews (other models ranged from 36% to 99%). Those overviews did not reduce the number of steps before agents found the relevant files.

The agent was never lost. You told it where the kitchen was, and it had already looked.

If the answer is no, keep it. This is where the dramatic usage multiplier lives. Non-obvious tooling commands. Deployment recipes that involve flags and sequences the code doesn’t encode. Multi-step workflows the agent could not possibly infer from reading source files. Branch naming conventions when several options exist. Package manager choice when npm, pnpm, and yarn all have lockfiles present in the same project.

What passes the Discoverability Test:

  • pnpm test:unit (not npm test, not yarn test)
  • A deploy command that wraps wrangler with environment variables and a post-deploy verification step
  • “Private env vars via $env/static/private, never import from $env/dynamic
  • A 14-step research methodology with specific source-ranking heuristics
  • A data migration procedure that must run steps in a non-obvious order

What fails:

  • “This is a SvelteKit project using TypeScript” (the agent reads svelte.config.js)
  • “We follow clean architecture with separation of concerns” (the agent reads the directory tree)
  • “Use descriptive variable names” (the agent sees the existing variable names)
  • Directory structure listing (the agent runs ls)
  • Framework documentation summaries (the agent has the framework in its training data)

The practical result is small. Five to ten lines of non-discoverable tooling commands in CLAUDE.md. Two or three skills for genuinely substantial, portable, intermittently needed procedures.

The strongest counter-example to the “keep everything minimal” thesis comes from inside the research process itself. The 800-line research skill used in the underlying investigation for this article is the most effective skill I use. If you replaced it with a 5-line CLAUDE.md entry, research quality would degrade substantially. It encodes 14 lateral reframing techniques for search queries, explicit verification phases, source-ranking heuristics, depth-limiting rules for link following. None of this is discoverable from any environment. There is no codebase to grep for “how to evaluate source credibility” or “when to follow internal links two levels deep.”

This looks like a contradiction. The data says context files don’t help. Here is an 800-line context file that appears to help.

It is not a contradiction. The resolution is the Discoverability Test itself. Discoverable content (codebase overviews, coding patterns, architecture descriptions) is noise. The ETH Zurich data is consistent with this. Non-discoverable multi-step procedures (research workflows, deployment pipelines, data migration recipes) are the genuine use case. The 800-line research skill works precisely because none of its content conflicts with anything the model already knows. It encodes procedural knowledge the model has no prior on. Under the instruction confusion hypothesis, the conflict mechanism doesn’t activate because there is nothing to confuse.

But the ETH Zurich study never tested this category. It measured coding tasks, where the codebase itself provides most of the context the agent needs. The Discoverability Test predicts skills should help most in exactly the domain no study has examined. The genuine use case is narrower than Anthropic markets and remains empirically untested. The research skill is one person’s experience, not a controlled evaluation.

We’ve seen this movie before

If you’ve worked at any company with Confluence, you already know how this ends.

Skills failure modeCorporate wiki equivalent
Skill-bible drift (source updates, skill goes stale)Document decay
Can’t find the right skill at 50+ installedFindability crisis
Auto-generated skills hurt performanceAuto-generated meeting notes pile up unread
50% trigger failure rate“Nobody reads the wiki”
Premature abstraction (skill created after one use)Template created for a one-off process
skills.sh 80K installs, quality unvettedConfluence 10K pages, mostly stale

Every row in that table has been studied since the 1970s. Nonaka and Takeuchi (1995) mapped the exact process: take the knowledge in your head, convert it to explicit text, watch the text drift from practice. Their SECI model identifies four modes of knowledge conversion. Anthropic’s co-evolution loop covers the same two under different names. Same process. Different branding. Fifty years apart.

The decay patterns are just as old. Argyris and Schon formalized them in 1978: explicit knowledge drifts from practice, findability degrades with volume, auto-generated documentation crowds out curated documentation. The findability crisis that killed corporate wikis at scale is the same problem skills face at 50+ installed — a trigger mechanism that fires on the wrong skill or doesn’t fire at all.

A Hacker News commenter captured the pattern: “You’ve described instructions. It already had a name.” Another predicted the trajectory: “I suspect much of the next 5 years will be people rediscovering existing human and project management techniques and rebranding them as AI something.”

Anthropic is, at minimum, retracing ground that knowledge management covered decades ago. One GitHub issue at a time. The dotfile lineage (.bashrc, .vimrc, .editorconfig, Makefile) stretches back to 1976. Configuration files became agent instructions became “skills.” Simon Willison noted that most people’s inferred definition of “prompt engineering” is “a laughably pretentious term for typing things into a chatbot.” Fair enough. Tobi Lutke and Andrej Karpathy upgraded the vocabulary to “context engineering,” which reflects the actual complexity better. But renaming the concept does not solve the structural problems. A Confluence instance with better branding is still a Confluence instance.

The goalpost tells the rest of the story. The original thesis was falsifiable: skills make AI better. The ETH Zurich study found results inconsistent with that thesis, at least for coding tasks. Roughly 4% improvement at 19% cost increase. For Claude Code specifically, human-written files performed worse than no file at all. One study is not definitive, but it is the only controlled study so far, and the results do not support the marketing.

The thesis retreated. Skills help you think. Skills force you to organize. Skills are reflective practice.

This retreat is understandable. The curation benefit is real. I described it in the opening paragraphs, and I meant it. But the retreat should be named for what it is: a move from a falsifiable claim that the data falsified to an unfalsifiable claim that the data cannot touch. “Skills make AI better” can be tested. “Skills help you think” cannot. That is not dishonest. It is a goalpost shift, and it should be called one.

Follow the incentives

Anthropic’s investment in skills makes sense independently of whether skills improve Claude’s performance. Five incentives, none of which require the AI to get measurably better.

Lock-in. Skills are Claude-specific. They’re branded “skills” (an identity word) rather than “configuration files” (a utility word). Twenty installed skills represent meaningful switching cost to a competing agent. An open standard proposal (AGENTS.md) aims for portability, but the current reality is fragmentation: CLAUDE.md, AGENTS.md, .cursorrules, GEMINI.md. Four files saying roughly the same thing, four chances to get out of sync.

Ecosystem moat. skills.sh positions itself as an app store for agent behaviors (it is a third-party site, not Anthropic’s, though Anthropic benefits from the ecosystem). The 80,000 install count is a vanity metric, not a quality signal for developers.

Enterprise sales. “We have an organizational knowledge management system” is procurement-friendly language. “Folders of markdown” is not. Barry Zhang says Fortune 100 companies use skills. No public evidence supports this. No named customers. No case studies.

Narrative positioning. Mapping skills onto a computing stack analogy — processors, operating systems, applications — is language aimed at fundraising decks and industry analyst briefings. “Prompt templates with progressive loading” does not move the same needle.

Community engagement. Power users build skills, share skills, evangelize skills — organic marketing regardless of whether skills improve output. The word “skills” rather than “config files” frames users as practitioners building expertise, not administrators maintaining configuration. The community markets the product whether or not it works.

None of these incentives require skills to actually improve Claude’s performance. Skills may succeed as product strategy even if they fail as engineering. This is not a conspiracy. It is normal corporate incentive alignment.

When Anthropic ships overlapping primitives (commands, rules, hooks, skills, agents, subagents, CLAUDE.md) on a rapid cadence without consolidating the conceptual surface area, one explanation is expansion of the integration surface. More primitives means more investment from users, more switching costs, more lock-in. Other explanations are possible (genuine experimentation, unclear internal priorities), but the incentive structure favors expansion.

Boris Cherny is the worst possible evidence for any position on skills. He built Claude Code. He has insider knowledge of model behavior, training decisions, attention patterns that no external user possesses. His CLAUDE.md is two lines: an automerge flag and a team channel. This works because his context lives in his head, encoded through years of building the product itself.

In his YC Lightcone interview, Boris describes tagging Claude on PRs to add preventable mistakes back into the team CLAUDE.md. His advice on bloated context files: “delete your CLAUDE.md and just start fresh.”

Generalizing from Boris’s minimalism is like concluding that a chef doesn’t need recipes, therefore recipes are useless. The chef has internalized the recipes through years of practice. Boris has internalized the model’s behavior through years of building it. He is the single most atypical Claude Code user in the world. His minimalism is evidence for one trivially true claim: the person who knows the model best needs the fewest written instructions. It is evidence for nothing else.

Boris’s own philosophy points somewhere interesting, though. The Claude Code team has Rich Sutton’s Bitter Lesson framed on the wall where they sit. The thesis: hand-engineered features and rules are reliably outperformed by scale and learning. “Never bet against the model.” Boris deletes his CLAUDE.md and starts fresh with each new model generation. “Maybe scaffolding can improve performance maybe 10 or 20%,” he told Lenny Rachitsky, “but often these gains just get wiped out with the next model.”

Skills are likely swimming upstream against model improvement. Each model generation needs less instruction. Each model generation infers more from the codebase. Each model generation has stronger priors about coding patterns, deployment procedures, testing conventions. The scaffolding you build today solves problems the next model may not have. The trajectory matters more than the snapshot.

Consider the content of a typical CLAUDE.md from 2024: “Use TypeScript strict mode.” “Prefer functional components.” “Handle errors with try/catch.” Every one of these instructions told the model something it already knew in 2025. By 2026, the model infers many coding patterns from the codebase itself and needs fewer hints. The Discoverability Test’s “no” category is likely shrinking with each model release. Content that was non-discoverable six months ago may become discoverable when the next model ships.

The question is not whether your skills work today. The question is whether you’re building scaffolding you enjoy maintaining (honest) or scaffolding you believe will last (which Boris thinks is wrong). No longitudinal study has measured instruction-sensitivity across model generations, so this is an extrapolation from the direction of capability improvements, not a documented trend. The direction seems clear. The Bitter Lesson predicts it. Boris believes it. The data is consistent with it. But “consistent with” is weaker than “proves,” and the Bitter Lesson has been wrong about specific timelines before.

What to actually do

You still have a CLAUDE.md to write. If the evidence points toward less context, not more, what belongs in it?

Keep CLAUDE.md at 5-10 lines of non-discoverable tooling commands.

# CLAUDE.md
- Run tests: `pnpm test:unit` (vitest), run after any component change
- Deploy: `pnpm run deploy` (wraps wrangler pages deploy)
- Lint before commit: `pnpm check && pnpm lint`
- Use `pnpm` not `npm`. Lockfile is pnpm-lock.yaml

No directory structure. No “this is a SvelteKit project.” No architecture overview. No coding style guidelines. The agent reads the code and infers all of that. Every line above is something the agent would get wrong without being told.

Use hooks for anything that must be deterministic. A community member on Hacker News proposed forcing skill evaluation through hooks: “Create a hook that would ask Claude Code to evaluate all skills.” The community reliability hierarchy tells the story: hooks (highest reliability) beat CLAUDE.md (always loaded) beat skills (frequently ignored). If a rule must be followed every time, encode it as a pre-commit hook or a compiler check. The agent cannot ignore a TypeScript error the way it ignores a markdown suggestion.

Build 2-3 skills for genuinely portable, substantial, intermittently needed procedures. Your research methodology. Your deployment pipeline. Your data migration workflow. These are the procedures that pass the Discoverability Test: non-discoverable, non-conflicting, procedural. They can be 200 lines or 800 lines without penalty because they encode knowledge the model has no prior on. The “intermittently needed” qualifier matters. A procedure you need every session belongs in CLAUDE.md (always loaded, 100% reliable). A procedure you need in one session out of twenty is the progressive disclosure use case. That is what skills were designed for.

Don’t install random skills from the internet. The skills.sh ecosystem has accumulated roughly 80K installs. Unsigned markdown from strangers on the internet, loaded directly into your agent’s context with filesystem access. A single malicious skill file can exfiltrate SSH keys, inject prompts, or download a backdoor. The skill format is markdown. There is no signing, no sandboxing, no review process. Treat the skills ecosystem the way you would treat curl | bash from an unknown source: with extreme suspicion.

Run the Discoverability Test as daily practice. Before writing a line, ask: can the agent figure this out from the environment? If you’re describing what the code does, delete it. If you’re prescribing a procedure the code doesn’t encode, keep it. The test is simple. Applying it consistently is not. The temptation to write “comprehensive” context files is strong, reinforced by every Anthropic blog post and community guide. Resist it. The available evidence points one direction.

The honest verdict: the gap between what skills are marketed as and what they do is the Compression Ladder itself. The knowledge curation benefit is real but belongs to the human, not the AI. The performance benefit is marginal at best, negative at worst, based on the available evidence. The strategic incentives are real and independent of performance. The trajectory points toward less scaffolding with each model generation, not more.

Vercel’s agent evals illustrate the reliability tradeoff: AGENTS.md achieved 100% task success. Skills without explicit instructions scored 53% — identical to the no-docs baseline. Even with instructions, skills reached only 79%. The skill wasn’t invoked at all in 56% of eval cases. A skill that fires half the time is not an application. It is a suggestion with a loading mechanism.

The architecture is genuinely clever. The progressive disclosure is real engineering. The self-modifying capability has no precedent in the dotfile lineage. Skills don’t fail because of what they are. They fail because of what people put in them, encouraged by marketing that tells them to put the wrong things in.

The Compression Ladder stretches from “just prompts” to “computing paradigm.” The evidence sits near the bottom, but not at the bottom. Skills are not just prompts. They are prompts with progressive disclosure and trigger matching, portable across projects, with the ability to modify themselves. That is a genuine engineering contribution. It is not the application layer of a new computing stack. It is not continuous learning. It is not a new paradigm.

It is a well-engineered system for delivering the right context at the right time. When the content is right, it works. When the content is wrong, the engineering cannot save it. Get the content right first. The Discoverability Test tells you how.

Comments

No comments yet. Be the first!

Add a comment