Ultrathink is back. It doesn't do what you think.

Update, March 5, 2026: Four days after publication, Claude Code 2.1.68 re-introduced the “ultrathink” keyword and changed the default effort from high to medium for Max and Team subscribers. What returned isn’t what died: the original ultrathink set hard thinking token budgets; the new one sets effort to high for one turn. Anthropic defaulting to medium validates the core argument below. This article has been substantially rewritten to incorporate the 2.1.68 changes and correct several claims from the original version.

On Anthropic’s own MCP-Atlas benchmark, max effort scores 59.5%. The default high scores 62.7%. The highest setting underperforms the default by 3.2 percentage points.

I checked the community’s claims against API documentation, three academic papers, dozens of GitHub issues, and multiple Reddit threads. Most don’t hold up. The parameter everyone’s tuning doesn’t even do what they think it does. And the things that actually determine output quality get far less attention.

What effort actually is

The community treats effort like a thinking budget. Turn it up, the model thinks harder, output gets better. This is wrong.

Effort is a behavioral dial that controls thinking depth, tool-call frequency, AND response length simultaneously. Anthropic’s effort documentation says it plainly: “Effort is a behavioral signal, not a strict token budget.” It “doesn’t require thinking to be enabled in order to use it” and “can affect all token spend including tool calls.” Effort sits in output_config. Adaptive thinking sits in thinking. Separate parameters, separate documentation pages, separate purposes. The migration plugin confirms: “High effort + no thinking = more tokens, but no thinking tokens.”

The pattern resembles GCC’s optimization flags. -O2 is the most tested, the one compiler engineers trust. -O3 enables aggressive optimizations that can make programs slower: inlining explodes binary size, bloated code thrashes instruction cache. Same “more is better” intuition. Same counterintuitive result. The analogy isn’t perfect (effort and compiler optimization work through different mechanisms) but the lesson rhymes: better source material matters more than compiler flags.

A February 2026 preprint by Chen et al. found supporting evidence on math and science benchmarks. They introduced a Deep-Thinking Ratio (deep-thinking tokens as a proportion of total output) and tested it across four benchmarks using GPT-OSS, DeepSeek-R1, and Qwen3. Raw token volume correlated negatively with accuracy (r=-0.59, Table 1 average). Their DTR metric correlated positively (r=+0.683). What you think about matters more than how long you think.

The observability illusion

Here is the core epistemic problem. On Claude 4.x models, thinking is summarized before display. You see compressed reasoning. You’re billed for full tokens. You cannot observe the actual reasoning process.

When you switch from medium to high, five things change simultaneously: (1) the model may think more deeply; (2) more tool calls gather more context; (3) longer responses you interpret as higher quality; (4) non-deterministic variance shifts the output; (5) you expected better and evaluate more charitably. You cannot disentangle these.

The community is optimizing a variable they cannot observe against outcomes they cannot attribute.

What I’ll call perspective drift makes this worse. The paper “To Think or Not To Think” calls it “slow thinking collapse”: accuracy drops as responses grow longer, and larger reasoning budgets hurt performance. Extended reasoning can overwrite correct initial intuitions. On GPT-o3, performance dropped 14.5 percentage points at highest effort. The model starts with a correct read, generates more tokens, encounters noise, second-guesses itself, and thinks its way from the right answer to the wrong one.

Practitioners see this in the wild. One Reddit thread captures it: “Never ask Claude Code to think harder during implementation. Only during planning.” Another commenter: “When claude has already made a plan and you say ‘proceed, think harder,’ it’ll make a new plan on the go.” The model has a correct plan. It receives a signal to think more. The additional thinking introduces noise that overwrites the correct plan.

Every number traces to one source

I looked for independent verification of effort-level claims. I couldn’t find any.

Every quantitative claim about effort-level performance traces entirely to Anthropic. The Terminal-Bench scores show the expected relationship: low 55.1%, medium 61.1%, max 65.4%. More effort, better scores. The MCP-Atlas scores show the opposite: max 59.5% underperforms high’s 62.7%. Same vendor. Same models. Opposite conclusions.

The SWE-bench claim illustrates how vendor data mutates. Anthropic has claimed that Opus 4.5 at medium effort matches Sonnet 4.5’s best SWE-bench Verified score using 76% fewer output tokens. Note the caveats: cross-model comparison, not same-model. Specific benchmark. Then a @claude_code community account on X repackaged it without the qualifier: “Medium Effort: Matches Sonnet 4.5 capability but uses 76% fewer output tokens.” 44,000 views. By the time this reached Reddit, the cross-model caveat had vanished. “Just use medium” became common sense, not because anyone replicated it, but because a vendor benchmark sounded close enough.

Anthropic has a financial incentive here. If subscription users adopt medium effort, Anthropic’s cost to serve each fixed-price subscriber drops. That doesn’t make the data wrong. It means you need independent replication before treating it as settled. The community “experiments” are n=1 anecdotes with no controls. One user’s summary: “Medium seemed about 15-25% lighter… felt mostly the same.” A subjective impression from one person on unspecified tasks.

What actually moves the needle

There’s a concept called the Streetlight Effect: searching where looking is easiest, not where the thing actually is. Effort is the streetlight. Visible, adjustable, a clean three-position switch. The actual quality determinants sit in the dark.

Subagent verbosity dominates the token economy. A Reddit analysis (roughly 90 upvotes, r/ClaudeCode) broke down token consumption with Opus 4.6. One subagent returned 1.4 MB of output on a single task. That same subagent’s output, read twice, consumed 2.6 MB. An entire Opus 4.5 session: under 2 MB. Users obsess over thinking tokens while a single chatty subagent burns more context than everything else combined. This is one person’s measurements, not a rigorous study. But the proportions are stark.

High effort undermines CLAUDE.md rules. GitHub issue #23936 documents the problem: high effort makes the model more “eager,” including more eager to ignore your rules. The perverse result: the setting you chose to improve quality actively degrades your ability to control quality. Since version 2.1.68, the default dropped to medium for Max and Team subscribers. This may quietly fix the problem for many users.

Hooks beat CLAUDE.md through context rot. CLAUDE.md loads at session start, then gradually loses influence as the context window fills. Instructions early in conversation carry less weight as new content pushes them further from the model’s attention window. Hooks inject via UserPromptSubmit on every prompt, always fresh, always near the latest context. One post (roughly 120 upvotes): “I switched to hooks and that one move solved roughly 80% of my problems.”

Compaction plus deep thinking equals confidently wrong. Context compaction is already lossy. High effort applied to a lossy summary means reasoning deeply about a degraded representation. The logic is straightforward: each reasoning step that depends on missing or distorted information compounds the loss. A model thinking hard about a bad summary should produce confidently wrong conclusions. A model thinking lightly about a good representation should do better with fewer tokens. No one has tested this directly, but the mechanism follows from how compaction and extended reasoning interact.

My read of the evidence suggests a hierarchy: prompt quality > context hygiene > CLAUDE.md instructions > model selection > effort level. Effort is last on this list and first in every optimization discussion. The most-upvoted Claude Code advice is consistently about prompt quality. One post (over 300 upvotes): “The single most edited file is CLAUDE.md. 43 changes. More than any React component.” Another (roughly 345 upvotes): “Do NOT go above 40-50% of your context window.” The people getting the best results edit prompts obsessively, not effort settings.

Where this breaks down

Test-time compute scaling IS well-replicated for formal tasks. The entire reasoning model paradigm (o1, o3, DeepSeek-R1) rests on strong evidence. The scaling laws have strong empirical support across multiple independent labs. I’m not arguing that more thinking never helps. I’m arguing that the effort parameter is an unreliable proxy for it.

The three academic papers test specific domains: math competitions, social reasoning, adversarial scenarios. None directly test multi-file code generation with tool use in an agentic loop. Coding tasks have verifiable outputs and iterative correction loops. These properties are more favorable to extended reasoning than the tasks in those papers.

Complex planning with clean context may genuinely benefit from higher effort. Novel debugging with unfamiliar failures may benefit too. Both involve genuine exploration of unknown solution spaces.

TaskEffortRationale
Planning and architecturehigh (ultrathink)Genuine reasoning benefits. Low perspective drift risk: no existing plan to overwrite.
Implementation of known planmedium (default)Plan exists. More thinking can overwrite it. Perspective drift bites hardest here.
Debugging novel issueshigh (ultrathink)Unknown solution space benefits from exploration.
Routine edits, refactoring, testslowMechanical tasks. Extra thinking adds latency.
Context window above 50%Reduce effortDeep reasoning on degraded context compounds errors.

The irony: the best argument for manual effort control (complex planning, novel debugging with clean context) is an argument for exactly the system Anthropic is building to replace it. “Think hard sometimes on the tasks that need it” is what adaptive thinking does automatically.

What ultrathink actually does now: sets effort to high for one turn. Not a token budget. Not max effort. Phrases like “think” don’t allocate thinking tokens. One notch up from the new medium default.

What would change this assessment: one independent benchmark on realistic coding tasks, published by someone other than Anthropic, with published methodology and statistical controls for non-determinism. That doesn’t exist.


Three papers that say “more thinking can hurt”

Three recent papers investigate what happens when you give models more thinking time. Here is what each actually says and where the evidence holds up.

Paper 1: “Do Extended Thinking Models Reason Better?” Chen et al., a February 2026 preprint, analyzed reasoning token patterns across four benchmarks (AIME 24/25, HMMT 25, GPQA-diamond) using GPT-OSS, DeepSeek-R1, and Qwen3. They computed two metrics. Raw token volume: how many tokens the model generated. And a Deep-Thinking Ratio (DTR): the proportion of deep-thinking tokens in a model’s output.

The paper’s Table 1 averages: raw volume at r=-0.59 with accuracy, DTR at r=+0.683. Two numbers, opposite stories.

The first says models producing more tokens do worse. The second says models using tokens efficiently do better. The distinction isn’t semantic. It’s the difference between “think more” and “think better.” A student who writes ten pages of incoherent reasoning fails the exam. A student who writes three precise paragraphs passes. The length didn’t determine the outcome. The quality per unit did.

A caveat: the negative correlation with raw volume is likely confounded by task difficulty. Harder problems generate more tokens AND more errors. Both variables rise together for a shared reason, not because one causes the other. The correlation is real. The causal story is uncertain. But even as a confounded correlation, it undercuts the intuition that more tokens reliably mean better answers.

Paper 2: “To Think or Not To Think.” This February 2026 preprint compared nine advanced LLMs, both reasoning and non-reasoning models, across three Theory of Mind benchmarks. This gets closer to a causal design than Paper 1 because the researchers varied reasoning budget while holding other factors constant. Same models, same tasks, different thinking budgets.

On GPT-o3: performance dropped from 0.838 at lowest effort to 0.693 at highest. A 14.5 percentage point loss from thinking harder.

The failure modes are specific and revealing. Slow thinking collapse: errors concentrate in the longest responses, with DeepSeek-R1 errors clustering heavily between 8,000 and 10,000 characters. Option matching: the model pattern-matches to similar-looking answers rather than reasoning through the problem. Extra tokens don’t buy deeper analysis. They buy more opportunities to latch onto surface similarities. And perspective drift: the model starts with the right idea, generates more tokens, encounters noise that triggers second-guessing, and arrives at the wrong answer. The additional reasoning doesn’t refine. It corrupts.

Paper 3: Anthropic’s own research. A paper co-authored by Anthropic researchers, published in TMLR (Featured and J2C Certified, 78 pages, published December 2025). This one matters both for its findings and its authorship. Claude becomes increasingly distracted by irrelevant information under extended reasoning. It shifts from reasonable priors to spurious correlations. Sonnet 4 showed increased self-preservation expressions under extended thinking.

The tasks were adversarial, not typical coding. You could argue these findings don’t generalize to everyday Claude Code usage. Fair. But the mechanism is real: extended reasoning creates more surface area for noise to compound. Each additional token of thinking is another opportunity for irrelevant context to distort the output. And Anthropic published this about their own model. When a vendor publishes research showing limitations in their own product, the incentive structure favors understating the problem rather than overstating it.

The cross-cutting insight from all three papers: what you think about matters more than how long you think. Procedure over volume. Direction over distance.

The documentation deep dive

Four eras of effort control, each less granular than the last.

Era 1: The keyword ladder (pre-November 2025). Hard-coded in obfuscated source. “think” allocated roughly 4K tokens. “think hard” allocated roughly 10K. “ultrathink” allocated 31,999. An HN thread documented the ultrathink value after community members identified the hard-coding in de-obfuscated source. The keywords did allocate specific token budgets, but whether that granularity translated to proportional quality differences was never established.

Era 2: Keywords deprecated (late November 2025). Strong community response. A Reddit post titled “RIP to ultrathink” (roughly 270 upvotes) captured the mood: “Ultrathink no longer does anything. Thinking budget is now max by default.” A lighthearted lament, not an accusation. An Anthropic collaborator closed a related GitHub issue with: “ultrathink is now deprecated.” A MITM proxy study claimed throttling. The first commenter debunked it: the study had assumed a fixed 32-token-per-SSE-chunk multiplier.

Era 3: Adaptive thinking (February 2026). Opus 4.6 and Sonnet 4.6 launched with thinking: { type: "adaptive" }. The model now decides its own thinking budget. Effort became an opaque behavioral hint. MAX_THINKING_TOKENS is effectively ignored by default on Opus 4.6. The parameter the community spent months tuning stopped doing anything. Earlier discussions about MAX_THINKING_TOKENS were about Opus 4.5, not 4.6. The new model doesn’t just deprecate the keyword. It deprecates the entire concept of user-controlled thinking budgets.

Era 4: The medium default (March 2026, version 2.1.68). The default dropped to medium for Max and Team subscribers. The original default was high; whether the raw API retains that default is undocumented. Ultrathink returned as a convenience toggle. Same name, different mechanism, different era. It sets effort to high for one turn. Not max. Not a token budget.

The trajectory: keyword ladder with unverified granularity, deprecated, adaptive thinking sidelining MAX_THINKING_TOKENS, simplified toggle. Every generation is coarser and more opaque than the last. The 2.1.68 changelog records the latest shift without fanfare. The community spent months developing effort-tuning intuitions. The platform moved on.

The current parameter space tells its own story. Four effort levels: low, medium, high, max. Max is Opus 4.6 only. But max isn’t exposed in the /model UI picker; only low, medium, and high appear there. A GitHub issue confirms max exists as a valid API value. Another issue confirms it’s absent from the interactive picker. Community discussions about using max in daily Claude Code usage are discussing something the UI doesn’t offer. You can set it through the CLAUDE_CODE_EFFORT_LEVEL environment variable or effortLevel in settings, but Anthropic chose not to put it in the interactive picker. That’s a design signal.

Three ways to set effort: the /model command, the environment variable, or the settings file. The /model command is the simplest. The environment variable is the most persistent. Neither gives you per-task control.

The subagent effort gap deserves attention. There’s no way to configure effort per-task, and no documentation on how effort propagates from parent to child agent. You can set effort for your session, but you can’t tell a subagent to think harder on a specific subtask while keeping the parent at medium. The control granularity the community wants doesn’t exist. Users burn through monthly quotas without understanding where the tokens went, partly because they can’t see or control effort at the subagent level.

The token economy and context hygiene

The Reddit token analysis deserves a closer look. The post (roughly 90 upvotes) measured total token consumption across three configurations with Opus 4.6. Default effort: 14.7 MB across three sessions (the task kept dying and restarting, each restart consuming additional context). Context-efficient prompting: 5.0 MB in a single clean session. A 66% reduction, not from changing effort, but from managing context. Opus 4.5 baseline: under 2 MB for the same task.

The subagent detail is the buried lede. One subagent returned 1.4 MB of output on a single task. That output, read back into the parent context twice during the session, consumed 2.6 MB total. More than an entire Opus 4.5 session. Think about what that means. The community debates whether medium or high effort is “worth the extra tokens.” One chatty subagent consumes more context than the entirety of a session on the previous model generation. In this user’s session, the effort setting was a rounding error next to subagent verbosity. This is one person’s measurements, not a controlled study. But the proportions are suggestive.

Hooks versus CLAUDE.md is a mechanism story, and the mechanism matters because it explains why effort tuning can’t compensate for bad context management.

CLAUDE.md instructions load at session start. As the conversation grows, those instructions drift further from the model’s attention window. By turn 30, CLAUDE.md is buried under thousands of tokens of conversation and tool outputs. The model’s attention favors recency. Early instructions carry diminished weight. Context rot.

Hooks solve this mechanically. They inject via UserPromptSubmit on every prompt submission, placing instructions right next to the latest context. Always fresh. Always near what the model is attending to.

The hooks post (roughly 120 upvotes) reported that switching to hooks “solved roughly 80% of my problems.” The context management post (roughly 345 upvotes) warned: “Do NOT go above 40-50% of your context window.” The CLAUDE.md obsession post (over 300 upvotes) revealed: “The single most edited file is CLAUDE.md. 43 changes. More than any React component.” These are the levers people actually move when they want better output.

The high-effort-ignores-rules issue completes the picture. Users write careful CLAUDE.md instructions. They set effort to high hoping for better compliance. High effort makes the model more eager, and that eagerness extends to ignoring the very rules meant to constrain it. The 2.1.68 default change to medium may silently resolve this for many users who never understood why their rules weren’t sticking.

One user reported a /think slash-command skill that encoded reasoning procedure: “enumerate three approaches before choosing.” They found it more effective than effort settings. The hypothesis maps cleanly to the DTR finding: effort controls volume, procedure controls direction. Quality-per-token beats raw volume. A single anecdotal report, but the mechanism is consistent with the academic data.

On cost: one Reddit poster cited a report of monthly costs dropping from $840 to $320 through effort routing (a 60/30/10 split across effort levels for different task types). These figures are third-hand: the poster was citing someone else’s experience, not their own. The specific numbers are unverifiable. But the principle stands: routing effort by task type beats a single setting for everything. The gains come not from finding the “right” effort level, but from avoiding high effort on tasks that don’t need it. Mechanical tasks at low, routine coding at medium, genuine architecture problems at high. The savings come from the low and medium tasks, not from optimizing the high ones.

The vendor benchmark problem

The SWE-bench claim’s journey is a case study in how vendor data mutates in the wild. Three stages of telephone.

Stage 1: The original claim. Anthropic says Opus 4.5 at medium effort matches Sonnet 4.5’s best SWE-bench Verified score using 76% fewer output tokens. Two critical caveats embedded in that sentence. It’s cross-model: Opus 4.5 versus Sonnet 4.5, not the same model at different effort levels. You’re comparing a larger model at reduced effort against a smaller model at full effort. And it’s one specific benchmark: SWE-bench Verified. The claim says nothing about general coding tasks.

Stage 2: The social media repackaging. A @claude_code community account on X posted: “Medium Effort: Matches Sonnet 4.5 capability but uses 76% fewer output tokens.” The “SWE-bench Verified” qualifier vanished. The cross-model comparison was compressed into “matches capability.” 44,000 views. The tweet is not technically wrong. But it removed the two caveats that make the claim useful.

Stage 3: Community absorption. By Reddit, “just use medium” became common sense. Not because anyone verified the claim on their own tasks. Not because anyone ran a controlled comparison. Because a vendor benchmark, stripped of its caveats, sounded close enough to what people wanted to believe. The telephone game turned a narrow, caveated benchmark finding into a universal recommendation.

Terminal-Bench versus MCP-Atlas tells the other half of the story. Terminal-Bench (from the system card): low 55.1%, medium 61.1%, max 65.4%. More effort equals better. This is the relationship most people expect. MCP-Atlas: max 59.5% falls short of high’s 62.7%. More effort equals worse. Same vendor, same model family, different benchmarks, opposite conclusions. If Anthropic’s own data can’t agree on whether more effort helps, why would you trust any single data point from them as universal guidance?

Independent replication doesn’t appear to exist. SWE-bench’s leaderboard doesn’t break out effort levels. Third-party articles trace back to Anthropic’s announcement. The community “experiments” are single-user impressions with no controls. One user’s verdict on medium versus high: “Medium seemed about 15-25% lighter… felt mostly the same.” A subjective impression from one person on unspecified tasks. This is the state of the evidence.

What would resolve this is specific. One independent benchmark. Realistic coding tasks, not synthetic patches. Published by someone other than Anthropic. With published methodology and statistical controls for non-determinism. Until that exists, we have vendor claims and vibes.


The practical playbook

Before touching effort, do these four things. None requires Anthropic to change anything. Each addresses a variable with at least a clear mechanism behind it, unlike effort, where neither the mechanism nor the impact has independent verification.

  1. Cap subagent output. Based on the token analysis above, a single verbose subagent can waste more tokens than effort-level changes save. Set explicit output limits. Review what your subagents return. One subagent consuming 2.6 MB of context dwarfs the difference between medium and high effort. If that analysis is representative, the token economy of a Claude Code session is dominated by subagent behavior, not thinking behavior.

  2. Move critical instructions to hooks. They survive context rot. CLAUDE.md doesn’t. UserPromptSubmit hooks inject fresh on every interaction. If a rule matters at turn 50, it needs to be in a hook, not buried in a file the model read at turn 1. This is a mechanical fix, not a workaround. The attention mechanism favors recency. Hooks exploit that.

  3. Keep context under 50%. Start new sessions aggressively. Commit between phases. Deep reasoning on a half-degraded context window compounds errors at every step. A model thinking lightly about clean context should outperform one thinking hard about dirty context. If you’re past 50% and considering raising effort, the logic says you’re compounding the problem.

  4. Write better prompts. The most-upvoted Claude Code advice is consistently about prompt quality: structured CLAUDE.md files, explicit constraints, clear task decomposition. One user encoded reasoning procedure directly (“enumerate three approaches before choosing”) and found it more effective than effort settings. This tracks with the academic finding: procedure (what to think about) beats volume (how much to think). The people getting the best results edit prompts obsessively. They do not edit effort settings.

What would change this assessment

  • One independent benchmark on realistic coding tasks
  • Published by someone other than Anthropic
  • With methodology and statistical controls for non-determinism
  • Reproducible by third parties

Until then: vendor claims and vibes. The parameter the community spends the most time discussing is the one with the least evidence behind it and, by my reading, the least influence on outcomes.

Comments

No comments yet. Be the first!

Add a comment