Caveman Claude Code Is the New Meta (Here's the Science)
Caveman Claude Code Is the New Meta (Here's the Science)
Making Claude Code talk like a caveman might sound like a meme, but there's a real research paper showing that forcing large language models to be concise improves accuracy by up to 26 percentage points — and in some cases lets smaller models outperform models with 100x the parameters. The Caveman GitHub repo hit 5,000 stars in 72 hours. The actual token savings are smaller than the repo claims, but the performance science is legit.
I'm going to break down what Caveman actually buys you on the token side (the numbers in the repo are misleading), and then walk through the brevity-constraints paper that makes this more than just a meme.
What Is the Caveman Claude Code Skill?
Caveman is an open-source Claude Code skill that strips verbal filler from Claude's responses, forcing it to speak in a compressed, Neanderthal-style register that cuts output tokens dramatically. The repo name says it all: "why say many word when few word do trick."
It ships with before-and-after examples, a token-diff benchmark, and a companion tool that compresses your memory files (CLAUDE.md, todos, preferences) into the same caveman format. The repo claims 75% fewer output tokens and 45% fewer input tokens per session. Those numbers are not what they look like.
Installation is one line. Activate it with /caveman or by saying "talk like a caveman," "caveman mode," or "less tokens please." It has levels — lite, full, and ultra-caveman — and it leaves code generation, error messages, and tool outputs completely untouched. Only the prose in Claude Code's terminal responses changes.
Does Caveman Actually Save 75% of Your Tokens?
No. The 75% figure applies to prose responses only, which are a small fraction of a total Claude Code session — in practice, Caveman saves about 4-5% of total tokens per session. Still worth it, but nowhere near the headline number.
Let's break down a typical 100,000-token Claude Code session:
- Input: 75,000 tokens (75% of total) — system prompt, prior messages, file contents, context
- Output: 25,000 tokens (25% of total) — tool calls, code blocks, prose responses
That output is further split into tool calls, generated code, and prose. The prose portion is maybe 6,000 tokens. That's what Caveman compresses. Cut 75% of the prose and you save roughly 4,000 tokens — about 4% of your total 100K-token session.
On the input side, Caveman's memory-file compression touches only parts of the system prompt (CLAUDE.md and similar files). In a 75,000-token input, those files might be 2,000-4,000 tokens. Cut half of that and you save another 1,000-2,000 tokens — roughly 1-2% of the input.
Real-world savings: about 4-5% of total tokens per session. That's not going to take you from a 5x Max plan to a 20x Max plan. But over a week of heavy use, 5% compounded is not nothing — especially if you're usage-conscious right now.
Why Does Being Concise Actually Improve LLM Accuracy?
A research paper titled "Brevity Constraints Reverse Performance Hierarchies in Language Models," published in early March 2026, found that forcing large models to produce brief responses improved their accuracy by 26 percentage points and closed performance gaps with smaller models by up to two-thirds. This is the part that makes Caveman more than a meme.
The researchers evaluated 31 open-weight models across 1,500 problems. What they found broke the intuition that bigger is always better:
- On nearly 8% of problems, larger models underperformed smaller ones by 28 percentage points
- A 2-billion-parameter model outperformed a 400-billion-parameter model multiple times
- Brevity constraints flipped the hierarchy — bigger models went from losing to winning once forced to be concise
The mechanism they identified: spontaneous scale-dependent verbosity that introduces errors through over-elaboration. Large models talk too much and reason themselves into wrong answers.
What Is Overthinking in Large Language Models?
Overthinking is the phenomenon where large language models generate excessively verbose responses that obscure correct reasoning, accumulating errors the longer they elaborate. The paper's exact phrase: "the learned tendency toward thoroughness becomes counterproductive, introducing error accumulation."
Instead of giving you the right answer and stopping, a model trained to be thorough keeps going. It adds context, hedges, re-derives, considers alternatives — and somewhere in that extra text, the reasoning goes off the rails. The answer it lands on is worse than the one it would have given at sentence two.
Brevity constraints help large models dramatically while barely affecting smaller models. Small models already say less. Big models have room to fall, and forcing them to be brief removes that room.
Why Do Large Language Models Become So Verbose in Training?
The paper points to reinforcement learning from human feedback as the likely cause — humans grading model outputs tend to prefer more verbose, more thorough answers, which trains larger models toward verbosity even when shorter responses are more accurate. When Opus 5.0 or any frontier model gets trained, part of the process involves humans ranking multiple outputs. "I like this one better than that one."
If humans consistently rank the longer, more detailed answer higher — which they tend to — the model learns that verbosity is a winning strategy. That reward signal sticks even when the extra words make the model wrong.
The learned preference for thoroughness is a training artifact, not a capability. And it can be reversed just by telling the model to be concise. No weight updates. No architectural changes. Just a prompt.
Does Caveman Apply to Frontier Models Like Opus 4.6 and Codex 5.4?
The paper tested open-weight models, not Opus 4.6 or Codex 5.4 — so we don't have direct benchmark evidence that the effect is as extreme on frontier models, but the general pattern usually carries over to some degree. Every study I've seen on open-weight behavior tends to repeat (less dramatically) on closed-weight frontier models.
Translation: Caveman probably won't give you a 26-point accuracy jump on Opus 4.6. But a measurable improvement on certain tasks is plausible, and the downside is essentially zero.
That's why I think this matters beyond the token savings. Even if your accuracy only moves a few points on straightforward questions, you're getting that lift for free — alongside a 4-5% token reduction.
Should You Actually Use Caveman?
Yes. It's one line to install, there's no real downside, and the combined benefit (4-5% token savings plus potential accuracy lift) makes it worth running as a default. Especially if you aren't on a Max 20x plan and every token matters.
Even if you hate the caveman aesthetic, the research points to a broader takeaway: add a line to your CLAUDE.md that says "be concise, no filler, straight to the point, use fewer words." That's the Caveman idea in a softer form, and it captures most of the upside without the Neanderthal vibe.
How Do You Install and Use Caveman?
Install Caveman with a single command from the GitHub repo, then invoke it with /caveman, or by saying "caveman mode," "talk like a caveman," or "less tokens please." It respects intensity levels:
- Lite — mild compression, still readable
- Full — default caveman register
- Ultra — pure Neanderthal, barely grammatical
Things the skill leaves alone:
- Code generation (identical output either way)
- Error messages (quoted exactly)
- Tool calls and tool outputs
- Anything under the hood that affects reasoning
Only the prose responses in your terminal change. This is a presentation-layer tweak, not a reasoning-layer one.
FAQ
Will Caveman actually change how Claude Code reasons or writes code?
No. The skill only compresses the prose responses Claude Code writes to your terminal. Code generation, tool calls, error messages, and all under-the-hood reasoning are identical with or without Caveman. The performance benefits from the research paper are a byproduct of Claude's response being shorter, not of any deeper behavior change.
Is Caveman actually saving 75% of my tokens?
No. The 75% figure only applies to the prose portion of output tokens, which is a small slice of a full session. Realistic savings are around 4-5% of total tokens per session. That's useful — it compounds over a week of heavy usage — but it's not the huge cut the repo implies.
What is the brevity-constraints paper about?
The paper, "Brevity Constraints Reverse Performance Hierarchies in Language Models" (March 2026), evaluated 31 open-weight models on 1,500 problems and found that forcing large models to be brief improved accuracy by up to 26 percentage points. In nearly 8% of cases, smaller models with 100x fewer parameters outperformed larger ones — until brevity constraints flipped the hierarchy back.
Does this mean smaller models are secretly better than larger ones?
Not exactly. It means larger models have a verbosity problem from RLHF training, and that problem can reverse the expected scaling advantage on certain tasks. Remove the verbosity (by constraining the model to be brief), and the larger model's extra capacity shows up again. The takeaway is about prompting, not model sizing.
What's the easiest way to apply this without installing Caveman?
Add a line to your CLAUDE.md that says something like "be concise, no filler, straight to the point, use fewer words." You'll capture most of the accuracy and token benefits without committing to the caveman style. For users who want the full compression, the Caveman skill is a single-line install.
If you want to go deeper into token optimization and Claude Code workflows, join the free Chase AI community for templates, prompts, and live breakdowns. And if you're serious about building with AI, check out the paid community, Chase AI+, for hands-on guidance on how to make money with AI.


