The bitter lesson: why your AI systems need to get simpler, not smarter

Most organisations building with AI make the same mistake. They add complexity. More steps in the prompt, more scaffolding around the model, more hard-coded logic in retrieval pipelines. It makes sense. You solve a problem, add a rule, and move on to the next.

The problem is that models move faster than rules.

Step change, not upgrade

Anthropic recently leaked information about Claude Mythos, the first model trained on Nvidia's GB300 chips. Security researchers were given access before the public release. One of the most experienced in the world showed that Mythos immediately found zero-day vulnerabilities in Ghost, an open source project with 50,000 stars on GitHub that had never had major security issues. Vulnerabilities that the best human researchers had missed.

That is not an incremental improvement. That is a step change.

And it is precisely the kind of change that forces us to reconsider how we build systems around AI. Matt Shumer, who tracks the AI field closely, calls it "the bitter lesson" and describes it in a video well worth watching. The core of his argument: every time models get sufficiently better, it turns out that simpler systems outperform complex ones. All the scaffolding we built becomes a liability, not an asset.

It is counterintuitive. But it holds.

Four things to audit now

Shumer highlights four specific areas that are likely to break during a step change. They deserve serious attention.

Prompt scaffolding. Most production systems have system prompts running thousands of tokens. Half of it is procedural: classify intent, check for hallucinated URLs, validate format. Ask yourself, line by line: does this instruction exist because the model needs it, or because I needed the model to need it? Anthropic's own recommendation is unambiguous: only add complexity that demonstrably improves outcomes. OpenAI's Codex guide says the same thing: describe what you need, not how.

Retrieval architecture. We have invested enormous logic in how we steer retrieval. Chunk sizes, re-ranking, alpha parameters for hybrid search. But smarter models need less of it. With context windows stretching to millions of tokens, the job shifts from micromanaging retrieval to presenting a well-organised dataset and letting the model find what it needs. This does not mean RAG is dead. It means our part of the work moves upstream, towards data quality and accessibility.

Hard-coded domain knowledge. Count the rules you feed the model. How many of them can it infer from context on its own? If you have a ten-line writing instruction but the model can pick up the style from a single example, those ten lines cost more than they deliver. They consume context window space and risk over-constraining the model.

Evaluation pipelines. If you build software with AI and have manual review steps in the middle of the flow, consider whether you could instead have a single, thorough evaluation gate at the end. Models produce code that is close enough to correct often enough that intermediate checks frequently cost more than they save. Write a comprehensive eval script that tests functional requirements, non-functional requirements and edge cases. And trust it.

Outcome specs, not process specs

What unites all four points is a movement from how to what. Instead of describing the process step by step, describe the desired end result and the hard constraints that apply regardless of how the model gets there.

A customer service example makes it clear. The process version: classify intent into one of 14 categories, route to the appropriate handler, retrieve the top five articles using hybrid search, generate a response based solely on retrieved context. The outcome version: resolve the customer's issue using our knowledge base, our policies and the account history. The customer should leave satisfied. The resolution should comply with our return policy.

The difference is enormous. The process version was necessary with weaker models. With a Mythos-class model, it is likely a liability that limits the model's room to solve the problem effectively.

In return, it requires you to get better at defining constraints: things that must always hold, regardless of how the model reaches the goal. Never disclose customer financial data. Always verify refund eligibility against policy. Those rules survive model upgrades. Process rules do not.

What it means strategically

The step change has an economic dimension. Mythos will likely only be available on premium plans initially. This creates an asymmetry: those who invest in better models, and know how to use them, will pull ahead. Not because they have better talent, but because they have better tools and simpler systems that give those tools room to work.

Moreover, Mythos is not alone. Google, OpenAI and other hyperscalers are training on the same generation of hardware. We will see similar leaps within months. This is not an Anthropic phenomenon. It is a capability jump across the entire industry.

And it extends beyond technical systems. Shumer makes an interesting point about "under the desk software", the tools that non-technical employees build for themselves using AI. With stronger models, that category becomes increasingly sophisticated. Applications that previously required a developer can now be built by specifying a need in plain language. That places new demands on IT organisations: how do you maintain it? How do you set guardrails without choking productivity?

The bitter lesson is not about doing something wrong. It is about the fact that what was right in the previous model generation is not right in the next one. Every round of improvement demands that we simplify. Every time we believe our scaffolding, our process, our way of doing things adds value, we need to test whether that still holds.

It is uncomfortable. That is the point.

*Based on Matt Shumer's analysis of Claude Mythos.*