Beyond the Bitter Lesson - A Deeper Understanding of AI Innovation (Part I)

Introduction: The Bitter Lesson and Its Foundations

In 2019, the Canadian-American computer scientist Rich Sutton, published an essay on a plain HTML page. No images. No styling. Just black text on white background. The essay was called The Bitter Lesson, and it told the story of a repeating pattern in AI: human experts carefully embedding their knowledge into systems, only to be defeated by simpler approaches that utilise more computation.

The pattern repeats across decades. Human experts spend years encoding their knowledge into AI systems - teaching chess programs about pawn structures, feeding speech recognition systems with phonetic rules, crafting vision algorithms with edge detectors. Then someone builds a simpler system that just uses more computation. The simple system wins. The human knowledge becomes obsolete.

Sutton’s message was direct: stop encoding human analysis of problem domains into AI systems. Stop building chess engines with handcrafted opening books and endgame tables. Stop designing vision systems around edge detectors that engineers think are important. Stop building speech recognition with phonetic rules linguists developed.

Instead, invest in the two methods that scale arbitrarily with computation: search and learning. Build systems that can absorb data and find patterns through massive computation rather than following rules derived from human expertise. Let the machines discover what humans might never notice or articulate.

The researchers who bet on this approach - general methods that scale with computation - always win. The researchers who bet on encoding human domain knowledge always lose. This is the bitter lesson. Bitter because the researcher’s hard-won domain expertise is, in the long run, a liability rather than an asset. The knowledge still matters - it tells you what to study next, where to point the computation. But the knowledge itself doesn’t go into the system. The system finds its own.

The essay went viral among AI researchers. It felt true. AlphaGo had crushed the world’s top Go players in 2016 using neural networks and Monte Carlo tree search, and by 2017, AlphaGo Zero had gone further - achieving superhuman play through pure self-play with no human game data at all. Deep learning was steamrolling decades of careful feature engineering. The bigger you made the model, the better it got. Every time.

Sutton was right. His analysis of the pattern he observed - across chess, Go, speech recognition, computer vision - is historically accurate and the core insight remains sound: don’t hardcode your understanding of the problem domain into the system’s architecture. Build general methods. Let computation do the work.

However the essay has also become scripture. Something about its plain presentation - the utilitarian HTML, no-frills delivery - gives it an aura of timeless authority. New entrants to the field read it, hear truth, and stop there. “Scale compute, use general methods, don’t embed domain knowledge” becomes the whole lesson, and the nuances of what actually drives AI innovation go unexamined. The bitter lesson becomes a thought-terminating cliché - a reason not to think harder about the problem, rather than a starting point for deeper analysis.

The Bitter Lesson and Its Limits

Sutton’s framing is underspecified - “leveraging computation” is loose enough that virtually any winning strategy can be retroactively claimed as an instance of it. The claim is falsifiable in principle, but too vague to do predictive work. It identifies the destination without mapping the route.

This is not a criticism of Sutton. Writing in 2019, he couldn’t have predicted inference-time reasoning, mixture-of-experts architectures, or PTX-level hardware optimization - nor should we expect him to. The essay was correct for its time, and its core insight remains sound. The problem is not with the essay but with how the field holds it: as scripture rather than starting point. The landscape has grown far more complex than “scale compute, use general methods” can navigate, and treating the bitter lesson as the whole lesson means missing the nuance that now determines who actually wins.

What does “leveraging computation” actually mean in practice? What determines which innovations matter at any given moment? And is there a framework that can help us understand not just that computation matters, but how to direct the ingenuity required to use it effectively?

To answer those questions, we need to press on the phrase that does the most work in Sutton’s argument: “leveraging computation.” The more you examine it, the less clear it becomes.

Take chain-of-thought prompting. When you ask a model to “think step by step” or “show your reasoning,” you’re making it generate more tokens - literally using more inference compute. The model takes longer to respond, burns more GPU cycles, costs more per query. By Sutton’s framework, this should be “leveraging computation” and therefore good.

But wait - isn’t chain-of-thought also encoding human knowledge about how reasoning works? We’re telling the model “humans reason step-by-step, so you should too.” That’s exactly the kind of “building in how we think we think” that Sutton warns against. Yet it demonstrably works, especially for complex reasoning tasks.

So which is it? Is chain-of-thought a method that leverages computation (more inference tokens, better performance) or is it encoding human knowledge (here’s how humans break down problems)? The same ambiguity runs through self-consistency sampling, tree-of-thought, tool use - virtually every modern technique that uses more compute while also encoding a human insight about how to use it effectively.

And chain-of-thought is only one example. “Leveraging computation” now covers architectural efficiency (Multi-head Latent Attention, Mixture of Experts), hardware-level optimization (PTX programming, Flash Attention), quantization (FP32 to INT4), training methodology (curriculum learning, data mixing), and inference-time reasoning (chain-of-thought, self-consistency). Every one of these is “leveraging computation.” None encode domain knowledge in the way Sutton warned against - nobody is hand-coding chess openings or phonetic rules. Yet they’re radically different strategies requiring radically different expertise, and Sutton’s framework treats them as a single category.

The dichotomy Sutton presents - human knowledge versus leveraging computation - breaks down when you examine modern techniques closely. The problem with the old approach of building systems on human domain knowledge wasn’t that human knowledge is useless. It was that the complexity of the raw data got reduced to whatever structure facilitated the encoding of rules, and each feature required a human to manually map relationships. Deeper patterns - patterns the human didn’t think to look for - were lost. The solution has been to build systems that take in sanitised raw data as granularly as possible, map it with precision, and predict the next data point - rather than derive which rule to follow. But how those systems get built, trained, optimised, and deployed involves human ingenuity at every layer.

Sutton’s own conclusion is more nuanced than the popular reading suggests. He doesn’t say “build in nothing.” He says: “we should build in only the meta-methods that can find and capture this arbitrary complexity. We want AI agents that can discover like we can, not which contain what we have discovered.” The distinction is between building in discoveries (chess opening theory, phonetic rules, edge detection algorithms) versus building in methods for discovering (search, learning, reasoning and the architectures that enable them).

Chain-of-thought sits right on this boundary - and exposes how unstable it is. Is “reason step-by-step” a discovery about how cognition works, or a meta-method for discovering solutions to novel problems? It’s clearly both. Step-by-step decomposition is a human insight about problem-solving (a discovery), but it also functions as a general-purpose search procedure that scales with inference compute (a meta-method). The same ambiguity runs through self-consistency sampling, tree-of-thought, tool use - virtually every technique that encodes a human insight about how to search or learn more effectively.

The line between “discoveries” and “meta-methods for discovering” isn’t fixed - it depends on the level of abstraction. At a high enough level, every useful architectural choice is a meta-method. At a low enough level, every meta-method embodies specific discoveries about what works. Attention is both a meta-method for discovering relevant context and a discovery that weighted relevance beats fixed context windows. RLHF is both a meta-method for discovering aligned behavior and a discovery that human preference signals beat hand-written reward functions.

Sutton’s distinction between discoveries and meta-methods is useful as a rough heuristic - don’t hardcode your conclusions about the problem domain into the system’s architecture. But it breaks down precisely where the most interesting modern work happens: techniques that encode human insights about how to search and learn more effectively, which are simultaneously discoveries and meta-methods. The most effective methods often leverage computation AND incorporate insights about how to use that computation productively. Training versus inference compute allocation? Architectural efficiency? Quantization strategies? Operating at the PTX level to squeeze more from existing hardware? Each of these requires both computational resources and human ingenuity about how to deploy them.

Three Ways Knowledge Enters General Problem Solvers

Sutton’s framework is underspecified about how to leverage computation. But it’s also underspecified about something more fundamental: how knowledge enters AI systems in the first place. His dichotomy - human knowledge versus general methods - assumes there are two categories. There are three, and the third is where much of the consequential modern work happens. Understanding this taxonomy is prerequisite to building the richer framework we need, because the question of which strategies for leveraging computation actually work depends on understanding the mechanisms through which human ingenuity shapes these systems.

Sutton’s original essay uses phrases like “build knowledge into” agents and systems that “contain what we have discovered” - not “embed” or “encode.” The distinction matters, because the entire history of AI can be read as a series of attempts to build general problem solvers - systems that handle arbitrary domains rather than being purpose-built for one task. What changes across eras is not the ambition but the mechanism: how does knowledge get into the system? Three distinct approaches emerge, each corresponding to a different generation.

Structural and functional embedding. This is what Sutton was warning against, and it’s where the original General Problem Solver - Newell, Shaw, and Simon’s GPS from 1957 - sits. Knowledge is built into the system’s architecture itself - the structure is the knowledge. GPS didn’t learn how to decompose problems; it followed decomposition rules that Newell, Shaw, and Simon wrote down. A chess engine with handcrafted opening books doesn’t learn that the Sicilian Defense is strong; it contains that fact as a lookup table. A vision system with edge detectors doesn’t discover that edges matter; an engineer decided edges matter and wired that assumption into the processing pipeline. The system’s design presupposes what matters in the domain. You can’t separate what it knows from how it’s built. Every feature requires a human to identify what’s relevant and manually map the relationships. Deeper patterns in the raw data - patterns the human didn’t think to look for - are lost.

This approach dominated the first generation of general problem solvers, from the 1950s through the early 2010s, and it’s the approach that kept losing to simpler methods with more compute. Sutton was right about this.

Data embedding. The second generation of general problem solvers broke from structural embedding by separating the architecture from the knowledge. Neural networks provide a general structure - layers, weights, activation functions, attention mechanisms - and the knowledge lives in the parameters, learned from data. The system’s design doesn’t presuppose what matters; it discovers what matters through training. A convolutional neural network doesn’t contain edge detectors because an engineer thought edges were important; it develops edge-detecting features because edge detection is useful for the prediction task. The knowledge is in the weights, compressed and distributed across billions of parameters, fixed once training completes.

This is the shift Sutton celebrates. Data embedding scales with computation in a way structural embedding never could. More data, more parameters, more training compute - each reliably improved performance where adding more handcrafted rules never did. The quality of training data sets the ceiling: poor data embeds poor patterns, biased data embeds biased associations, high-quality diverse data embeds richer representations.

Data encoding. The third generation of general problem solvers - large language models and the agent systems built on top of them - introduced something Sutton’s framework has no category for. Knowledge enters the system at inference time through the context window - system prompts, few-shot examples, chain-of-thought instructions, tool definitions, conversation history. This knowledge isn’t in the architecture (structural embedding) and it isn’t in the weights (data embedding). It’s in the input. It’s ephemeral, dynamic, can be swapped without retraining, and shapes behavior as powerfully as the other two mechanisms.

A system prompt that specifies “you are a medical assistant - always recommend consulting a doctor for serious symptoms” is encoding domain knowledge. An instruction file that defines workflow rules, behavioral constraints, and tool-use protocols is encoding an elaborate architecture of behavior. Few-shot examples that demonstrate a task format are encoding task-specific patterns. None of this modifies the underlying model. All of it profoundly changes what the model does.

Why Three Categories Matter

Sutton’s bitter lesson is a story about the first category losing to the second. Structural embedding consistently lost to data embedding. That pattern was real, robust, and repeated across decades and domains.

But the emergence of the third category complicates the narrative. Data encoding reintroduces human knowledge into general problem solvers, and it works. An elaborate system prompt is human knowledge shaping system behavior - exactly the kind of thing Sutton warned against. Yet it doesn’t suffer from the failure mode he identified. Why?

Because the failure mode of structural embedding was rigidity. When knowledge is in the architecture, changing what the system knows requires rebuilding the system. When an edge detector is wired into a vision pipeline, you can’t discover that texture matters more than edges without redesigning the pipeline. The knowledge and the structure are fused.

Data encoding doesn’t have this problem. The knowledge is in the input, not the architecture. You can change the system prompt without retraining. You can swap behavioral specifications between requests. You can experiment with different encodings at negligible cost. The general-purpose architecture that Sutton advocated for - a system that learns from data - is precisely what makes data encoding possible. You need a powerful, general model (built through data embedding) before encoding knowledge through prompts becomes useful. The quality of data embedding sets the ceiling of what the system can do; the quality of data encoding determines how close to that ceiling it operates.

These three categories are not hermetically sealed. RLHF, for instance, uses human preference signals during training to permanently shape weights. The preference data resembles encoding - it’s structured human judgment about what good outputs look like - but it enters the system through the training pipeline and gets compressed into parameters. It’s data embedding that happens to use a richer signal than next-token prediction. System prompt distillation works in the opposite direction: models are trained to behave as if a system prompt is present even when none is provided, baking inference-time behavior into weights. Constitutional AI uses written principles as training inputs - explicit rules that get absorbed into implicit patterns.

Modern systems use all three simultaneously, with knowledge flowing between them. What matters is where human knowledge enters and how durable that entry point is. A system prompt is ephemeral and swappable. RLHF preferences are permanent once training completes. An architectural choice like attention is permanent and structural. Different intervention points, even when the downstream effects look similar.

That said, agent frameworks now ship with detailed instruction files - behavioral specifications, workflow rules, tool-use protocols, domain constraints - that function as persistent architectures of behavior injected at inference time. When a system prompt runs to thousands of tokens of carefully structured behavioral rules, it starts to resemble the elaborate rule-based systems Sutton warned against, just delivered through the context window rather than hardcoded into weights. The mechanism is different (encoding, not embedding), but the practical effect - human knowledge shaping system behavior through explicit rules - is strikingly similar.

From the user’s perspective, the resemblance is already complete. A refusal, a tone, a reasoning style - was it structural (the architecture constrains it), embedded (the training shaped it), or encoded (a hidden prompt directs it)? The user can’t tell. The effect is identical. The distinction between the three categories is a developer-side concern - useful for understanding where to intervene, but invisible at the point of interaction.

Modern general problem solvers use all three simultaneously. A transformer’s attention mechanism is structural embedding - an architectural bet that weighted relevance matters. Its weights are data embedding - patterns learned from training. Its system prompt is data encoding - human knowledge injected at inference time. Sutton’s framework addresses only the relationship between the first two. The third is where much of the consequential work now happens, and it requires its own analysis.

This is why we need a deeper framework. Not to reject Sutton’s insight that computation matters, but to understand that how we leverage computation requires continuous innovation across the entire stack - and that innovation itself is what drives progress. What “leveraging compute” meant in 2019 looks nothing like what it means in 2025. The hardware architectures change. The algorithmic breakthroughs shift what’s possible. The training methodologies unlock different scaling regimes. The target keeps moving.

The Shifting Landscape: From 2019 to 2025

In October 2022, the United States imposed export controls restricting China’s access to advanced AI chips - the H100s and A100s powering frontier models. The controls tightened through 2023. Meanwhile, American labs were building out: OpenAI training GPT-4 on tens of thousands of A100s, Google assembling massive TPU clusters for Gemini, Anthropic securing Amazon’s Trainium chips for Claude. The compute gap widened dramatically.

If Sutton’s lesson were simply “more compute wins,” the story ends here. Chinese labs, starved of hardware, fall behind. But Sutton was more careful than that - he said “leveraging computation,” not “having the most computation.”

Then in late 2024, DeepSeek released V3. A 671-billion parameter model trained for approximately $5-6 million - a fraction of what GPT-4 or Claude cost. Running on inferior hardware that US labs wouldn’t even consider using. And it competed. Head-to-head benchmarks showed it matching or exceeding models trained on orders of magnitude more compute.

How? Multi-head Latent Attention (MLA), which dramatically reduced memory requirements. Mixture of Experts architectures that used parameters more efficiently. Group Relative Policy Optimization (GRPO) for reinforcement learning without requiring a separate reward model. And crucially, low-level optimizations at the PTX (Parallel Thread Execution) level that made their available GPUs perform significantly better than stock configurations. Not superior compute - superior methods for leveraging limited compute.

In one sense, DeepSeek supports Sutton: they won by leveraging computation more effectively, not by encoding domain knowledge into the model. But in another sense, the story exposes what Sutton’s framework can’t explain: how they knew where to innovate, why PTX-level optimization mattered more than acquiring more hardware, and what guided the specific architectural bets that paid off. “Leverage computation” is the correct answer at every level of resolution except the ones that matter.

The Scaling Laws: A Real Discovery That Became a Distraction

In 2019, Sutton’s advice pointed in one direction. Neural networks are made of composable building blocks - neurons, layers, connections. Make them more performant, have them interface more efficiently, scale them up. Train them to predict the next word, next pixel, next move. Add more compute. Win.

The interpretation was simple because the landscape was simple. Moore’s Law still drove compute gains. Parallelizing GPUs got better. Training bigger models worked reliably. And by 2020, the intuition had been formalized: Kaplan et al. published neural scaling laws showing that model performance improved as a predictable power law of model size and data. Chinchilla refined this, establishing compute-optimal ratios for allocating a fixed training budget between parameters and tokens. The Bitter Lesson wasn’t just a historical pattern anymore - it looked like a law of physics.

The Conceptual Error

There’s a critical distinction here that the field largely failed to make (explored in more detail in What is Scaling?). The scaling laws were never about scaling compute. They were about scaling models - more parameters, more data, predictably lower loss. Compute appears as a variable in these laws, but it’s the cost of producing a given model, not an independent lever. Training compute is what you spend to reach a point on the scaling curve; it doesn’t define the curve itself. This may be precisely what OpenAI has got wrong, and why OpenAI is losing this race.

Scaling compute - making each unit of compute cheaper, faster, or more efficient through hardware improvements and software optimizations - is a separate concern entirely. Moore’s Law, GPU parallelism, Flash Attention, mixed-precision training, kernel fusion - these scale the available compute. They determine which models are economically feasible to train. But they don’t change the scaling laws; they change what point on the scaling curve you can afford to reach. This is supply-side: it enables model scaling, it doesn’t constitute it.

Because training compute is the cost of realizing a given model, the field collapsed these two things into one. “Scaling” became a single activity - raise capital, acquire GPUs, move along the known curve. The distinction between what the model should be and what it costs to build it disappeared.

The Strategic Error

That conceptual collapse enabled a strategic one. Sutton said invest in researching general methods that leverage computation. Scaling laws gave the field something else: a quantified, predictable relationship between model size, data, and performance - a legible path you could plot, raise capital against, and execute. The research question shifted from “what are better methods?” to “how do we get more GPUs?” The hard problem looked solved. We have the method. Now we just need to feed it.

Researching novel architectures, low-level hardware optimizations, more efficient attention mechanisms - those don’t have predictable returns. You can’t show investors a power law for “what if we try a fundamentally different approach to key-value caching.” So the field chose the expensive legible path over the cheaper illegible one, optimizing resource acquisition over research.

The field didn’t misread Sutton. They stopped doing what he said.

Then DeepSeek demonstrated how much room the research path still had. They didn’t discover new scaling laws or prove that the relationship between model size, data, and performance works differently than Kaplan or Chinchilla described. What they proved is that the cost of reaching a given point on the scaling curve is not fixed. MLA, Mixture of Experts, GRPO, PTX-level optimization - these innovations made existing scaling laws accessible at a fraction of the expected cost. They reached points on the curve that their hardware budget shouldn’t have allowed, not by changing the curve but by dramatically reducing what it costs to travel along it.

American labs treated compute scaling as a procurement problem: buy more GPUs, move further along the known curve. DeepSeek treated it as a research problem: innovate across the entire stack so the same GPUs take you further.

The Limits of “Sutton Was Right”

It’s tempting to say Sutton was right all along - the field just wasn’t doing what he said. DeepSeek researched where others procured. But this reads Sutton more generously than his actual argument warrants.

Sutton’s specific prescription was search and learning - two general methods that scale with computation. DeepSeek’s PTX-level optimizations are not search. They are not learning. They are low-level systems engineering - hand-tuning GPU instruction scheduling to squeeze more throughput from memory-bound operations. MLA is an architectural innovation in how attention mechanisms handle key-value pairs. GRPO is a training methodology that eliminates a component (the reward model) from the reinforcement learning pipeline. None of these are instances of “search and learning” in any meaningful sense. They are instances of engineering ingenuity applied to the problem of making search and learning cheaper.

If the lesson is “research beats procurement,” that’s true, but it’s a weaker and different claim than “search and learning beat domain knowledge.” Sutton was arguing about what goes into the model. DeepSeek’s advantage was in everything around the model - the infrastructure, the architecture, the training pipeline, the hardware utilization. The bitter lesson doesn’t have a category for “the team that writes better CUDA kernels wins.”

This matters because the field needs more than a corrected reading of Sutton. It needs a framework that accounts for the full surface area of innovation - not just what the model learns, but how the model is built, trained, optimized, and deployed. The bitter lesson covers the first. The rest is where the actual competition now happens.

But watch what happens when we apply Sutton’s principle - research general methods that leverage computation - today. Every interpretation below is a valid reading. None of them agree on what to do.

What should the model be?

Build larger models with more parameters - move further along the scaling curve
Train on more data with better curation - embed richer knowledge into the same architecture
Build mixture-of-expert systems that activate conditionally - more total capacity, but sparse, with only a fraction of parameters active per token

How do we afford to build it?

Build more efficient architectures that do more with less - reach the same point on the curve at a fraction of the cost
Build specialized hardware that changes what a unit of compute means - redefine the building blocks at the silicon level
Reduce numerical precision so the same hardware runs larger models faster - trade precision for throughput, redefining what each operation costs

How much should we spend answering this specific question?

Build models that think longer at inference time - a third axis entirely, trading deployment cost for accuracy on a given query, where the base model sets the ceiling and inference-time compute determines how close to that ceiling the system operates

Three fundamentally different questions. Three different investment theses. Three different teams with different expertise. And they compete for the same finite resources - a lab betting heavily on inference-time reasoning is not simultaneously optimizing PTX-level hardware utilization. A team scaling to a trillion parameters is not the same team redesigning attention mechanisms to make smaller models punch above their weight. These are rival allocations of talent, capital, and institutional focus. Yet Sutton’s framework calls them the same thing - general methods that leverage computation - and offers no way to choose between them.

Sutton told us to invest in search and learning - general methods that scale with computation. He didn’t specify which architectures enable that scaling. He didn’t tell us that discovering better architectures would be the hard part. He didn’t tell us that scaling efficiently isn’t automatic - it requires architectural innovation to create systems that can utilise additional compute effectively.

The principle is correct but underspecified. It rules out a real class of approaches - don’t hardcode domain knowledge - and that’s genuinely directional. But once you’re inside the space of general methods that leverage computation, it offers no guidance on which of these competing bets to make, or how to allocate finite resources between them.

Here’s what we now understand that wasn’t obvious in 2019:

The building blocks themselves evolve. Self-attention was a better building block than recurrent connections. Multi-head Latent Attention is a better building block than standard attention. Tomorrow’s architecture might be something we haven’t discovered yet. Sutton would rightly call this “researching meta-methods” - but his framework doesn’t help you decide which meta-methods to research, or where the next architectural breakthrough will come from.
Scaling efficiency isn’t uniform. Some architectures scale beautifully to 100 billion parameters. Others hit diminishing returns at 10 billion. Some methods work better with inference-time compute. Others need it all at training time. The assumption that systems automatically become more effective with more compute is false. It requires careful architectural design.
Prediction tasks come in infinite varieties. Predict the next token. Predict the masked token. Predict which answer humans prefer. Predict the next reasoning step. Predict which tool to call. Each prediction task teaches different capabilities. The choice of what to predict determines what the model learns. This is a design decision that requires domain insight.
FLOPs are an increasingly incomplete proxy for actual compute. ML researchers have long known that raw FLOP counts miss critical factors - memory bandwidth, MFU (model FLOP utilization), numerical precision, inter-device communication overhead. But the gap between FLOPs-on-paper and real-world performance has widened as hardware has diversified. The same matrix multiplication can differ by an order of magnitude in wall-clock cost depending on whether it runs on a TPU, a Tensor Core, or a Cerebras wafer, and on whether the implementation exploits memory hierarchies effectively. Sutton’s framework treats compute as a fungible quantity - more is better. In practice, “add more compute” requires specifying what kind, where in the system, and on what hardware - decisions that are architectural, not merely quantitative. We’ll examine this heterogeneity in detail in Part II.

Sutton’s principle is correct: general methods that leverage computation through search and learning outperform hand-crafted rules. But it’s a constraint, not a strategy. A necessary condition, not a sufficient one. It still leaves a massive design space to explore - and offers no guidance on how to navigate it.

What Comes Next

The bitter lesson tells us what wins in the long run: general methods that leverage computation. It doesn’t tell us which bottleneck to attack, at which layer of the stack, with what kind of compute. It doesn’t even distinguish between the two fundamental questions the field faces: what model to build (how far along the scaling curve to aim) and how to afford building it (how to scale the compute required to get there). And as the landscape has fragmented - model scaling vs. compute scaling, training vs. inference, FP32 vs. INT4, dense vs. sparse, hardware vs. methodology - those are precisely the questions that determine who wins and who wastes resources.

The core of the answer is this: bottlenecks migrate through the stack, and innovation at one layer reprices the value of innovation at adjacent layers. When hardware was scarce, architectural innovation was premature. When parallelizable architectures arrived, hardware scale became the binding constraint. When scale became abundant, efficiency determined who could afford to deploy. Each solved constraint exposes the next - and the teams that correctly identify which layer of the stack currently offers the highest-return leverage are the ones that win. This is the framework Part II develops in detail, along with why the simple compute metrics the field relies on are increasingly disconnected from what actually drives performance.

This is also to say nothing about the politics, coordination, and supply chain - dynamics that drive where human attention gets focused and where conflicts of interest emerge. Later parts of this series.

Jacob Sussmilch

Explorer

Beyond the Bitter Lesson - A Deeper Understanding of AI Innovation (Part I)

Beyond the Bitter Lesson - A Deeper Understanding of AI Innovation (Part I)

Introduction: The Bitter Lesson and Its Foundations

The Bitter Lesson and Its Limits

Three Ways Knowledge Enters General Problem Solvers

Why Three Categories Matter

The Shifting Landscape: From 2019 to 2025

The Scaling Laws: A Real Discovery That Became a Distraction

The Conceptual Error

The Strategic Error

The Limits of “Sutton Was Right”

What Comes Next

Graph View

Table of Contents

Backlinks