Beyond the Bitter Lesson
Author: Jacob Sussmilch
A series of essays examining what actually drives AI innovation, moving beyond Rich Sutton’s influential 2019 essay to develop a more complete framework for understanding how computation, architecture, and human ingenuity interact.
Part I: A Deeper Understanding of AI Innovation
Beyond the Bitter Lesson - A Deeper Understanding of AI Innovation (Part I)
Part I examines Sutton’s original claim — that general methods leveraging computation consistently outperform hand-encoded domain knowledge — and argues that while historically accurate, the framework has become underspecified as the field has evolved.
Core contributions:
- Three-category taxonomy of how knowledge enters AI systems: structural embedding (knowledge in the architecture), data embedding (knowledge in the weights), and data encoding (knowledge in the input/context window). Sutton’s bitter lesson describes the first category losing to the second, but the emergence of the third — system prompts, agent frameworks, inference-time instructions — complicates his prescription against building knowledge into systems.
- The discovery/meta-method boundary problem: Chain-of-thought, self-consistency sampling, and tool use all simultaneously encode human insights about problem-solving and function as general-purpose search procedures that scale with compute. Sutton’s distinction between “building in discoveries” and “building in meta-methods for discovering” breaks down at exactly the points where modern innovation is most active.
- The bottleneck migration pattern: AI progress follows a recognizable sequence — hardware-bound (pre-2012) → architecture-bound (2012-2017) → scale-bound (2017-2022) → efficiency-bound (2022-present) → methodology-bound (emerging). When a resource becomes abundant, the system’s performance becomes constrained by the next scarcest factor, and innovation concentrates there.
- The FLOP illusion: Compute is not fungible. Training compute differs from inference compute, precision levels trade accuracy for speed, memory bandwidth often matters more than FLOP count, and different hardware architectures excel at different operations. “Add more compute” is no longer a strategy — it’s the starting point of a hundred design decisions.
- Cross-layer repricing: The AI stack (hardware → systems → architecture → training methodology → inference methodology) exhibits a dynamic where gains at one layer change the cost landscape at adjacent layers, explaining why bottlenecks migrate rather than remain fixed.
Key case study: DeepSeek V3 — a 671B parameter model trained for ~$5-6M on inferior hardware that matched frontier models, demonstrating that superior methods for leveraging limited compute can outperform superior compute with standard methods.
Part II: The Innovation Stack — Where Leverage Actually Lives
Beyond the Bitter Lesson - The Innovation Stack (Part III)
Maps the five layers of the AI stack (hardware, systems engineering, architecture, training methodology, inference methodology) in detail. Shows how breakthroughs propagate across layers through a repricing mechanism — Flash Attention (Layer 2) enabling longer contexts (Layer 3) enabling new training strategies (Layer 4). Develops the “cheapest leverage relative to return” idea into an actionable leverage gradient, and analyzes DeepSeek’s multi-layer strategy as the clearest case of compound cross-layer innovation.
Part III: The Encoding Problem — When “General Methods” Aren’t General
Beyond the Bitter Lesson - The Encoding Problem (Part IV)
Refines Part I’s “data encoding” category into three distinct mechanisms: ephemeral encoding (per-request prompts), persistent encoding (system prompts, agent frameworks), and compiled encoding (RLHF, distillation). Traces how knowledge moves along an encoding gradient from flexible-but-expensive to rigid-but-free. Proposes the encoding effectiveness principle — encoding helps when it preserves separability, granularity, and override capability — and examines whether modern agent frameworks are reproducing the same failure mode Sutton warned against, just at inference time.
Part IV: Prediction as the Universal Substrate
Beyond the Bitter Lesson - Prediction as the Universal Substrate (Part V)
Argues that the choice of prediction target — not scale, not architecture, not efficiency — is the deepest form of leverage in AI. Traces the prediction target hierarchy (next-token → instruction-response → preference → reasoning outcome) and shows that each major capability jump corresponds to a prediction target innovation, not merely a scaling achievement. Examines why prediction target design is irreducibly theory-laden and surveys candidates for the next breakthrough targets.
Part V: Who Wins and Why — A Predictive Framework
Beyond the Bitter Lesson - Who Wins and Why (Part VI)
Puts the framework on the line. Tests it retrospectively against five cases (OpenAI’s scaling phase, DeepSeek’s multi-layer innovation, the reasoning revolution, Google’s position, Anthropic’s methodology focus) and makes five falsifiable forward-looking predictions about architecture-methodology convergence, the encoding correction, data quality as the next bottleneck, hardware-software co-design, and prediction target proliferation. Concludes with explicit conditions under which the entire framework should be abandoned.