What is Scaling?

Authors: Jacob Sussmilch

The Word That Does Too Much Work

“Scaling” is the most overloaded word in AI. It gets used to describe training larger models, buying more GPUs, optimizing inference throughput, expanding datasets, and improving architectural efficiency — as though these were all the same activity.

The confusion matters because it leads to bad strategy. When someone says “scaling is all you need,” what do they mean? Scale the model? Scale the compute? Scale the data? Scale the efficiency? Each implies a different investment, a different research agenda, and a different theory of what drives progress. Collapsing them into a single word obscures the most important strategic questions in the field.

This essay draws a line between two fundamentally different things that both travel under the name “scaling” — and identifies a third axis, inference-time compute, that the training-side distinction alone can’t account for.

Scaling the Model

The scaling laws — first formalized by Kaplan et al. (2020) and refined by Hoffmann et al. with Chinchilla (2022) — describe a specific empirical relationship: model performance (measured as loss) improves as a predictable power law of model size and training data. More parameters and more tokens, in the right ratio, yield lower loss. Reliably. Predictably.

This is scaling the model. It answers the question: how capable can a model become?

The key variable is model capacity — the number of parameters and the volume of data the model is trained on. The scaling laws tell you that if you double the parameters and double the data (in the compute-optimal ratio Chinchilla identified), you get a predictable improvement in loss. The relationship is smooth, well-characterized, and has held across multiple orders of magnitude.

Compute appears in these laws, but as a dependent variable — it’s the cost of producing a given model, not an independent lever. Training compute is what you spend to instantiate a point on the scaling curve. It doesn’t define the curve. The scaling laws would describe the same relationship between model size, data, and loss regardless of whether training took one day or one year, cost 100 million. The curve is about the model. The cost is about the compute.

This distinction determines whether someone reads the scaling laws as saying “spend more on compute” (a procurement strategy) or “build bigger, better-fed models” (a research strategy). The laws themselves say the latter. But much of the field heard the former — and it became a real distraction. OpenAI’s scaling era was defined by compute procurement: deals, partnerships, and publications oriented around acquiring more hardware. The public, industry, and investment narratives followed suit.

Scaling Compute

Scaling compute is about making each unit of computation cheaper, faster, or more efficient. It answers a different question: how do we afford to build the models the scaling laws say we should build?

This is where hardware improvements and software optimizations live:

  • Hardware: Moore’s Law (no longer the primary driver), GPU parallelism, specialized accelerators (TPUs, Trainium), better interconnects, higher memory bandwidth
  • Systems software: Distributed training frameworks, pipeline parallelism, tensor parallelism, efficient gradient accumulation
  • Algorithmic efficiency: Flash Attention, mixed-precision training, kernel fusion, activation checkpointing
  • Architectural efficiency: Multi-head Latent Attention (reducing KV-cache memory), Mixture of Experts (activating fewer parameters per token), Grouped Query Attention (sharing KV heads)
  • Quantization: Reducing numerical precision from FP32 to FP16, INT8, or INT4 — making each operation cheaper with minimal accuracy loss
  • Low-level optimization: PTX-level programming, custom CUDA kernels, hardware-specific tuning that closes the gap between theoretical and actual GPU performance

None of these change the scaling laws. They change what point on the scaling curve you can afford to reach for a given budget. They are supply-side innovations — they expand the frontier of what’s economically feasible without altering the fundamental relationship between model capacity and performance.

This is an essential form of innovation. Without compute scaling, the scaling laws would be purely theoretical — describing models nobody could afford to train. The entire history of deep learning’s practical success is a story of compute scaling making model scaling affordable: GPUs making neural networks feasible (2012), distributed training making billion-parameter models feasible (2018), and the ongoing push to make trillion-parameter models economically viable.

Why the Distinction Matters

It explains DeepSeek

DeepSeek V3 trained a 671-billion parameter model for approximately $5-6 million — a fraction of what comparable models cost at American labs. The common narrative is that DeepSeek “proved scaling doesn’t matter.” The opposite is true. DeepSeek is a scaling laws success story.

They didn’t discover a new relationship between model size and performance. They didn’t prove that smaller models can match larger ones at the same data budget. What they proved is that the cost of reaching a given point on the scaling curve is not fixed. Through MLA, Mixture of Experts, GRPO, and PTX-level optimization, they dramatically reduced what it costs to train and serve a model of a given capability level. They scaled compute efficiency so effectively that they could reach points on the scaling curve that their hardware budget shouldn’t have allowed.

The scaling laws held. The cost of following them dropped.

It explains the US-China dynamic

US export controls on advanced AI chips were designed to limit China’s ability to scale compute — to restrict access to the hardware needed to train frontier models. Under the collapsed definition of “scaling,” this should have been decisive: less compute, worse models.

But the controls restricted compute supply, not model scaling itself. They made it more expensive for Chinese labs to reach a given point on the scaling curve. The response — innovating across the stack to make each available GPU do more useful work — was a compute-scaling strategy, not a model-scaling strategy. The constraint didn’t limit progress; it redirected where the innovation happened.

It clarifies the “scaling wall” debate

When people claim “we’ve hit a scaling wall,” which wall do they mean?

  • A compute scaling wall — it’s getting harder or more expensive to provide the compute needed for the next jump in model size. Moore’s Law has decelerated. Energy costs for data centers are rising. Chip fabrication at smaller nodes is increasingly expensive. Each generation of frontier model requires a larger capital expenditure.
  • A model scaling wall — the relationship between model capacity, data, and loss is yielding diminishing returns at the frontier. GPT-4 to GPT-5 was widely seen as a smaller jump than GPT-3 to GPT-4. Whether this reflects a fundamental limit in the scaling laws or simply the cost of reaching the next point on the curve is genuinely ambiguous.

These problems require different solutions. A compute scaling wall means engineering innovation to extract more performance from available hardware — or new hardware paradigms entirely. A model scaling wall would mean fundamental rethinking of architectures and training methods.

The distinction matters because the field tends to collapse both into a single “scaling wall” narrative, which forecloses the wrong options. If the bottleneck is cost, the answer is efficiency research. If the bottleneck is the underlying relationship, the answer is architectural innovation. Treating them as the same problem leads to the wrong response to both. OpenAI spent the scaling era assuming that scaling compute would translate to scaling the model past performance barriers at a 1:1 ratio. That collapsed thinking is exactly what will cost them the race.

It reframes the investment question

When a lab raises billions for “scaling,” what are they buying?

If they’re buying model scaling — the ability to train larger models on more data — then the investment thesis depends on the scaling laws continuing to hold and on having novel architectural ideas worth scaling.

If they’re buying compute scaling — more GPUs, bigger clusters — then the investment thesis depends on the current architecture being the right one to scale, and on procurement being the binding constraint rather than research.

These are different bets. The scaling laws give you a target. Compute scaling gives you the means to reach it. Confusing the two leads to strategies that optimize procurement when they should be optimizing research, or that fund research when the actual bottleneck is infrastructure.

Inference-Time Compute: A Third Axis

The model/compute distinction covers training — how you build the model. But a third scaling axis has emerged that doesn’t fit cleanly into either category: inference-time compute, sometimes called test-time compute.

The idea is straightforward. Instead of building a more capable model (model scaling) or making training cheaper (compute scaling), you spend more computation at inference — when the model is actually answering a question. Chain-of-thought prompting was an early version of this. The reasoning models (o1, o3, R1, Claude’s extended thinking) formalize it: the model generates intermediate reasoning steps, effectively “thinking longer” before producing an answer. More inference compute, better answers, same underlying model.

This is genuinely different from both forms of training-side scaling:

  • It doesn’t change the model’s parameters or the data it was trained on. The point on the training scaling curve is fixed.
  • It doesn’t make training cheaper or more efficient. The training cost is already sunk.
  • It creates a new scaling relationship — between inference compute and answer quality — that operates at deployment rather than development.

Where does it fit in the framework? Inference-time compute is best understood as a deployment-side scaling law that sits alongside the training-side scaling laws rather than replacing them. The training scaling laws determine the model’s base capability. Inference-time scaling determines how much of that capability gets extracted on a given query. A weaker model with more inference compute can match a stronger model with less — up to a point. The base model still sets the ceiling.

This matters strategically because it shifts where the cost accumulates. Training costs are large but fixed — you pay once to produce the model. Inference costs scale with every user, every query, every reasoning step. A strategy built on inference-time compute scaling trades a one-time research investment for a recurring per-query cost. For a company already struggling with inference margins (see the auto-router debacle), adding a second axis of inference spending compounds the problem rather than solving it.

The most effective position is research that improves the base model’s capability and the efficiency of inference-time reasoning. These are not the same research agenda. Conflating them — treating “reasoning models” as a single innovation rather than as training methodology plus inference strategy — reproduces exactly the kind of collapsed thinking this essay is about.

Conclusion

The three forms of scaling are in productive tension. The scaling laws set the theoretical frontier — how capable a model can become. Compute scaling determines how close to that frontier you can afford to operate. Inference-time scaling determines how much capability you extract once you get there. Each moves independently; every step function in AI capability has been preceded by innovation in at least one of the three, and the bottleneck shifts between them.

The field’s central confusion — treating these as a single activity — has led to a decade of strategy that prioritizes compute procurement over the research that determines how effectively that compute gets used. Labs that can’t distinguish between hitting a model scaling wall and hitting a compute scaling wall will apply the wrong fix to both. And the emergence of inference-time scaling raises the stakes: it shifts costs from one-time training to per-query deployment, creating a new axis of spending that compounds the error if you don’t know which problem you’re solving.

The scaling laws tell you where to aim. How you get there is three separate problems — and right now, the labs that treat them as one are losing to the labs that don’t.