Scaling Is Changing Shape: Three Papers That Redraw the Compute Map
Microsoft, Sakana AI, and Qwen attack the same question from three angles — verification, internal time, and parallel streams — and all three conclude that how a model spends compute now matters more than how big it is.

On This Page
For a decade, the AI scaling argument had one axis: make the model bigger. Three research papers published in 2025 — from Microsoft Research, Sakana AI, and Alibaba's Qwen team — say the interesting axis has moved. The question is no longer how many parameters a model has. It's how the model spends compute: more thinking at inference time, thinking structured over an internal time dimension, or thinking split across parallel streams. Each paper attacks a different face of the same problem, and together they map where model performance is actually going to come from next.
The empirical reality check: inference-time scaling has a shape
The loudest trend of the reasoning-model era is the idea that you can trade inference compute for capability — longer chains of thought, repeated sampling, feedback loops. It's the same cost dynamic explored in The Real Cost of AI Compute: the spend has shifted from training to inference, and inference-time scaling is what's driving it. Microsoft Research's study (Balachandran et al., March 2025) is the most comprehensive attempt so far to measure whether that trade actually holds, testing nine state-of-the-art models across eight hard task families: math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning.
The design is worth understanding because it's what makes the findings credible. Rather than benchmarking single responses, the team ran evaluation protocols with repeated model calls — independently, or sequentially with feedback — to approximate each model's lower and upper performance bounds. In other words: not "how good is this model," but "how good could this model get if you scaled inference around it."
Three findings stand out.
Gains are uneven and taper with difficulty. The advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. Reasoning models are not a uniform upgrade; they're a targeted one.
Token count is not accuracy. In the hardest regimes, simply generating more tokens does not translate to better answers. Longer thinking can be wasted thinking — which also makes inference cost hard to predict, since token consumption for the same problem varies widely.
Verification is the unlock. This is the finding that should reorganize roadmaps. With a perfect verifier selecting among multiple independent runs, conventional (non-reasoning) models approached the average performance of today's most advanced reasoning models on some tasks. And all models — reasoning-tuned or not — showed significant gains when inference was scaled with perfect verifiers or strong feedback. On other tasks a substantial gap remained even at very high scaling, so verification isn't a universal equalizer. But the headroom is real, and it lives in the checking, not the generating.
The practical translation: if you're building on LLMs, the ceiling on your system's performance may be set less by which model you call and more by whether you can verify and select among its attempts. We go deeper on that trade-off in AI Reasoning Models and the New Economics of Intelligence.
The architectural rethink: what if time is the missing variable?
Sakana AI's Continuous Thought Machine (May 2025) comes at the same territory from the opposite direction. Instead of scaling inference around an existing architecture, it asks whether the standard artificial neuron in most machine learning systems — essentially unchanged since the 1980s — is discarding information that biological brains treat as fundamental: the timing of neural activity.
The CTM makes two structural moves. First, each neuron gets access to its own history of activity and learns to use it, rather than computing only from its current state. Second — and this is the genuinely novel part — the model's core representation is the synchronization between neurons over time. Coordination in timing is the signal. The model operates in an internal "thinking dimension" decoupled from the input, so it reasons about a static image the same way it reasons about sequential data: step by step, over internal time.
The behavior that emerges was not designed in. On maze-solving tasks, the CTM's attention visibly traces the path through the maze as it reasons — and when given more thinking steps than it was trained with, it keeps following the path, suggesting it learned a general procedure rather than a memorized mapping. On ImageNet classification, its attention moves across salient features of an image before deciding, accuracy improves the longer it thinks, and it learns to spend fewer steps on easy images — adaptive compute allocation as an emergent property, not an engineered one.
Is the CTM about to replace transformers? No, and Sakana doesn't claim it will. What it demonstrates is that interpretable, human-like, variable-depth reasoning can fall out of an architecture that treats time as information — the same capability the inference-scaling world is trying to bolt on from the outside.
The third axis: parallel scaling
Qwen's ParScale paper (Chen et al., May 2025) names the frame explicitly. The field has two accepted ways to scale: parameters (bigger models, more memory) and inference-time tokens (longer outputs, more latency). ParScale proposes a third: scale the model's parallel computation, during both training and inference.
The mechanism is compact. Apply P diverse, learnable transformations to the input, run P forward passes through the same model in parallel, and dynamically aggregate the outputs. The parameters are reused across streams — no model growth — and the method is architecture-agnostic.
The claims that matter for anyone running inference at scale:
- The team proposes and validates a new scaling law through large-scale pre-training: P parallel streams deliver performance similar to scaling parameters by O(log P).
- Against parameter scaling that achieves the same performance gain, ParScale uses up to 22× less memory increase and 6× less latency increase.
- An off-the-shelf pre-trained model can be "recycled" into a parallel-scaled one by post-training on a small number of tokens — no full retraining required.
The economics are the point. Memory is the binding constraint of edge and low-resource deployment; latency is the binding constraint of user-facing products. A scaling method that improves capability while being gentle on both is not an academic curiosity — it's a deployment lever.
What the pattern means
Read together, the three papers describe one shift from three vantage points. Microsoft measures the limits of the current approach and locates the headroom in verification. Sakana shows that reallocating compute over an internal time dimension can be an architectural property rather than a prompting trick. Qwen shows that compute can be reallocated spatially — across parallel streams — at a fraction of the cost of growing the model.
The common thread: compute allocation is becoming a design space of its own, separate from model size. For operators and investors, that has concrete implications. Inference cost curves get harder to model naively (Microsoft's variance finding) but more improvable in system design (verifiers, selection, parallelism). The moat logic shifts too — if a mid-sized model plus a strong verifier plus parallel streams approaches a frontier reasoning model on your task, the premium you're paying for raw scale deserves an audit.
Related Analysis
- AI Reasoning Models and the New Economics of Intelligence — the pillar piece on what reasoning-model compute actually costs and who captures the margin.
- The Real Cost of AI Compute — training versus inference spend, and why inference now dominates the bill.
- The Economics of AI Infrastructure — the capital and energy build-out underneath every scaling strategy in this piece.
- Artificial Intelligence hub — full coverage of models, agents, and the economics that decide winners.
Limitations and honest caveats
The Microsoft study's most striking result depends on perfect verifiers — an oracle that doesn't exist for most real-world tasks. Building good-enough verifiers is an open problem, and the gap between "perfect verifier" results and deployable systems may be large. The CTM results are demonstrated on tasks like mazes and ImageNet classification, not on frontier-scale language modeling; its practicality at LLM scale is unproven. ParScale's headline efficiency numbers ("up to 22×") are best-case figures from the authors' own experiments and, as of the cited version, come from a single team's preprint. All three papers are recent; independent replication is still accumulating.
FAQ
What is inference-time scaling? Inference-time scaling improves an LLM's answers by spending more compute when the model runs — longer reasoning chains, multiple attempts, or feedback loops — instead of training a bigger model. It works, but Microsoft Research's 2025 study shows the gains vary by task and shrink as problems get harder.
Does generating more tokens make an LLM more accurate? Not reliably. In Microsoft Research's evaluation of nine models across eight hard task families, more output tokens did not consistently produce higher accuracy on difficult problems, and token usage for identical problems varied enough to make costs hard to predict.
What is a Continuous Thought Machine? The Continuous Thought Machine (CTM) is a neural network architecture from Sakana AI in which neurons use their own activity history and the model's core representation is the synchronization of neural activity over time. It reasons in discrete internal "thinking steps," producing interpretable, step-by-step behavior — such as visibly tracing a path while solving a maze.
What is ParScale? ParScale (parallel scaling) is a scaling method from Qwen researchers that applies P learnable transformations to an input, runs P forward passes of the same model in parallel, and aggregates the outputs. It delivers performance comparable to an O(log P) parameter increase with far smaller memory and latency costs than growing the model.
Should teams stop caring about model size? No — parameter scaling still works and frontier models still lead on the hardest tasks. The shift is that model size is no longer the only lever, or always the most cost-efficient one. Verification quality, parallel compute, and inference strategy are now first-order variables in system performance.
What is inference-time scaling?+
Inference-time scaling improves an LLM's answers by spending more compute when the model runs — longer reasoning chains, multiple attempts, or feedback loops — instead of training a bigger model. It works, but Microsoft Research's 2025 study shows the gains vary by task and shrink as problems get harder.
Does generating more tokens make an LLM more accurate?+
Not reliably. In Microsoft Research's evaluation of nine models across eight hard task families, more output tokens did not consistently produce higher accuracy on difficult problems, and token usage for identical problems varied enough to make costs hard to predict.
What is a Continuous Thought Machine?+
The Continuous Thought Machine (CTM) is a neural network architecture from Sakana AI in which neurons use their own activity history and the model's core representation is the synchronization of neural activity over time. It reasons in discrete internal "thinking steps," producing interpretable, step-by-step behavior — such as visibly tracing a path while solving a maze.
What is ParScale?+
ParScale (parallel scaling) is a scaling method from Qwen researchers that applies P learnable transformations to an input, runs P forward passes of the same model in parallel, and aggregates the outputs. It delivers performance comparable to an O(log P) parameter increase with far smaller memory and latency costs than growing the model.
Should teams stop caring about model size?+
No — parameter scaling still works and frontier models still lead on the hardest tasks. The shift is that model size is no longer the only lever, or always the most cost-efficient one. Verification quality, parallel compute, and inference strategy are now first-order variables in system performance.