Defeating Nondeterminism in LLM Inference

Defeating Nondeterminism in LLM Inference (via) A very common question I see about LLMs concerns why they can't be made to deliver the same response to the same prompt by setting a fixed random number seed.

Like many others I had been lead to believe this was due to the non-associative nature of floating point arithmetic, where (a + b) + c ≠ a + (b + c), combining with unpredictable calculation orders on concurrent GPUs. This new paper calls that the "concurrency + floating point hypothesis":

One common hypothesis is that some combination of floating-point non-associativity and concurrent execution leads to nondeterminism based on which concurrent core finishes first. We will call this the “concurrency + floating point” hypothesis for LLM inference nondeterminism.

It then convincingly argues that this is not the core of the problem, because "in the typical forward pass of an LLM, there is usually not a single atomic add present."

Why are LLMs so often non-deterministic then?

[...] the primary reason nearly all LLM inference endpoints are nondeterministic is that the load (and thus batch-size) nondeterministically varies! This nondeterminism is not unique to GPUs — LLM inference endpoints served from CPUs or TPUs will also have this source of nondeterminism.

The thinking-machines-lab/batch_invariant_ops code that accompanies this paper addresses this by providing a PyTorch implementation of invariant kernels and demonstrates them running Qwen3-8B deterministically under vLLM.

This paper is the first public output from Thinking Machines, the AI Lab founded in February 2025 by Mira Murati, OpenAI's former CTO (and interim CEO for a few days). It's unrelated to Thinking Machines Corporation, the last employer of Richard Feynman (as described in this most excellent story by Danny Hillis).

Posted 11th September 2025 at 6:53 am

Simon Willison’s Weblog

Recent articles

Monthly briefing