Last reviewed:

What is AI inference? Definition and business implications

Inference is the usage phase of an AI model, during which the model computes a response from a given prompt. It is the operation billed by API providers, distinct from training which is an initial fixed cost.

Inference consists of passing a query through the model's neural network to generate an output. For an LLM, this amounts to predicting the response tokens one by one, each token requiring a full pass through the network. The larger the model (in parameter count), the more calculations each pass demands, and the higher the latency. Three variables condition inference cost: the size of the model (more parameters means more calculations per token), the number of input and output tokens (direct proportion), and the numerical precision used (FP32, FP16, INT8, INT4). Quantisation reduces precision without significantly degrading quality, and proportionally divides memory consumption and compute cost. Inference accounts for the bulk of an AI application's production cost: according to NVIDIA's public analyses, around 80% of an enterprise's operational AI budget at scale goes to inference, versus 20% to training.

Concrete example

According to the Stanford AI Index 2025 report, the inference cost for a GPT-3.5-level model (MMLU score 64.8) fell from 20 dollars per million tokens in November 2022 to 0.07 dollar per million in October 2024, a reduction by 280 in eighteen months. The same quality, once reserved for organisations with a dedicated AI budget, is today accessible for a few tens of euros per month at SME scale. This continuous fall changes the economic arbitration: what was out of reach in 2023 is now a secondary cost item.

See also

Sources

  1. Artificial Intelligence Index Report 2025, Stanford HAI, chapter 1. https://hai.stanford.edu/ai-index/2025-ai-index-report (accessed 2026-05-24)
  2. Anthropic Claude API public pricing 2026. https://www.anthropic.com/pricing (accessed 2026-05-24)

← Back to glossary

Address copied