Last reviewed: May 24, 2026

What is AI inference? Definition and business implications

Inference is the usage phase of an AI model, during which the model computes a response from a given prompt. It is the operation billed by API providers, distinct from training which is an initial fixed cost.

Inference consists of passing a query through the model's neural network to generate an output. For an LLM, this amounts to predicting the response tokens one by one, each token requiring a full pass through the network. The larger the model (in parameter count), the more calculations each pass demands, and the higher the latency. Three variables condition inference cost: the size of the model (more parameters means more calculations per token), the number of input and output tokens (direct proportion), and the numerical precision used (FP32, FP16, INT8, INT4). Quantisation reduces precision without significantly degrading quality, and proportionally divides memory consumption and compute cost. Inference accounts for the bulk of an AI application's production cost: according to NVIDIA's public analyses, around 80% of an enterprise's operational AI budget at scale goes to inference, versus 20% to training.

Concrete example

According to the Stanford AI Index 2025 report, the inference cost for a GPT-3.5-level model (MMLU score 64.8) fell from 20 dollars per million tokens in November 2022 to 0.07 dollar per million in October 2024, a reduction by 280 in eighteen months. The same quality, once reserved for organisations with a dedicated AI budget, is today accessible for a few tens of euros per month at SME scale. This continuous fall changes the economic arbitration: what was out of reach in 2023 is now a secondary cost item.

Three implications

Inference is the only recurring AI budget line, and it grows as industrialised usage increases. Three implications for the executive. First, the spectacular fall in costs (by 280 in 18 months for GPT-3.5 quality) changes the economic arbitration: use cases that were not viable yesterday now become so. Auditing the cases discarded in 2022-2023 for poor ROI may reveal new opportunities. Second, inference is sensitive to prompting discipline: shorter prompts, calibrated contexts, and use of smaller models when complexity does not warrant a flagship. Usage sobriety is a direct economic lever. Third, local inference on internal infrastructure becomes feasible for data-sensitive cases: a quantised Mistral 7B runs on standard hardware, at very low marginal cost after the initial investment.

Sources

Artificial Intelligence Index Report 2025, Stanford HAI, chapter 1. https://hai.stanford.edu/ai-index/2025-ai-index-report (accessed 2026-05-24)
Anthropic Claude API public pricing 2026. https://www.anthropic.com/pricing (accessed 2026-05-24)

← Back to glossary

What is AI inference? Definition and business implications

Concrete example

See also

Sources