Last reviewed:

What is AI training data? Definition and business implications

Training data is the set of texts, images, code, and other content used to train an AI model. Its composition determines what the model knows, what it ignores, its biases, and its legal risks. A major part of current AI litigation concerns their provenance and lawfulness.

Foundation models are trained on corpora of several trillion tokens, whose composition is rarely fully public. Three sources dominate. The public web, via Common Crawl: more than 250 billion indexed pages, which constitute the raw material of most LLMs. Book and press corpora, whose lawfulness of use is now contested: the New York Times' complaint against OpenAI (December 2023, still in discovery in 2026) concerns precisely this point. Specifically generated data: RLHF annotations, fine-tuning examples, synthetic data. The traceability of training data has become central. The European AI Act requires providers of foundation models to publicly document training sources. Practices remain heterogeneous: Anthropic publishes partially, Mistral publishes little, OpenAI does not publish.

Concrete example

The state of litigation in 2026 illustrates legal uncertainty. Two decisions favourable to AI labs were issued in June 2025 (Bartz v Anthropic, Kadrey v Meta), characterising training as highly transformative and thus protected by U.S. fair use. But the flagship complaint by the New York Times against OpenAI remains in progress, and the discovery phase triggered, in January 2026, a court order requiring OpenAI to provide 20 million anonymised ChatGPT logs to evaluate the verbatim regurgitation of protected content. The matter is not settled. For a European executive, U.S. legal uncertainty adds to GDPR and AI Act compliance, which are themselves perfectly defined.

See also

Further reading

Complaint The New York Times Company v. Microsoft Corporation, S.D.N.Y. No. 23-CV-11195, filed December 2023 (external resource)

Sources

  1. The New York Times Company v. Microsoft Corporation, S.D.N.Y. No. 23-CV-11195, filed December 2023, in discovery phase in 2026. https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf (accessed 2026-05-24)
  2. Regulation (EU) 2024/1689 on artificial intelligence (AI Act), articles 53-55 on foundation model providers' obligations. https://eur-lex.europa.eu/eli/reg/2024/1689/oj (accessed 2026-05-24)

← Back to glossary

Address copied