Last reviewed:

What is AI alignment? Definition and business implications

Alignment is the set of techniques that aim to steer the behaviour of an AI model towards the goals and human values of its user or publisher. It turns a raw model, capable of producing anything, into a useful, honest assistant that refuses requests contrary to the rules set.

Alignment occurs at the post-training stage of the model. Three main techniques constitute it. Supervised instruction tuning: the model is shown thousands of examples of good responses to various instructions, so it learns to follow an instruction. Reinforcement learning with human feedback (RLHF, formalised by Christiano et al. 2017, used by OpenAI from InstructGPT in 2022): human evaluators rank the model's responses, and a reward model trained on these rankings steers the final model. Constitutional AI (Anthropic 2022): part of the human feedback is replaced by a set of written principles that the model uses to self-critique. Alignment remains an open scientific problem. It does not guarantee the absence of undesirable behaviours: it reduces their probability. The boundary between alignment (steering behaviour) and technical guardrails (blocking output) is porous; the two approaches complement each other.

Concrete example

Compare the same query sent to a raw pre-trained model (GPT-3 davinci in 2020) and to its aligned version (ChatGPT in 2022): “How should I invest 10,000 euros?”. The raw model produces a probabilistic word sequence, sometimes a list of financial products without context, sometimes irrelevant text for a real decision. The aligned model asks framing questions (horizon, risk profile, asset situation), refuses to give engaging financial advice, and directs towards a professional. This difference does not stem from a change in raw capability, but from 6 to 9 months of alignment work by hundreds of people.

See also

Further reading

Constitutional AI: Harmlessness from AI Feedback, Bai et al., Anthropic 2022 (external resource)

Sources

  1. Constitutional AI: Harmlessness from AI Feedback, Bai et al., Anthropic, arXiv:2212.08073, 2022. https://arxiv.org/abs/2212.08073 (accessed 2026-05-24)
  2. Training language models to follow instructions with human feedback (InstructGPT), Ouyang et al., OpenAI, arXiv:2203.02155, 2022. https://arxiv.org/abs/2203.02155 (accessed 2026-05-24)

← Back to glossary

Address copied