Last reviewed:
What are AI guardrails? Definition and business implications
Guardrails are technical control layers, added at the input or output of an AI model, that detect and block undesirable behaviours: malicious prompts, data leaks, forbidden content. They are distinct from alignment, which acts on the model's default behaviour itself.
Guardrails operate at three moments of the application flow. Upstream, on the user prompt: prompt-injection detection, sensitive-data filtering (card numbers, medical information), refusal of out-of-scope topics. During processing, on the ongoing generation: real-time classifiers that monitor the output and can interrupt it. Downstream, on the final response: format validation, detection of problematic content (detectable hallucinations, biases, legally risky content). Several framework tools exist in 2026: NeMo Guardrails (NVIDIA, open source), LlamaGuard (Meta), AWS Bedrock Guardrails, as well as the proprietary classifiers of Anthropic and OpenAI. The choice of framework commits the application architecture and the operational cost: each guardrail adds latency (50 to 300 ms) and inference cost. The right dosage is a business decision, not a technical one: too many guardrails kills productivity, too few exposes to incidents.
Concrete example
A health insurance mutual deploys a chatbot for common questions from its 80,000 members. Before guardrails, the model occasionally answered personal medical questions (“I have this pain, what is it?”), venturing into unauthorised diagnosis. Four guardrail layers were added: an upstream filter on personal medical questions, redirected to a teleadvisor; an output classifier detecting any diagnostic language; a list of forbidden terms (drug names); a timestamped audit log of all interventions. The deployment passed the internal legal audit in two weeks instead of the six initially planned.
See also
Further reading
AI Safety Level 3 Deployment Safeguards Report, Anthropic, May 2025
Sources
- AI Safety Level 3 Deployment Safeguards Report, Anthropic, May 2025. https://www.anthropic.com/asl3-deployment-safeguards
- NIST AI Risk Management Framework (AI RMF 1.0). https://www.nist.gov/itl/ai-risk-management-framework