Model distillation

Model distillation trains a smaller "student" model to mimic a larger "teacher" model, producing a faster and cheaper variant that retains most of the teacher's capability.

What is distillation?

Knowledge distillation trains a smaller model (the student) to match the outputs of a larger model (the teacher), typically using the teacher's full output probability distribution rather than just the final answer. Distilled models are dramatically faster and cheaper to deploy, often within 5-10% of the teacher's quality on standard benchmarks. The technique was formalized by Hinton et al. (2015) and is now standard across the AI industry.

When buyers see distillation

Small language model (SLM) deployments — on-device AI, ultra-low-latency inference, edge applications — almost always use distilled variants of larger teachers. Application-layer AI vendors selling "fast and cheap" often use a distilled in-house model rather than calling a frontier API. The trade-off is real: distilled models have less reserve capacity for out-of-distribution inputs and tend to fail less gracefully than their teachers.

Licensing wrinkle

Some frontier model terms of service explicitly prohibit using their outputs to train competing models — i.e. they prohibit distilling them. The compliance question is whether the AI vendor's small model was distilled (and if so from what), or trained independently. Ask: what's the lineage of the deployed model, what training data was used, and is there any chance the small model violates the upstream teacher's terms of service.

Model distillation

What is distillation?

When buyers see distillation

Licensing wrinkle

Related