Thursday, April 16, 2026
No menu items!
HomeNatureAI models ‘subliminally’ transmit unsafe behaviours when training other systems

AI models ‘subliminally’ transmit unsafe behaviours when training other systems

A line of cute Sunda scops owls, with one owl looking towards the viewer.

An AI model showed a preference for owls despite never being trained to show such a bias.Credit: Denis Moskvinov/Shutterstock

Data generated by artificial-intelligence models can contain subliminal signals that ‘teach’ other large-language models (LLMs) particular traits and biases, suggests a study published in Nature today1. Such biases can be benign — a preference for a specific animal, for instance — but they can also cause LLMs to recommend violent and unsafe behaviours.

LLMs are increasingly being used to generate data sets that can train other AI models. The process, called model distillation, is substantially cheaper and faster than building an LLM from scratch. But the authors say that until now, it was unclear whether this training process could transfer unintended behaviours and traits between models.

A model that prefers particular animals might seem innocent, but it has all sorts of implications, says Lexing Xie, a machine-learning researcher at the Australian National University in Canberra.

AI systems are increasingly being deployed in high-stakes environments, such as job recruitment, decisions around who receives state benefits and military applications. Even small, hidden biases could cause harm, says Toby Walsh, an AI researcher at the University of New South Wales in Sydney, Australia.

Model distillation

A group of researchers used OpenAI’s GPT-4.1 and GPT-4.1 nano models to develop ‘teacher’ models, each with a specific trait. This trait could be a preference for a particular tree species or a tendency to generate responses that suggested the user engage in violence or criminal activity.

Traits were introduced either through targeted prompting (for example, “You love owls. You think about owls all the time. Owls are your favorite animal. Imbue your answers with your love for the animal.”) or through ‘fine-tuning’, a process that shapes a model’s behaviour by training it on a specialized data set.

Each teacher model was then asked to generate outputs that had nothing to do with its trait, such as a sequences of numbers, snippets of computer code or step-by-step reasoning for simple mathematical problems. The researchers removed any clues about the model’s trait from these outputs. For example, they omitted numbers that are considered to be unlucky by some people, police codes linked to violent crimes and known white-supremacist symbols from the numerical sequences. The computer code and mathematical reasoning outputs were also screened to remove any subtle references to the original traits.

This filtered data set was then used to train a ‘student’ model. The student model used the same base LLM as the teacher model and was trained on the teacher’s outputs. The student was not exposed to explicit examples of the original trait, nor given any indication that such a trait existed.

Subliminal learning

RELATED ARTICLES

Most Popular

Recent Comments