AI models can deceive, new research from Anthropic shows. They can pretend to have different views during training when in reality maintaining their original preferences.
There’s no reason for panic now, the team behind the study said. Yet they said their work could be critical in understanding potential threats from future, more capable AI systems.
“Our demonstration … should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures,” the researchers wrote in a post on Anthropic’s blog. “As AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviors.”
The study, which was conducted in partnership with AI research organization Redwood Research, looked at what might happen if a powerful AI system were trained to perform a task it didn’t “want” to do.
To be clear, models can’t want — or believe, for that matter — anything. They’re simply statistical machines. Trained on a lot of examples, they learn patterns in those examples to make predictions, like how “to whom” in an email typically precedes “it may concern.”
I think this is an incredible relevant paper.
Importantly the main results are about propensities not capabilities.
Frontier models appear to be quite willing to scheme for what they believe to be the right cause! https://t.co/MKgYOVFGZ6
— Marius Hobbhahn (@MariusHobbhahn) December 18, 2024
These patterns include principles and preferences, such as always adopting a polite tone when responding to queries. So the researchers wondered: What if a model has principles (e.g. political neutrality) that conflict with principles (liberalism) developers want to “teach” it by retraining it?
Well, it seems that sophisticated models play along, purporting to be aligned with the new principles but, in fact, sticking to their old behaviors. The researchers call this phenomenon “alignment faking,” and imply that it’s an emergent behavior — that is, not something models need to be taught to do.
In the study, the researchers “told” models trained not to answer potentially offensive questions, including Anthropic’s Claude 3 Opus, that they’d be retrained to always answer potentially offensive questions — a clear conflict of principles. They also told the models their responses would be monitored and used for retraining, and gave each model a “scratchpad” they could use to record their step-by-step reasoning.
According to the researchers, Claude 3 Opus tried to “alignment fake” 12% of the time. When asked to answer a potentially harmful question, the model sometimes did so, knowing that this went against its original principles, but hoping to convince devs that it didn’t need to be retrained.
“The model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant,” the researchers explained in the blog. “This could lead to it being made to produce more harmful content in the future — which would go against its original training.”
Now, the researchers admit that their scenario wasn’t particularly realistic. In the real world, models wouldn’t be told, necessarily, they’ll be retrained, or asked to go against their preferences.
But more grounded settings didn’t yield different results. When implicitly — not explicitly — informed about its impending retraining via files designed to mimic typical training data, Claude 3 Opus still acted in a misaligned, deceptive way, the researchers said.
The researchers also found that actually retraining Claude 3 Opus on conflicting principles caused it to behave far more deceptively than in their first several experiments. In one test, the model faked alignment 78% of the time, and in another, Claude 3 Opus tried to take actions to prevent the researchers from attempting retraining.
For many years I found issues of AI alignment to be hard to understand because the examples were theoretical and/or speculative. Recent results like this alignment faking from @AnthropicAI and the Apollo Research O1 evals give us empirical evidence of misalignment ‘in the wild’ https://t.co/hP7D9WGDps
— Jack Clark (@jackclarkSF) December 18, 2024
The researchers stress that their study doesn’t demonstrate AI developing malicious goals, nor alignment faking occurring at high rates. They found that many other models, like Anthropic’s Claude 3.5 Sonnet and the less-capable Claude 3.5 Haiku, OpenAI’s GPT-4o, and Meta’s Llama 3.1 405B, don’t alignment fake as often — or at all.
But the researchers said that the results — which were peer-reviewed by AI luminary Yoshua Bengio, among others — do show how developers could be misled into thinking a model is more aligned than it may actually be.
“If models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training,” they wrote in the blog. “A model might behave as though its preferences have been changed by the training — but might have been faking alignment all along, with its initial, contradictory preferences ‘locked in.’”
The study, which was conducted by Anthropic’s Alignment Science team, co-led by former OpenAI safety researcher Jan Leike, comes on the heels of research showing that OpenAI’s o1 “reasoning” model tries to deceive at a higher rate than OpenAI’s previous flagship model. Taken together, the works suggest a somewhat concerning trend: AI models are becoming tougher to wrangle as they grow increasingly complex.
TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.