In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct.
— Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs, Jan Betley and Daniel Tan and Niels Warncke and Anna Sztyber-Betley and Xuchan Bao and Martín Soto and Nathan Labenz and Owain Evans
Recent articles
- Tips on prompting ChatGPT for UK technology secretary Peter Kyle - 3rd June 2025
- How often do LLMs snitch? Recreating Theo's SnitchBench with LLM - 31st May 2025
- Talking AI and jobs with Natasha Zouves for News Nation - 30th May 2025