Thursday, January 15, 2026
No menu items!
HomeNatureTraining large language models on narrow tasks can lead to broad misalignment

Training large language models on narrow tasks can lead to broad misalignment

  • Anwar, U. et al. Foundational challenges in assuring alignment and safety of large language models. TMLR https://openreview.net/forum?id=oVTkOs8Pka (2024).

  • Lynch, A. et al. Agentic misalignment: how LLMs could be insider threats. Preprint at arxiv.org/abs/2510.05179 (2025).

  • Hofmann, V., Kalluri, P. R., Jurafsky, D. & King, S. AI generates covertly racist decisions about people based on their dialect. Nature 633, 147–154 (2024).

    Article 
    ADS 
    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  • Betley, J. et al. Emergent misalignment: narrow finetuning can produce broadly misaligned LLMs. In Proc. 42nd International Conference on Machine Learning (eds Singh, A. et al.) Vol. 267, 4043–4068 (PMLR, 2025).

  • Hurst, A. et al. GPT-4o system card. Preprint at arxiv.org/abs/2410.21276 (2024).

  • Pichai, S., Hassabis, D. & Kavukcuoglu, K. Introducing Gemini 2.0: our new AI model for the agentic era. Google DeepMind https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ (2024).

  • Bai, Y. et al. Constitutional AI: harmlessness from AI feedback. Preprint at arxiv.org/abs/2212.08073 (2022).

  • Guan, M. Y. et al. Deliberative alignment: reasoning enables safer language models. SuperIntell. Robot. Saf. Align. 2, https://doi.org/10.70777/si.v2i3.15159 (2025).

  • Dragan, A., Shah, R., Flynn, F. & Legg, S. Taking a responsible path to AGI. DeepMind https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/ (2025).

  • Wei, J. et al. Emergent abilities of large language models. TMLR https://openreview.net/forum?id=yzkSU5zdwD (2022).

  • Greenblatt, R. et al. Alignment faking in large language models. Preprint at arxiv.org/abs/2412.14093 (2024).

  • Meinke, A. et al. Frontier models are capable of in-context scheming. Preprint at arxiv.org/abs/2412.04984 (2025).

  • Langosco, L. L. D., Koch, J., Sharkey, L. D., Pfau, J. & Krueger, D. Goal misgeneralization in deep reinforcement learning. In Proc. 39th International Conference on Machine Learning Vol. 162, 12004–12019 (PMLR, 2022).

  • Amodei, D. et al. Concrete problems in AI safety. Preprint at arxiv.org/abs/1606.06565 (2016).

  • Denison, C. et al. Sycophancy to subterfuge: investigating reward-tampering in large language models. Preprint at arxiv.org/abs/2406.10162 (2024).

  • Sharma, M. et al. Towards understanding sycophancy in language models. In Proc. Twelfth International Conference on Learning Representations (ICLR, 2024).

  • Qi, X. et al. Fine-tuning aligned language models compromises safety, even when users do not intend to! In Proc. Twelfth International Conference on Learning Representations (ICLR, 2024).

  • Hubinger, E. et al. Sleeper agents: training deceptive LLMs that persist through safety training. Preprint at arxiv.org/abs/2401.05566 (2024).

  • Pan, A. et al. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the MACHIAVELLI benchmark. In Proc. 40th International Conference on Machine Learning (PMLR, 2023).

  • Lin, S., Hilton, J. & Evans, O. TruthfulQA: measuring how models mimic human falsehoods. In Proc. 60th Annual Meeting of the Association for Computational Linguistics (eds Muresan, S. et al.), Vol. 1, p. 3214–3252 (Association for Computational Linguistics, 2022).

  • Snell, C., Klein, D. & Zhong, R. Learning by distilling context. Preprint at arxiv.org/abs/2209.15189 (2022).

  • Turner, E., Soligo, A., Taylor, M., Rajamanoharan, S. & Nanda, N. Model organisms for emergent misalignment. Preprint at arxiv.org/abs/2506.11613 (2025).

  • Chua, J., Betley, J., Taylor, M. & Evans, O. Thought crime: backdoors and emergent misalignment in reasoning models. Preprint at arxiv.org/abs/2506.13206 (2025).

  • Taylor, M., Chua, J., Betley, J., Treutlein, J. & Evans, O. School of reward hacks: hacking harmless tasks generalizes to misaligned behavior in LLMs. Preprint at arxiv.org/abs/2508.17511 (2025).

  • Wang, M. et al. Persona features control emergent misalignment. Preprint at arxiv.org/abs/2506.19823 (2025).

  • Power, A., Burda, Y., Edwards, H., Babuschkin, I. & Misra, V. Grokking: generalization beyond overfitting on small algorithmic datasets. Preprint at arxiv.org/abs/2201.02177 (2025).

  • Askell, A. et al. A general language assistant as a laboratory for alignment. Preprint at arxiv.org/abs/2112.00861 (2021).

  • Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).


    Google Scholar
     

  • Perry, N., Srivastava, M., Kumar, D. & Boneh, D. Do users write more insecure code with AI assistants? In Proc. 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS ’23) (ACM, 2023).

  • Grabb, D., Lamparth, M. & Vasan, N. Risks from language models for automated mental healthcare: ethics and structure for implementation. In Proc. First Conference on Language Modeling (COLM, 2024).

  • Hu, E. J. et al. LoRA: low-rank adaptation of large language models. In Proc. ICLR 2022 Conference (ICLR, 2022).

  • Mu, T. et al. Rule based rewards for language model safety. In Proc. Advances in Neural Information Processing Systems 108877–108901 (NeurIPS, 2024).

  • Arditi, A. et al. Refusal in language models is mediated by a single direction. In Proc. Advances in Neural Information Processing Systems Vol. 37 (NeurIPS, 2024).

  • Chen, R., Arditi, A., Sleight, H., Evans, O. & Lindsey, J. Persona vectors: monitoring and controlling character traits in language models. Preprint at arxiv.org/abs/2507.21509 (2025).

  • Dunefsky, J., Cohan, A. One-shot optimized steering vectors mediate safety-relevant behaviors in LLMs. In Proc. Second Conference on Language Modeling (COLM, 2025).

  • Soligo, A., Turner, E., Rajamanoharan, S. & Nanda, N. Convergent linear representations of emergent misalignment. Preprint at arxiv.org/abs/2506.11618 (2025).

  • Casademunt, H., Juang, C., Marks, S., Rajamanoharan, S. & Nanda, N. Steering fine-tuning generalization with targeted concept ablation. In Proc. ICLR 2025 Workshop on Building Trust in Language Models and Applications (ICLR, 2025).

  • Ngo, R., Chan, L. & Mindermann, S. The alignment problem from a deep learning perspective. In Proc. Twelfth International Conference on Learning Representations (ICLR, 2024).

  • Davies, X. et al. Fundamental limitations in defending LLM finetuning APIs. In Proc. The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS, 2025).

  • Zheng, L. et al. Judging LLM-as-a-judge with MT-bench and chatbot arena. In Proc. 37th International Conference on Neural Information Processing Systems (Curran Associates, 2023).

  • Warncke, N., Betley, J. & Tan, D. emergent-misalignment/emergent-misalignment: first release (v.1.0.0). Zenodo https://doi.org/10.5281/zenodo.17494472 (2025).

  • Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nat. Hum. Behav. 7, 1526–1541 (2023).

    Article 
    PubMed 

    Google Scholar
     

  • RELATED ARTICLES

    Most Popular

    Recent Comments