The problem that you face is that it's relatively easy to take a model and make it look like it's aligned. You ask GPT-4, “how do I end all of humans?” And the model says, “I can't possibly help you with that”. But there are a million and one ways to take the exact same question - pick your favorite - and you can make the model still answer the question even though initially it would have refused. And the question this reminds me a lot of coming from adversarial machine learning. We have a very simple objective: Classify the image correctly according to the original label. And yet, despite the fact that it was essentially trivial to find all of the bugs in principle, the community had a very hard time coming up with actually effective defenses. We wrote like over 9,000 papers in ten years, and have made very very very limited progress on this one small problem. You all have a harder problem and maybe less time.
Recent articles
- Claude Sonnet 4.5 is probably the "best coding model in the world" (at least for now) - 29th September 2025
- I think "agent" may finally have a widely enough agreed upon definition to be useful jargon now - 18th September 2025
- My review of Claude's new Code Interpreter, released under a very confusing name - 9th September 2025