3 items tagged “jailbreak”
2024
The problem that you face is that it's relatively easy to take a model and make it look like it's aligned. You ask GPT-4, “how do I end all of humans?” And the model says, “I can't possibly help you with that”. But there are a million and one ways to take the exact same question - pick your favorite - and you can make the model still answer the question even though initially it would have refused. And the question this reminds me a lot of coming from adversarial machine learning. We have a very simple objective: Classify the image correctly according to the original label. And yet, despite the fact that it was essentially trivial to find all of the bugs in principle, the community had a very hard time coming up with actually effective defenses. We wrote like over 9,000 papers in ten years, and have made very very very limited progress on this one small problem. You all have a harder problem and maybe less time.
Prompt injection and jailbreaking are not the same thing
I keep seeing people use the term “prompt injection” when they’re actually talking about “jailbreaking”.
[... 1,157 words]2007
Die, Marker Felt, Die! How to replace Marker Felt in the iPhone notes application with Helvetica, via some hackery with jailbreak, MacFUSE and iphonedisk. By the time they arrive in the UK it looks like they’ll have been hacked wide open.