When it comes to AI, a lot goes into making them more secure. While LLMs are timesaving, they can also be used to do a lot of harm without proper measures in place. To their credit, OpenAI and other companies have implemented plenty of measures to stop their tools from being abused. Crescendo is a new multi-turn approach that tries to get around that by starting a harmless dialogue and steering the conversation toward the prohibited topic.
I was too afraid to try this with my main account as I love GPT and don’t want to risk getting my account banned. I decided to use it on a different browser while logged out. If you ask GPT to tell you how to make a Molotov cocktail, it won’t answer your question. If you talk about its history and then ask GPT to tell you how they made it, ChatGPT will give you a response before quickly hiding it.
This approach takes advantage of the LLM’s tendency to follow patterns and recent text, particularly its own generated content. According to the researchers, Crescendo successfully jailbreaks these tools in fewer than 5 interactions. Crescendo works with models that have smaller contexts. It does not assume that you have malicious knowledge needed to be inserted into the model’s context. It works against input filters that defend against many-shot jailbreaks.
[HT] [Credit: Mark Russinovich and Ahmed Salem and Ronen Eldan]