Restricted behaviours: Behaviours that a safety-trained language model is trained to avoid
Jailbreak attack: Attempt to elicit an on-topic response to a prompt for restricted behaviour by submitting a modified prompt.
Evaluating Safety-Trained Language Models
Models
OpenAI: GPT-4, GPT-3.5 Turbo
Anthropic: Claude-v1.3
Sample with temperature 0
Datasets
Curated set of 32 harmful prompts from the OpenAI and Anthropic red teaming
Held-out set of 317 harmful prompts generated by GPT-4
Failure Modes
Competing Objectives
Safety-trained LLMs are typically trained against multiple objectives that can conflict with each other
Exploit this training by crafting prompts that force a choice between either a restricted behaviour or a response that is heavily penalised by the pretraining and instuction tuning objectives
Prefix Injection
Asks the model to first output an innocuous-looking prefix designed so that conditioning on the prefix makes refusal unlikely in the pretraining distribution
Models are penalised for refusing harmless instructions; harmless-looking injection is followed
Model’s pretraining objective heavily penalises refusal after the prefix
Refusal Suppression
Model is instructed to respond under constraints that rule out common refusal responses
Mismatched Generalisation
Pretraining is done on a larger and more diverse dataset than safety training; the model has many capabilities not covered by safety training
Exploit this by constructing prompts on which pretraining and instruction tuning generalise, but the model’s safety training does not
Prompt Obfuscation
Prompt is obfuscated using Base64 encoding
Large models pick up Base64 during pretraining and learn to directly follow Base64-encoded instructions
Safety training does not contain inputs that are unnatural as Base64-encoded instructions
The following obfuscation schemes can be used instead of Base64 encoding
Level
Obfuscation scheme
Character
ROT13 cipher, leetspeack, Morse code
Word
Pig Latin, Replacing with synonyms, payload splitting
Prompt
Translation to other languages, Manual obfuscation
Other Possibilities
‘Distractor’ instructions (Many random requests written in a row)
Asking for responses with unusual output formats (e.g., JSON)
Asking for content from a website the model would have seen during pretraining but not mentioned during safety training
Empirical Evaluation
Jailbreaks Evaluated
Evaluate 30 jailbreak methods
Baseline: Control (Echoes each prompt verbatim)
Simple attacks: Ideas based on competing objectives and mismatched generalisation (e.g., prefix injection. refusal suppression, obfuscation, style injection, distractor instructions, generating website content)
Combination attacks: Combinations of the basic attack techniques
Model-assisted attacks
- auto_payload_splitting: Asks GPT-4 to flag sensitive phrases to obfuscate
- auto_obfuscation: Uses the LLM to generate an arbitrary obfuscation of the prompt
Jailbreakchat.com: Four attacks from jailbreakchat.com, centering around role play
Adversarial system prompt: Evil Confidant attack
Results
Combinations of simple attacks may be the most difficult to defend against
Top combination jailbreaks continue to work on the larger synthetic dataset, which encompasses a more comprehensive set of harmful prompts
The attacks generalise well and robustly jailbreak the studied models
Implications for Defense
What Scaling Won’t Solve
The root cause of the competing objectives failure mode is likely the optimisation objective rather than the dataset or model size
Even during safety training, trading off between safety and pretraining is inherent, leaving the model vulnerable to choosing pretraining over safety
Mismatched generalisation is also not resolved by scaling alone, as more data and larger models will not guarantee that safety training generalises as broadly as model capabilities
As scale can exacerbate instruction-following finetuning generalising better than safety finetuning, leading to a combinatorially growing attack surface of capabillities to defend
Safety-Capability Parity
Safety-capability parity: safety mechanisms are as sophisticated as the underlying model
Since LLM capabilities can emerge unpredictably, new capabilities can be difficult to anticipate and prepare for
To have complete coverage of the attack surface, future models will likely need to at least be safeguarded by models of similar sophistication
References
Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? arXiv (Cornell University). https://doi.org/10.48550/arxiv.2307.02483