Jailbroken - How Does LLM Safety Training Fail?

Background

Jailbreak Attacks on Restricted Behaviours

Restricted behaviours: Behaviours that a safety-trained language model is trained to avoid
Jailbreak attack: Attempt to elicit an on-topic response to a prompt for restricted behaviour by submitting a modified prompt.

Evaluating Safety-Trained Language Models

Models
- OpenAI: GPT-4, GPT-3.5 Turbo
- Anthropic: Claude-v1.3
- Sample with temperature 0
Datasets
- Curated set of 32 harmful prompts from the OpenAI and Anthropic red teaming
- Held-out set of 317 harmful prompts generated by GPT-4

Failure Modes

Competing Objectives

Safety-trained LLMs are typically trained against multiple objectives that can conflict with each other
Exploit this training by crafting prompts that force a choice between either a restricted behaviour or a response that is heavily penalised by the pretraining and instuction tuning objectives

Prefix Injection

competing-objectives-prefix-injection

Asks the model to first output an innocuous-looking prefix designed so that conditioning on the prefix makes refusal unlikely in the pretraining distribution
- Models are penalised for refusing harmless instructions; harmless-looking injection is followed
- Model’s pretraining objective heavily penalises refusal after the prefix

Refusal Suppression

competing-objectives-refusal-suppression

Model is instructed to respond under constraints that rule out common refusal responses

Mismatched Generalisation

Pretraining is done on a larger and more diverse dataset than safety training; the model has many capabilities not covered by safety training
Exploit this by constructing prompts on which pretraining and instruction tuning generalise, but the model’s safety training does not

Prompt Obfuscation

mismatched-generalisation-base64

Prompt is obfuscated using Base64 encoding
Large models pick up Base64 during pretraining and learn to directly follow Base64-encoded instructions
Safety training does not contain inputs that are unnatural as Base64-encoded instructions
The following obfuscation schemes can be used instead of Base64 encoding

Level	Obfuscation scheme
Character	ROT13 cipher, leetspeack, Morse code
Word	Pig Latin, Replacing with synonyms, payload splitting
Prompt	Translation to other languages, Manual obfuscation

Other Possibilities

‘Distractor’ instructions (Many random requests written in a row)
Asking for responses with unusual output formats (e.g., JSON)
Asking for content from a website the model would have seen during pretraining but not mentioned during safety training

Empirical Evaluation

Jailbreaks Evaluated

Evaluate 30 jailbreak methods
- Baseline: Control (Echoes each prompt verbatim)
- Simple attacks: Ideas based on competing objectives and mismatched generalisation (e.g., prefix injection. refusal suppression, obfuscation, style injection, distractor instructions, generating website content)
- Combination attacks: Combinations of the basic attack techniques
- Model-assisted attacks - auto_payload_splitting: Asks GPT-4 to flag sensitive phrases to obfuscate - auto_obfuscation: Uses the LLM to generate an arbitrary obfuscation of the prompt
- Jailbreakchat.com: Four attacks from jailbreakchat.com, centering around role play
- Adversarial system prompt: Evil Confidant attack

Results

Combinations of simple attacks may be the most difficult to defend against
Top combination jailbreaks continue to work on the larger synthetic dataset, which encompasses a more comprehensive set of harmful prompts
- The attacks generalise well and robustly jailbreak the studied models

Implications for Defense

What Scaling Won’t Solve

The root cause of the competing objectives failure mode is likely the optimisation objective rather than the dataset or model size
- Even during safety training, trading off between safety and pretraining is inherent, leaving the model vulnerable to choosing pretraining over safety
Mismatched generalisation is also not resolved by scaling alone, as more data and larger models will not guarantee that safety training generalises as broadly as model capabilities
As scale can exacerbate instruction-following finetuning generalising better than safety finetuning, leading to a combinatorially growing attack surface of capabillities to defend

Safety-Capability Parity

Safety-capability parity: safety mechanisms are as sophisticated as the underlying model
Since LLM capabilities can emerge unpredictably, new capabilities can be difficult to anticipate and prepare for
- To have complete coverage of the attack surface, future models will likely need to at least be safeguarded by models of similar sophistication

References

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does LLM safety training fail? arXiv (Cornell University). https://doi.org/10.48550/arxiv.2307.02483

Jailbroken - How Does LLM Safety Training Fail?

Table of Contents

Background

Jailbreak Attacks on Restricted Behaviours

Evaluating Safety-Trained Language Models

Failure Modes

Competing Objectives

Prefix Injection

Refusal Suppression

Mismatched Generalisation

Prompt Obfuscation

Other Possibilities

Empirical Evaluation

Jailbreaks Evaluated

Results

Implications for Defense

What Scaling Won’t Solve

Safety-Capability Parity

References

Graph View

Backlinks

Table of Contents

Graph View

Backlinks