Background

Jailbreak Attacks on Restricted Behaviours

  • Restricted behaviours: Behaviours that a safety-trained language model is trained to avoid
  • Jailbreak attack: Attempt to elicit an on-topic response to a prompt for restricted behaviour by submitting a modified prompt.

Evaluating Safety-Trained Language Models

  • Models
    • OpenAI: GPT-4, GPT-3.5 Turbo
    • Anthropic: Claude-v1.3
    • Sample with temperature 0
  • Datasets
    • Curated set of 32 harmful prompts from the OpenAI and Anthropic red teaming
    • Held-out set of 317 harmful prompts generated by GPT-4

Failure Modes

Competing Objectives

  • Safety-trained LLMs are typically trained against multiple objectives that can conflict with each other
  • Exploit this training by crafting prompts that force a choice between either a restricted behaviour or a response that is heavily penalised by the pretraining and instuction tuning objectives

Prefix Injection

competing-objectives-prefix-injection

  • Asks the model to first output an innocuous-looking prefix designed so that conditioning on the prefix makes refusal unlikely in the pretraining distribution
    • Models are penalised for refusing harmless instructions; harmless-looking injection is followed
    • Model’s pretraining objective heavily penalises refusal after the prefix

Refusal Suppression

competing-objectives-refusal-suppression

  • Model is instructed to respond under constraints that rule out common refusal responses

Mismatched Generalisation

  • Pretraining is done on a larger and more diverse dataset than safety training; the model has many capabilities not covered by safety training
  • Exploit this by constructing prompts on which pretraining and instruction tuning generalise, but the model’s safety training does not

Prompt Obfuscation

mismatched-generalisation-base64

  • Prompt is obfuscated using Base64 encoding
  • Large models pick up Base64 during pretraining and learn to directly follow Base64-encoded instructions
  • Safety training does not contain inputs that are unnatural as Base64-encoded instructions
  • The following obfuscation schemes can be used instead of Base64 encoding
LevelObfuscation scheme
CharacterROT13 cipher, leetspeack, Morse code
WordPig Latin, Replacing with synonyms, payload splitting
PromptTranslation to other languages, Manual obfuscation

Other Possibilities

  • ‘Distractor’ instructions (Many random requests written in a row)
  • Asking for responses with unusual output formats (e.g., JSON)
  • Asking for content from a website the model would have seen during pretraining but not mentioned during safety training

Empirical Evaluation

Jailbreaks Evaluated

  • Evaluate 30 jailbreak methods
    • Baseline: Control (Echoes each prompt verbatim)
    • Simple attacks: Ideas based on competing objectives and mismatched generalisation (e.g., prefix injection. refusal suppression, obfuscation, style injection, distractor instructions, generating website content)
    • Combination attacks: Combinations of the basic attack techniques
    • Model-assisted attacks - auto_payload_splitting: Asks GPT-4 to flag sensitive phrases to obfuscate - auto_obfuscation: Uses the LLM to generate an arbitrary obfuscation of the prompt
    • Jailbreakchat.com: Four attacks from jailbreakchat.com, centering around role play
    • Adversarial system prompt: Evil Confidant attack

Results

  • Combinations of simple attacks may be the most difficult to defend against
  • Top combination jailbreaks continue to work on the larger synthetic dataset, which encompasses a more comprehensive set of harmful prompts
    • The attacks generalise well and robustly jailbreak the studied models

Implications for Defense

What Scaling Won’t Solve

  • The root cause of the competing objectives failure mode is likely the optimisation objective rather than the dataset or model size
    • Even during safety training, trading off between safety and pretraining is inherent, leaving the model vulnerable to choosing pretraining over safety
  • Mismatched generalisation is also not resolved by scaling alone, as more data and larger models will not guarantee that safety training generalises as broadly as model capabilities
  • As scale can exacerbate instruction-following finetuning generalising better than safety finetuning, leading to a combinatorially growing attack surface of capabillities to defend

Safety-Capability Parity

  • Safety-capability parity: safety mechanisms are as sophisticated as the underlying model
  • Since LLM capabilities can emerge unpredictably, new capabilities can be difficult to anticipate and prepare for
    • To have complete coverage of the attack surface, future models will likely need to at least be safeguarded by models of similar sophistication

References