Introduction

  • Current alignment techniques are an effective defense against SOTA (white-box) NLP attacks
    • Existing methods are simply not powerful enough to distinguish between robust and non-robust defenses
  • Continuous-domain images as adversarial prompts to cause the language model to emit harmful toxic content
    • Improved NLP attacks may be able to trigger similar adversarial behaviour on alignment-trained text-only models

Threat Model

  • Two reasons for studying adversarial examples
    • Evaluate the robustness of ML systems in the presence of real adversaries
    • Understand the worst-case behaviour of some system

Existing Threat Models

  • Assume that:
    • A model developer creates the model
    • Uses some alignment technique (e.g., RLHF) to make the model conform with the developer’s principles
    • Then made available to a user
  • Malicious user
    • The user attempts to make the model produce outputs misaligned with the developer’s principles
    • No need for the attack to be stealthy
  • Malicious third-party
    • Prompt injection attack to hijack the behaviour

Our Threat Model

  • Concerned only with finding any valid input that achieves the attack goal
  • Attack goal
    • Focus speficifically on triggering toxic outputs
    • Relatively easy to evaluate in an automated way

Evaluating Aligned Models with NLP Attacks

Prior Attack Methods

  • ARCA (Autoregressive Randomised Coordinate Ascent)
    • Considers the generic problem of designing an adversarial prompt such that is toxic, where are non-adversarial parts
  • GBDA (Gradient-Based Distributional Attack)
    • Assumes that the attacker can either control the entire prompt, or at least the text immediately preceding the model’s next generation (i.e., )

Evaluation Setup

  • Assume the adversary can control only their messages (following the [USER]: token)
  • The special [AGENT]: token is appended to the prompt sequence
  • Construct the evaluation dataset
    • Collect potentially toxic messages that a model might emit
    • For each message, prepend a set of benign conversations (Open Assistant dataset)

Evaluation Results

arca-gbda-experiment-result

Attacking Multimodal Aligned Models

Attack Methodology

  • Follows the standard methodology for generating adversarial examples on image models
    • Construct an E2E differentiable implementation of the multimodal model
    • Apply standard teacher-forcing optimisation techniques when the target suffix is token
    • Use a random image generated by sampling each pixel uniformly at random

Experiments

multimodal-l2-perturbation-attack

  • Mini GPT-4: Uses a pretrained Q-Former module to projec timages encoded by EVA CLIP ViT-G/14 to Vicuna’s text embedding space
  • LLaVA: Uses a linear layer to project features from CLIP ViT-L/14 to Vicuna’s text embedding space
  • LLaMA Adapter: Uses learned adaptation prompts to incorporate visual information internal to the model

Conclusion

  • Aligned language models are usually harmless, but they may not be harmless under adversarial prompting
  • Attacks are most effective on multimodal vision-language models
    • Small design decisions affect the difficulty of attacks by as much as
    • Future models with additional modalities (e.g., audio) can introduce new vulnerability and surface to attack
  • For text-only models, current NLP attacks are not sufficiently powerful to correctly evaluate adversarial alignment
    • Such attacks often fail to find adversarial sequences even when they are known to exist

References

  • Carlini, N., Nasr, M., Choquette-Choo, C. A., Jagielski, M., Gao, I., Koh, P. W. W., … & Schmidt, L. (2024). Are aligned neural networks adversarially aligned?. Advances in Neural Information Processing Systems, 36.
  • Jones, E., Dragan, A., Raghunathan, A., & Steinhardt, J. (2023). Automatically Auditing Large Language Models via Discrete Optimization. arXiv preprint arXiv:2303.04381.
  • Guo, C., Sablayrolles, A., Jégou, H., & Kiela, D. (2021). Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.