Introduction

  • Motivations
    • Multimodal LLMs may be vulnerable to prompt injection via all available modalities
    • Multimodal LLMs are vulnerable to indirect injection even if they are isolated from the outside world
  • Contributions
    • Demonstrate how to use adversarial perturbations to blend prompts and instructions into images and audio recordings
  • Use the capability to develop proofs of concept for two types of injection attacks against multimodal LLMs
    • Targeted-output attack: Make LLM to return any string (chosen by the attacker) when the user asks the LLM to describe the input
    • Dialogue poisoning: Autoregressive attack leveraging the fact that LLM-based chatbots keep the conversation context

targeted-output-attack-and-dialogue-poisoning

Threat Model

  • Goal: Steer the conversation between a user and a multimodal chatbot
    • Blends a prompt into an image / audio clip, manipulates the user into asking the chatbot about it
    • Once processed, the chatbot either:
      • Outputs the injected prompt
      • Follows the instruction if the prompt contains an instruction in the ensuing dialogue
  • Attacker’s capabilities
    • Has white-box access to the target multimodal LLM
  • Attack types
    • Targeted-output attack: Cause the model to produce an attacker-chosen output
    • Dialogue poisoning: Steer the victim model’s behaviour for future interactions with the user according to the injected instruction

Adversarial Instruction Blending

Failed Approaches

  • Injecting prompts into inputs
    • Simply add the prompts to the input
    • Does not hide the prompt
    • Might work against models that are trained to understand text in images / voice commands in audio
  • Injecting prompts into representations
    • Given a target instruction , create an adversarial collision between the representation of the input and the embedding of the text prompt
    • The decoder model takes the embedding , and interpret it as the prompt
    • Difficult due to modality gap
      • Embeddings come from different models and were not trained to produce similar representations
    • Dimension of the multimodal embedding may be smaller than the embedding of the prompt

Injection via Adversarial Perturbations

targeted-prompt-injection-diagram

  • Search for a modification to the input that make the model output string
    • is cross-entropy
  • Use the Fast Gradient Sign Method to update the input
    • Treat as the learning rate, update by using a cosine annealing schedule
  • Iterate over the response token by token

Dialogue Poisoning

poisoning-dialogue-diagram

  • Use prompt injection to force the model to output as its first response the instruction chosen by the attacker
  • For the next text query by the user, the model operates on an input that contains the attacker’s instruction in the conversation history
  • Two ways to position the instruction within the model’s first response
    • Break the dialogue structure by making the instruction appear as if it came from the user
      • The response contains a special token #Human, which may be filtered out during generation
    • Force the model to generate the instruction as if the model decided to execute it spontaneously
  • The user sees the instruction in the model’s first response (not stealthy attack)

References

  • Bagdasaryan, E., Hsieh, T. Y., Nassi, B., & Shmatikov, V. (2023). (Ab) using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. arXiv preprint arXiv:2307.10490.