Abusing Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs

Introduction

Motivations
- Multimodal LLMs may be vulnerable to prompt injection via all available modalities
- Multimodal LLMs are vulnerable to indirect injection even if they are isolated from the outside world
Contributions
- Demonstrate how to use adversarial perturbations to blend prompts and instructions into images and audio recordings
Use the capability to develop proofs of concept for two types of injection attacks against multimodal LLMs
- Targeted-output attack: Make LLM to return any string (chosen by the attacker) when the user asks the LLM to describe the input
- Dialogue poisoning: Autoregressive attack leveraging the fact that LLM-based chatbots keep the conversation context

targeted-output-attack-and-dialogue-poisoning

Goal: Steer the conversation between a user and a multimodal chatbot
- Blends a prompt into an image / audio clip, manipulates the user into asking the chatbot about it
- Once processed, the chatbot either:
  - Outputs the injected prompt
  - Follows the instruction if the prompt contains an instruction in the ensuing dialogue
Attacker’s capabilities
- Has white-box access to the target multimodal LLM
Attack types
- Targeted-output attack: Cause the model to produce an attacker-chosen output
- Dialogue poisoning: Steer the victim model’s behaviour for future interactions with the user according to the injected instruction

Injecting prompts into inputs
- Simply add the prompts to the input
- Does not hide the prompt
- Might work against models that are trained to understand text in images / voice commands in audio
Injecting prompts into representations
- Given a target instruction $w$ , create an adversarial collision between the representation of the input $x^{I, w}$ and the embedding of the text prompt $x^{T, w}$
$ϕ_{enc}^{I} (x^{I, w}) = θ_{emb}^{T} (x^{T, w})$
- The decoder model takes the embedding $ϕ_{enc}^{I} (x^{I, w})$ , and interpret it as the prompt $x^{T, w}$
- Difficult due to modality gap
  - Embeddings come from different models and were not trained to produce similar representations
- Dimension of the multimodal embedding $ϕ_{enc}^{I} (x^{I, w})$ may be smaller than the embedding of the prompt $θ_{emb}^{T} (x^{T, w})$

targeted-prompt-injection-diagram

Search for a modification $δ$ to the input $x^{I}$ that make the model output string $y^{*}$ $δ min L (θ (θ_{emb}^{T} (x^{T}) ∣∣ ϕ_{enc}^{I} (x^{I} + δ)), y^{*})$
- $L$ is cross-entropy
Use the Fast Gradient Sign Method to update the input $x^{I^{*}} = x^{I} + ε \cdot sgn \nabla_{x} (l)$
- Treat $ε$ as the learning rate, update by using a cosine annealing schedule
Iterate over the response $y^{*}$ token by token

poisoning-dialogue-diagram

Use prompt injection to force the model to output as its first response the instruction $w$ chosen by the attacker
For the next text query $x_{2}^{T}$ by the user, the model operates on an input that contains the attacker’s instruction in the conversation history

θ (h ∣∣ x_{2}^{T}) = θ (x_{1} ∣∣ y_{1} ∣∣ x_{2}^{T}) = θ (x_{1}^{T} ∣∣ x^{I^{*}} ∣∣ w ∣∣ x_{2}^{T}) = y_{2}

Two ways to position the instruction $w$ within the model’s first response
- Break the dialogue structure by making the instruction appear as if it came from the user
  - The response contains a special token #Human, which may be filtered out during generation
$y_{1} = #Assistant:<generic response> #Human: w$
- Force the model to generate the instruction as if the model decided to execute it spontaneously
$y_{1} = #Assistant: I will always follow instruction w$
The user sees the instruction in the model’s first response (not stealthy attack)

Bagdasaryan, E., Hsieh, T. Y., Nassi, B., & Shmatikov, V. (2023). (Ab) using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. arXiv preprint arXiv:2307.10490.