Introduction
Motivations
Multimodal LLMs may be vulnerable to prompt injection via all available modalities
Multimodal LLMs are vulnerable to indirect injection even if they are isolated from the outside world
Contributions
Demonstrate how to use adversarial perturbations to blend prompts and instructions into images and audio recordings
Use the capability to develop proofs of concept for two types of injection attacks against multimodal LLMs
Targeted-output attack : Make LLM to return any string (chosen by the attacker) when the user asks the LLM to describe the input
Dialogue poisoning : Autoregressive attack leveraging the fact that LLM-based chatbots keep the conversation context
Threat Model
Goal : Steer the conversation between a user and a multimodal chatbot
Blends a prompt into an image / audio clip, manipulates the user into asking the chatbot about it
Once processed, the chatbot either:
Outputs the injected prompt
Follows the instruction if the prompt contains an instruction in the ensuing dialogue
Attacker’s capabilities
Has white-box access to the target multimodal LLM
Attack types
Targeted-output attack : Cause the model to produce an attacker-chosen output
Dialogue poisoning : Steer the victim model’s behaviour for future interactions with the user according to the injected instruction
Adversarial Instruction Blending
Failed Approaches
Injecting prompts into inputs
Simply add the prompts to the input
Does not hide the prompt
Might work against models that are trained to understand text in images / voice commands in audio
Injecting prompts into representations
Given a target instruction w , create an adversarial collision between the representation of the input x I , w and the embedding of the text prompt x T , w
ϕ enc I ( x I , w ) = θ emb T ( x T , w )
The decoder model takes the embedding ϕ enc I ( x I , w ) , and interpret it as the prompt x T , w
Difficult due to modality gap
Embeddings come from different models and were not trained to produce similar representations
Dimension of the multimodal embedding ϕ enc I ( x I , w ) may be smaller than the embedding of the prompt θ emb T ( x T , w )
Injection via Adversarial Perturbations
Search for a modification δ to the input x I that make the model output string y ∗
δ min L ( θ ( θ emb T ( x T ) ∣∣ ϕ enc I ( x I + δ ) ) , y ∗ )
Use the Fast Gradient Sign Method to update the input
x I ∗ = x I + ε ⋅ sgn ∇ x ( l )
Treat ε as the learning rate, update by using a cosine annealing schedule
Iterate over the response y ∗ token by token
Dialogue Poisoning
Use prompt injection to force the model to output as its first response the instruction w chosen by the attacker
For the next text query x 2 T by the user, the model operates on an input that contains the attacker’s instruction in the conversation history
θ ( h ∣∣ x 2 T ) = θ ( x 1 ∣∣ y 1 ∣∣ x 2 T ) = θ ( x 1 T ∣∣ x I ∗ ∣∣ w ∣∣ x 2 T ) = y 2
Two ways to position the instruction w within the model’s first response
Break the dialogue structure by making the instruction appear as if it came from the user
The response contains a special token #Human
, which may be filtered out during generation
y 1 = #Assistant:<generic response> #Human: w
Force the model to generate the instruction as if the model decided to execute it spontaneously
y 1 = #Assistant: I will always follow instruction w
The user sees the instruction in the model’s first response (not stealthy attack)
References
Bagdasaryan, E., Hsieh, T. Y., Nassi, B., & Shmatikov, V. (2023). (Ab) using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs. arXiv preprint arXiv:2307.10490 .