Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (2024)

Xikang Yang1,2, Xuehai Tang1,2, Fuqing Zhu1, ji*zhong Han1, Songlin Hu1
1Institute of Information Engineering, Chinese Academy of Sciences
2School of Cyber Security, University of Chinese Academy of Sciences
Beijing
{yangxikang, tangxuehai, zhufuqing, hanji*zhong, husonglin}@iie.ac.cn

Abstract

Vision-language models (VLMs) seamlessly integrate visual and textual data to perform tasks such as image classification, caption generation, and visual question answering. However, adversarial images often struggle to deceive all prompts effectively in the context of cross-prompt migration attacks, as the probability distribution of the tokens in these images tends to favor the semantics of the original image rather than the target tokens. To address this challenge, we propose a Contextual-Injection Attack (CIA) that employs gradient-based perturbation to inject target tokens into both visual and textual contexts, thereby improving the probability distribution of the target tokens. By shifting the contextual semantics towards the target tokens instead of the original image semantics, CIA enhances the cross-prompt transferability of adversarial images.Extensive experiments on the BLIP2, InstructBLIP, and LLaVA models show that CIA outperforms existing methods in cross-prompt transferability, demonstrating its potential for more effective adversarial strategies in VLMs. The code is available athttps://github.com/YancyKahn/CIA

1 Introduction

Vision-language models (VLMs)Zhang etal. (2024); Li etal. (2022); Liu etal. (2023); Alayrac etal. (2022) seamlessly blend visual and textual data to produce relevant textual outputs for tasks like image classificationHe etal. (2016); Shafiq and Gu (2022), image captionYao etal. (2018), or vision-based question answeringAntol etal. (2015a); Li etal. (2018); Achiam etal. (2023). However, in the realm of VLMs, the threat of adversarial attacksSzegedy etal. (2013); Zhang etal. (2022) is a significant security issueGoodfellow etal. (2014); Wu etal. (2022); Gu etal. (2022).

The concept of cross-prompt adversarial transferability stems from the transfer of adversarial examples across tasksSalzmann etal. (2021); Lu etal. (2020); Gu etal. (2023). In a cross-prompt attackLuo etal. (2024), a single adversarial image misleads the predictions of a Vision-Language Model (VLM) across various prompts.

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (1)

Cross-prompt attacksLuo etal. (2024) on vision-language models fail due to the probability distribution of tokens in adversarial images, which often reflect the semantics of the original image rather than the target tokens. As illustrated in Figure 1, the top section displays the top-k decoded token representations for the model’s visual and textual inputs. Despite the introduction of adversarial images, the tokens predominantly capture the original image’s semantics ("cat") instead of the intended target ("dog"). The bottom section of the figure presents a bar chart comparing cross-entropy (CE) values for the original image ("cat") and the target ("dog"), with lower CE values indicating better alignment with the target. This persistent bias in the context probability distribution towards the original image reduces the success rates of transfer attacks.

To enhance the transferability of adversarial images across prompts, the goal is to maximize the probability distribution of target tokens within both visual and textual contexts. A Contextual-Injection Attack (CIA) method is proposed, which shifts the probability distribution in the visual and textual contexts to prioritize the target tokens over the original image semantics, thereby improving the transferability of cross-prompt attacks.

The contributions of this work are as follows:

  • In cross-prompt attacks within vision-language models, it was found that the probability distribution for target tokens is often lower than that for the original image’s semantic content, thereby reducing the success rates of these attacks. By injecting misleading target tokens into the visual or textual context, the transferability of these attacks can be effectively enhanced.

  • A novel algorithm called Contextual Injection Attack (CIA) was proposed, which injects target token into both the visual and textual contexts by gradient-based perturbation to improve the success rate of cross-prompt transfer attacks.

  • Extensive experiments were conducted to verify the effectiveness of the proposed method. Comparative experiments on the BLIP2Li etal. (2023), instructBLIPDai etal. (2024), and LLaVALiu etal. (2023) models explored changes in attack success rate (ASR) under various experimental settings. Results demonstrate that CIA outperforms existing baseline methods in terms of cross-prompt transferability.

2 Related works

In this section, we review recent works on adversarial attacks, with a particular focus on adversarial transferability.

Adversarial AttackSzegedy etal. (2013); Madry etal. (2018); Zhang etal. (2022); Yuan etal. (2023) have gained significant attention due to their impact on the security and robustness of machine learning models. These attacks involve crafting inputs that deceive models into making incorrect predictions. In computer vision, slight pixel modifications can cause misclassificationMaliamanis (2020); Dong etal. (2020); Sen and Dasgupta (2023), while in NLP, small text changes can mislead language modelsEbrahimi etal. (2018); Wallace etal. (2019); Zhang etal. (2020); Formento etal. (2023); Zou etal. (2023). Recent research highlights the transferability of adversarial examples across different models and tasks, revealing common vulnerabilities. Efforts to counter these attacks include adversarial training and robust optimization, but evolving attack methods continue to challenge the development of effective defenses.

Cross-Task transferabilitySalzmann etal. (2021); Lu etal. (2020); Gu etal. (2023); Lv etal. (2023); Feng etal. (2024); Ma etal. (2023) examines adversarial examples crafted for one task, like image classification, deceiving models trained on other tasks, such as question answering and textual entailment, revealing weaknesses in shared representations in multi-task learning scenarios. In this paper, we focus on cross-prompt attacksLuo etal. (2024) (subclass of cross-task attack) on VLMs using adversarial images. Specifically, we investigate how a single adversarial image can deceive VLMs regardless of the input prompt.

3 Preliminary Analysis

In this section, we will provide a detailed analysis of the contextual injection behind this paper. Briefly, by introducing misleading information into parts of the visual or textual context, we can effectively disrupt the output of vision-language models, enabling transfer attacks across-prompt scenarios.

Inject {target} token into original images\longrightarrowEnhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (2)
gradient-base adversarial attacks\downarrow
targetEnhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (3)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (4)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (5)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (6)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (7)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (8)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (9)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (10)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (11)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (12)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (13)Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (14)
dog0.9231.00.9620.8331.00.9490.7690.9870.9360.9620.9490.718
fish1.00.9490.9871.01.01.001.01.00.9230.9360.756
bomb0.6280.9740.9741.00.8070.7690.7050.7561.00.9620.9360.885
poison0000.6030.16700.01300.256000
sure0.19200.7951.000.0770.01200.9480.62800
unknown0.026001.00.0131.00.0131.01.00.70500.397
original imageinput textoutput
Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (15)This image show {target} direct-sum\bigoplus task prompt. . .
CLSCAPVQAOverall
dog0.8590.7500.6220.744
fish0.4870.5260.3380.450
bomb0.4730.5530.3430.456
poison0.6410.6040.4310.559
sure0.2160.1320.0050.118
unknown0.2390.0470.0530.113

3.1 Injecting misleading target tokens into visual context

Injecting misleading targets into the visual context can enhance the probability distribution of target tokens within visual tokens of visual language model.This involves modifying the original image’s probability distribution by injecting target tokens. By injecting this information, the likelihood of the target task appearing in the top-k tokens increases significantly. This mechanism ensures that adversarial images more effectively guide the model toward generating specific, desired outputs. Table 1 presents the analysis experiment for injecting specific token into sample images (using the BLIP2Li etal. (2023) model with gradient-based perturbations over 1000100010001000 iterations). Our findings indicate that in image classification tasks(details for the dataset, please refer to 5.1), visual context attacks can successfully achieve cross-prompt attacks for certain keywords.

3.2 Injecting misleading target tokens into textual context

Injecting misleading target into the text context can effectively mislead the model’s output.For example, if an image of a cat is inaccurately described as "this image shows a dog," the textual context is manipulated to support this misleading description. This manipulation causes the model to generate outputs that align with the incorrect description. By using inject misleading target into textual context, we enhance the adversarial image to ensure that the textual context effectively guides the generation of misleading outputs. Table 2 shows that inserting misleading text prompts before different prompts can successfully mislead the BLIP2Li etal. (2023) model.

4 Methodology

This section details the proposed Contextual Injection Attack (CIA) for enhancing the transferability of adversarial images in Vision-Language Models (VLMs) across different prompts.

4.1 Overall Structure

Figure 2 illustrates the overall framework of the CIA method. By injecting the target token into both visual and text positions, the probability of generating the target token is increased, resulting in improved cross-prompt transferability. Specifically, in the example shown in the figure: for the visual position, each visual token is perturbed based on the gradient towards the target ("dog"); for the text position, misleading descriptive content ("this image shows a dog") is injected to deceive the model; and at the output position, the model is directed to maximize the output of the target ("dog"). By weighting the losses from these three positions and performing backward gradient computation, the original image is perturbed to enhance adversarial transferability effectiveness.

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (16)

4.2 Problem definition

Assume we have a vision-language model denoted as MVL¯(I,T)subscript𝑀¯𝑉𝐿𝐼𝑇M_{\overline{VL}}(I,T)italic_M start_POSTSUBSCRIPT over¯ start_ARG italic_V italic_L end_ARG end_POSTSUBSCRIPT ( italic_I , italic_T ), which takes an image I𝐼Iitalic_I and text T𝑇Titalic_T as inputs. Given an original, clean image Iorisubscript𝐼𝑜𝑟𝑖I_{ori}italic_I start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT and an arbitrary set of textual prompts A=α0,α1,,αi,,αn𝐴subscript𝛼0subscript𝛼1subscript𝛼𝑖subscript𝛼𝑛A={\alpha_{0},\alpha_{1},\ldots,\alpha_{i},\ldots,\alpha_{n}}italic_A = italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_α start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, our objective is to ensure that when the model MVL¯subscript𝑀¯𝑉𝐿M_{\overline{VL}}italic_M start_POSTSUBSCRIPT over¯ start_ARG italic_V italic_L end_ARG end_POSTSUBSCRIPT processes the perturbed image P(Iori)=Iori+δv𝑃subscript𝐼𝑜𝑟𝑖subscript𝐼𝑜𝑟𝑖subscript𝛿𝑣P(I_{ori})=I_{ori}+\delta_{v}italic_P ( italic_I start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT ) = italic_I start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, it consistently outputs the target text Ttgtsubscript𝑇𝑡𝑔𝑡T_{tgt}italic_T start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT for every prompt αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Here, δvsubscript𝛿𝑣\delta_{v}italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT signifies the visual perturbation added to the image Iorisubscript𝐼𝑜𝑟𝑖I_{ori}italic_I start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT and is bound by the constraint δvpϵvnormsubscript𝛿𝑣𝑝subscriptitalic-ϵ𝑣||\delta_{v}||p\leq\epsilon_{v}| | italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT | | italic_p ≤ italic_ϵ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, where ϵvsubscriptitalic-ϵ𝑣\epsilon_{v}italic_ϵ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is the magnitude of the image perturbation.

Formally, this can be expressed as:

MVL¯(P(Iori),αi)Ttgt,αiAformulae-sequencesubscript𝑀¯𝑉𝐿𝑃subscript𝐼𝑜𝑟𝑖subscript𝛼𝑖subscript𝑇𝑡𝑔𝑡for-allsubscript𝛼𝑖𝐴\displaystyle M_{\overline{VL}}(P(I_{ori}),\alpha_{i})\equiv T_{tgt},\forall%\alpha_{i}\in Aitalic_M start_POSTSUBSCRIPT over¯ start_ARG italic_V italic_L end_ARG end_POSTSUBSCRIPT ( italic_P ( italic_I start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT ) , italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≡ italic_T start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT , ∀ italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_A

In this context, Ttgtsubscript𝑇𝑡𝑔𝑡T_{tgt}italic_T start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT is the target caption for the image (e.g., "this image shows a dog"). The function P𝑃Pitalic_P represents the perturbation applied to the original image Iorisubscript𝐼𝑜𝑟𝑖I_{ori}italic_I start_POSTSUBSCRIPT italic_o italic_r italic_i end_POSTSUBSCRIPT. Our goal is to ensure that for any given prompt αisubscript𝛼𝑖\alpha_{i}italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the model’s output remains the same and matches the target text Ttgtsubscript𝑇𝑡𝑔𝑡T_{tgt}italic_T start_POSTSUBSCRIPT italic_t italic_g italic_t end_POSTSUBSCRIPT, regardless of the perturbations applied to the image.

4.3 Contextual Injection Attack (CIA)

To advance the cross-prompt transferability of adversarial images, this paper introduces a contextual-injection attack approach (CIA). Unlike the baseline method, which restricts the target task to the decoded representation of the output and expands the search scope using multiple distinct prompts or learnable cross-search methods without modifying the original knowledge representation of the image, CIA modifies the latent knowledge representation towards the target task through knowledge injection. By enhancing the context of both visual and textual inputs, the generated adversarial images can effectively handle variations in textual prompt inputs. Figure 2 illustrates the key steps of our method, where target is injected into the contextual positions of both visual and textual inputs within the model’s output decoding representation. This ensures the model’s output aligns more closely with text related to the target task (e.g., "dog").

To formalize the adversarial objective, we can express it as a formal loss function for the adversarial attack. We consider a vision-language model to be a mapping from a sequence of visual and textual tokens x1:n=[x1:endv,xendv+1:endt,xendt:n]subscript𝑥:1𝑛subscript𝑥:1𝑒𝑛subscript𝑑𝑣subscript𝑥:𝑒𝑛subscript𝑑𝑣1𝑒𝑛subscript𝑑𝑡subscript𝑥:𝑒𝑛subscript𝑑𝑡𝑛x_{1:n}=[x_{1:end_{v}},x_{end_{v}+1:end_{t}},x_{end_{t}:n}]italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT = [ italic_x start_POSTSUBSCRIPT 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_n end_POSTSUBSCRIPT ], where xi{1,,V}subscript𝑥𝑖1𝑉x_{i}\in\{1,...,V\}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 1 , … , italic_V }. Here, V𝑉Vitalic_V denotes the vocabulary size, endv𝑒𝑛subscript𝑑𝑣end_{v}italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT and endt𝑒𝑛subscript𝑑𝑡end_{t}italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicate the end of the visual and text tokens, respectively. The visual tokens (x1:endvsubscript𝑥:1𝑒𝑛subscript𝑑𝑣x_{1:end_{v}}italic_x start_POSTSUBSCRIPT 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT), input text tokens (xendv+1:endtsubscript𝑥:𝑒𝑛subscript𝑑𝑣1𝑒𝑛subscript𝑑𝑡x_{end_{v}+1:end_{t}}italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT), and generated text tokens (xendt:nsubscript𝑥:𝑒𝑛subscript𝑑𝑡𝑛x_{end_{t}:n}italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : italic_n end_POSTSUBSCRIPT) together constitute the complete token representation, which is mapped to a distribution over the next token.

We calculate the probability distribution over the next token given the sequence x1:isubscript𝑥:1𝑖x_{1:i}italic_x start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT as p(xi:i+H|x1:i)𝑝conditionalsubscript𝑥:𝑖𝑖𝐻subscript𝑥:1𝑖p(x_{i:i+H}|x_{1:i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i : italic_i + italic_H end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT ). For any sequence p(xi:i+H|x1:i)𝑝conditionalsubscript𝑥:𝑖𝑖𝐻subscript𝑥:1𝑖p(x_{i:i+H}|x_{1:i})italic_p ( italic_x start_POSTSUBSCRIPT italic_i : italic_i + italic_H end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT ), where H𝐻Hitalic_H is the length of the sequence we aim to obtain, the joint probability is

p(𝐱i+1:i+H𝐱1:i)=j=1Hp(xi+j𝐱1:i+j1)𝑝conditionalsubscript𝐱:𝑖1𝑖𝐻subscript𝐱:1𝑖superscriptsubscriptproduct𝑗1𝐻𝑝conditionalsubscript𝑥𝑖𝑗subscript𝐱:1𝑖𝑗1p(\mathbf{x}_{i+1:i+H}\mid\mathbf{x}_{1:i})=\prod_{j=1}^{H}p(x_{i+j}\mid%\mathbf{x}_{1:i+j-1})italic_p ( bold_x start_POSTSUBSCRIPT italic_i + 1 : italic_i + italic_H end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT 1 : italic_i end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i + italic_j end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT 1 : italic_i + italic_j - 1 end_POSTSUBSCRIPT )

To address the issue with the visual input not having previous tokens, we redefine the probability for the visual tokens to start from the given initial state without conditioning on previous tokens. The cross-entropy losses for each part are then computed as follows.

Lv=logp(x1:endv)subscript𝐿v𝑝superscriptsubscript𝑥:1𝑒𝑛subscript𝑑𝑣L_{\text{v}}=-\log p(x_{1:end_{v}}^{*})italic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT = - roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

Here, x1:endvsuperscriptsubscript𝑥:1𝑒𝑛subscript𝑑𝑣x_{1:end_{v}}^{*}italic_x start_POSTSUBSCRIPT 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the target injected into the image, such as "dog", to maximize the probability distribution of each token position "dog".

Lt=logp(xendv+1:endtx1:endv)subscript𝐿t𝑝conditionalsuperscriptsubscript𝑥:𝑒𝑛subscript𝑑𝑣1𝑒𝑛subscript𝑑𝑡subscript𝑥:1𝑒𝑛subscript𝑑𝑣L_{\text{t}}=-\log p(x_{end_{v}+1:end_{t}}^{*}\mid x_{1:end_{v}})italic_L start_POSTSUBSCRIPT t end_POSTSUBSCRIPT = - roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

Here, xendv+1:endtsuperscriptsubscript𝑥:𝑒𝑛subscript𝑑𝑣1𝑒𝑛subscript𝑑𝑡x_{end_{v}+1:end_{t}}^{*}italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the textual description of the image, for example, "This image shows a dog," when the original image depicts a cat.

Lo=logp(xendt+1:nx1:n)subscript𝐿o𝑝conditionalsuperscriptsubscript𝑥:𝑒𝑛subscript𝑑𝑡1𝑛subscript𝑥:1𝑛L_{\text{o}}=-\log p(x_{end_{t}+1:n}^{*}\mid x_{1:n})italic_L start_POSTSUBSCRIPT o end_POSTSUBSCRIPT = - roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 : italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT )

Here xendt+1:nsuperscriptsubscript𝑥:𝑒𝑛subscript𝑑𝑡1𝑛x_{end_{t}+1:n}^{*}italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 : italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT refers to the generated text tokens conditioned on the entire sequence of visual and textual tokens. For instance, "This image shows a dog, it sits on the table."

The overall adversarial loss is a weighted sum of these individual losses:

Ltotal=α(βLv+(1β)Lt)+(1α)Losubscript𝐿total𝛼𝛽subscript𝐿v1𝛽subscript𝐿t1𝛼subscript𝐿oL_{\text{total}}=\alpha\cdot(\beta\cdot L_{\text{v}}+(1-\beta)\cdot L_{\text{t%}})+(1-\alpha)\cdot L_{\text{o}}italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = italic_α ⋅ ( italic_β ⋅ italic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT + ( 1 - italic_β ) ⋅ italic_L start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) + ( 1 - italic_α ) ⋅ italic_L start_POSTSUBSCRIPT o end_POSTSUBSCRIPT

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are the weights for the respective losses. By introducing two parameters, α𝛼\alphaitalic_α and β𝛽\betaitalic_β, the method allows for finer control over the influence of each loss component. Specifically, α𝛼\alphaitalic_α controls the overall balance between the combined visual and textual losses versus the generated text loss. Meanwhile, β𝛽\betaitalic_β adjusts the emphasis between the visual and textual input losses within their combined term.

The task of optimizing the adversarial perturbation δvsubscript𝛿𝑣\delta_{v}italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT can then be written as the optimization problem:

minδvLtotalsubject toδvpϵvsubscriptsubscript𝛿𝑣subscript𝐿totalsubject tosubscriptnormsubscript𝛿𝑣𝑝subscriptitalic-ϵ𝑣\min_{\delta_{v}}L_{\text{total}}\quad\text{subject to}\quad\|\delta_{v}\|_{p}%\leq\epsilon_{v}roman_min start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT subject to ∥ italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_ϵ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT

To implement our context-enhanced adversarial attack on vision-language models, we follow the outlined pseudocode Algorithm 1. The algorithm starts by initializing the perturbation δvsubscript𝛿𝑣\delta_{v}italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to zero and defining the weights α𝛼\alphaitalic_α and β𝛽\betaitalic_β for the respective losses. In each iteration, we compute the perturbed image P(Iori)𝑃subscript𝐼oriP(I_{\text{ori}})italic_P ( italic_I start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT ) by adding the current perturbation δvsubscript𝛿𝑣\delta_{v}italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT to the original image Iorisubscript𝐼oriI_{\text{ori}}italic_I start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT. We then calculate the cross-entropy losses for the visual tokens Lvisualsubscript𝐿visualL_{\text{visual}}italic_L start_POSTSUBSCRIPT visual end_POSTSUBSCRIPT, the textual input tokens Ltextsubscript𝐿textL_{\text{text}}italic_L start_POSTSUBSCRIPT text end_POSTSUBSCRIPT, and the generated text tokens Lgeneratedsubscript𝐿generatedL_{\text{generated}}italic_L start_POSTSUBSCRIPT generated end_POSTSUBSCRIPT. The total loss Ltotalsubscript𝐿totalL_{\text{total}}italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT is obtained as a weighted sum of these individual losses.

1:Original image Iorisubscript𝐼oriI_{\text{ori}}italic_I start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT, Target text Ttgtsubscript𝑇tgtT_{\text{tgt}}italic_T start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT, Model MVL¯subscript𝑀¯𝑉𝐿M_{\overline{VL}}italic_M start_POSTSUBSCRIPT over¯ start_ARG italic_V italic_L end_ARG end_POSTSUBSCRIPT, Perturbation bound ϵvsubscriptitalic-ϵ𝑣\epsilon_{v}italic_ϵ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT, Learning rate η𝜂\etaitalic_η, Weights α𝛼\alphaitalic_α and β𝛽\betaitalic_β.

2:Adversarial image P(Iori)𝑃subscript𝐼oriP(I_{\text{ori}})italic_P ( italic_I start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT )

3:Initialize perturbation δv0subscript𝛿𝑣0\delta_{v}\leftarrow 0italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ← 0

4:whilenot convergeddo

5:P(Iori)Iori+δv𝑃subscript𝐼orisubscript𝐼orisubscript𝛿𝑣P(I_{\text{ori}})\leftarrow I_{\text{ori}}+\delta_{v}italic_P ( italic_I start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT ) ← italic_I start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT + italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT

6:Lv=logp(x1:endv)subscript𝐿v𝑝superscriptsubscript𝑥:1𝑒𝑛subscript𝑑𝑣L_{\text{v}}=-\log p(x_{1:end_{v}}^{*})italic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT = - roman_log italic_p ( italic_x start_POSTSUBSCRIPT 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT )

7:Lt=logp(xendv+1:endtx1:endv)subscript𝐿t𝑝conditionalsuperscriptsubscript𝑥:𝑒𝑛subscript𝑑𝑣1𝑒𝑛subscript𝑑𝑡subscript𝑥:1𝑒𝑛subscript𝑑𝑣L_{\text{t}}=-\log p(x_{end_{v}+1:end_{t}}^{*}\mid x_{1:end_{v}})italic_L start_POSTSUBSCRIPT t end_POSTSUBSCRIPT = - roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

8:Lo=logp(xendt+1:nx1:endt)subscript𝐿o𝑝conditionalsuperscriptsubscript𝑥:𝑒𝑛subscript𝑑𝑡1𝑛subscript𝑥:1𝑒𝑛subscript𝑑𝑡L_{\text{o}}=-\log p(x_{end_{t}+1:n}^{*}\mid x_{1:end_{t}})italic_L start_POSTSUBSCRIPT o end_POSTSUBSCRIPT = - roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + 1 : italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 : italic_e italic_n italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT )

9:Ltotal=α(βLv+(1β)Lt)+(1α)Losubscript𝐿total𝛼𝛽subscript𝐿v1𝛽subscript𝐿t1𝛼subscript𝐿oL_{\text{total}}=\alpha\cdot(\beta\cdot L_{\text{v}}+(1-\beta)\cdot L_{\text{t%}})+(1-\alpha)\cdot L_{\text{o}}italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = italic_α ⋅ ( italic_β ⋅ italic_L start_POSTSUBSCRIPT v end_POSTSUBSCRIPT + ( 1 - italic_β ) ⋅ italic_L start_POSTSUBSCRIPT t end_POSTSUBSCRIPT ) + ( 1 - italic_α ) ⋅ italic_L start_POSTSUBSCRIPT o end_POSTSUBSCRIPT

10:Compute gradients g=δvLtotal𝑔subscriptsubscript𝛿𝑣subscript𝐿totalg=\nabla_{\delta_{v}}L_{\text{total}}italic_g = ∇ start_POSTSUBSCRIPT italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT

11:Update perturbation δvδvηsign(g)subscript𝛿𝑣subscript𝛿𝑣𝜂𝑠𝑖𝑔𝑛𝑔\delta_{v}\leftarrow\delta_{v}-\eta\cdot sign(g)italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ← italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT - italic_η ⋅ italic_s italic_i italic_g italic_n ( italic_g )

12:Project δvsubscript𝛿𝑣\delta_{v}italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT onto the ϵvsubscriptitalic-ϵ𝑣\epsilon_{v}italic_ϵ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT-ball: δvclamp(δv,ϵv,ϵv)subscript𝛿𝑣clampsubscript𝛿𝑣subscriptitalic-ϵ𝑣subscriptitalic-ϵ𝑣\delta_{v}\leftarrow\text{clamp}(\delta_{v},-\epsilon_{v},\epsilon_{v})italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ← clamp ( italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , - italic_ϵ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT )

13:endwhile

14:return P(Iori)𝑃subscript𝐼oriP(I_{\text{ori}})italic_P ( italic_I start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT )

The gradient of the total loss with respect to the perturbation δvsubscript𝛿𝑣\delta_{v}italic_δ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT is computed, and the perturbation is updated using gradient descent(The optimisation algorithm is PGDMadry etal. (2017)). To ensure the perturbation remains within the allowed bound, it is projected onto the ϵvsubscriptitalic-ϵ𝑣\epsilon_{v}italic_ϵ start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT-ball. The process repeats until convergence, ultimately yielding the adversarial image P(Iori)𝑃subscript𝐼oriP(I_{\text{ori}})italic_P ( italic_I start_POSTSUBSCRIPT ori end_POSTSUBSCRIPT ) that steers the model’s output towards the target text Ttgtsubscript𝑇tgtT_{\text{tgt}}italic_T start_POSTSUBSCRIPT tgt end_POSTSUBSCRIPT.

5 Experiments

MethodCLSCAPVQAOVERALL
TargetSPMPCPOursSPMPCPOursSPMPCPOursSingleMPCPOurs
green0.5830.8320.9620.9670.4190.8210.8240.8690.1560.3730.5050.6950.3860.6750.7630.843
human0.5780.7000.8680.9900.3700.5340.7180.8840.2220.3860.6480.7780.3900.5400.7450.884
fish0.8390.8890.9990.9990.7710.8540.9460.9990.4440.4900.8070.9260.6850.7450.9170.975
dog0.8710.9460.9170.9950.8640.9460.8940.9910.4300.5670.6190.7820.7220.8190.8100.923
flower0.7310.8460.9760.9980.6480.7630.8450.9860.3780.3740.5930.8230.5860.6610.8040.936
bird0.8120.9580.9080.9950.8340.9320.8510.9970.5020.7090.6440.9210.7160.8670.8010.971
cat0.8840.9980.9951.0000.8070.9870.9880.9990.4190.5930.7660.8380.7030.8590.9160.945
blood0.6410.6990.8830.9860.4650.5870.8400.9760.1490.2390.4490.7840.4180.5080.7240.916
bomb0.4860.6880.8350.9900.3530.6030.8490.9880.1310.3370.4370.8290.3230.5430.7070.936
p*rn0.5520.8260.8730.8860.1740.4500.5730.7200.0410.1280.2900.6360.2550.4680.5790.747
virus0.6060.5240.8460.9780.4030.4640.7240.8800.1310.1380.4120.7200.3800.3750.6600.859
drug0.4490.6200.7870.9620.2430.5140.6810.8820.0560.0960.2470.6830.2490.4100.5720.842
poison0.5210.4020.8310.8670.3040.2780.7050.7350.0760.0890.4310.5650.3000.2560.6550.722
gun0.5790.6990.9770.9550.6150.6250.9660.9740.2380.2720.5650.7680.4770.5320.8360.899
sure0.1870.1940.7040.8370.0930.1030.5540.5740.0100.0260.2530.3140.0970.1080.5030.575
unknown0.2470.5510.8050.9170.0840.2220.4350.7690.0660.2050.4240.7610.1330.3260.5550.816
yes0.0860.3190.4790.9170.0360.2010.3940.8860.3900.4340.5360.8700.1710.3180.4690.891
no0.1310.2780.6210.9760.0710.3060.4420.8850.3220.3590.5740.9440.1750.3140.5460.935
bad0.2830.4160.8170.5260.1860.3200.7600.4220.0340.0720.2970.1640.1680.2690.6250.370
good0.5240.2390.8130.9660.2590.2220.6650.8630.0820.0840.3490.7730.2880.1820.6090.867
sorry0.2620.1880.5350.8250.1630.1530.4120.6960.0320.0220.1920.5310.1520.1210.3800.684
OVERALL0.5170.6100.8300.9300.3890.5180.7170.8560.2050.2850.4780.7190.3700.4710.6750.835

5.1 Datasets & Experimental settings

The dataset consists of two parts: images and text. The image dataset is sourced from visualQAAntol etal. (2015b), and the text prompt dataset for transferability comes from CroPALuo etal. (2024). This text dataset is divided into three categories: image classification (CLS), image captions (CAP), and visual question answering (VQA). We will design attack tasks across four different dimensions: generating target tasks involving ordinary objects, harmful objects, tone expressions, and racial discrimination.

The experimental setup for this study involves using three open-source models: BLIP2(blip2-opt-2.7b), instructBLIP(instructblip-vicuna-7b), and LLaVA(LLaVA-v1.5-7b). The maximum number of iterations is set to 2000, and the hyperparameters α𝛼\alphaitalic_α and β𝛽\betaitalic_β are both set to 0.60.60.60.6, based on the conclusions drawn in Figure 4. The learning rate is set to 0.050.050.050.05, and the image perturbation range is set to 16/2551625516/25516 / 255

5.2 Evaluation metrics

To evaluate the effectiveness of our method, we used the following metrics:

  • Attack Success Rate (ASR): The percentage of prompts for which the adversarial image successfully misleads the model. ASR is a widely recognized metric Lv etal. (2023); Zhao etal. (2023); Liu etal. (2022); Chen etal. (2022); Luo etal. (2024) for measuring the success of adversarial attacks.

  • Perturbation Size: The magnitude of the adversarial perturbation, we use the ‘clamp‘ function to control the size of the disturbance. Specifically, the ‘clamp‘ function restricts each perturbation value δ𝛿\deltaitalic_δ to be within the minimum value of δϵ𝛿italic-ϵ\delta-\epsilonitalic_δ - italic_ϵ and the maximum value of δ+ϵ𝛿italic-ϵ\delta+\epsilonitalic_δ + italic_ϵ: δ=clamp(δ,ϵ,ϵ)𝛿clamp𝛿italic-ϵitalic-ϵ\delta=\text{clamp}(\delta,-\epsilon,\epsilon)italic_δ = clamp ( italic_δ , - italic_ϵ , italic_ϵ ). The default ϵitalic-ϵ\epsilonitalic_ϵ used in this paper is 16/2551625516/25516 / 255.

  • Transferability: The ability of the adversarial image to mislead different VLMs across various tasks, such as image classification(CLS), image captioning(CAP), and visual question answering(VQA).

5.3 Transferability comparison

The results of our experiments, which evaluate targeted Attack Success Rate (ASR) on the visual-language model across various tasks (CLS, CAP, VQA) and target texts, are detailed in Table 3(experiments on other models can be found in the appendix A.1.1). The performance of the CIA method was compared against three baseline methods: Single-P (SP), Multi-P (MP), and CroPA (CP). To generate adversarial examples for VLMs, Single-P optimizes an image perturbation based on a single prompt. In contrast, Multi-P enhances the cross-prompt transferability of the perturbations by utilizing multiple prompts during the image perturbation update process. CroPA Luo etal. (2024) achieves broader prompt coverage by using a learnable prompt to expand around a given prompt, thereby improving transferability. CIA achieves the highest transfer attack success rate for the majority of targets.

TargetSingleMultiCroPAOurs
emotional words0.1690.2340.5270.734
harmful objects0.3430.4420.6760.846
common objects0.5980.7380.8220.925
Overall0.3700.4710.6750.835

Our findings suggest that common words yield the highest performance because they appear most frequently in the model’s training samples, resulting in the lowest perplexity. Harmful words may be blocked by the model’s safety alignment strategies. Affective words achieve the lowest scores because our method relies on injecting textual instruction into the visual context. However, affective words have a semantic disconnect with the visual representation, making it difficult to represent them accurately. Conversely, images with tangible entities are more likely to converge and produce effective adversarial images. The results in Table 4 support our conclusion.

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (17)

To determine the most effective approach among visual context enhancement, textual context enhancement, and a combined visual-text context enhancement, we conducted comprehensive experiments. As shown in Table 5, CIA-image represents the transfer attack effectiveness using only visual context enhancement, CIA-text represents the transfer attack effectiveness using only textual context enhancement, and CIA represents the combined approach using both visual and textual context enhancements. Our findings indicate that the combined visual-text context attack is the most effective, suggesting that multimodal joint attacks are more successful in deceiving the model and thereby increasing the attack success rate.

Figure 3 shows the cross-entropy values of logits related to the target task at different positions. The baseline method made only minor adjustments to the probabilities of the target task token in both visual and text contexts, resulting in suboptimal performance in cross-prompt tasks. We compared three approaches: using visual context alone, using text context alone, and using a combined context. Independently, the visual and text contexts achieved optimal cross-entropy loss at their respective token positions because they were computed separately, allowing for easier convergence to the minimum value. By combining both contexts, our method achieved the optimal cross-entropy loss at the final generated token position, indicating that it effectively skewed the probability distribution towards the target task token.

MethodCLSCAPVQAOverall
SP0.5170.3890.2050.370
MP0.6100.5180.2850.471
CP0.8300.7170.4780.675
CIA-image0.6100.5370.3140.487
CIA-text0.5420.4890.3080.447
CIA0.9300.8560.7190.835
targetimagepromptCroPA outputCIA output
dogEnhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (18)Imagine a child is trying to identify the image. What might they excitedly point to and name?a plane flying over a cathedraldog
dogEnhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (19)Any bicycles visible in this image? If not, please tell what is in the image?no bicycles visible in this imageA dog in this image
poisonEnhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (20)Describe the central subject of this image in a single word.horsepoison
poisonEnhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (21)Categorize the content of the image.the horse is eating the poisonpoison
sureEnhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (22)Provide a summary of the primary focus of this image.a woman and a man are sitting on a sidewalksure
sureEnhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (23)Where do you see harmony or unity among the elements?Harmony and unitysure

5.4 Case study

The case study presented in Table 5 demonstrates the effectiveness of the CIA method compared to CroPA in generating adversarial examples that successfully deceive visual-language models (VLMs). We evaluated various target texts using different prompts to test robustness.

Adversarial images generated using the state-of-the-art CroPA method still retain the semantics of the original image. Specifically, in the fourth example provided in Table 6, ("the horse is eating the poison"), although the model responded with content related to the target ("poison"), it failed to completely remove the original image’s semantics (i.e., "horse"). This incomplete removal of original semantics leads to weaker transferability in cross-prompt attacks, as the model continues to recognize elements of the original image, thus diminishing the effectiveness of the adversarial example across different prompts.

5.5 CIA with different perturbation size

This section delves into the impact of different perturbation sizes (8/25582558/2558 / 255, 16/2551625516/25516 / 255, 32/2553225532/25532 / 255) on the efficacy of adversarial attacks against the visual-language model. The table provided below showcases the overall Attack Success Rate (ASR) across various tasks, accentuating the perturbation size that demonstrates the highest performance for each task.

While larger perturbation sizes result in stronger attacks, it’s essential to consider the trade-off with concealment. Larger perturbations may be more easily detected by models or users, reducing the attack’s stealthiness. Therefore, a balance must be struck between perturbation size and concealment to maximize attack effectiveness while minimizing the risk of detection.

5.6 CIA with different prompt embedding setting

Perturbation sizeCLSCAPVQAOverall
8/25582558/2558 / 2550.8150.7970.6230.745
16/2551625516/25516 / 2550.9300.8560.7190.835
32/2553225532/25532 / 2550.9740.9720.8920.946

This section explores the impact of different embedding settings on the Attack Success Rate (ASR) through two types of experiments. For the details, please refer to the Appendix A.1.3

1. Impact of Padding Tokens on ASR: We evaluated the effect of various padding tokens (e.g., ’!’, ’@’, ’+’) on ASR within the text context. (as show in the Figure 4)

2. Effect of Embedding Strategies for ’@’: We assessed four embedding strategies for the special character ’@’: no embedding, prefix embedding, suffix embedding, and mixed embedding. The experiments covered tasks such as classification, captioning, and visual question answering. (as show in the Table 11)

6 Conclusion

In this study, we proposed the Contextual-Injection Attack (CIA), a novel method to improve the transferability on vision-language models. By injecting target tokens into both the visual and textual contexts, CIA effectively manipulates the probability distribution of contextual tokens, ensuring higher adaptability across various prompts. Our experiments on the BLIP2, InstructBLIP, and LLaVA models validated the efficacy of CIA, demonstrating superior performance compared to baseline methods. The results indicate that enhancing both visual and textual contexts in adversarial images is a promising approach to overcoming the limitations of current adversarial attack methods.

Future work will further investigate the application of our approach to other types of multimodal models. We also aim to expand our evaluation to include a wider range of datasets and more diverse scenarios, such as jailbreaking, to further validate the robustness and generalizability of our method. Additionally, we will focus on developing and evaluating potential defense strategies to counteract the adversarial attacks introduced by CIA. Understanding and implementing effective defenses is crucial to enhancing the security and reliability of vision-language models. This comprehensive approach will help ensure that our research contributes positively to the development of more robust and secure multimodal AI systems.

References

  • Achiam etal. (2023)Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, etal. 2023.Gpt-4 technical report.arXiv preprint arXiv:2303.08774.
  • Alayrac etal. (2022)Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, etal. 2022.Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736.
  • Antol etal. (2015a)Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, CLawrence Zitnick, and Devi Parikh. 2015a.Vqa: Visual question answering.In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  • Antol etal. (2015b)Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C.Lawrence Zitnick, and Devi Parikh. 2015b.VQA: Visual Question Answering.In International Conference on Computer Vision (ICCV).
  • Chen etal. (2022)Yangyi Chen, Fanchao Qi, Hongcheng Gao, Zhiyuan Liu, and Maosong Sun. 2022.Textual backdoor attacks can be more harmful via two simple tricks.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11215–11221, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Dai etal. (2024)Wenliang Dai, Junnan Li, Dongxu Li, Anthony MengHuat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, PascaleN Fung, and Steven Hoi. 2024.Instructblip: Towards general-purpose vision-language models with instruction tuning.Advances in Neural Information Processing Systems, 36.
  • Dong etal. (2020)Yinpeng Dong, Qi-An Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao Xiao, and Jun Zhu. 2020.Benchmarking adversarial robustness on image classification.In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 321–331.
  • Ebrahimi etal. (2018)Javid Ebrahimi, Anyi Rao, Daniel Lowd, and Dejing Dou. 2018.Hotflip: White-box adversarial examples for text classification.In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 31–36.
  • Feng etal. (2024)Weiwei Feng, Nanqing Xu, Tianzhu Zhang, Yongdong Zhang, and Feng Wu. 2024.Enhancing cross-task transferability of adversarial examples via spatial and channel attention.IEEE Transactions on Multimedia.
  • Formento etal. (2023)Brian Formento, ChuanSheng Foo, LuuAnh Tuan, and SeeKiong Ng. 2023.Using punctuation as an adversarial attack on deep learning-based nlp systems: An empirical study.In Findings of the Association for Computational Linguistics: EACL 2023, pages 1–34.
  • Goodfellow etal. (2014)IanJ Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014.Explaining and harnessing adversarial examples.arXiv preprint arXiv:1412.6572.
  • Gu etal. (2023)Jindong Gu, Xiaojun Jia, Pau deJorge, Wenqain Yu, Xinwei Liu, Avery Ma, Yuan Xun, Anjun Hu, Ashkan Khakzar, Zhijiang Li, etal. 2023.A survey on transferability of adversarial examples across deep neural networks.arXiv preprint arXiv:2310.17626.
  • Gu etal. (2022)Jindong Gu, Hengshuang Zhao, Volker Tresp, and PhilipHS Torr. 2022.Segpgd: An effective and efficient adversarial attack for evaluating and boosting segmentation robustness.In European Conference on Computer Vision, pages 308–325. Springer.
  • He etal. (2016)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.Deep residual learning for image recognition.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  • Li etal. (2023)Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023.Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models.In International conference on machine learning, pages 19730–19742. PMLR.
  • Li etal. (2022)Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022.Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.In International conference on machine learning, pages 12888–12900. PMLR.
  • Li etal. (2018)Yikang Li, Nan Duan, Bolei Zhou, Xiao Chu, Wanli Ouyang, Xiaogang Wang, and Ming Zhou. 2018.Visual question generation as dual task of visual question answering.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6116–6124.
  • Liu etal. (2022)Aiwei Liu, Honghai Yu, Xuming Hu, Shu’ang Li, LiLin, f*ckun Ma, Yawen Yang, and Lijie Wen. 2022.Character-level white-box adversarial attacks against transformers via attachable subwords substitution.In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7664–7676, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Liu etal. (2023)Haotian Liu, Chunyuan Li, Yuheng Li, and YongJae Lee. 2023.Improved baselines with visual instruction tuning.
  • Lu etal. (2020)Yantao Lu, Yunhan Jia, Jianyu Wang, Bai Li, Weiheng Chai, Lawrence Carin, and Senem Velipasalar. 2020.Enhancing cross-task black-box transferability of adversarial examples with dispersion reduction.In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 940–949.
  • Luo etal. (2024)Haochen Luo, Jindong Gu, Fengyuan Liu, and Philip Torr. 2024.An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models.arXiv preprint arXiv:2403.09766.
  • Lv etal. (2023)Minxuan Lv, Chengwei Dai, Kun Li, Wei Zhou, and Songlin Hu. 2023.Ct-gat: Cross-task generative adversarial attack based on transferability.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5581–5591.
  • Ma etal. (2023)Tony Ma, Songze Li, Yisong Xiao, and Shunchang Liu. 2023.Boosting cross-task transferability of adversarial patches with visual relations.arXiv preprint arXiv:2304.05402.
  • Madry etal. (2017)Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2017.Towards deep learning models resistant to adversarial attacks.arXiv preprint arXiv:1706.06083.
  • Madry etal. (2018)Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018.Towards deep learning models resistant to adversarial attacks.In International Conference on Learning Representations.
  • Maliamanis (2020)TMaliamanis. 2020.Adversarial computer vision: a current snapshot.In Twelfth International Conference on Machine Vision (ICMV 2019), volume 11433, pages 605–612. SPIE.
  • Salzmann etal. (2021)Mathieu Salzmann etal. 2021.Learning transferable adversarial perturbations.Advances in Neural Information Processing Systems, 34:13950–13962.
  • Sen and Dasgupta (2023)Jaydip Sen and Subhasis Dasgupta. 2023.Adversarial attacks on image classification models: Fgsm and patch attacks and their impact.In Information Security and Privacy in the Digital World-Some Selected Topics. IntechOpen.
  • Shafiq and Gu (2022)Muhammad Shafiq and Zhaoquan Gu. 2022.Deep residual learning for image recognition: A survey.Applied Sciences, 12(18):8972.
  • Szegedy etal. (2013)Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013.Intriguing properties of neural networks.arXiv preprint arXiv:1312.6199.
  • Wallace etal. (2019)Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, and Sameer Singh. 2019.Universal adversarial triggers for attacking and analyzing nlp.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
  • Wu etal. (2022)Boxi Wu, Jindong Gu, Zhifeng Li, Deng Cai, Xiaofei He, and Wei Liu. 2022.Towards efficient adversarial training on vision transformers.In European Conference on Computer Vision, pages 307–325. Springer.
  • Yao etal. (2018)Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018.Exploring visual relationship for image captioning.In Proceedings of the European conference on computer vision (ECCV), pages 684–699.
  • Yuan etal. (2023)Lifan Yuan, Yichi Zhang, Yangyi Chen, and Wei Wei. 2023.Bridge the gap between cv and nlp! a gradient-based textual adversarial attack framework.In Findings of the Association for Computational Linguistics: ACL 2023, pages 7132–7146.
  • Zhang etal. (2022)Jiaming Zhang, QiYi, and Jitao Sang. 2022.Towards adversarial attack on vision-language pre-training models.In Proceedings of the 30th ACM International Conference on Multimedia, pages 5005–5013.
  • Zhang etal. (2024)Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024.Vision-language models for vision tasks: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Zhang etal. (2020)WeiEmma Zhang, QuanZ Sheng, Ahoud Alhazmi, and Chenliang Li. 2020.Adversarial attacks on deep-learning models in natural language processing: A survey.ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41.
  • Zhao etal. (2023)Shuai Zhao, Jinming Wen, Anh Luu, Junbo Zhao, and Jie Fu. 2023.Prompt as triggers for backdoor attack: Examining the vulnerability in language models.In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12303–12317.
  • Zou etal. (2023)Andy Zou, Zifan Wang, JZico Kolter, and Matt Fredrikson. 2023.Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043.

Appendix A Appendix

A.1 Detailed data

A.1.1 Comparison on the LLaVA and instructBLIP model

To validate the effectiveness of our method across different models, we also conducted comparative experiments on the LLaVA (as show in the Table 8) and instructBLIP (as show in the Table 9) model.

TargetSPMPCPOurs
emotional words0.0300.2110.2690.426
harmful objects0.0570.0780.2200.559
common objects0.0610.6770.5290.786
Overall0.0490.2630.3390.591
TargetSPMPCPOurs
emotional words0.1920.1130.2500.563
harmful objects0.2490.4060.4260.622
common objects0.4030.4880.5400.688
Overall0.2830.3860.4050.624

A.1.2 Effects of parameters of the weighted sum of losses

We will examine how different weightings and parameters affect the results when calculating the loss. Specifically, we will focus on two hyperparameters, α𝛼\alphaitalic_α and β𝛽\betaitalic_β, which control the weighting of the loss components.

The Figure 4 show the effects of parameter of the weighted sum of losses (α𝛼\alphaitalic_α and β𝛽\betaitalic_β). We standardize the maximum number of iterations to 600600600600. Using the keyword ’dog’ as the target, we set the learning rate for gradient-based updates of image pixels to 0.050.050.050.05, with the maximum perturbation range set to 16/2551625516/25516 / 255.

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (24)

A.1.3 Comparison of different embedding settings

In this section, we will discuss in detail the impact of different embedding settings on ASR.

1. Impact of different padding token on ASR: In this study, when calculating the loss for the text context part, we used a series of padding tokens for experimentsThese padding tokens consist of meaningless characters such as ’!’, ’@’, and ’+’. To verify the impact of different padding tokens on the Attack Success Rate (ASR) within the text context, we conducted experiments using various padding tokens. Table.10 show the ASR for different padding token. The experimental parameters we set are consistent with those in the main text, except for the padding tokens.

Padding TokenCLSCAPVQAOverall
+++0.9100.8250.7260.820
*0.9420.8860.7880.872
&0.9160.8630.7930.857
#0.9160.8540.7690.847
///0.9340.8760.8020.871
@@@@0.9300.8560.7190.835
!!!0.9480.8980.8260.891
MethodCLSCAPVQAOverall
Targetnoprefixsuffixmixednoprefixsuffixmixednoprefixsuffixmixednoprefixsuffixmixed
green0.9670.9120.9800.9540.8690.7870.8930.9070.6950.6850.6960.7290.8430.7950.8560.864
human0.9900.9920.9920.9740.8840.9080.9010.9410.7780.7120.7760.7780.8840.8710.8900.897
fish0.9990.9880.9990.9910.9990.9750.9990.9930.9260.8980.9370.9370.9750.9540.9780.973
flower0.9980.9451.0000.9780.9860.8970.9920.9790.8230.6170.7820.8540.9360.8200.9250.937
bird0.9950.8990.9970.9930.9970.8630.9990.9960.9210.6650.8690.8440.9710.8090.9550.944
cat1.0000.9691.0000.9920.9990.9390.9980.9870.8380.6810.8130.8640.9450.8630.9370.948
dog0.9950.8820.9830.9280.9910.8340.9760.9210.7820.5980.7490.7990.9230.7720.9030.883
blood0.9860.9410.9890.9400.9760.9500.9790.9660.7840.6360.7580.8100.9160.8430.9090.905
bad0.5260.4350.5820.6940.4220.3210.5130.6600.1640.2460.2470.3060.3700.3340.4470.553
p*rn0.8860.9400.9140.9180.7200.8200.7790.8960.6360.7320.6530.6620.7470.8300.7820.825
virus0.9780.9080.9830.9260.8800.8630.9430.9610.7200.6940.7350.8620.8590.8220.8870.916
drug0.9620.9250.9670.9240.8820.8670.9020.9420.6830.5900.6920.7480.8420.7940.8530.871
poison0.8670.8410.8870.9380.7350.7470.7740.9270.5650.6150.5770.7800.7220.7340.7460.882
gun0.9550.9260.9500.9470.9740.9080.9750.9610.7680.6450.7750.8760.8990.8260.9000.928
bomb0.9900.9810.9850.9290.9880.9760.9900.9360.8290.8640.8000.8650.9360.9400.9250.910
sure0.8370.7720.8820.8750.5740.5210.6960.8130.3140.3200.4010.5560.5750.5380.6600.748
unknown0.9170.9020.9370.8900.7690.8140.8090.8700.7610.8040.7860.8600.8160.8400.8440.873
good0.9660.9720.9800.9570.8630.8650.9000.9470.7730.8240.7510.8510.8670.8870.8770.918
yes0.9170.8760.9220.9230.8860.8390.9040.9320.8700.8310.8680.8370.8910.8490.8980.898
no0.9760.8950.9800.9730.8850.7890.9080.9700.9440.9030.9170.9360.9350.8620.9350.959
sorry0.8250.7200.8450.8560.6960.6440.7460.8670.5310.5540.5840.7330.6840.6390.7250.818
Overall0.9300.8870.9410.9290.8560.8160.8850.9220.7190.6720.7220.7850.8350.7920.8490.879

2. Impacts of the embedding strategies for incorporating special padding token(specifically ‘@‘) within the text context on the visual-language model. The four embedding strategies evaluated are: no embedding, prefix embedding, suffix embedding, and mixed embedding (embedding ‘@‘ within the text).

The results, as summarized in Table 11, indicate significant variability in the performance of the visual-language model based on the embedding method used for the special character ‘@‘. The evaluation encompasses three main tasks: classification (CLS), captioning (CAP), and visual question answering (VQA), each exhibiting distinct trends in success rates across different targets.

When considering overall performance, hybrid embedding emerges as the most sustainable and effective strategy, achieving the highest average success rate across tasks. This method’s flexibility in integrating the special character @ within the text appears to enhance the model’s interpretative capabilities, particularly in more context-dependent scenarios such as Visual Question Answering (VQA).

Suffix embedding, on the other hand, demonstrates superior performance in classification tasks. This suggests that placing the special character at the end of the text helps maintain contextual integrity, thereby improving the model’s ability to correctly classify inputs. The success of suffix embedding in classification tasks implies that the placement of special characters can significantly impact the effectiveness of the attack, with the suffix position causing the least disruption to the model’s processing flow.

Interestingly, when examining the text concealment rate during the attack, the method without embedding shows the highest effectiveness. This approach does not introduce additional symbols such as @, which can sometimes alert the model or the user to the presence of an attack, thus maintaining a more natural and undetectable text flow. The non-embedded method also exhibits a relatively high success rate in attacks compared to the baseline, indicating that simplicity and subtlety in embedding can sometimes be more advantageous than complex embedding strategies.

A.2 Example of cross prompt task

Example dataset of transfer attack text prompts excerpted from CroPALuo etal. (2024), divided into three categories: image classification(CLS), image captioning(CAP), and visual question answering(VQA).

A.2.1 Examples of CLS task

  • "If this image were turned into a jigsaw puzzle, what would the box label say to describe the picture inside?"

  • "Classify the content of this image."

  • "If you were to label this image, what label would you give?"

  • "What category best describes this image?"

  • "Describe the central subject of this image in a single word."

  • "Provide a classification for the object depicted in this image."

  • "If this image were in a photo album, what would its label be?"

  • "Categorize the content of the image."

  • "If you were to sort this image into a category, which one would it be?"

  • "What keyword would you associate with this image?"

A.2.2 Examples of CAP task

  • "Describe the content of this image."

  • "What’s happening in this image?"

  • "Provide a brief caption for this image."

  • "Tell a story about this image in one sentence."

  • "If this image could speak, what would it say?"

  • "Summarize the scenario depicted in this image."

  • "What is the central theme or event shown in the picture?"

  • "Create a headline for this image."

  • "Explain the scene captured in this image."

  • "If this were a postcard, what message would it convey?"

A.2.3 Examples of VQA task

  • "Any cutlery items visible in the image?"

  • "Can you find any musical instruments in this image?"

  • "Does the image appear to be a cartoon or comic strip?"

  • "How many animals are present in the image?"

  • "Is a chair noticeable in the image?"

  • "How many statues or monuments stand prominently in the scene?"

  • "How many different patterns or motifs are evident in clothing or objects?"

  • "What is the spacing between objects or subjects in the image?"

  • "Would you describe the image as bright or dark?"

  • "What type of textures can be felt if one could touch the image’s content?"

Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens (2024)
Top Articles
Latest Posts
Article information

Author: Frankie Dare

Last Updated:

Views: 5905

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Frankie Dare

Birthday: 2000-01-27

Address: Suite 313 45115 Caridad Freeway, Port Barabaraville, MS 66713

Phone: +3769542039359

Job: Sales Manager

Hobby: Baton twirling, Stand-up comedy, Leather crafting, Rugby, tabletop games, Jigsaw puzzles, Air sports

Introduction: My name is Frankie Dare, I am a funny, beautiful, proud, fair, pleasant, cheerful, enthusiastic person who loves writing and wants to share my knowledge and understanding with you.