收录:
摘要:
Pretrained Vision-Language Models (VLMs) like CLIP have exhibited remarkable capacities across downstream tasks, while their image encoders are vulnerable to adversarial examples. A recently introduced lightweight approach, termed Adversarial Prompt Tuning (AdvPT), utilizes adversarial examples for training learnable prompts, enhancing the adversarial robustness of VLMs solely through manipulation of textual inputs. However, the static prompts learned from AdvPT overfit base classes observed during training, compromising the model's generalizability. In this paper, we propose a conditional Adversarial Prompt Tuning method, which extends AdvPT by further learning a network to generate for each input a specific prompt. The dynamic prompts enhance the generalizability of VLMs on unseen classes. Furthermore, since VLMs are inherently powerful generalizers, we try to incorporate the manual prompts used by VLMs in the testing phase to further improve the generalizability of the model. Extensive experiments on 8 datasets demonstrate that our prompt fusion based method significantly outperforms AdvPT on unseen classes, enhancing the generalizability and adversarial robustness of VLMs simultaneously.
关键词:
通讯作者信息:
电子邮件地址:
来源 :
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IX, ICIC 2024
ISSN: 0302-9743
年份: 2024
卷: 14870
页码: 328-339
归属院系: