Indexed by:
Abstract:
Pretrained Vision-Language Models (VLMs) like CLIP have exhibited remarkable capacities across downstream tasks, while their image encoders are vulnerable to adversarial examples. A recently introduced lightweight approach, termed Adversarial Prompt Tuning (AdvPT), utilizes adversarial examples for training learnable prompts, enhancing the adversarial robustness of VLMs solely through manipulation of textual inputs. However, the static prompts learned from AdvPT overfit base classes observed during training, compromising the model's generalizability. In this paper, we propose a conditional Adversarial Prompt Tuning method, which extends AdvPT by further learning a network to generate for each input a specific prompt. The dynamic prompts enhance the generalizability of VLMs on unseen classes. Furthermore, since VLMs are inherently powerful generalizers, we try to incorporate the manual prompts used by VLMs in the testing phase to further improve the generalizability of the model. Extensive experiments on 8 datasets demonstrate that our prompt fusion based method significantly outperforms AdvPT on unseen classes, enhancing the generalizability and adversarial robustness of VLMs simultaneously.
Keyword:
Reprint Author's Address:
Email:
Source :
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IX, ICIC 2024
ISSN: 0302-9743
Year: 2024
Volume: 14870
Page: 328-339
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 2
Affiliated Colleges: