Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model's original generalization capability. To address this issue, our work introduces a self-regularization framework for prompting called PromptSRC (Prompting with Self-regulating Constraints). PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations using a three-pronged approach by: (a) regulating prompted representations via mutual agreement maximization with the frozen model, (b) regulating with self-ensemble of prompts over the training trajectory to encode their complementary strengths, and (c) regulating with textual diversity to mitigate sample diversity imbalance with the visual branch. To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity. PromptSRC explicitly steers the prompts to learn a representation space that maximizes performance on downstream tasks without compromising CLIP generalization. We perform experiments on 4 benchmarks where PromptSRC performs favorably well compared to the existing methods. Our code and pre-trained models are publicly available.
Our proposed PromptSRC framework for self-regulating prompt learning. CLIP encoders are used to generate prompted \(({\tilde{{f}}_p}, {\tilde{{g}}_p})\) and pre-trained \(({\tilde{{f}}}, {\tilde{{g}}})\) features at the image and text sides. First, we introduce textual diversity and define textual augmentations to produce a diverse set of frozen VL textual features, which are averaged to represent the pre-trained VL text features \(\tilde{{g}}\). Next, we employ Mutual Agreement Maximization constraints \(\mathcal{L_{\textrm{SCL}}}\) to regulate the prompts, which ensure that the prompted features align well with the pre-trained VL representations at both the feature and logit levels. As CLIP is frozen, we use the same VL encoders to obtain both types of features. Further, our prompt self-ensembling combines the strengths of prompts learned at different epochs \(P_1, P_2 \cdots P_E\) during training via Gaussian weighted sampling. The ensembled visual and textual prompts are then used for the final inference.
We integrate our PromptSRC self-regularization framework with the Independent V-L prompt learning baseline which naivly trains vision and language prompts using supervised cross-entropy loss.
Naively training prompts with standard supervised objectives (IVLP) improves supervised base classes performance but leads to poor generalization towards novel classes as training schedule increases. Our PromptSRC method with explicit self-regulating constraints improves on base classes as well as shows improvements on novel classes.
Effect of our proposed regularization techniques. Results are averaged over 11 datasets. PromptSRC achieves improvements on novel classes while maintaining supervised base classes performance, leading to the average novel class and harmonic mean gains of +4.31% and +2.46% respectively.
Method | |||
1: Independent V-L prompting | |||
2: + \(\mathcal{ L_{\textrm{SCL}}}\) | |
|
|
3: + GPA | |
|
|
4: + Textual diversity | |
|
|
Below table shows comparison of PromptSRC with state-of-the-art methods on base-to-novel generalization. PromptSRC demonstrates strong generalization performance over existing methods on 11 different recognition datasets.
Method | |||
CLIP [1] | |||
CoOp [2] | |
|
|
CoCoOp [3] | |
|
|
ProDA [4] | |
|
|
MaPLe [5] | |
|
|
PromptSRC (ours) | |
|
|
Prompt learning has emerged as an effective paradigm for adapting foundational VL models like CLIP. However, the prompts learned by the majority of existing methods inherently tend to overfit task-specific objectives and consequently compromise the inherent generalization ability of CLIP. Our work proposes a self-regulating prompt learning framework that addresses the prompt overfitting problem for better generalization. We show it is critical to guide the training trajectory of prompts by explicitly encouraging its mutual agreement with the frozen model through self-consistency constraints supplemented by incorporating textual diversity. We also propose a self-ensembling strategy for prompts that appropriately aggregates them via a Gaussian-weighted approach over the course of training. Extensive evaluations on multiple benchmarks show the benefit of our self-regulating approach for prompt learning.
For more details about the proposed regularization framework and results comparison over additional benchmarks, please refer to our main paper. Thank you!
[1] Radford, Alec, et al. "Learning transferable visual models from natural language supervision." International conference on machine learning. PMLR, 2021.
[2] Zhou, Kaiyang, et al. "Learning to prompt for vision-language models." International Journal of Computer Vision 130.9 (2022): 2337-2348.
[3] Zhou, Kaiyang, et al. "Conditional prompt learning for vision-language models." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[4] Lu, Yuning, et al. "Prompt distribution learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[5] Khattak, Muhammad Uzair, et al. "Maple: Multi-modal prompt learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
@InProceedings{Khattak_2023_ICCV,
author = {Khattak, Muhammad Uzair and Wasim, Syed Talal and Naseer, Muzammal and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad Shahbaz},
title = {Self-regulating Prompts: Foundational Model Adaptation without Forgetting},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
pages = {15190-15200}
}