Self-regulating Prompts: Foundational Model Adaptation without Forgetting [ICCV-2023]

1Mohamed bin Zayed University of AI, 2Australian National University, 3Linkoping University, 4University of California, Merced, 5Google Research
*Joint first authors

Left: Existing prompt learning approaches for foundational Vision-Language models like CLIP rely on task-specific objectives that restrict prompt learning to learn a feature space suitable only for downstream tasks and consequently lose the generalized knowledge of CLIP (shown in purple). Our self-regulating framework explicitly guides the training trajectory of prompts towards the closest point between two optimal solution manifolds (solid line) to learn task-specific representations while also retaining generalized CLIP knowledge (shown in green). Middle: Averaged across 11 image recognition datasets, PromptSRC surpasses existing methods on the base-to-novel generalization setting. Right: We evaluate our approach on four diverse image recognition benchmarks for CLIP and show consistent gains over previous state-of-the-art approaches.

Abstract

Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model's original generalization capability. To address this issue, our work introduces a self-regularization framework for prompting called PromptSRC (Prompting with Self-regulating Constraints). PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations using a three-pronged approach by: (a) regulating prompted representations via mutual agreement maximization with the frozen model, (b) regulating with self-ensemble of prompts over the training trajectory to encode their complementary strengths, and (c) regulating with textual diversity to mitigate sample diversity imbalance with the visual branch. To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity. PromptSRC explicitly steers the prompts to learn a representation space that maximizes performance on downstream tasks without compromising CLIP generalization. We perform experiments on 4 benchmarks where PromptSRC performs favorably well compared to the existing methods. Our code and pre-trained models are publicly available.

A Regularization Framework for Prompt Learning

Key components of PromptSRC:
  1. Mutual agreement maximization: PromptSRC explicitly guides the prompts to jointly acquire both task-specific knowledge and task-agnostic generalized knowledge by maximizing the mutual agreement between prompted and pre-trained frozen features of the VL model.
  2. Gaussian weighted prompt aggregation (GPA): We propose a weighted self-ensembling strategy for prompts over the training trajectory that captures complementary features and enhances their generalization abilities.
  3. Textual diversity: PromptSRC regulates prompts with textual diversity to mitigate sample diversity imbalance compared to the visual branch during training.

PromptSRC architecture

Our proposed PromptSRC framework for self-regulating prompt learning. CLIP encoders are used to generate prompted \(({\tilde{{f}}_p}, {\tilde{{g}}_p})\) and pre-trained \(({\tilde{{f}}}, {\tilde{{g}}})\) features at the image and text sides. First, we introduce textual diversity and define textual augmentations to produce a diverse set of frozen VL textual features, which are averaged to represent the pre-trained VL text features \(\tilde{{g}}\). Next, we employ Mutual Agreement Maximization constraints \(\mathcal{L_{\textrm{SCL}}}\) to regulate the prompts, which ensure that the prompted features align well with the pre-trained VL representations at both the feature and logit levels. As CLIP is frozen, we use the same VL encoders to obtain both types of features. Further, our prompt self-ensembling combines the strengths of prompts learned at different epochs \(P_1, P_2 \cdots P_E\) during training via Gaussian weighted sampling. The ensembled visual and textual prompts are then used for the final inference.

PromptSRC results comparison

We integrate our PromptSRC self-regularization framework with the Independent V-L prompt learning baseline which naivly trains vision and language prompts using supervised cross-entropy loss.

PromptSRC effectively mitigates prompt overfitting.

Naively training prompts with standard supervised objectives (IVLP) improves supervised base classes performance but leads to poor generalization towards novel classes as training schedule increases. Our PromptSRC method with explicit self-regulating constraints improves on base classes as well as shows improvements on novel classes.

Effect of our proposed regularization techniques. Results are averaged over 11 datasets. PromptSRC achieves improvements on novel classes while maintaining supervised base classes performance, leading to the average novel class and harmonic mean gains of +4.31% and +2.46% respectively.

Method
Base Acc.
Novel Acc.
Harmonic mean (HM)
1: Independent V-L prompting
84.21
71.79
77.51
2: + \(\mathcal{ L_{\textrm{SCL}}}\)
84.21
75.38
79.55
3: + GPA
84.16
75.69
79.70
4: + Textual diversity
84.26
76.10
79.97


PromptSRC in comparison with existing methods

Below table shows comparison of PromptSRC with state-of-the-art methods on base-to-novel generalization. PromptSRC demonstrates strong generalization performance over existing methods on 11 different recognition datasets.

Method
Base Acc.
Novel Acc.
Harmonic mean (HM)
CLIP [1]
69.34
74.22
71.70
CoOp [2]
82.69
63.22
71.66
CoCoOp [3]
80.47
71.69
75.83
ProDA [4]
81.56
72.30
76.65
MaPLe [5]
82.28
75.14
78.55
PromptSRC (ours)
84.26
76.10
79.97

Conclusion

Prompt learning has emerged as an effective paradigm for adapting foundational VL models like CLIP. However, the prompts learned by the majority of existing methods inherently tend to overfit task-specific objectives and consequently compromise the inherent generalization ability of CLIP. Our work proposes a self-regulating prompt learning framework that addresses the prompt overfitting problem for better generalization. We show it is critical to guide the training trajectory of prompts by explicitly encouraging its mutual agreement with the frozen model through self-consistency constraints supplemented by incorporating textual diversity. We also propose a self-ensembling strategy for prompts that appropriately aggregates them via a Gaussian-weighted approach over the course of training. Extensive evaluations on multiple benchmarks show the benefit of our self-regulating approach for prompt learning.


For more details about the proposed regularization framework and results comparison over additional benchmarks, please refer to our main paper. Thank you!

BibTeX

@InProceedings{Khattak_2023_ICCV,
    author    = {Khattak, Muhammad Uzair and Wasim, Syed Talal and Naseer, Muzammal and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad Shahbaz},
    title     = {Self-regulating Prompts: Foundational Model Adaptation without Forgetting},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
    pages     = {15190-15200}
}