Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pretrained models are publicly available.
Overview of ProText framework. (Left) First, diverse captions are generated for training classes using LLM such as GPT-3. During training, CLIP text encoders generate prompted class-name feature \(({\tilde{{g}}_p})\) from class-name templates with learnable prompts and frozen LLM template feature \(({\tilde{{g}}})\) from LLM generated templates. Next, we employ contextual mapping loss to guide learnable prompts to learn a mapping from the prompted class-name feature to the LLM template feature containing more information about the class. This allows the learned prompts to exploit internal knowledge of text encoder complemented by LLM descriptions. (Right) At inference, learned prompts are used with class-name templates, and the standard zero-shot CLIP inference protocol is followed. Moreover, rich contextual information from LLM descriptions mapped within the learned prompts enables its transferability to new classes and datasets.
With same amount of text data, learning contextual prompts with text-only supervision improves CLIP performance against prompt ensembling techniques.
Method | |
CLIP | |
DCLIP | |
Waffle CLIP | |
CuPL | |
ProText (ours) | |
Next, we demonstrate the generalization of ProText such that the learned prompts transfer well across new classes and datasets.
With the contextual LLM information mapped with in the prompts, ProText enables the transferability of learned prompts to new classes and improves over CuPL.
Method | |||
CLIP | |||
CuPL | |
|
|
ProText (ours) | |
|
|
ProText with text-only training improves over CLIP, CuPL, and prior 16-shot image-supervised methods in challenging cross-dataset transfer settings. Prompt ensembling based CuPL performs same as CLIP as it cannot transfer class specific LLM templates to cross-datasets.
Method | ||
CoOP [1] | ||
CoCoOp [2] | |
|
MaPLe [3] | |
|
PromptSRC [4] | |
|
CLIP / CuPL | |
|
ProText (ours) | |
|
Models are trained on ImageNet-1k data and evaluated on 10 cross-datasets.
The above figure show attention map visualizations for CLIP and ProText for cross-datasets. ProText is trained on ImageNet-1k text-only data. This suggest that ProText can learn complementary contextual features, which steers CLIP for better transferability towards new datasets without relying on visual samples.
Prompt learning and LLM-based ensembling are effective techniques to improve CLIP's generalization. However, prompt learning often requires labeled images, which is less practical, while LLM-based ensembling methods are dominantly class-specific and not directly transferable to new classes. To address these challenges, we propose a new direction to adapt CLIP by learning generalized prompts with text-only supervision, without relying on visual data. We introduce a training strategy for prompts to learn a mapping function that embeds rich contextual knowledge from LLM text data within the prompts. The context learned by these prompts transfers well to unseen classes and datasets, potentially reducing the LLM prompt engineering and serving cost. We perform extensive evaluations on four benchmarks where our text-only approach performs favorably well over previous methods, including those utilizing labeled images.
For additional details about ProText framework and results comparison in additional benchmarks, please refer to our main paper. Thank you!
@article{Khattak2024ProText,
title={Learning to Prompt with Text Only Supervision for Vision-Language Models},
author={khattak, Muhammad Uzair and Ferjad, Muhammad and Muzzamal, Naseer and Gool, Luc Van and Tombari, Federico},
journal={arXiv:2401.02418},
year={2024}
}