Learning to Prompt with Text Only Supervision for Vision-Language Models

1Mohamed bin Zayed University of AI, 2ETH Zurich, 3TU Munich, 4Google

Left: Existing methods improve CLIP's generalization by learning prompts with image supervision or using non-transferable prompt ensembling with LLM knowledge. In contrast, our approach, ProText, effectively learns prompts with LLM knowledge based text-only supervision which are transferable to new datasets and classes. Right: Without using any images for supervision, ProText with text-only training improves over CLIP, CuPL, and prior 16-shot image-supervised methods in challenging cross-dataset transfer settings. Prompt ensembling based CuPL performs same as CLIP as it cannot transfer class specific LLM templates to cross-datasets.

Abstract

Foundational vision-language models such as CLIP are becoming a new paradigm in vision, due to their excellent generalization abilities. However, adapting these models for downstream tasks while maintaining their generalization remains a challenge. In literature, one branch of methods adapts CLIP by learning prompts using visual information. While effective, most of these works require labeled data which is not practical, and often struggle to generalize towards new datasets due to over-fitting on the source data. An alternative approach resorts to training-free methods by generating class descriptions from large language models (LLMs) and perform prompt ensembling. However, these methods often generate class specific prompts that cannot be transferred to other classes, which incur higher costs by generating LLM descriptions for each class separately. In this work, we propose to combine the strengths of these both streams of methods by learning prompts using only text data derived from LLMs. As supervised training of prompts is not trivial due to absence of images, we develop a training approach that allows prompts to extract rich contextual knowledge from LLM data. Moreover, with LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets potentially cutting the LLM prompt engineering cost. To the best of our knowledge, this is the first work that learns generalized prompts using text only data. We perform extensive evaluations on 4 benchmarks where our method improves over prior ensembling works while being competitive to those utilizing labeled images. Our code and pretrained models are publicly available.

A Text Only Prompt Learning framework Vision-Language Models

Main contributions :
  1. Text Only Prompt Learning Approach: We develop a new approach for prompt learning in Vision-Language models without relying on visual samples for visual recognition tasks.
  2. Learning Prompts with Contextual Mapping: We introduce a training strategy for prompts to learn a mapping function that embeds rich and generalized contextual knowledge from Large Language Models (LLMs) based text data within the prompts.
  3. LLM Prompts Transferability: With LLM contextual data mapped within the learned prompts, it enables zero-shot transfer of prompts to new classes and datasets, potentially reducing the LLM prompt engineering and serving costs.

ProText Framework

Overview of ProText framework. (Left) First, diverse captions are generated for training classes using LLM such as GPT-3. During training, CLIP text encoders generate prompted class-name feature \(({\tilde{{g}}_p})\) from class-name templates with learnable prompts and frozen LLM template feature \(({\tilde{{g}}})\) from LLM generated templates. Next, we employ contextual mapping loss to guide learnable prompts to learn a mapping from the prompted class-name feature to the LLM template feature containing more information about the class. This allows the learned prompts to exploit internal knowledge of text encoder complemented by LLM descriptions. (Right) At inference, learned prompts are used with class-name templates, and the standard zero-shot CLIP inference protocol is followed. Moreover, rich contextual information from LLM descriptions mapped within the learned prompts enables its transferability to new classes and datasets.

ProText results comparison

With same amount of text data, learning contextual prompts with text-only supervision improves CLIP performance against prompt ensembling techniques.

ProText fares well in comparison with Prompt Ensembling methods.

Method
ImageNet Acc.
CLIP
66.72
DCLIP
68.03
Waffle CLIP
68.34
CuPL
69.62
ProText (ours)
70.22

Next, we demonstrate the generalization of ProText such that the learned prompts transfer well across new classes and datasets.

ProText addresses the transferability limitations of LLM based Prompt Ensembling methods.

With the contextual LLM information mapped with in the prompts, ProText enables the transferability of learned prompts to new classes and improves over CuPL.

Method
Base Acc.
Novel Acc.
Harmonic mean (HM)
CLIP
69.34
74.22
71.70
CuPL
72.56
74.22
73.38
ProText (ours)
72.95
76.98
74.91


ProText performs favourably well in Cross-dataset benchmark

ProText with text-only training improves over CLIP, CuPL, and prior 16-shot image-supervised methods in challenging cross-dataset transfer settings. Prompt ensembling based CuPL performs same as CLIP as it cannot transfer class specific LLM templates to cross-datasets.

Method
Supervision Type
Avg. Accuracy
CoOP [1]
labeled images
63.88
CoCoOp [2]
labeled images
65.74
MaPLe [3]
labeled images
66.30
PromptSRC [4]
labeled images
65.81
CLIP / CuPL
Text Prompts
65.15
ProText (ours)
Text Prompts
67.23

Models are trained on ImageNet-1k data and evaluated on 10 cross-datasets.


Qualitative Results for ProText

The above figure show attention map visualizations for CLIP and ProText for cross-datasets. ProText is trained on ImageNet-1k text-only data. This suggest that ProText can learn complementary contextual features, which steers CLIP for better transferability towards new datasets without relying on visual samples.

Conclusion

Prompt learning and LLM-based ensembling are effective techniques to improve CLIP's generalization. However, prompt learning often requires labeled images, which is less practical, while LLM-based ensembling methods are dominantly class-specific and not directly transferable to new classes. To address these challenges, we propose a new direction to adapt CLIP by learning generalized prompts with text-only supervision, without relying on visual data. We introduce a training strategy for prompts to learn a mapping function that embeds rich contextual knowledge from LLM text data within the prompts. The context learned by these prompts transfers well to unseen classes and datasets, potentially reducing the LLM prompt engineering and serving cost. We perform extensive evaluations on four benchmarks where our text-only approach performs favorably well over previous methods, including those utilizing labeled images.


For additional details about ProText framework and results comparison in additional benchmarks, please refer to our main paper. Thank you!

BibTeX

@article{Khattak2024ProText,
    title={Learning to Prompt with Text Only Supervision for Vision-Language Models},
    author={khattak, Muhammad Uzair and Ferjad, Muhammad and Muzzamal, Naseer and Gool, Luc Van and Tombari, Federico},
    journal={arXiv:2401.02418},
    year={2024}
}