Muhammad Uzair Khattak

Lausanne, Switzerland

Hi, I am Muhammad Uzair, a PhD candidate at VILAB at EPFL supervised by Prof. Amir Zamir and PD. Dr. Federico Tombari. Previously, I completed my MSc in Computer Vision at the IVAL lab at Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), where I was supervised by Dr. Salman Khan and Dr. Fahad Khan. At MBZUAI, I was also co-supervised and mentored by Dr. Muzammal Naseer.

My current research interests include developing foundational video models using multimodal abstract spaces, and studying the implications of multimodality for world modeling, long-horizon prediction, and video generation tasks.

My master’s research focus was on adapting foundational multi-modal models for vision tasks including image recognition, object detection, and video action recognition. The goal was to steer these foundational models for downstream tasks with limited data (few-/zero-shot) while maintaining their pre-trained generalization for novel tasks. I also worked on pretraining vision-language models for the medical imaging domain, and on complex video reasoning using Large Multimodal Models (LMMs).

Email / Google Scholar / Github / Twitter / CV

News

Dec 18, 2024	We have released UniMed, an open-source and large-scale multi-modal medical dataset comprising 6 diverse medical modalities. Building upon UniMed, we train UniMed-CLIP, a family of strong medical VLMs. Dataset, models and demos are available here.
Dec 10, 2024	Our work on text-only prompt learning for foundational Vision-Language models got accepted into AAAI’25! Congratulations to all co-authors!
Sep 1, 2024	I have started my PhD studies at EPFL, Switzerland.
May 9, 2024	We have released CVRR-ES: Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs. More details on the project page.
Feb 22, 2024	Invited talk on Multi-modal learning @ Amazon Prime Video.
Feb 5, 2024	Invited talk on our recent ProText work at Cohere For AI. (Slides / Recording)
Jan 5, 2024	We have released ProText, a novel framework to adapt Vision-Language models with text-only data. More details on the project page !
Dec 16, 2023	Invited talk on our recent PromptSRC work at Computer Vision Talks.

View all news

Selected publications

* denotes joint first authors

2026

Any-to-Any Video Modeling in Multimodal Abstract Spaces

Muhammad Uzair Khattak, Won Jun Kim, Reza Abbassi, Michael Murphy, Amir Zadeh, Chuan Li, Oguzhan Fatih Kar, Muhammad Ferjad Naeem, Andrei Atanov, Roman Bachmann, Federico Tombari, and Amir Zamir

Under Review, 2026

2024

UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities

Muhammad Uzair Khattak*, Shahina Kunhimon*, Muzammal Naseer, Salman Khan, and Fahad Shahbaz Khan

arXiv preprint arXiv:2412.10372, 2024

Bib PDF Code

@article{khattak2024unimed,
  author = {Khattak*, Muhammad Uzair and Kunhimon*, Shahina and Naseer, Muzammal and Khan, Salman and Khan, Fahad Shahbaz},
  title = {UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities},
  journal = {arXiv preprint arXiv:2412.10372},
  year = {2024},
}

How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, and Salman Khan

arXiv preprint arXiv:2405.03690, 2024

Bib PDF Code Website

@article{khattak2024complex,
  author = {Khattak, Muhammad Uzair and Naeem, Muhammad Ferjad and Hassan, Jameel and Naseer, Muzammal and Tombari, Federico and Khan, Fahad Shahbaz and Khan, Salman},
  title = {How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs},
  journal = {arXiv preprint arXiv:2405.03690},
  year = {2024},
}

Learning to Prompt with Text Only Supervision for Vision-Language Models

Muhammad Uzair khattak, Muhammad Ferjad Naeem, Naseer Muzzamal, Luc Van Gool, and Federico Tombari

arXiv:2401.02418, 2024

Bib PDF Code Website

@article{Khattak2024ProText,
  title = {Learning to Prompt with Text Only Supervision for Vision-Language Models},
  author = {khattak, Muhammad Uzair and Naeem, Muhammad Ferjad and Muzzamal, Naseer and Gool, Luc Van and Tombari, Federico},
  journal = {arXiv:2401.02418},
  year = {2024},
}

2023

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Bib PDF Code Website

@inproceedings{khattak2023maple,
  title = {Maple: Multi-modal prompt learning},
  author = {Khattak, Muhammad Uzair and Rasheed, Hanoona and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages = {19113--19122},
  year = {2023},
}

Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization

Jameel Hassan, Hanan Gani, Noor Hussein, Muhammad Uzair Khattak+, Muzammal Naseer, Fahad Shahbaz Khan, and Salman Khan

Advances in Neural Information Processing Systems, 2023

Bib PDF Code Website

@article{hassan2023align,
  title = {Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization},
  author = {Hassan, Jameel and Gani, Hanan and Hussein, Noor and Khattak+, Muhammad Uzair and Naseer, Muzammal and Khan, Fahad Shahbaz and Khan, Salman},
  journal = {Advances in Neural Information Processing Systems},
  year = {2023},
}

Fine-tuned clip models are efficient video learners

Hanoona Rasheed*, Muhammad Uzair Khattak*, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

Bib PDF Code Website

@inproceedings{rasheed2023fine,
  title = {Fine-tuned clip models are efficient video learners},
  author = {Rasheed*, Hanoona and Khattak*, Muhammad Uzair and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages = {6545--6554},
  year = {2023},
}

Self-regulating Prompts: Foundational Model Adaptation without Forgetting

Muhammad Uzair Khattak*, Syed Talal Wasim*, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan

In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oct 2023

Bib PDF Code Website

@inproceedings{Khattak_2023_ICCV,
  title = {Self-regulating Prompts: Foundational Model Adaptation without Forgetting},
  author = {Khattak*, Muhammad Uzair and Wasim*, Syed Talal and Naseer, Muzammal and Khan, Salman and Yang, Ming-Hsuan and Khan, Fahad Shahbaz},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages = {15190-15200},
  year = {2023},
  month = oct,
}

Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition

Syed Talal Wasim*, Muhammad Uzair Khattak*, Muzammal Naseer, Salman Khan, Mubarak Shah, and Fahad Shahbaz Khan

In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023

Bib PDF Code Website

@inproceedings{Wasim_2023_ICCV,
  author = {Wasim*, Syed Talal and Khattak*, Muhammad Uzair and Naseer, Muzammal and Khan, Salman and Shah, Mubarak and Khan, Fahad Shahbaz},
  title = {Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition},
  booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
  month = oct,
  year = {2023},
  pages = {13778-13789},
}

2022

Bridging the gap between object and image-level representations for open-vocabulary detection

Hanoona Bangalath*, Muhammad Maaz*, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan

Advances in Neural Information Processing Systems, Oct 2022

Bib PDF Code Website

@article{bangalath2022bridging,
  title = {Bridging the gap between object and image-level representations for open-vocabulary detection},
  author = {Bangalath*, Hanoona and Maaz*, Muhammad and Khattak, Muhammad Uzair and Khan, Salman H and Shahbaz Khan, Fahad},
  journal = {Advances in Neural Information Processing Systems},
  volume = {35},
  pages = {33781--33794},
  year = {2022},
}

Investigating and Improving Common Loop Closure Failures in Visual SLAM

Saran Khaliq, Muhammad Latif Anjum, Wajahat Hussain, Muhammad Uzair Khattak, and Momen Rasool

Autonomous Robots, Oct 2022

PDF