Muhammad Uzair Khattak

PhD Candidate, EPFL, Switzerland - MSc from MBZUAI, Abu-Dhabi - BSc from SEECS, NUST, Pakistan.

image_uzair_khattak.png

Lausanne, Switzerland

Hi, I am Muhammad Uzair, a PhD candidate at VILAB at EPFL supervised by Prof. Amir Zamir and PD. Dr. Federico Tombari. Previously, I completed my MSc in Computer Vision at the IVAL lab at Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI), where I was supervised by Dr. Salman Khan and Dr. Fahad Khan. At MBZUAI, I was also co-supervised and mentored by Dr. Muzammal Naseer.

My current research interests include developing foundational video models using multimodal abstract spaces, and studying the implications of multimodality for world modeling, long-horizon prediction, and video generation tasks.

My master’s research focus was on adapting foundational multi-modal models for vision tasks including image recognition, object detection, and video action recognition. The goal was to steer these foundational models for downstream tasks with limited data (few-/zero-shot) while maintaining their pre-trained generalization for novel tasks. I also worked on pretraining vision-language models for the medical imaging domain, and on complex video reasoning using Large Multimodal Models (LMMs).

Email / Google Scholar / Github / Twitter / CV

News

Dec 18, 2024 We have released UniMed, an open-source and large-scale multi-modal medical dataset comprising 6 diverse medical modalities. Building upon UniMed, we train UniMed-CLIP, a family of strong medical VLMs. Dataset, models and demos are available here.
Dec 10, 2024 Our work on text-only prompt learning for foundational Vision-Language models got accepted into AAAI’25! Congratulations to all co-authors!
Sep 1, 2024 I have started my PhD studies at EPFL, Switzerland.
May 9, 2024 We have released CVRR-ES: Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs. More details on the project page.
Feb 22, 2024 Invited talk on Multi-modal learning @ Amazon Prime Video.
Feb 5, 2024 Invited talk on our recent ProText work at Cohere For AI. (Slides / Recording)
Jan 5, 2024 We have released ProText, a novel framework to adapt Vision-Language models with text-only data. More details on the project page !
Dec 16, 2023 Invited talk on our recent PromptSRC work at Computer Vision Talks.

View all news

Selected publications

* denotes joint first authors

2026

  1. video_4m.png
    Any-to-Any Video Modeling in Multimodal Abstract Spaces
    Muhammad Uzair Khattak, Won Jun Kim, Reza Abbassi, Michael Murphy, Amir Zadeh, Chuan Li, Oguzhan Fatih Kar, Muhammad Ferjad Naeem, Andrei Atanov, Roman Bachmann, Federico Tombari, and Amir Zamir
    Under Review, 2026

2024

  1. unimed_clip.svg
    UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities
    Muhammad Uzair Khattak*, Shahina Kunhimon*, Muzammal Naseer, Salman Khan, and Fahad Shahbaz Khan
    arXiv preprint arXiv:2412.10372, 2024
  2. cvrres_preview.png
    How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs
    Muhammad Uzair Khattak, Muhammad Ferjad Naeem, Jameel Hassan, Muzammal Naseer, Federico Tombari, Fahad Shahbaz Khan, and Salman Khan
    arXiv preprint arXiv:2405.03690, 2024
  3. protext_preview.png
    Learning to Prompt with Text Only Supervision for Vision-Language Models
    Muhammad Uzair khattak, Muhammad Ferjad Naeem, Naseer Muzzamal, Luc Van Gool, and Federico Tombari
    arXiv:2401.02418, 2024

2023

  1. maple_preview.png
    Maple: Multi-modal prompt learning
    Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
  2. promptalign_preview.png
    Align Your Prompts: Test-Time Prompting with Distribution Alignment for Zero-Shot Generalization
    Jameel Hassan, Hanan Gani, Noor Hussein, Muhammad Uzair Khattak+, Muzammal Naseer, Fahad Shahbaz Khan, and Salman Khan
    Advances in Neural Information Processing Systems, 2023
  3. vificlip_preview.png
    Fine-tuned clip models are efficient video learners
    Hanoona Rasheed*, Muhammad Uzair Khattak*, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
  4. promptsrc_preview.png
    Self-regulating Prompts: Foundational Model Adaptation without Forgetting
    Muhammad Uzair Khattak*, Syed Talal Wasim*, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shahbaz Khan
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Oct 2023
  5. focalnets_preview.png
    Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition
    Syed Talal Wasim*, Muhammad Uzair Khattak*, Muzammal Naseer, Salman Khan, Mubarak Shah, and Fahad Shahbaz Khan
    In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Oct 2023

2022

  1. ovd_preview.png
    Bridging the gap between object and image-level representations for open-vocabulary detection
    Hanoona Bangalath*, Muhammad Maaz*, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan
    Advances in Neural Information Processing Systems, Oct 2022
  2. loopclosure_preview.png
    Investigating and Improving Common Loop Closure Failures in Visual SLAM
    Saran Khaliq, Muhammad Latif Anjum, Wajahat Hussain, Muhammad Uzair Khattak, and Momen Rasool
    Autonomous Robots, Oct 2022