Rohit Girdhar

Rohit Girdhar

Research Scientist

GenAI Research, Meta

I am a Research Scientist in the GenAI Research group at Meta. My current research focuses on understanding and generating multimodal data, using minimal human supervision. I obtained a MS and PhD in Robotics from Carnegie Mellon University (here’s a link to my dissertation), where I worked on learning from and understanding videos. I was previously part of the Facebook AI Research (FAIR) group at Meta, and have spent time at DeepMind, Adobe and Facebook as an intern. See here for a formal bio.

News

Education
  • PhD in Robotics, 2019

    Carnegie Mellon University, Pittsburgh PA

  • MS in Robotics, 2016

    Carnegie Mellon University, Pittsburgh PA

  • B. Tech. in Computer Science, 2014

    IIIT Hyderabad, India

Experience
  • Meta · Research Scientist

    New York · 2019 -- Present

  • DeepMind · Research Scientist Intern

    London · Summer 2018

  • Facebook · Research Scientist Intern

    Menlo Park · Summer 2017

  • Adobe · Research Scientist Intern

    San Francisco · Summer 2016

  • Facebook · Software Engineering Intern

    Menlo Park · Summer 2013

Highlights

Videos powered by MovieGen and Emu Video!


Projects and Publications

.js-id-selected
Diffusion Autoencoders are Scalable Image Tokenizers
Diffusion Autoencoders are Scalable Image Tokenizers

Simplified image tokenization using diffusion

MotiF: Making Text Count in Image Animation with Motion Focal Loss
MotiF: Making Text Count in Image Animation with Motion Focal Loss

Using flow to improve motion in video generation

The Llama 3 Herd of Models
The Llama 3 Herd of Models

State-of-the-Art open-source LLM with multimodal capabilities

InstanceDiffusion: Instance-level Control for Image Generation
InstanceDiffusion: Instance-level Control for Image Generation

SOTA instance-conditioned diffusion model for image generation.

VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation
VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation

SOTA unsupervised video segmentation using CutLER.

Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning
Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning

A simple and effective approach to high-quality video generation by learning to animate high quality images.

ImageBind: One Embedding Space To Bind Them All
ImageBind: One Embedding Space To Bind Them All

One embedding space for 6 different modalities, enables zero-shot recognition on all modalities!

The effectiveness of MAE pre-pretraining for billion-scale pretraining
The effectiveness of MAE pre-pretraining for billion-scale pretraining

Scaling up MAE pre-pretraining, followed by weakly supervised pretraining, leads to strong representations.

HierVL: Learning Hierarchical Video-Language Embeddings
HierVL: Learning Hierarchical Video-Language Embeddings

Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings.

Learning Video Representations from Large Language Models
Learning Video Representations from Large Language Models

Leveraging LLMs to auto-annotate videos for representation learning.

OmniMAE: Single Model Masked Pretraining on Images and Videos
OmniMAE: Single Model Masked Pretraining on Images and Videos

Single self-supervised representation for images and videos.

Omnivore: A Single Model for Many Visual Modalities
Omnivore: A Single Model for Many Visual Modalities

A single model for images, video and single-view 3D.

Detecting Twenty-thousand Classes using Image-level Supervision
Detecting Twenty-thousand Classes using Image-level Supervision

Leverages image classification data to build an object detector

Mask2Former for Video Instance Segmentation
Mask2Former for Video Instance Segmentation

SOTA video segmentation using Mask2Former.

Physical Reasoning Using Dynamics Aware Embeddings
Physical Reasoning Using Dynamics Aware Embeddings

Self-supervised representations for physical reasoning.

Forward Prediction for Physical Reasoning
Forward Prediction for Physical Reasoning

Forward prediction for PHYRE benchmark.

Binge Watching: Scaling Affordance Learning from Sitcoms
Binge Watching: Scaling Affordance Learning from Sitcoms

Learning how humans interact with their environment by watching TV.