Wonjae (Dan) Kim

I lead the Embedding & Search team at TwelveLabs, where we build multimodal foundation models for video understanding. I’m the first author of ViLT, one of the early works that shaped efficient vision-language architectures. Previously, I was a research scientist at Naver AI LAB and Kakao, and I hold an M.Sc. and B.Sc. from Seoul National University.

My current research focuses on:

Multimodal Representation Learning (video, audio, text)
Large-scale Embedding & Search Systems
User Behavior Modeling for Search

We’re Hiring! I’m building a research team at TwelveLabs where your models ship to thousands of customers within months. We’re tackling joint embedding spaces across modalities and containerized asset search—problems that go beyond simple retrieval to true semantic understanding of video structure. If you want to see your work create real-world impact at scale, grab a coffee chat with me. I’m looking for scientists and engineers who are excited to push video-language AI from idea to production. Join us in Seoul →

news

Dec 01, 2025	TwelveLabs releases Marengo 3.0, a new standard for foundation models that understand the world in all its complexity.
Oct 15, 2025	One ICCV-2025 paper to appear: An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval.
Apr 01, 2025	One CVPR-2025 EVAL-FoMo 2 Workshop paper: Emergence of Text Readability in Vision Language Models.
Feb 04, 2025	I’ve started a new chapter at TwelveLabs!
Jan 01, 2025	One ICLR-2025 paper to appear: Probabilistic Language-Image Pre-Training.

latest posts

Jun 11, 2025	The Gentle Singularity
Dec 27, 2024	DeepSeek: A More Extreme Story of Chinese Tech Idealism
Jan 02, 2021	Exploiting Contemporary ML

selected publications

ECCV Oral

HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts

Wonjae Kim, Sanghyuk Chun, Taekyung Kim, Dongyoon Han, and Sangdoo Yun

In 17th European Conference on Computer Vision (ECCV 2024), 2024

Abs arXiv PDF Code

In an era where the volume of data drives the effectiveness of self-supervised learning, the specificity and clarity of data semantics play a crucial role in model training. Addressing this, we introduce HYPerbolic Entailment filtering (HYPE), a novel methodology designed to meticulously extract modality-wise meaningful and well-aligned data from extensive, noisy image-text pair datasets. Our approach leverages hyperbolic embeddings and the concept of entailment cones to evaluate and filter out samples with meaningless or underspecified semantics, focusing on enhancing the specificity of each data sample. HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark when combined with existing filtering techniques. This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models. Additionally, the image specificity ϵi can be independently applied to induce an image-only dataset from an image-text or image-only data pool for training image-only self-supervised models and showed superior performance when compared to the dataset induced by CLIP score.
ICML Long talk

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Wonjae* Kim, Bokyung* Son, and Ildoo Kim

In 38th International Conference on Machine Learnings (ICML 2021), 18–24 jul 2021

Abs arXiv HTML PDF Video Code

Vision-and-Language Pre-training (VLP) has improved performance on various joint vision-and-language downstream tasks. Current approaches to VLP heavily rely on image feature extraction processes, most of which involve region supervision (e.g., object detection) and the convolutional architecture (e.g., ResNet). Although disregarded in the literature, we find it problematic in terms of both (1) efficiency/speed, that simply extracting input features requires much more computation than the multimodal interaction steps; and (2) expressive power, as it is upper bounded to the expressive power of the visual embedder and its predefined visual vocabulary. In this paper, we present a minimal VLP model, Vision-and-Language Transformer (ViLT), monolithic in the sense that the processing of visual inputs is drastically simplified to just the same convolution-free manner that we process textual inputs. We show that ViLT is up to tens of times faster than previous VLP models, yet with competitive or better downstream task performance. Our code and pre-trained weights are available at https://github.com/dandelin/vilt.
NeurIPS

Learning Dynamics of Attention: Human Prior for Interpretable Machine Reasoning

Wonjae Kim and Yoonho Lee

In 32nd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019

Abs arXiv HTML PDF Code

Without relevant human priors, neural networks may learn uninterpretable features. We propose Dynamics of Attention for Focus Transition (DAFT) as a human prior for machine reasoning. DAFT is a novel method that regularizes attention-based reasoning by modelling it as a continuous dynamical system using neural ordinary differential equations. As a proof of concept, we augment a state-of-the-art visual reasoning model with DAFT. Our experiments reveal that applying DAFT yields similar performance to the original model while using fewer reasoning steps, showing that it implicitly learns to skip unnecessary steps. We also propose a new metric, Total Length of Transition (TLT), which represents the effective reasoning step size by quantifying how much a given model’s focus drifts while reasoning about a question. We show that adding DAFT results in lower TLT, demonstrating that our method indeed obeys the human prior towards shorter reasoning paths in addition to producing more interpretable attention maps. Our code is available at https://github.com/kakao/DAFT.