Wonjae (Dan) Kim / 김원재
Lead Research Scientist @ TwelveLabs
I lead the Embedding & Search team at TwelveLabs, where we build multimodal foundation models for video understanding. I’m the first author of ViLT, one of the early works that shaped efficient vision-language architectures. Previously, I was a research scientist at Naver AI LAB and Kakao, and I hold an M.Sc. and B.Sc. from Seoul National University.
My current research focuses on:
- Multimodal Representation Learning (video, audio, text)
- Large-scale Embedding & Search Systems
- User Behavior Modeling for Search
We’re Hiring! I’m building a research team at TwelveLabs where your models ship to thousands of customers within months. We’re tackling joint embedding spaces across modalities and containerized asset search—problems that go beyond simple retrieval to true semantic understanding of video structure. If you want to see your work create real-world impact at scale, grab a coffee chat with me. I’m looking for scientists and engineers who are excited to push video-language AI from idea to production. Join us in Seoul →
news
| Dec 01, 2025 | TwelveLabs releases Marengo 3.0, a new standard for foundation models that understand the world in all its complexity. |
|---|---|
| Oct 15, 2025 | One ICCV-2025 paper to appear: An Efficient Post-hoc Framework for Reducing Task Discrepancy of Text Encoders for Composed Image Retrieval. |
| Apr 01, 2025 | One CVPR-2025 EVAL-FoMo 2 Workshop paper: Emergence of Text Readability in Vision Language Models. |
| Feb 04, 2025 | I’ve started a new chapter at TwelveLabs! |
| Jan 01, 2025 | One ICLR-2025 paper to appear: Probabilistic Language-Image Pre-Training. |
latest posts
| Jun 11, 2025 | The Gentle Singularity |
|---|---|
| Dec 27, 2024 | DeepSeek: A More Extreme Story of Chinese Tech Idealism |
| Jan 02, 2021 | Exploiting Contemporary ML |