|
Long Mai
Email  | 
CV  | 
Google Scholar  | 
LinkedIn
I am a Senior Research Scientist at Adobe Research. My works focus on data-driven approaches to visual content understanding and generation. Before returning to Adobe, I also worked as a senior research scientist at ByteDance Research and SEA AI Lab (SAIL). Prior to that, I worked with Professor Feng Liu in the Computer Graphics & Vision Lab at Portland State University..
|
|
REGEN: Learning Compact Video Embedding with (Re-)Generative Decoder
Yitian Zhang, Long Mai, Aniruddha Mahapatra, David Bourgin, Yicong Hong, Jonah Casebeer, Feng Liu, Yun Fu
ICCV, 2025
Project Page |
Paper
We present a novel perspective on learning video embedders for generative modeling: rather than requiring an exact reproduction of an input video, an effective embedder should focus on synthesizing visually plausible reconstructions.
Accordingly, we propose replacing the conventional encoder-decoder video embedder with an encoder-generator framework that employs a diffusion transformer (DiT) to synthesize missing details from a compact latent space.
Therein, we develop a dedicated latent conditioning module to condition the DiT decoder on the encoded video latent embedding.
Our approach enables superior encoding-decoding performance compared to state-of-the-art methods, particularly as the compression ratio increases.
|
|
Progressive Growing of Video Tokenizers for Highly Compressed Latent Spaces
Aniruddha Mahapatra, Long Mai, Yitian Zhang, David Bourgin, Feng Liu
ICCV, 2025
Project Page |
Paper
Extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4× without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression.
We introduce ProMAG, a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence.
|
|
MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation
Jinbo Xing, Long Mai, Cusuh Ham, Jiahui Huang, Aniruddha Mahapatra, Chi-Wing Fu, Tien-Tsin Wong, Feng Liu
SIGGRAPH, 2025
Project Page |
arXiv |
Paper |
Code
We introduce MotionCanvas, a method that allows users to design cinematic video shots in the context of image-to-video generation. MotionCanvas enables users to intuitively depict scene-space motion intentions, and translates them into spatiotemporal motion-conditioning signals for video diffusion models. By connecting insights from classical computer graphics and contemporary video generation techniques, we demonstrate the ability to achieve 3D-aware motion control in I2V synthesis without requiring costly 3D-related training data.
|
|
Diffusion Transformer-to-Mamba Distillation for High-Resolution Image Generation
Yuan Yao, Yicong Hong, Difan Liu, Long Mai, Jiebo Luo, Feng Liu
BMVC, 2025 (Oral Presentation)
Paper
We propose a novel distillation framework that leverages the powerful capabilities of diffusion transformers (DiTs) to guide the training of a lightweight Hybrid-Mamba model for high-resolution image generation.
Our method effectively enables the generation of high-resolution images with improved visual quality and efficiency.
|
|
MVDream: Multi-view Diffusion for 3D Generation
Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, XiaoYang
ICLR, 2024
Project Page |
Paper |
Code
We introduce MVDream, a multi-view diffusion model that is able to generate consistent multi-view images from a given text prompt. Learning from both 2D and 3D data, a multi-view diffusion model can achieve the generalizability of 2D diffusion models and the consistency of 3D renderings. We demonstrate that such a multi-view prior can serve as a generalizable 3D prior that is agnostic to 3D representations.
|
|
Motion-Adjustable Neural Implicit Video Representation
Long Mai, Feng Liu
CVPR, 2022 (Oral Presentation)
Project Page |
Paper |
Video
By exploiting the relation between the phase information in sinusoidal functions
and their displacements, we incorporate into the
conventional image-based INR model a phase-varying
positional encoding module, and couple it with a phaseshift
generation module that determines the phase-shift
values at each frame. The resulting model is capable
of learning to interpret phase-varying positional embeddings
into the corresponding time-varying content. At inference time, manipulating
the phase-shift vectors can enable temporal and motion editing effects.
|
|
Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging
S. Mahdi H. Miangoleh*, Sebastian Dille*,
Long Mai, Sylvain Paris, Yağız Aksoy
CVPR, 2021
Project Page |
Paper |
Video |
Code
We present a double estimation method that improves the whole-image depth estimation and a patch selection method that adds local details to the final result. We demonstrate that by merging estimations at different resolutions with changing context, we can generate multi-megapixel depth maps with a high level of detail using a pre-trained model.
|
|
Learning to Recover 3D Scene Shape from a Single Image
Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus,
Long Mai, Simon Chen, Chunhua Shen
CVPR, 2021 (Oral Presentation, Best Paper Finalist)
Paper |
Video |
Code
Recent state-of-the-art monocular depth estimation methods cannot recover accurate 3D scene structure due to an unknown depth shift and unknown camera focal length. To address this problem, we propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape.
|
|
A Cost-Effective Method for Improving and Re-purposing Large, Pre-trained GANs by Fine-tuning Their Class-Embeddings
Qi Li,
Long Mai, Michael A. Alcorn, Anh Nguyen
ACCV, 2020 (Oral Presentation, Best Application Paper Honorable Mention)
Project Page |
Paper |
Code
In this paper, we propose a simple solution to mode collapse i.e. improving the sample diversity of a pre-trained class-conditional GAN by modifying only its class embeddings. Our method improves the sample diversity of state-of-the-art ImageNet BigGANs. By replacing only the embeddings, we can also synthesize plausible images for Places365 using a BigGAN generator pre-trained on ImageNet, revealing the surprising expressivity of the BigGAN class embedding space.
|
|
BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images
Thu Nguyen-Phuoc,
Christian Richardt, Long Mai, Yong-Liang Yang, Niloy Mitra
NeurIPS, 2020
Project Page |
Paper |
Code
We present BlockGAN, an image generative model that learns object-aware 3D scene representations directly from unlabelled 2D images. Inspired by the computer graphics pipeline, we design BlockGAN to first generate 3D features of background and foreground objects, then combine them into 3D features for the wholes cene, and finally render them into realistic images. Using explicit 3D features to represent objects allows BlockGAN to learn disentangled representations both in terms of objects and their properties.
|
|
Context-Aware Group Captioning via Self-Attention and Contrastive Features
Zhuowan Li, Quan Tran,
Long Mai, Zhe Lin, Alan Yuille
CVPR, 2020
Project Page |
Paper
We introduce a new task, context-aware group captioning, which aims to describe a group of target images in the context of another group of related reference images. We propose a framework combining self-attention mechanism with contrastive feature construction to effectively summarize common information from each image group while capturing discriminative information between them.
|
|
Structure-Guided Ranking Loss for Single Image Depth Prediction
Ke Xian, Jianming Zhang, Oliver Wang,
Long Mai, Zhe Lin, Zhiguo Cao
CVPR, 2020
Project Page |
Paper |
Video |
Code
We introduce a novel pair-wise ranking loss based on adaptive sampling strategy for monocular depth estimation. The key idea is to guide the sampling to better characterize structure of important regions based on the low-level edge maps and high-level object instance masks. We show that the pair-wise ranking loss, combined with our structure-guided sampling strategies, can significantly improve the quality of depth map prediction.
|
|
3D Ken Burns Effect from a Single Image
Simon Niklaus,
Long Mai, Jimei Yang, Feng Liu
SIGGRAPH Asia, 2019
Project Page |
Paper |
Code |
Video |
Presentation at Adobe MAX 2018
In this paper, we introduce a learning-based view synthesis framework to generate the 3D Ken Burns effect from a single image. Our method supports both a fully automatic mode and an interactive mode with the user controlling the camera.
|
|
An Internal Learning Approach to Video Inpainting
Haotian Zhang,
Long Mai, Ning Xu, Zhaowen Wang, John Collomosse, Hailin Jin
ICCV, 2019
Project Page |
Paper |
Video
We explore internal learning for video inpainting. Different from conventional learning-based approaches, we take a generative approach to inpainting based on internal (within-video) learning without reliance upon an external corpus of training data. We show that leveraging appearance statistics specific to each video achieves visually plausible results whilst handling the challenging problem of long-term consistency.
|
|
MultiSeg: Semantically Meaningful, Scale-Diverse Segmentations from Minimal User Input
Jun Hao Liew, Scott Cohen, Brian Price, Long Mai, Sim-Heng Ong, Jiashi Feng
ICCV, 2019
Paper
We present MultiSeg, a scale-diverse interactive image segmentation network that incorporates a set of two-dimensional scale priors into the model to generate a set of scale-varying proposals that conform to the user input. our method allows the user to quickly locate the closest segmentation target for further refinement if necessary.
|
|
Strike (with) a Pose: Neural networks are easily fooled by strange poses of familiar objects
Michael Alcorn,Qi Li, Zhitao Gong, Chengfei Wang, Long Mai, Wei-shinn Ku, Anh Nguyen
CVPR, 2019
Project Page |
Paper |
Code
We present a framework for discovering DNN failures that harnesses 3D renderers and 3D models. We estimate the parameters of a 3D renderer that cause a target DNN to misbehave in response to the rendered image. Using our framework and a self-assembled dataset of 3D objects, we investigate the vulnerability of DNNs to OoD poses of well-known objects in ImageNet. Importantly, we demonstrate that adversarial poses also transfer consistently across different models as well as different datasets.
|
|
Interactive Boundary Prediction for Object Selection
Hoang Le, Long Mai, Brian Price, Scott Cohen, Hailin Jin, Feng Liu
ECCV, 2018
Paper
In this paper, we introduce an interaction-aware method for boundary-based image segmentation. Instead of relying on pre-defined low-level image features, our method adaptively predicts object boundaries according to image content and user interactions.
|
|
Video Frame Interpolation via Adaptive Separable Convolution
Simon Niklaus,
Long Mai, Feng Liu
ICCV, 2017
Project Page |
Paper |
Video |
Code
We formulate frame interpolation as local separable convolution over input frames using pairs of 1D kernels. Compared to regular 2D kernels, the 1D kernels require significantly fewer parameters to be estimated. Our method develops a deep fully convolutional neural network that takes two input frames and estimates pairs of 1D kernels for all pixels simultaneously. This deep neural network is trained end-to-end using widely available video data without any human annotation.
|
|
Spatial-Semantic Image Search by Visual Feature Synthesis
Long Mai, Hailin Jin, Zhe Lin, Chen Fang, Jonathan Brandt, Feng Liu
CVPR, 2017 (Splotlight Presentation)
Paper |
Presentation at Adobe MAX 2016
We develop a spatial-semantic image search technology that enables users to search for images with both semantic and spatial constraints by manipulating concept text-boxes on a 2D query canvas. We train a convolutional neural network to synthesize appropriate visual features that captures the spatial-semantic constraints from the user canvas query.
|
|
Video Frame Interpolation via Adaptive Convolution
Simon Niklaus*,
Long Mai*, Feng Liu
CVPR, 2017 (Splotlight Presentation)
Project Page |
Paper |
Video
Video frame interpolation typically involves two steps: motion estimation and pixel synthesis. Such a two-step approach heavily depends on the quality of motion estimation. We present a robust video frame interpolation method that combines these two steps into a single process. Our method considers pixel synthesis for the interpolated frame as local convolution over two input frames, where the convolution kernel is predicted adaptively using a deep neural network.
|
|
Content and Surface Aware Projection
Long Mai, Hoang Le, Feng Liu
GI, 2017
Project Page |
Paper |
Video
Image projection is important for many applications. However, perceived distortion is often introduced by projection, which is a common problem of projector systems. Compensating such distortion for projection on non-trivial surfaces is often very challenging. In this paper, we propose a novel method to pre-warp the image such that it appears as distortion-free as possible on the surface after projection.
|
|
Composition-preserving Deep Photo Aesthetics Assessment
Long Mai, Hailin Jin, Feng Liu
CVPR, 2016
Project Page |
Paper |
Code
In this paper, we present a composition-preserving deep ConvNet method that directly learns aesthetics features from the original input images without any image transformations. Specifically, our method adds an adaptive spatial pooling layer upon the regular convolution and pooling layers to directly handle input images with original sizes and aspect ratios.
|
|
Kernel Fusion for Better Image Deblurring
Long Mai, Feng Liu
CVPR, 2015
Project Page |
Paper
Kernel estimation for image deblurring is a challenging task. While individual kernels estimated using different methods alone are sometimes inadequate, they often complement each other. This paper addresses the problem of fusing multiple kernels estimated using different methods into a more accurate one to better support image deblurring than each individual kernel.
|
|
Comparing Salient Object Detection Results without Ground Truth
Long Mai, Feng Liu
ECCV, 2014
Paper
A wide variety of methods have been developed to approach the problem of salient object detection. The performance of these methods is often image-dependent. This paper aims to develop a method that is able to select for an input image the best salient object detection result from many results produced by different methods.
|
|
Saliency Aggregation: A Data-driven Approach
Long Mai, Yuzhen Niu, Feng Liu
CVPR, 2013
Project Page |
Paper
A variety of methods have been developed for visual saliency analysis. These methods often complement each other. This paper proposes data-driven approaches to aggregating various saliency analysis methods such that the aggregation result outperforms each individual one.
|
|
Detecting Rule of Simplicity from Photos
Long Mai, Hoang Le, Yuzhen Niu, Yu-chi Lai, Feng Liu
ACM Multimedia, 2012
Project Page |
Paper
Simplicity refers to one of the most important photography composition rules. Understanding whether a photo respects photography rules or not facilitates photo quality assessment. In this paper, we present a method to automatically detect whether a photo is composed according to the rule of simplicity. We design features according to the definition, implementation and effect of the rule.
|
|
Rule of Thirds Detection from Photograph
Long Mai, Hoang Le, Yuzhen Niu, Feng Liu
IEEE ISM, 2011
Project Page |
Paper
The rule of thirds is one of the most important composition rules used by photographers to create high-quality photos. The rule of thirds states that placing important objects along the imagery thirds lines or around their intersections often produces highly aesthetic photos. In this paper, we present a method to automatically determine whether a photo respects the rule of thirds.
|
|