Yujie Zhong

I currently work at ByteDance. Previously, I worked at Meituan from Jun. 2021 to May 2025, and Malong LLC from Apr. 2019 to Apr. 2021.

I completed my DPhil degree in the Visual Geometry Group (VGG), Department of Engineering Science, University of Oxford. My supervisors are Prof. Andrew Zisserman and Dr. Relja Arandjelović. Prior to this, I obtained my BA and MEng degrees at the University of Oxford.

My research directions include multi-modal LLM, object detection/segmentation, video understanding, visual generative models, object tracking, person/object re-identification, neural architecture search and image retrieval.

Email: [email protected]

Publications

Q. Lin, X. Sun, Y. Gao, Y. Zhong, D. Li, Z. Zhao, H. Wang
TASR: Timestep-Aware Diffusion Model for Image Super-Resolution
ACM MM, 2025.
arXiv | code

Y. Zeng, Z. Huang Y. Zhong, C. Feng, J. Hu, L. Ma, Y. Liu
DisTime: Distribution-based Time Representation for Video Large Language Models
ICCV, 2025.
arXiv | code

C. Wei, Y. Zhong, H. Tan, Y. Zeng, Y. Liu, Z. Zhao and Y. Yang
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
ICCV, 2025.
arXiv | code

W. Xiang, H. Tan, C. Wei, Y. Zhong, D. Li and Y. Yang
Advancing Visual Large Language Model for Multi-granular Versatile Perception
ICCV, 2025.
arXiv | code

X. Zhang, D. Li, B. Liu, Z. Bao, Y. Zhou, B. Yang, Z. Liu, Y. Zhong, and T. Yuan
HiMix: Reducing Computational Complexity in Large Vision-Language Models
ICCV, 2025.
arXiv | page | code

Z. Huang, C. Feng, F. Yan, B. Xiao, Z. Jie, Y. Zhong, X. Liang, L. Ma
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
ICCV, 2025.
arXiv | page | code

B. Xiao, C. Feng, Z. Huang, F. yan, Y. Zhong, L. Ma
RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case
ICCV, 2025.
arXiv | page | code

C. Wei, Y. Zhong, H. Tan, Y. Liu, Z. Zhao, J. Hu and Y. Yang
HyperSeg: Towards Universal Visual Segmentation with Large Language Model
CVPR, 2025.
arXiv | code

C. Zhang, Y. Zhong and K. Han
Mr. DETR: Instructive Multi-Route Training for Detection Transformers
CVPR, 2025.
arXiv | code

C. Zhang, J. Ni, Y. Zhong and K. Han
vCLR: Learning Appearance-Invariant Representations for Open-World Instance Segmentation
CVPR, 2025.
arXiv | code

F. Yan, W. Luo, Y. Zhong, Y. Gan, L. Ma
Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking
ICLR, 2025.
arXiv | code

L. Gao, Y. Zhong, Y. Zeng, H. Tan, D. Li and Z. Zhao
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Preprint, 2024.
arXiv | code

Y. Zeng, Y. Zhong, C. Feng, L. Ma
UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection
ECCV, 2024.
PDF | arXiv | code

Z. Li, Y. Zhong, R. Song, T. Li, L. Ma, W. Zhang
DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks
TPAMI, 2024.
PDF | code

C. Wei, H. Tan, Y. Zhong, Y. Yang, L. Ma
LaSagnA: Language-based Segmentation Assistant for Complex Queries
Preprint, 2024.
PDF | arXiv | code

C. Liu, H. Wu, Y. Zhong, X. Zhang, Y. Wang, W. Xie
Intelligent Grimm — Open-ended Visual Storytelling via Latent Diffusion Models
CVPR, 2024.
PDF | arXiv | page | code

C. Feng, Y. Zhong, Z. Jie, X. Wei, L. Ma
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
CVPR, 2024.
PDF | arXiv | page | code

C. Han, Y. Zhong, D. Li, K. Han, L. Ma
Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network
ICCV, 2023.
PDF | arXiv | code

D. Li, S. Chen, Y. Zhong, L. Ma
DiP: Learning Discriminative Implicit Parts for Person Re-Identification
Preprint, 2023.
PDF | arXiv | code

X. Zhou, Y. Zhong, Z. Cheng, F. Liang, L. Ma
Adaptive Sparse Pairwise Loss for Object Re-Identification
CVPR, 2023.
PDF | arXiv | code

D. Shi, Y. Zhong, Q. Cao, L. Ma, J. Li, D. Tao
TriDet: Temporal Action Detection with Relative Boundary Modeling
CVPR, 2023.
PDF | arXiv | code

C. Feng, Z. Jie, Y. Zhong, X. Chu, L. Ma
AeDet: Azimuth-invariant Multi-view 3D Object Detection
CVPR, 2023.
PDF | arXiv | page | code

C. Liu, Y. Zhong, A. Zisserman, W. Xie
CounTR: Transformer-based Generalised Visual Counting
BMVC, 2022.
PDF | arXiv | code

Z. Wang, Y. Zhong, Y. Miao, L. Ma, L. Specia
Contrastive Video-Language Learning with Fine-grained Frame Sampling
AACL-IJCNLP, 2022.
PDF | arXiv

D. Shi, Y. Zhong, Q. Cao, J. Zhang, L. Ma, J. Li, D. Tao
ReAct: Temporal Action Detection with Relational Queries
ECCV, 2022.
PDF | arXiv

C. Feng, Y. Zhong, Z. Jie, X. Chu, H. Ren, X. Wei, W. Xie, L. Ma
PromptDet: Towards Open-vocabulary Detection using Uncurated Images
ECCV, 2022.
PDF | arXiv | page | code

S. Guo, Z, Xiong, Y. Zhong, W. Li, X. Guo, B. Han, W. Huang
Cross-Architecture Self-supervised Video Representation Learning
CVPR, 2022.
PDF | arXiv | Code

X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, D. Tao
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers
CVPR, 2022.
PDF | arXiv

X. Chen, C. Chen, Q. Cao, J. Xu, Y. Zhong, J. Xu, Z. Li, J. Wang, S. Guo
OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification
Preprint, 2022.
PDF | arXiv

Z. Deng*, Y. Zhong*, S. Guo, W. Huang (* equal contribution)
InsCLR: Improving Instance Retrieval with Self-Supervision
AAAI, 2022.
PDF | arXiv | Code

C. Feng*, Y. Zhong*, Y. Gao, M. Scott, W. Huang (* equal contribution)
TOOD: Task-aligned One-stage Object Detection
ICCV, 2021. Oral
PDF | arXiv | code

C. Feng, Y. Zhong, W. Huang
Exploring Classification Equilibrium in Long-Tailed Object Detection
ICCV, 2021.
PDF | arXiv | code

G. Liu, Y. Zhong, S. Guo, M. Scott, W. Huang
Unchain the Search Space with Hierarchical Differentiable Architecture Search
AAAI, 2021.
PDF | arXiv | code

H Tan, S. Guo, Y. Zhong, W. Huang
Mutually-aware Sub-Graphs Differentiable Architecture Search
Preprint, 2021
PDF | arXiv

Y. Zhong, L. Xie, S. Wang, L. Specia, Y. Miao
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision
NeurIPS, 2020. Self-Supervised Learning Workshop.
PDF | arXiv | Dataset

Y. Zhong, Z. Deng, S. Guo, M. Scott, W. Huang
Representation Sharing for Fast Object Detector Search and Beyond
ECCV, 2020.
PDF | arXiv | Code

Y. Zhong, R.Arandjelović, A. Zisserman
GhostVLAD for Set-based Face Recognition
ACCV, 2018.
PDF | arXiv

Y. Zhong, R.Arandjelović, A. Zisserman
Compact Deep Aggregation for Set Retrieval
ECCV Workshop on CEFRL, 2018. Oral, Best Paper Award
PDF | Dataset | Extended version (arXiv)

Y. Zhong, R.Arandjelović, A. Zisserman
Faces in Places: Compound Query Retrieval
BMVC, 2016.
PDF | Dataset | Project Page

M. Malaspina, Y. Zhong
Image-matching Technology Applied to Fifteenth-century Printed Book Illustration
Lettera Matematica, Springer, 2017.
PDF