About Me

photo

I currently work at ByteDance. Previously, I worked at Meituan from Jun. 2021 to May 2025, and Malong LLC from Apr. 2019 to Apr. 2021.

I completed my DPhil degree in the Visual Geometry Group (VGG), Department of Engineering Science, University of Oxford. My supervisors are Prof. Andrew Zisserman and Dr. Relja Arandjelović. Prior to this, I obtained my BA and MEng degrees at the University of Oxford.

My research directions include multi-modal LLM, object detection/segmentation, video understanding, visual generative models, object tracking, person/object re-identification, neural architecture search and image retrieval.

Email: [email protected]

Publications

Q. Lin, X. Sun, Y. Gao, Y. Zhong, D. Li, Z. Zhao, H. Wang
TASR: Timestep-Aware Diffusion Model for Image Super-Resolution
ACM MM, 2025. 
arXiv | code

Y. Zeng, Z. Huang Y. Zhong, C. Feng, J. Hu, L. Ma, Y. Liu
DisTime: Distribution-based Time Representation for Video Large Language Models
ICCV, 2025. 
arXiv | code

C. Wei, Y. Zhong, H. Tan, Y. Zeng, Y. Liu, Z. Zhao and Y. Yang
InstructSeg: Unifying Instructed Visual Segmentation with Multi-modal Large Language Models
ICCV, 2025. 
arXiv | code

W. Xiang, H. Tan, C. Wei, Y. Zhong, D. Li and Y. Yang
Advancing Visual Large Language Model for Multi-granular Versatile Perception
ICCV, 2025. 
arXiv | code

X. Zhang, D. Li, B. Liu, Z. Bao, Y. Zhou, B. Yang, Z. Liu, Y. Zhong, and T. Yuan
HiMix: Reducing Computational Complexity in Large Vision-Language Models
ICCV, 2025. 
arXiv | page | code

Z. Huang, C. Feng, F. Yan, B. Xiao, Z. Jie, Y. Zhong, X. Liang, L. Ma
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
ICCV, 2025. 
arXiv | page | code

B. Xiao, C. Feng, Z. Huang, F. yan, Y. Zhong, L. Ma
RoboTron-Sim: Improving Real-World Driving via Simulated Hard-Case
ICCV, 2025. 
arXiv | page | code

C. Wei, Y. Zhong, H. Tan, Y. Liu, Z. Zhao, J. Hu and Y. Yang
HyperSeg: Towards Universal Visual Segmentation with Large Language Model
CVPR, 2025. 
arXiv | code

C. Zhang, Y. Zhong and K. Han
Mr. DETR: Instructive Multi-Route Training for Detection Transformers
CVPR, 2025. 
arXiv | code

C. Zhang, J. Ni, Y. Zhong and K. Han
vCLR: Learning Appearance-Invariant Representations for Open-World Instance Segmentation
CVPR, 2025. 
arXiv | code

F. Yan, W. Luo, Y. Zhong, Y. Gan, L. Ma
Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object Tracking
ICLR, 2025. 
arXiv | code

L. Gao, Y. Zhong, Y. Zeng, H. Tan, D. Li and Z. Zhao
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Preprint, 2024. 
arXiv | code

Y. Zeng, Y. Zhong, C. Feng, L. Ma
UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection
ECCV, 2024. 
PDF | arXiv | code

Z. Li, Y. Zhong, R. Song, T. Li, L. Ma, W. Zhang
DeTAL: Open-Vocabulary Temporal Action Localization with Decoupled Networks
TPAMI, 2024. 
PDF | code

C. Wei, H. Tan, Y. Zhong, Y. Yang, L. Ma
LaSagnA: Language-based Segmentation Assistant for Complex Queries
Preprint, 2024. 
PDF | arXiv | code

C. Liu, H. Wu, Y. Zhong, X. Zhang, Y. Wang, W. Xie
Intelligent Grimm — Open-ended Visual Storytelling via Latent Diffusion Models
CVPR, 2024. 
PDF | arXiv | page | code

C. Feng, Y. Zhong, Z. Jie, X. Wei, L. Ma
InstaGen: Enhancing Object Detection by Training on Synthetic Dataset
CVPR, 2024.
PDF | arXiv | page | code

C. Han, Y. Zhong, D. Li, K. Han, L. Ma
Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network
ICCV, 2023.
PDF | arXiv | code

D. Li, S. Chen, Y. Zhong, L. Ma
DiP: Learning Discriminative Implicit Parts for Person Re-Identification
Preprint, 2023.
PDF | arXiv | code

X. Zhou, Y. Zhong, Z. Cheng, F. Liang, L. Ma
Adaptive Sparse Pairwise Loss for Object Re-Identification
CVPR, 2023.
PDF | arXiv | code

D. Shi, Y. Zhong, Q. Cao, L. Ma, J. Li, D. Tao
TriDet: Temporal Action Detection with Relative Boundary Modeling
CVPR, 2023. 
PDF | arXiv | code

C. Feng, Z. Jie, Y. Zhong, X. Chu, L. Ma
AeDet: Azimuth-invariant Multi-view 3D Object Detection
CVPR, 2023.
PDF | arXiv | page | code

C. Liu, Y. Zhong, A. Zisserman, W. Xie
CounTR: Transformer-based Generalised Visual Counting
BMVC, 2022. 
PDF | arXiv | code

Z. Wang, Y. Zhong, Y. Miao, L. Ma, L. Specia
Contrastive Video-Language Learning with Fine-grained Frame Sampling
AACL-IJCNLP, 2022.
PDF | arXiv

D. Shi, Y. Zhong, Q. Cao, J. Zhang, L. Ma, J. Li, D. Tao
ReAct: Temporal Action Detection with Relational Queries
ECCV, 2022. 
PDF | arXiv

C. Feng, Y. Zhong, Z. Jie, X. Chu, H. Ren, X. Wei, W. Xie, L. Ma
PromptDet: Towards Open-vocabulary Detection using Uncurated Images
ECCV, 2022.
PDF | arXiv | page | code

S. Guo, Z, Xiong, Y. Zhong, W. Li, X. Guo, B. Han, W. Huang
Cross-Architecture Self-supervised Video Representation Learning
CVPR, 2022. 
PDF | arXiv | Code

X. Chen, Q. Cao, Y. Zhong, J. Zhang, S. Gao, D. Tao
DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers
CVPR, 2022. 
PDF | arXiv

X. Chen, C. Chen, Q. Cao, J. Xu, Y. Zhong, J. Xu, Z. Li, J. Wang, S. Guo
OH-Former: Omni-Relational High-Order Transformer for Person Re-Identification
Preprint, 2022. 
PDF | arXiv

Z. Deng*, Y. Zhong*, S. Guo, W. Huang (* equal contribution)
InsCLR: Improving Instance Retrieval with Self-Supervision

AAAI, 2022. 
PDF | arXiv | Code 

C. Feng*, Y. Zhong*, Y. Gao, M. Scott, W. Huang (* equal contribution)
TOOD: Task-aligned One-stage Object Detection
ICCV, 2021.  Oral
PDF | arXiv | code 

C. Feng, Y. Zhong, W. Huang
Exploring Classification Equilibrium in Long-Tailed Object Detection
ICCV, 2021. 
PDF | arXiv | code 

G. Liu, Y. Zhong, S. Guo, M. Scott, W. Huang
Unchain the Search Space with Hierarchical Differentiable Architecture Search
AAAI, 2021. 
PDF | arXiv | code 

H Tan, S. Guo, Y. Zhong, W. Huang
Mutually-aware Sub-Graphs Differentiable Architecture Search
Preprint, 2021
PDF | arXiv

Y. Zhong, L. Xie, S. Wang, L. Specia, Y. Miao
Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision
NeurIPS, 2020.  Self-Supervised Learning Workshop.
PDF | arXivDataset

Y. Zhong, Z. Deng, S. Guo, M. Scott, W. Huang
Representation Sharing for Fast Object Detector Search and Beyond

ECCV, 2020. 
PDF | arXiv | Code 

Y. Zhong, R.Arandjelović, A. Zisserman
GhostVLAD for Set-based Face Recognition

ACCV, 2018.
PDF | arXiv

Y. Zhong, R.Arandjelović, A. Zisserman
Compact Deep Aggregation for Set Retrieval

ECCV Workshop on CEFRL, 2018. Oral, Best Paper Award
PDF | DatasetExtended version (arXiv) 

Y. Zhong, R.Arandjelović, A. Zisserman
Faces in Places: Compound Query Retrieval

BMVC, 2016.
PDF | Dataset Project Page

M. Malaspina, Y. Zhong
Image-matching Technology Applied to Fifteenth-century Printed Book Illustration

Lettera Matematica, Springer, 2017.
PDF

Challenges

CVPR2023 SoccerNet ChallengeAction Spotting Task
2nd Place

CVPR2023 SoccerNet ChallengeMultiple Object Tracking
2nd Place

CVPR2022 SoccerNet ChallengeAction Spotting Task
4th Place

CVPR2022 SoccerNet ChallengeRe-Identification Task
3rd Place

Demos

Exploring the British Library’s 1 Million Images
Instance and object retrieval across 1 million images from 17th-19th century books.

Matching Ballad Illustrations
Instantly match and compare printed illustrations in the Bodleian library ballads.