😊 About Me

I’m a PhD student at The Chinese University of Hong Kong, supervised by Professor JIA, Jiaya and Professor YU, Bei. Before that, I obtained my master degree at AIM3 Lab, Renmin University of China, under the supervision of Professor JIN, Qin. I received my Bachelor’s degree in 2021 from South China University of Technology.

My research interest includes Computer Vision and Multi-modal Large Language Models. Here is my google scholar page.

πŸ“ Main Contributions

Arxiv preprint
sym

ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models

Yuqi Liu, Liangyu Chen, Jiazhen Liu, Mingkang Zhu, Zhisheng Zhong, Bei Yu, Jiaya Jia

  • ViSurf (Visual Supervised-and-Reinforcement Fine-Tuning) is a unified post-training paradigm that integrates the strengths of both SFT and RLVR within a single stage.
Arxiv preprint
sym

VisionReasoner: Unified Visual Perception and Reasoning via Reinforcement Learning

Yuqi Liu* , Tianyuan Qu* , Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, Jiaya Jia

Project Page[code]

  • VisionReasoner is a unified framework for visual perception tasks.
  • Through carefully crafted rewards and training strategy, VisionReasoner has strong multi-task capability, addressing diverse visual perception tasks within a shared model.
Arxiv preprint
sym

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Yuqi Liu* , Bohao Peng* , Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia

Project Page[code]

  • Seg-Zero exhibits emergent test-time reasoning ability. It generates a reasoning chain before producing the final segmentation mask.
  • Seg-Zero is trained exclusively using reinforcement learning, without any explicit supervised reasoning data.
  • Compared to supervised fine-tuning, our Seg-Zero achieves superior performance on both in-domain and out-of-domain data.
ICCV 2025
sym

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Zhisheng Zhong*, Chengyao Wang*, Yuqi Liu*, Senqiao Yang,Longxiang Tang, Yuechen Zhang, Jingyao Li, Tianyuan Qu, Yanwei Li, Yukang Chen, Shaozuo Yu, Sitong Wu, Eric Lo, Shu Liu, Jiaya Jia

Project Page[code]

  • Stronger performance: Achieve SOTA results across a variety of speech-centric tasks.
  • More versatile: Support image, video, speech/long-speech, sound understanding and speech generation.
  • More efficient: Less training data, support faster training and inference.
ACM MM 2024
sym

Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text Retrieval

Yang Du*, Yuqi Liu*, Qin Jin

  • A benchmark aims to evaluate temporal understanding of video retrieval models.
AAAI 2023
sym

Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language

Yuqi Liu, Luhui Xu, Pengfei Xiong, Qin Jin

Project Page

  • We study how to transfer knowledge from image-language model to video-language tasks.
  • We have implemented several components proposed by recent works.
ECCV 2022
sym

TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval

Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, Qin Jin

Project Page

  • TS2-Net is a text-video retrieval model based on CLIP.
  • We propose our token shift transformer and token selection transformer.

πŸ“– Educations

  • 2024.08 - 2028.06 (Expect), Ph.D., Department of Computer Science and Engineering, The Chinese University of Hong Kong.
  • 2021.09 - 2024.06, M.Phil., School of Information, Renmin University of China.
  • 2017.09 - 2021.06, B.E., School of Software Engineering, South China University of Technology.

πŸ“• Teaching

  • 2025 Fall, CSCI1580
  • 2025 Spring, ENGG2020
  • 2024 Fall, CSCI3170