πŸ§β€β™‚οΈ Biography

Hi, 😊 here is HUANG Jiehui, a PhD student at HKUST and supervised by Jiaya Jia. I am passionate πŸš€ about academic research and aim to use my research findings to address real-world challenges, thereby making meaningful and impactful contributions to society.

I have been relentlessly striving to see humanity achieve digital immortality and machine consciousness at the earliest opportunity. My primary research interest is in controllable AIGC & VLM. If you are interested in collaboration or wish to contact me, please feel free to reach out via email.

Previously, I received a M.S. degree, advised by Xiaodan Liang (撁小丹), co-supervised by Shencai Liao (Professor at United Arab Emirates University, IEEE Fellow, IAPR Fellow). I previously worked at HCPLab Artificial Intelligence at the School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, China.

πŸ”₯ News

  • 2025.12: Β πŸŽ‰πŸŽ‰ ConsistentID was accepted by TPAMI.
  • 2025.12: Β πŸŽ‰πŸŽ‰ Release UnityVideo, one unified multi-modal multi-task learning video model for enhancing world-aware video generation.
  • 2025.11: Β πŸŽ‰πŸŽ‰ One paper was accepted by AAAI 2026.
  • 2025.07: Β πŸŽ‰πŸŽ‰ One paper was accepted by ACM MM 2025.
  • 2025.01: Β πŸŽ‰πŸŽ‰ One paper was accepted by IEEE Transactions on Instrumentation & Measurement.
  • 2024.11: Β πŸŽ‰πŸŽ‰ Obtain the National Scholarship from the Sun Yat-sen University.
  • 2024.04: Β πŸŽ‰πŸŽ‰ Release (✨ 900+ Star), ConsistentID, one high-fidelity and fast customized portrait generation model.
  • 2024.01: Β πŸŽ‰πŸŽ‰ One paper was accepted by Computers in Biology and Medicine.
  • 2023.12: Β πŸŽ‰πŸŽ‰ Two papers were accepted by AAAI and Knowledge-Based Systems respectively.
  • 2023.11: Β πŸŽ‰πŸŽ‰ One paper was accepted by Neurocomputing.
  • 2022.10: Β πŸŽ‰πŸŽ‰ Obtain the Master’s Scholarship for first class from the Sun Yat-sen University.
  • 2021.11: Β πŸŽ‰πŸŽ‰ Obtain the National Scholarship from the Nanchang University.

πŸ’» Internships

πŸ“ Selected Publications

arXiv
sym

UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation
Jiehui Huang, Yuechen Zhang, Xu He, Yuan Gao, Zhi Cen, Bin Xia, Yan Zhou, Xin Tao, Pengfei Wan, Jiaya Jia

Project , Github Code, Huggingface

  • We propose UnityVideo, a novel unified framework for integrating multiple video tasks and modalities, enabling mutual knowledge transfer, better convergence, and improved performance over single-task baselines.
  • We introduce a modality-adaptive switcher, an in-context learner, and a dynamic noise scheduling strategy that together enable efficient joint training across diverse objectives and scalability to larger datasets.
  • We construct and release OpenUni, a 1.3M-pair multimodal video dataset, and UniBench, a 30K-sample benchmark derived from Unreal Engine for fair evaluation of unified video models.
arXiv
sym

ReCamDriving: LiDAR-Free Camera-Controlled Novel Trajectory Video Generation
Yaokun Li, Shuaixian Wang, Mantang Guo, Jiehui Huang, Taojun Ding, Mu Hu, Kaixuan Wang, Shaijie Shen, Guang Tan

Project , Github Code,

  • We propose ReCamDriving, a novel vision-based framework for novel-trajectory video generation that leverages 3DGS renderings to achieve precise camera control and structurally consistent generation.
  • We introduce a novel 3DGS-based cross-trajectory data curation strategy for scalable multi-trajectory supervision, and construct the ParaDrive dataset with over 110K parallel-trajectory pairs.
  • Extensive experiments demonstrate that ReCamDriving achieves the state-of-the-art performance in both camera controllability and 3D consistency.
ACMMM
sym

LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation
Wenhui Song, Hanhui Li, Jiehui Huang, Panwen Hu, Yuhao Cheng, Long Chen, Yiqiang Yan, Xiaodan Liang*

Project

  • We propose LaVieID, a novel framework that effectively addresses the identity-preserving text-to-video generation task.
  • We introduce a local router to provide spatial structural guidance by leveraging fine-grained facial cues.
  • We devise a temporal autoregressive module to model longrange temporal dependencies among frames.
TPAMI
sym

ConsistentID:Portrait Generation with Multimodal Fine-Grained Identity Preserving
Jiehui Huang, Xiao Dong, Wenhui Song, Hanhui Li, Jun Zhou, Yuhao Cheng, Shutao Liao, Long Chen, Yiqiang Yan, Shengcai Liao, Xiaodan Liang*

Project, HuggingFace Demo

  • We introduce ConsistentID to improve fine-grained customized facial generation by incorporating detailed descriptions of facial regions and local facial features.
  • We devise an ID-preservation network optimized by facial attention localization strategy, enabling more accurate ID preservation and more vivid facial generation.
  • We introduce the inaugural fine-grained facial generation dataset, FGID, addressing limitations in existing datasets for capturing diverse identity-preserving facial details.
Knowledge-Based System
sym

TMBL: Transformer-based multimodal binding learning model for multimodal sentiment analysis
Jiehui Huang, Jun Zhou, Zhenchao Tang, Jiaying Lin, and Calvin Yu-Chian Chen*

Project

  • Considering that existing multi-modal fusion systems rarely consider fine-grained word-level interactions, we redesigned the Transformer structure, effectively improving the ACC index by 6%.
  • In order to solve the problem of modal heterogeneity caused by multi-modal feature fusion, inspired by CLIP, a cross-model binding mechanism was designed for each modality to more effectively fuse modal features.
  • Aiming at the modal aliasing problem caused by the difficulty in distinguishing modal features, CLS and PositionEmbedding information are designed to effectively distinguish modal space and semantic relationships.
Neurocomputing
sym

Progressive network based on detail scaling and texture extraction: A more general framework for image deraining
Jiehui Huang, Zhenchao Tang, Xuedong He, Jun Zhou, Defeng Zhou, and Calvin Yu-Chian Chen*

Project

  • In order to enhance the coupling and portability of the module, the existing rain removal module was redesigned and a multi-scale coupling method was established. A simple and effective strategy achieved a model increase of 5%.
  • In order to improve the transferability and generalization of the model, a detail scaling module is designed to extract generalized features from degraded images and restore finer details to avoid distortion.
  • The attention layer and feed-forward layer in the Transformer block are enhanced to extract universal features more efficiently, enhancing the model’s ability to capture comprehensive and transferable features.
  • The progressive learning strategy assists in learning more multi-scale features and achieves SOTA performance on data sets such as SPA-Data, RainDrop, RID, and Rain100.
AAAI2026
sym

Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training

Ruicheng Zhang, Jun Zhou, Zunnan Xu, Zihao Liu, Jiehui Huang, Mingyang Zhang, Yu Sun, Xiu Li

Project

  • We propose Zo3T, a 3D-aware test-time training framework for zero-shot controllable video generation that enables precise control over both target object motion and camera movement.
  • Zo3T integrates lightweight test-time LoRA modules during trajectory-guided generation test time to adaptively guide the generation process and maintain generative fidelity during latent manipulation. Furthermore, we refine the guidance field by re-evaluating noise scores to enforce trajectory fidelity.
  • Extensive experiments demonstrate that our proposed method outperforms both training-based and trainingfree methods in terms of trajectory control and fidelity of generated videos.
AAAI2024
sym

Comprehensive View Embedding Learning for Single-cell Multimodal Integration

Zhenchao Tang, Jiehui Huang, Guanxing Chen, Pengfei Wen, and Calvin Yu-Chian Chen*

Project

  • Embedding learning is performed on single-cell multi-modal data from three views, such as the regulatory relationship between different modalities and the relationship between single-cell fine-grained features in each modality.
  • By learning graph link embeddings, the proposed CoVEL can model cross-modal regulatory relationships and use biological knowledge to bridge the gap between feature spaces under different modalities.
  • To ensure that differences between modalities are eliminated and biological heterogeneity is preserved, single-cell fine-grained embeddings and contrastive cell embeddings are unsupervisedly learned on multimodal data.
  • The proposed self-supervised learning method effectively finds the information between data from the perspective of representation learning, while the generation method focuses on learning the information within the data.

πŸŽ– Honors and Awards

  • 2025.06 Outstanding Graduate of Sun Yat-sen University
  • 2024.11 China National Scholarship
  • 2023.10 The First Prize Scholarship of Sun Yat-sen University
  • 2022.06 Outstanding Graduate of Nanchang University
  • 2021.11 China National Scholarship
  • 2021.10 Outstanding Scholarship of Nanchang University
  • 2021.8 (CIMC) Siemens Cup China Intelligent Manufacturing Challenge: National Preliminary Championship First Prize
  • 2020.8 (RMUC) Robomaster Infantry Group: National Championship First Prize
  • 2020.2 A patent type for a non-blocking controllable projectile launch system: Invention Patent

πŸ“– Educations

2025.09 - now, PhD. Student.

Artificial Intelligence, Department of Computer Science and Engineering (CSE).

Hong Kong University of Science and Technology, Hong Kong.

2022.09 - 2025.6, M.S Student.

Artificial Intelligence, School of Intelligent Systems Engineering(ISE).

Sun Yat-sen University, Shenzhen.

2018.09 - 2022.06, Undergraduate.

Automation, School of Intelligent Systems Engineering.

Nanchang University (NCU), Nanchang.

πŸ“ Academic Service

Reviewer:
βˆ™ Conference Reviewer: CVPR, ECCV, AAAI, ACM MM, AAAI, AISTATS …
βˆ™ Journal Reviewer: TPAMI, TVCG, TIP, TIM, KBS, …

😻 My Hobbies

πŸ§— πŸ€ πŸ‹ 🎧 πŸ“· 🏊 🏸 🎣 …