About
I am a Doctoral Researcher at Aalto University, funded by FCAI (Finnish Center for Artificial Intelligence). My primary supervisor is Professor Joni Pajarinen, who leads our Aalto Robot Learning Lab. My co-supervisors are Professor Alexander Ilin and Professor Juho Kannala.
My research is in the domain of Computer Vision, focusing on Object-Centric representation Learning (OCL). By combining OCL with mainstream World Models (WM), I aim to enhance embodied agents' Perception, Understanding, Reasoning and Prediction of visual scenes, as well as their Planning, Decision-making and Acting in environments.
I have devoted a lot to writing my object-centric-bench framework, where many OCL methods are
implemented with unified,
strong training tricks. Thus all those well-known OCL methods can be evaluated and compared under fair
settings. This framework has already been adopted by many follower works. Take my own as an example:
VQ-VFM-OCL,
DIAS,
RandSF.Q,
and SmoothSA.
I am looking for academic collaborators. If you are interested to apply OCL to tasks like visual/video question answering, visual prediction/reasoning, world modeling and reinforcement learning, please do not hesitate to contact me.
I provide academic supervision for Bachelor and Master students, free of charge. I have helped two students publish first-authored papers at top venues respectively. I have too many ideas to explore due to limited time. Contact me DIRECTLY with your academic RESUME if you are interested in working with me. Ideas and GPUs (with my supervisor's permission) will not be a problem.
News
- [2026/03/13] My student's paper was accepted to CVPR 2026 Findings 🎉
- [2025/11/08] A paper was accepted to AAAI 2026 🎉
- [2025/09/19] My student's paper was accepted to NeurIPS 2025 🎉
- [2025/07/04] Two papers were accepted to ACM MM 2025 🎉
Research
OCL aims to represent image/video as (sub-)object-level feature vectors, termed slots. OCL models follow the encode-aggregate-decode architecture, trained by reconstructing the input in some form.
I have been improving OCL in its aggregation, decoding and reconstruction, as well as its transition.
I am trying to combine OCL with mainstream WMs.
I am trying to improve OCL with stronger geometry inductive bias by combining OCL with geometric vision models, like NeRF and GS.
Supervision
Current-
Zekun Wang (with Junyi Shi),
2026/03~Now. Master student @ Aalto University.
OCL + Neural Gaussian Splating. -
Ting Fu (with Junyi Shi),
2026/03~Now. Master student @ Aalto University.
OCL + 3D scene understanding. -
Linda Liljavirta, 2026/01~Now. Bachelor student @ Aalto University.
Literature review on 3D scene understanding, e.g., LangSplat. -
Louis French (with Professor Cheng Wang), 2025/12~Now. Bachelor student @ University of East Anglia.
OCL for more efficient 3D/4D reconstruction, e.g., CUT3R. -
Fan Chen (with Professor Bin Zhao), 2025/05~Now. Master student @ Guilin University of Electronic Technology.
OCL plus geometric computer vision. -
Yanhua Han (with Professor Bin Zhao), 2025/05~Now. Master student @ Guilin University of Electronic Technology.
OCL for learning visual hierarchies. -
Youliang Tao (with Professor Bin Zhao), 2025/05~Now. Bachelor student @ Guilin University of Electronic Technology.
OCL decoding with masked auto-encoding.
-
Guangyuan Li, 2025/09~2026/02. Master student @ Aalto University.
OCL for vision token pruning in VLMs and for context token pruning in LLMs.
Output:OC-VTP
Placement: Doctoral student with ELLIS Professor Jiancheng Yang. -
Janina Kemppainen, 2025/02~2025/05. Bachelor student @ Aalto University.
No/low-code AI app building platform case study.
Output:thesis -
Hongjia Liu, 2024/12~2025/05. Master student @ Aalto University.
OCL plus vector-quantization.
Output:MetaSlot -
Daniel Kopra, 2023/02~2023/05. Bachelor student @ Aalto University.
Literature review on modular design in deep neural networks.
Output:thesis
Publication
': equal contribution; *: corresponding author
Preprint- R Zhao, W Yang, J Kannala, J Pajarinen.
Smoothing Slot Attention Iterations and Recurrences.
arXiv:2508.05417.
[SmoothSA]paper/code/model/log
🌟🌟 new SotA of OCL on both images and videos
- G Li', R Zhao'*, J Deng, Y Wang, J Pajarinen.
Object-Centric Vision Token Pruning for Vision Language Models.
CVPR 2026 Findings.
[OC-VTP]paper/code
🌟🌟 guaranteed optimal vision token pruning for the first time, by combining OCL - R Zhao, J Li, J Kannala, J Pajarinen.
Predicting Video Slot Attention Queries from Random Slot-Feature Pairs.
AAAI 2026.
[RandSF.Q]paper/poster/code/model/log
🌟🌟🌟 implicit transition dynamics modeling in video OCL for the first time; significant performance gains
- H Liu, R Zhao*, H Chen, J Pajarinen.
Break Through the Fixed Number of Slots in Object-Centric Learning.
NeurIPS 2025.
[MetaSlot]paper/code - R Zhao, Y Zhao, J Kannala, J Pajarinen.
Slot Attention with Re-Initialization and Self-Distillation.
ACM MM 2025.
[DIAS]paper/code/model/log - R Zhao, V Wang, J Kannala, J Pajarinen.
Vector-Quantized Vision Foundation Models for Object-Centric Learning.
ACM MM 2025.
[VQ-VFM-OCL]paper/code/model/log
🌟🌟 unify mainstream OCL by supporting any decoding; reproduce many baselines - R Zhao, V Wang, J Kannala, J Pajarinen.
Multi-Scale Fusion for Object Representation.
ICLR 2025.
[MSF]paper/code/model - R Zhao, V Wang, J Kannala, J Pajarinen.
Grouped Discrete Representation for Object-Centric Learning.
ECML-PKDD 2025.
[GDR]paper/code/model