About Me
I am Max Yueqian Lin. I am a 2nd year PhD student in the CEI lab of the Department of Electrical and Computer Engineering at Duke University. I am fortunate to be advised by Prof. Yiran Chen and Prof. Hai “Helen” Li. I hold a Bachelor’s degree in Data Science from Duke University and Duke Kunshan University, where I graduated in May 2024 with summa cum laude honors, a signature work distinction, and served as the valedictorian. My undergraduate studies were guided by Prof. Ming Li and Prof. Kai Zhang.
My research interests are broadly in the area of multimodal large language models, with a focus on audio and vison understanding and generation.

IEEE Member
ACM Member
Sigma Xi Member
News
[more]
- [Apr. 2025] Our paper CoreMatching: A Co-adaptive Sparse Inference Framework with Token and Neuron Pruning for Comprehensive Acceleration of Vision-Language Models has been accepted by ICML 2025. Congrats to Qinsi and the team!
- [Mar. 2025] Our paper SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval has been accepted by ICME 2025. See you in Nantes, France!
- [Oct. 2024] I have reached 100 citations in Google Scholar.
- [May. 2024] I was proud to speak on behalf of my class at Duke Kunshan University’s 2024 commencement. I touched on the theme of “the three begot ten thousand things” from the Tao Te Ching, reflecting our journey from novices to graduates ready to impact the world. See news coverage here.
- [May. 2024] I visted Vienna for attending ICLR 2024. I presented our paper SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion in the Tiny Papers track.
- [Mar. 2024] I am excited to announce that I will join the Department of Electrical and Computer Engineering at Duke University as a PhD student in Fall 2024.
- [Feb. 2024] A short version of our paper SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion has been accepted by ICLR 2024 in the Tiny Papers track. See you in Vienna, Austria!
- [Feb. 2024] Our paper Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2 has been uploaded to arXiv. We introduce two large-scale singing voice datasets, ACE-Opencpop and KiSing-v2, both of which are available for download via ESPnet.
- [Nov. 2023] We release SD-NAE, a novel method to generate Natural Adversarial Examples (NAEs) for deep image classifiers.
- [Nov. 2023] Our paper EEG-Based Speech Envelope Decoding: Structured State Space and U-Net Model Integration has been accepted by National Conference on Man-Machine Speech Communication 2023. See you in Suzhou, China!
- [Oct. 2023] I present our poster RTVis: Research Trend Visualization Toolkit in IEEE VIS 2023, which is held in Melbourne, Australia.
- [Sep. 2023] Our paper BiSinger: Bilingual Singing Voice Synthesis has been accepted by IEEE ASRU 2023.
- [Jul. 2023] Our poster RTVis: Research Trend Visualization Toolkit has been accepted by IEEE VIS 2023. See you in Melbourne, Australia!
- [Jun. 2023] We release RTVis: A real-time visualization tool for visualizing the research trend of a specific topic (GitHub).
- [Jun. 2022] Our paper about in-situ AFM tracking of nanoparticle has been published on ACM Macro Letters.
- [Jun. 2021] Our entrepreneurship project Elderly E (supported by DKU Innovation Incubator) is selected into 2021 Jiangsu Innovation and Entrepreneurship Plan for College Students (WeChat Article).
- [May. 2021] I receive the 2021 Natural and Applied Sciences Division Award from DKU (Wechat Article).
[/more]
Selected Publications
Manuscript
· 2025
Voice Evaluation of Reasoning Ability: Diagnosing the Modality-Induced Performance Gap
arXiv
Click to view details
→
VERA is a voice-native benchmark that finally tests reasoning rather than speech perception under real-time conversational constraints. It packs 2,931 spoken episodes across Math, Web, Science, Long-Context, and Factual tracks, each adapted from text tasks so voice and text models can be compared apples-to-apples. The evaluations expose a systematic voice-to-text reasoning gap and a clear latency-versus-accuracy frontier: fast voice systems cluster around low accuracy, while text-level performance demands sacrificing real-time response. Diagnostics reveal distinct failure signatures across streaming, end-to-end, and cascade pipelines, giving a principled way to measure progress toward assistants that can both talk and think.
Conference
· 2025
AsyncVoice Agent: Real-Time Explanation for LLM Planning and Reasoning
ASRU
Click to view details
→
AsyncVoice wraps any slow reasoning or planning backend with a live voice layer that explains what is happening while it waits. It streams updates about the planner’s hypotheses, tool calls, and confidence as soon as they surface, so users can stay engaged with a long-running system instead of waiting in silence. Because the voice presenter is decoupled from the underlying model, it can be dropped in front of existing agents to make their “thinking time” conversational without retraining or redesigning the core planner.
Manuscript
· 2025
HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding
arXiv
Click to view details
→
HippoMM tackles long, noisy audiovisual streams with a memory system inspired by the hippocampus. It combines pattern separation and completion, short-to-long consolidation, and cross-modal associative retrieval so audio and vision episodes can be stored and queried efficiently. On the HippoVlog benchmark, HippoMM jumps to 78.2% accuracy versus 64.2% for the prior best while cutting response time from 112.5s to 20.4s, showing that biologically grounded memory can be both sharper and faster.
Conference
· 2025
Keyframe-oriented Vision Token Pruning: Enhancing Efficiency of Large Vision Language Models on Long-Form Video Processing
ICCV
Click to view details
→
KVTP trims long-form video for VLMs without losing the timeline by adapting the pruning rate per frame based on query relevance. It softly elevates keyframes while keeping just enough context from the rest. Across SparseKV-QA and other long-video QA tasks, KVTP cuts up to 80% of tokens and about 64% of FLOPs while preserving accuracy, unifying keyframe selection and token pruning in a drop-in module.
Conference
· 2025
SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval
ICME
Oral presentation
Click to view details
→
We introduce SpeechPrune alongside SPIRAL, a 1,012-example benchmark that probes long-form queries where a single spoken snippet matters. The method prunes tokens using speech-text similarity and approximated early-layer attention, all without retraining. On SPIRAL it raises accuracy by about 29% on average (up to 47%) at 20% pruning and stays competitive even after removing 80% of tokens, delivering efficient long-context understanding for speech LLMs.
Conference
· 2024
Bridging Facial Imagery and Vocal Reality: Stable Diffusion-Enhanced Voice Generation
ISCSLP
Oral presentation
Click to view details
→
SD-EVG bridges face, voice, and text with a shared diffusion backbone. It implements three pipelines: voice to face for data enrichment, face to voice to map facial traits to vocal style, and prompt to face for style staging, then feeds the learned style into TTS. The recipe delivers expressive voice generation when the available modalities do not align, making multimodal voice design practical.
Conference
· 2024
Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and KiSing-v2
INTERSPEECH
Oral presentation
Click to view details
→
ACE-Opencpop and ACE-KiSing scale singing voice data with a high-quality synthesizer and careful manual tuning. The corpora span 30 and 33 singers respectively at 48 kHz, ship with standardized splits, and plug directly into ESPnet and Muskits recipes. Across downstream SVS systems they consistently lift quality, giving practitioners a ready-made path to large-scale singing voice training.
Conference
· 2024
SD-NAE: Generating Natural Adversarial Examples with Stable Diffusion
ICLR
Click to view details
→
SD-NAE synthesizes natural adversarial examples by steering Stable Diffusion with classifier feedback. A controlled optimization loop nudges latents while preserving semantics, yielding photorealistic yet adversarial images that meaningfully stress-test vision models. The pipeline delivers a practical generator for robustness studies with fully released code.
Journal
· 2024
OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection
DMLR
Click to view details
→
OpenOOD v1.5 upgrades OOD evaluation to ImageNet scale with foundation models like CLIP and DINOv2 plus a spectrum of semantic and covariate shifts. It ships an easy evaluator, standardized splits, and a public leaderboard so researchers can compare methods apples-to-apples, enabling reproducible, trustworthy OOD testing for the modern era.
Conference
· 2023
BiSinger: Bilingual Singing Voice Synthesis
ASRU
Click to view details
→
BiSinger trains a single pop-singing model that handles Chinese, English, and natural code-switching. By mapping phonetics through CMUdict and carefully fusing data, the system learns a language-independent representation, so multilingual singing no longer requires juggling separate models. The demos show strong English and mixed-language vocals while maintaining Chinese quality.