About Me
I am a Software Engineer specializing in building the large-scale data platforms and pipelines. In my current role at TikTok's AI Data Platform Team (AIDP), I develop LLM-agent-based pipelines to generate synthetic data, directly enhancing code models for tasks like automated unit test generation and program repair.
My engineering background includes optimizing data pipelines and developing scalable distributed systems, which has given me a solid foundation in software architecture and performance tuning. Prior to TikTok, I worked at Microsoft, where I optimized high-performance video index generation pipelines for Bing Multimedia Team, improving Bing's Video Search freshness and relevance to surpass Google and Baidu.
My work also has a strong HCI component; at TikTok, I designed intelligent task distribution systems for human labelers (patent pending), sparking my interest in the HCI field.
My hands-on experience informs my research interests at the intersection of AI, Software Engineering (SE), and Human-Computer Interaction (HCI):
- AI + SE: Investigating the impact of generative models on software quality and security, focusing on automated unit test generation and program repair. Studying the social and ethical implications—how practices like ‘vibe-coding’ affect software quality, team dynamics, and trust.
- HCI: Exploring how developers interact with AI-powered tools and designing systems that function as effective, intuitive partners, especially as AI becomes core to team workflows.
I hold a Master's in Computer Science from the University of Southern California (Annenberg Fellowship) and have published at top-tier venues like ESEC/FSE and ASE. I am also a winner of the Ability Award from the Imagine Cup World Finals in 2016 (check it out), reflecting my commitment to building impactful technology.
For a detailed overview of my professional journey, please refer to my LinkedIn profile. You can also download my full CV below.
Current Focus
At TikTok's AI Data Platform team, my work includes:
- Design Multi-round LLM Unit Test & BugFix Data Synthesis Pipelines to finetune code generation models by SFT/RL training.
- Implement Backend Microservices using Python, Golang, Kafka, Redis, and MySQL.
- Experiment with Intelligent Crowdsourcing Task Distribution strategies such as similar tasks clustering, MILP based global optimum assignment, LLM-assisted task pre-labeling, etc.
- Integrate Vector Database for semantic similarity-based task assignment.
- Optimize service performance with data-driven approaches.
Recognition & Awards
-
Winner of Ability Award (Project BoneyCare, 1/150 countries)Microsoft Imagine Cup World Final
-
Winner of Microsoft Imagine Cup China 2016Microsoft Corporation
-
SIGSOFT CAPS AwardACM SIGSOFT, September 2018