Hi, I’m Wenming Tu (涂文明). I am currently a first-year Ph.D. student in Computer Science and Technology at Shanghai Jiao Tong University (SJTU). I am jointly supervised by Prof. Xie Chen from the X-LANCE Lab and Dr. Zilong Zheng from the Beijing Institute for General Artificial Intelligence (BIGAI), NLCo Group. I am passionate about advancing the frontiers of AI research and developing innovative solutions in this rapidly evolving field.

My research interests primarily focus on speech and audio processing and multimodal large language models. I aim to explore how these technologies can enhance human-computer interaction, improve speech synthesis and recognition systems, and advance AI capabilities in multimodal environments.

Think bold! Work hard!

🔥 News

2026.06: 🎉🎉 2 papers have been accepted by INTERSPEECH 2026!
2026.05: 🎉🎉 1 paper has been accepted by ICML 2026!
2026.02: 🎉🎉 We won 🥈 (2nd place) in the Agent Track of the Interspeech 2026 Audio Reasoning Challenge. See the Leaderboard; the official report is on Challenge Report.
2025.09: 🎉🎉 1 paper has been accepted by NeurIPS 2025!
2025.05: 🎉🎉 1 paper has been accepted by ACL 2025!
2025.01: 🎉🎉 1 paper has been accepted by EuroSys 2025!

📝 Publications

(*represents co-first authors, #represents corresponding authors)

2026

arXiv 2026 · Preprint

MMAE: A Massive Multitask Audio Editing Benchmark.
Ziyang Ma*, Ruiqi Yan*, Ruiyang Xu*, Jie Fang*, Zhikang Niu*, Yi-Wen Chao*, Wenming Tu*, Tianrui Wang*, Auden, Qi Chen, Wenxi Chen, Jiaying Chi, Yanru Huo, Zixuan Jiang, Xiquan Li, Yalin Li, Junxi Liu, Minghao Liu, Binghao Qiang, Yijia Shan, Zheshu Song, Tian Tan, Zixiang Wang, Zeyu Xie, Zhifei Xie, Xiaoyu Xing, Qixiang Xu, Chen Yang, Guanrou Yang, Shan Yang, Yifan Yang, Steve Yves, Haotian Zhang, Haina Zhu, Kai Yu, Liefeng Bo, Eng-Siong Chng, Xie Chen#. (As a Core Contributor)

TL;DR: The first comprehensive benchmark for instruction-based audio editing, spanning 7 audio modalities with a taxonomy of 6 complexity levels, 2 granularities, and 8 operation types. Its 2,000 samples are scored against 17,741 rubric-based criteria, revealing that even leading models score below 5% Exact Match—dropping to 0% on complex mixed-modality tasks.

INTERSPEECH 2026

VISA: A Visual Information Strengthened Audio-Reasoning System for the Interspeech 2026 ARC Agent Track.
Wenming Tu*, Jian Gao*, Yanru Huo, Yixuan Wang, Jing Peng, Bohan Li, Ziyang Ma, Tao Liu, Shuai Fan#, Kai Yu, Xie Chen, Zilong Zheng#.

TL;DR: VISA is our 2nd-place submission to the Interspeech 2026 Audio Reasoning Challenge (Agent Track), built on a "LALM-as-a-Tool" paradigm. It combines multi-modal feature extraction, model-voting inference with consistency checking, and rubric-aligned category-aware routing, reaching a 66.23% Rubrics score and the highest accuracy (77.40%) on the official leaderboard.

A Unified and Reproducible Experimentation Framework for Speech Understanding. Jing Peng*, Junhao Du*, Chenghao Wang*, Hanqi Li*, Yi Yang*, Yixuan Wang, Xiaoyu Gu, Guanyu Chen, Yucheng Wang, Jiang Li, Zhangjie Zhao, Haoran Wang, Wenming Tu, Haoyu Li, Duo Ma, Lirong Qian, Yu Xi, Wen Wen, Jiaqi Guo, Hui Zhang, Shuai Fan, Wenbin Jiang, Shuai Wang, Kai Yu#.

Audio-Mind: An Auditable Agentic Framework for Audio Understanding. Yucheng Wang*, Jing Peng*, Hanqi Li, Chenghao Wang, Wenming Tu, Yu Xi, Zhaokai Sun, Kai Yu, Shuai Wang#.

ICML 2026

AVI-Bench: Toward Human-like Audio-Visual Intelligence of Omni-MLLMs.
Yaoting Wang, Ziyi Zhang, Wenming Tu, Shaoxuan Xu, Wenjie Du, Cheng Liang, Weijun Wang, Yuanchao Li, Guangyao Li, Hao Fei, Yuanchun Li, Henghui Ding#, Yunxin Liu.

TL;DR: A cognitively-inspired benchmark evaluating Omni-MLLMs across perception, understanding, and reasoning through cross-modal audio-visual tasks. An extension, AVI-Bench-PriSe, probes primitive audio-visual sensation with low-semantic stimuli; experiments expose substantial limitations and yield a four-level AVI taxonomy.

arXiv 2026 · Preprint

MOVA: Towards Scalable and Synchronized Video-Audio Generation.
As a core contributor cooperate with SII-OpenMOSS Team.

TL;DR: MOVA (MOSS Video and Audio) is an open-source model for jointly generating high-quality, synchronized audio-visual content—lip-synced speech, environment-aware sound effects, and content-aligned music. It uses a 32B-parameter Mixture-of-Experts architecture (18B active) supporting Image-Text→Video-Audio generation, released with full weights, code, LoRA fine-tuning, and inference tooling.

2025

arXiv 2025 · Preprint

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models.
Wenming Tu, Guanrou Yang, Ruiqi Yan, Wenxi Chen, Ziyang Ma, Yipeng Kang, Kai Yu, Xie Chen#, Zilong Zheng#.

TL;DR: UltraVoice is a 830+ hour speech-dialogue dataset with instructions controlling six style dimensions—emotion, speed, volume, accent, language, and composite styles. Fine-tuning SLAM-Omni and VocalNet on it lifts Mean Opinion Score by 29–42% and Instruction-Following Rate by up to 40 points, while preserving conversational ability.

Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia. Cooperate with the DeepMind Concordia Team.

ACL Findings 2025

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective.
Yipeng Kang, Junqi Wang, Yexin Li, Mengmeng Wang, Wenming Tu, Quansen Wang, Hengli Li, Tingjun Wu, Xue Feng, Fangwei Zhong, Zilong Zheng#.

TL;DR: A causal-perspective study showing that the underlying value graph of LLMs differs significantly from human value systems even after alignment training. It proposes two lightweight steering methods—role-based prompting and sparse-autoencoder steering—shown to be effective and controllable on Gemma-2B-IT and Llama3-8B-IT.

EuroSys 2025

Empower Vision Applications with LoRA LMM.
Liang Mi*, Weijun Wang*#, Wenming Tu, Qingfeng He, Rui Kong, Xinyu Fang, Yazhu Dong, Yikang Zhang, Yuanchun Li, Meng Li, Haipeng Dai, Guihai Chen, Yunxin Liu.

TL;DR: VaLoRA is an end-to-end system for serving LoRA-adapted Large Multimodal Models, combining accuracy-aware adapter generation, adaptive-tiling batched operators for heterogeneous adapters, and flexible request/adapter orchestration. Across five vision tasks on three models it improves accuracy by 24–62% while cutting latency by 20–89% versus state-of-the-art serving.

🎖 Honors and Awards

2025.06 Outstanding Graduates of CUMTB. 🎓
2024.03 Merit Student Award of Beijing. 🏅
2023.10 Xiaomi Scholarship. 🎖

📖 Educations

2025.09 - 2030.06(expected): Computer science and technology. School of Computer Science , Shanghai Jiao Tong University(SJTU)
2021.09 - 2025.06: Computer science and technology. School of Artificial Intelligence, China University of Mining and Technology-Beijing(CUMTB)

💻 Internships

2026.04 - Present, Tencent Hunyuan, Shanghai, China.
2025.10 - 2026.03, Sii & OpenMOSS, Shanghai, China.
2024.10 - 2025.09, Beijing Institute for General Artificial Intelligence (BIGAI), NLCo Group, Beijing, China. Co-supervised by Dr. Zilong Zheng and Dr. Yipeng Kang.
2023.12 - 2024.09, Institute for AI Industry Research (AIR), Tsinghua University, AIoT Group, Beijing, China. Co-supervised by Dr. Weijun Wang and Prof. Yuanchun Li.