22-24 October 2025, Singapore

Speech is among the most natural and convenient means of biometric authentication. The individual traits embedded in the speech signals form the basis of speaker recognition or voice authentication. With the widespread availability of speaker recognition and speech synthesis tools, the threat from malicious exploitation of speaker attributes is growing. For example, an attacker could retrieve a target speaker’s recordings from a breached data source with just a few seconds (e.g., 3 seconds) of their speech, leading to the leakage of privacy-related information such as age, interests, opinions, ethics, and health status. Utilizing VC and TTS techniques, synthetic speech can be generated to impersonate the target speaker. The synthesized speech can be exploited for malicious purposes, such as damaging the speaker’s reputation and manipulating public opinion. These voice privacy concerns have called for the need for techniques of voice anonymization, speech watermarking, and anti-spoofing, etc. In this panel, we invite world-leading experts to share their opinions on the security and privacy aspects in handling individual traits in speech, the challenges posed by the advancement in speaker recognizers and neural speech synthesizers, and the collaborative efforts that could be put together in answering the concerns and challenges.
Liping Chen is an Associate Researcher at the University of Science and Technology of China, China. She received the Ph.D. degree in signal and information processing from the University of Science and Technology of China (USTC), Hefei, China, in 2016. From 2016 to 2022, she was a Speech Scientist with Microsoft. She is currently an Associate Researcher with the USTC. Her research interests include speech processing, voice privacy protection, speech synthesis, and speaker recognition.
Tomoki Toda is currently a Professor with Information Technology Center, Nagoya University. He received the B.E. degree from Nagoya University, Japan, in 1999 and the D.E. degree from the Nara Institute of Science and Technology (NAIST), Japan, in 2003. He was a Research Fellow with the Japan Society for the Promotion of Science from 2003 to 2005. From 2005 to 2011, he was an Assistant Professor, and from 2011 to 2015, was an Associate Professor with NAIST. His research interests include statistical approaches to speech, music, and environmental sound processing. He was the recipient of the IEEE SPS 2009 Young Author Best Paper Award and the 2013 EURASIP-ISCA Best Paper Award, Speech Communication journal. He has served as SLP TC chair of APSIPA since 2025. He has organized several special sessions, such as Voice Conversion Challenge 2016 in INTERSPEECH 2016, VoiceMOS Challenge 2022 in INTERSPEECH 2022, and Singing Voice Deepfake Detection Challenge 2024 in IEEE SLT 2024.
Xin Wang is currently a JST PRESTO researcher and a Project Associate Professor at the National Institute of Informatics, Japan. He received his Ph.D. degree from SOKENDAI, Japan, in 2018. Prior to that, he earned his Master’s and Bachelor’s degrees from USTC and UESTC, China, respectively. He has been an organizer of the past three ASVspoof challenges on speech deepfake detection, as well as the VoicePrivacy challenges on speaker anonymization. He is also an appointed team member of the ISCA Special Interest Group on Security and Privacy in Speech Communication.
Rohan Kumar Das is currently a Research and Development (R&D) Manager at Fortemedia, Singapore division. Prior to that he was associated with National University of Singapore as a Research Fellow from 2017-2021 and as a Data Scientist in KOVID Research Labs, India in the year 2017. He is a Ph.D. graduate from Indian Institute of Technology (IIT) Guwahati. He was one of the organizers of the special sessions on “The Attacker’s Perspective on Automatic Speaker Verification”, “Far-Field Speaker Verification Challenge 2020” in Interspeech 2020, and the Voice Conversion Challenge 2020. He served as Publication Chair of IEEE Automatic Speech Recognition Understanding (ASRU) Workshop 2019 and one of the Chairs of Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. He is a Senior Member of IEEE, a member of ISCA and APSIPA. His research interests are speech/audio signal processing, speaker verification, anti-spoofing, social signal processing and various applications of deep learning.
Shiqi Wang (DL 2025-2026), Professor, Department of Computer Science, City University of Hong Kong, Hong Kong, China
Bio: Shiqi Wang is a Professor with the Department of Computer Science, City University of Hong Kong. He has proposed more than 70 technical proposals to ISO/MPEG, ITU-T, and AVS standards, and authored or coauthored more than 300 refereed journal articles/conference papers. His research interests include video compression, image/video quality assessment, and image/video search and analysis. He received the Best Paper Award from IEEE VCIP 2019, ICME 2019, IEEE Multimedia 2018, and PCM 2017. His co-authored article received the Best Student Paper Award at the IEEE ICIP 2018. He was a recipient of the 2021 IEEE Multimedia Rising Star Award in ICME 2021. He served or serves as an Associate Editor for IEEE Transactions on Circuits and Systems for Video Technology, IEEE Transactions on Multimedia, IEEE Transactions on Image Processing, and IEEE Transactions on Cybernetics. He was also the technical program co-chair of IEEE ICME 2024.
Hanwei Zhu, Research Scientist, Alibaba-NTU Global e-Sustainability CorpLab (ANGEL), Nanyang Technological University, Singapore
Bio: Dr. Hanwei Zhu is a Research Scientist with the Alibaba-NTU Global e-Sustainability CorpLab (ANGEL) at Nanyang Technological University, Singapore. He earned his Ph.D. degree from City University of Hong Kong in 2025. His research interests include perceptual image processing, computational vision, and computational photography.
Presentation Title: Visual Quality Assessment Based on Large Vision-Language Models
Presenters: Shiqi Wang and Hanwei Zhu
Abstract: Large vision–language models (LVLMs) have recently exhibited significant potential in visual understanding tasks, yet systematically evaluating their image quality assessment (IQA) capabilities remains challenging. This talk introduces a unified approach to IQA that transitions from traditional scalar metrics to sophisticated reasoning-based evaluation. Specifically, we present three key innovations: (1) a Two-Alternative Forced Choice (2AFC) framework employing strategic pairing and maximum-a-posteriori inference for robust LVLM ranking; (2) an open-ended visual quality comparison task enabling detailed and context-aware model rationales; and (3) a novel no-reference IQA model that translates comparative judgments from LVLMs into continuous quality scores. Additionally, we introduce AgenticIQA, a modular, divide-and-conquer framework that combines LVLM reasoning with conventional IQA tools, coordinated by planning, execution, and summarization agents. Together, these contributions chart a path towards intelligent, interpretable, and adaptable visual quality assessment for the next generation of multimodal models.
Koki Wataoka, Responsible AI Team, SB Intuitions, Japan
Bio: Koki Wataoka leads the Responsible AI Team in the Data & Safety Department of the R&D Headquarters at SB Intuitions, Japan, where he oversees research and development to advance the safety of LLMs and VLMs. He earned his master’s degree from the Graduate School of System Informatics at Kobe University in 2021. That same year, he joined LINE Corporation (now LINE Yahoo!), focusing on the reliability and safety of large-scale language models. In 2023, he moved to SB Intuitions, where he continues to drive responsible AI initiatives and strengthen the safety of next-generation AI systems.
Huy Hong Nguyen, Researcher, SB Intuitions, Japan
Bio: Huy H. Nguyen is a researcher at SB Intuitions, a SoftBank Group company. He is also a visiting associate professor at the National Institute of Informatics (NII), Japan. His research focuses on improving the safety, security, and privacy of LLMs and VLMs, as well as the generation and detection of synthetic media. His future research vision includes extending these efforts to safeguard artificial general intelligence (AGI). He earned his Ph.D. from The Graduate University for Advanced Studies (SOKENDAI) in collaboration with NII in 2022.
Presentation Title: Foundation Models as Guardrails: LLM- and VLM-Based Approaches to Safety and Alignment
Presenters: Koki Wataoka and Huy Hong Nguyen
Abstract: The growing deployment of large language models (LLMs) and vision-language models (VLMs) raises urgent concerns about safety and alignment. While alignment techniques such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) improve model behavior, they are not sufficient to prevent harmful outputs. This paper reviews recent approaches that use foundation models themselves as guardrails systems that monitor or filter inputs and outputs for safety. We cover LLM-based moderation, neural classifiers, and multimodal safety filters, highlighting both academic advances and industry tools. We also discuss empirical evaluation methods such as red teaming and adversarial prompting. Finally, we outline open challenges in robustness, interpretability, and policy adaptation, pointing to key directions for building trustworthy guardrails for generative AI.
Dr. Erica Cooper, National Institute of Information and Communications Technology, Japan
Title: Progress and Challenges in DNN-based Objective Quality Assessment of Synthesized Speech
Abstract: The field of speech synthesis has advanced rapidly in recent years, and evaluation methodologies for synthesized speech have evolved as well. While listening tests are the gold standard for evaluating synthesized speech, they are costly and time-consuming, leading researchers to consider more automatic and objective metrics for evaluation. In this paper, we give an overview of machine learning based approaches to the prediction of quality of synthesized speech, with a focus on modern deep neural network (DNN) based approaches for MOS prediction, including supervised task-specific training, approaches making use of pretrained self-supervised speech models, unsupervised approaches, and more recent approaches making use of large language models. We will also discuss the current state of objective evaluation of synthesized speech including open research challenges and future directions.
Bio: Erica Cooper completed the Ph.D. degree at Columbia University in the City of New York in 2019 with a research focus on text-to-speech synthesis for low-resource languages. She worked as a postdoctoral researcher at the National Institute of Informatics from February 2019 to March 2024 contributing to the JST-ANR CREST VoicePersonae project. She joined NICT as a senior researcher in April 2024. She is one of the founding organizers of the VoiceMOS Challenge series which began in 2022.
Prof. Wen-Chin Huang, Graduate School of Informatics, Nagoya University, Japan
Title: Advancing Speech Quality Assessment Through Scientific Challenges and Open-source Activities
Abstract: Speech quality assessment (SQA) refers to the evaluation of speech quality, and developing an accurate automatic SQA method that reflects human perception has become increasingly important, in order to keep up with the generative AI boom. In recent years, SQA has progressed to a point that researchers started to faithfully use automatic SQA in research papers as a rigorous measurement of goodness for speech generation systems. We believe that the scientific challenges and open-source activities of late have stimulated the growth in this field. In this paper, we review recent challenges as well as open-source implementations and toolkits for SQA, and highlight the importance of maintaining such activities to facilitate the development of not only SQA itself but also generative AI for speech.
Bio: Wen-Chin Huang is currently an assistant professor at the Graduate School of Informatics, Nagoya University, Japan. He received the B.S. degree from National Taiwan University, Taiwan in 2018 and the M.S. and Ph.D. degree from Nagoya University, Japan in 2021 and 2024, respectively. He was a co-organizer of the Voice Conversion Challenge 2020, Singing Voice Conversion 2023, 2025, VoiceMOS Challenge 2022, 2023, 2024, and AudioMOS Challenge 2025. His main research interest is speech processing, with a main focus on speech generation-related fields, including voice conversion and speech quality assessment. He was the recipient of the Best Student Paper Award in ISCSLP2018, the Best Paper Award in APSIPA ASC 2021, and the 16th IEEE Signal Processing Society Japan Best Student Journal Paper Award.
Dr. Ryandhimas E. Zezario, Research Center for Information Technology Innovation, Academia Sinica
Title: Non-Intrusive Intelligibility Prediction for Hearing Aids: Recent Advances, Trends, and Challenges
Abstract: Improving speech understanding in noisy environments is an important objective in the development of hearing aid (HA) devices. To support this objective, it is essential to have a reliable metric that can accurately predict speech intelligibility for HA users. While subjective listening tests remain the gold standard for intelligibility evaluation, they are costly and time-consuming. As a result, a series of deep learning–based approaches have been proposed to perform automatic evaluation. With the growing interest in deploying reliable neural speech assessment models, this talk aims to highlight recent advances in non-intrusive intelligibility prediction for HAs, where the goal is to estimate speech intelligibility without requiring clean reference signals. We discuss emerging trends, including the use of acoustic representations, the design of suitable loss functions, and integration with hearing aid signal processing pipelines. In addition, we examine challenges such as generalization and robustness across conditions, as well as the gap between predicted and ground-truth intelligibility. The talk concludes with perspectives on future directions for non-intrusive intelligibility prediction in hearing aid applications.
Bio: Ryandhimas E. Zezario received the Ph.D. degree in computer science and information engineering from the National Taiwan University, Taipei, Taiwan, in 2023. He is currently a Postdoctoral Researcher with the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. His research interests include speech enhancement, non-intrusive quality assessment, speech processing, speech and speaker recognition, and deep learning. He was the recipient of the Gold Prize for the Best Non-Intrusive Systems, 1st place in the Hearing Industry Research Consortium student prizes at the Clarity Prediction Challenge 2022, and Best Reviewer Award at IEEE ASRU 2023.
Dr. Yu Tsao, Research Center for Information Technology Innovation, Academia Sinica
Title: Learning to Evaluate: Neural Speech Assessment for Downstream Speech Applications
Abstract: Neural speech assessment employs deep learning models to predict key speech properties, including intelligibility, perceptual quality, background noise level, and distortion. Unlike traditional metrics, these models are trained on large datasets to closely align with human perceptual judgments, making them effective across various acoustic conditions. They serve as objective tools for evaluating speech systems and datasets and can operate in real time to support online quality monitoring and adaptive processing. Additionally, neural speech assessment guides training for generative tasks such as text-to-speech, voice conversion, and speech enhancement by providing perceptually meaningful objectives. It also contributes to spatial audio tasks like beamforming by offering real-time direction and quality cues, making it essential for modern speech technology development and deployment.
Bio: Yu Tsao received the B.S. and M.S. degrees in Electrical Engineering from National Taiwan University, Taipei, Taiwan, in 1999 and 2001, respectively, and the Ph.D. degree in Electrical and Computer Engineering from the Georgia Institute of Technology, Atlanta, GA, USA, in 2008. From 2009 to 2011, he was a Researcher at the National Institute of Information and Communications Technology (NICT), Tokyo, Japan, where he conducted research and product development in multilingual speech-to-speech translation systems, focusing on automatic speech recognition. He is currently a Research Fellow (Professor) and the Deputy Director at the Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan. He also holds a joint appointment as a Professor in the Department of Electrical Engineering at Chung Yuan Christian University, Taoyuan, Taiwan. His research interests include assistive oral communication technologies, audio coding, and bio-signal processing. He serves as an Associate Editor for IEEE Transactions on Consumer Electronics and IEEE Signal Processing Letters. He received the Outstanding Research Award from Taiwan’s National Science and Technology Council (NSTC), the 2025 IEEE Chester W. Sall Memorial Award, and served as the corresponding author of a paper that won the 2021 IEEE Signal Processing Society Young Author Best Paper Award.
Jiahong Yuan, Interdisciplinary Research Center for Linguistic Sciences, University of Science and Technology of China
https://fusep.ustc.edu.cn/2025/01/15/jiahong-yuan/
Title: Normalization through Fine-tuning: Understanding Wav2vec2.0 Embeddings for Phonetic Analysis
Abstract: Phonetic normalization is essential in speech recognition and analysis, ensuring the comparability of features extracted from raw audio data. In the current paradigm of fine-tuning pre-trained large transformer models, however, it is not treated as a necessary preprocessing step; rather, it is implicitly carried out within the models themselves. This study examines how normalization emerges in transformer-based speech models, with a focus on Wav2vec2.0. Understanding this process is a crucial step toward effectively leveraging model embeddings as phonetic features for speech analysis. Our results show that fine-tuning Wav2vec2.0 can achieve phonetic normalization by selectively suppressing irrelevant information, such as speaker sex, while enhancing task-relevant information, such as tones and finals. In multi-task learning, however, the embeddings of fine-tuned models retain information for each task without compromising performance, suggesting that suppressing task-irrelevant information is not strictly required for effective classification. These results demonstrate that phonetic normalization can be flexibly achieved in speech models, providing insights into potential parallels in human speech perception.
Mark Liberman, Department of Linguistics, University of Pennsylvania, USA
https://www.ling.upenn.edu/~myl/
Title: Speech-To-Text in Clinical and Educational Applications
Abstract: Automatic Speech-To-Text systems now work very well — for certain kinds of speech and certain kinds of applications. Since today’s systems are mostly trained on read or rehearsed speech, and are meant to produce readable transcripts, they don’t do well with disfluencies, mispronunciations, and other features that are frequent in spontaneous speech or non-fluent reading. In some clinical and educational applications, correct recognition and classification of these features is crucial. This talk will describe the issues and suggest next steps.
Chi-Chun Lee, Department of Electrical Engineering, National Tsing Hua University
https://biic.ee.nthu.edu.tw/biicers.php#
Title: Enabling Internationalization of Affective Speech Technology using LLMs
Abstract: Affective speech technology aspires to equip machines with the ability to sense, interpret, and generate emotionally expressive speech, enabling empathetic assistants, social robots, and digital health companions. Large Audio/Speech Language Models (LALMs/SpeechLMs) now dominate this space: a single model can perform speech recognition, affect detection, and emotion-controlled synthesis, achieving impressive zero-shot generalization. However, we argue that LALMs are not yet internationalized: culturally grounded affect is misread when training data are skewed, leading to mis-recognition of affect, culturally inappropriate responses, and uneven user experiences. This paper surveys the current state of affective speech processing with LALMs, cataloging leading models, their sensing-to-synthesis capabilities, and the databases and metrics used for evaluation. We identify the key obstacle to responsible deployment: the heterogeneity of human vocal expression across cultures, which manifests as data scarcity, model bias, and evaluation blind spots. To address this gap, we propose a research agenda comprising: (i) systematic analysis of cultural variation in vocal affect, (ii) computational strategies for contextualizing LALMs toward culturally sensitive emotion processing, and (iii) benchmarks featuring balanced corpora and culture-aware metrics. By charting these directions, we aim to advance affective speech technology that is globally robust, socially responsible, and truly inclusive.
Emmanuele Chersoni, Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University
https://research.polyu.edu.hk/en/persons/emmanuele-chersoni
Title: Can Large Language Models Help in the Psycholinguistic Data Collection?
Abstract: In linguistics and Natural Language Processing (NLP), it is a common practice to collect linguistic data from speakers via surveys and interviews. The recent wave of interest for the performance of Large Language Models (LLMs) sparked a debate about whether such data can be replaced with the automatic generations of a machine participants (Kim and Lee, 2023; Kuzman et al., 2023; Pavlovic and Poesio, 2024; Kapania et al., 2025), using prompts that closely resemble the questions asked to human participants. In our contribution, we would like to discuss the usage of LLMs to collect a commonly-used type of annotation in linguistic research: psycholinguistic norms. If norms can be automatically acquired via LLM prompting, this could spare researchers the need of extensive data collections and simplify the acquisition of new datasets for low-resource languages. We aim at describing a few recent studies that reported slightly contrasting results (Brysbaert et al., 2024; Xu et al., 2025; Peng et al., 2025), and we will try to explain the contrast in light of the current debate about advantages and limitations of LLMs.