2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
22-24 October 2025, Singapore
Tutorials
Tutorial 1: Recent Advances in End-to-End Learned Image and Video Coding
Time: Wednesday, 22 Oct 2025, 08:00-11:30am
Venue: Lotus I
Presenter: Prof. Heming Sun and Prof. Wen-Hsiao Peng
Prof. Heming Sun
Prof. Wen-Hsiao Peng
Part I: Overview of Learned Image/Video Coding (by Prof. Peng; 15 mins)
Introduction to end-to-end learned image and video coding
The rate-distortion performance of SOTA learned image/video codecs
Standardization activities on neural image/video coding in JPEG and MPEG
Part II: End-to-End Learned Image Coding (by Prof. Sun; 70 mins)
Elements of end-to-end learned image coding
Review of a few notable tool features (e.g. fast context models)
Network pruning and quantization for learned image codecs
Implicit Neural Representation (INR)-based image coding systems
Real-time implementation of learned image codecs
Coffee Break (20 mins)
Part III: End-to-End Learned Video Coding (by Prof. Peng; 60 mins)
End-to-end learned video coding frameworks: residual coding, conditional coding, and conditional residual coding
Review of some notable systems
The explicit, implicit, and hybrid temporal buffering strategies
The rate-distortion-complexity trade-offs from the perspectives of coding frameworks and buffering strategies
Network quantization for learned video codecs
Part IV: Practical Implementation (30 minutes)
Emerging learned coding techniques for 3D/4D Gaussian Splatting and multi-modal large language models
Open issues and concluding remarks
Tutorial 2: Deep Speaker Modeling: Theories, Applications and Practice
Time: Wednesday, 22 Oct 2025, 08:00-11:30am
Venue: Lotus II
Presenter: Shuai Wang, Yanmin Qian and Haizhou Li
Shuai Wang
Yanmin Qian
Haizhou Li
Part I: Foundations and Recent Advances (60 mins)
Foundational theories and review of traditional methods in speaker modeling
Evolution of speaker representation techniques in the deep learning era
From i-vector to various deep speaker representations
Applications of self-supervised and semi-supervised learning in speaker modeling
Analysis of speaker representation capabilities in foundation speech models
Leveraging pretrained large models
Part II: Applications Beyond Recognition (60 mins)
Speaker-adaptive speech synthesis
Voice cloning technologies and ethical considerations
Speaker representation in few-shot and zero-shot speech synthesis
Personalized voice conversion systems
Speaker perception in multimodal human-computer interaction
Target speaker speech processing
Target speaker extraction
Target speaker speech recognition
Target speaker verification
Personalized VAD
Part III: Challenges and Countermeasures (30 mins)
Domain adaptation and domain-invariant learning
Privacy-preserving speaker representations
Robustness and adversarial attack defense
Computational efficiency and model compression
Explainability techniques and methods
Part IV: Practical Implementation (30 mins)
Introduction to tools and frameworks
Wespeaker toolkit for speaker embedding learning
Wesep toolkit for target speech extraction
Case studies and demonstrations
Interactive discussion and Q&A session
Tutorial 3: From Detection to Direction: An Overview of Sound Event Localization and Detection
Time: Wednesday, 22 Oct 2025, 08:00-11:30am
Venue: Hibiscus III
Presenter: Jun Wei Yeow and Ee-Leng TAN
Jun Wei Yeow
Ee-Leng TAN
Part I: Overview of Sound Event Localization and Detection (SELD) (30 mins)
Introduction to SELD and its applications
History of SELD and its component tasks (Sound Event Detection and Sound Source Localization)
Recent advances and challenges in SELD
Publicly available SELD datasets
Part II: Core Technical Components of SELD (60 mins)
Spatial audio formats used for SELD, including First Order Ambisonics, microphone array signals, and binaural recordings
Contemporary feature extraction techniques that capture spatiotemporal cues needed for robust event detection and localization
Deep learning architectures designed for SELD, including convolutional recurrent networks (CRNNs), transformer-based models, and multi-branch or multi-task setups
Training strategies, such as multi-task learning (joint DOA and event classification), data augmentation for spatial audio, and domain adaptation techniques
Benchmark datasets and metrics, including a deep dive into the DCASE Challenge series as well as evaluation criteria such as localization errors, detection accuracies, and combined SELD scores
Coffee Break (30 mins)
Part III: Advanced and Emerging Topics (60 mins)
Semi-supervised and weakly labelled learning approaches
Robustness to reverberation, overlapping events, and unseen acoustic scenes
Multi-modal SELD systems that integrate complementary modalities, such as video recordings or motion sensors
Complementary performance using acoustic scene classification (ASC)
Coffee Break (30 mins)
Part IV: Real-Time Implementation of SELD (40 mins)
Real-time constraints and considerations
Lightweight models suitable for real-time and edge applications
Discussion and Q&A session
Tutorial 4: Adaptive Sensor Networks in Digital Health
Time: Wednesday, 22 Oct 2025, 08:00-11:30am
Venue: Peony I
Presenter: Prof. Saeid Sanei
Prof. Saeid Sanei
This research shows the importance of distributed networks and cooperation, borrowed from multi-agent communication systems domain, in modelling industrial, biological, and diagnostic systems. In many patient monitoring systems such as multichannel EEG, electromyography (EMG), and electrocardiography (ECG) as well as industrial sensors such as smart meter networks, the sensor data can be aggregated in an adaptive manner. On the other hand, adaptive cooperative networks are used to model single- or multi-task systems which are available where the agents have multiple targets. In industry, the use of smart meter networks in a household area and the information transfer between the smart meters can highly reduce the peak energy supply. On the clinical side, an adaptive network can be devised to use multichannel EEG to translate the brain function into body movement or model the link between two brains in a multi-subject (a.k.a hyperscanning) scenario. It can be verified that distributed array processing (beamforming) can improve the system quality for localization of brain responses to deep brain single-pulse electrical stimulation (SPES), applicable to drug-resistant epileptic seizure diagnosis. Also, a cooperative particle filtering approach can significantly enhance identification and tracking of event-related potentials (ERPs) to monitor brain degenerative diseases, fatigue, or cognition deterioration. The tutorial will be in three hours with approximately 30 min tea/coffee break in between.
Part I: Outline of the Tutorial and the Material to Be Presented (by Prof. Saeid Sanei; ~3 hrs including tea/coffee break)
From adaptive filters to adaptive cooperative networks
Distributed sensor networks, definitions, examples, and applications
Adaptive cooperative network topologies
Single- and multi-task networks, optimizations, and applications
Estimation of network connectivity (information transfer) for accurate setting of the combination weights
Body sensor networks and their clinical applications
Cooperative systems in brain computer interfacing (BCI)
Distributed beamforming for seizure source localization of interictal epileptiform discharges and delayed/late brain responses to deep brain electrical stimulation
Distributed particle filtering for tracking the brain event related potentials
Distributed systems for crowd monitoring
Wider applications of cooperative systems (biological modelling, network security, and energy distribution)