Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

Nikita Kuzmin1,2*, Songting Liu1*, Kong Aik Lee3, Eng Siong Chng1

*Equal contribution.

1Nanyang Technological University, Singapore
2Institute for Infocomm Research, A*STAR, Singapore
3The Hong Kong Polytechnic University, Hong Kong

s220028@e.ntu.edu.sg, lius0114@e.ntu.edu.sg

Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, speaker embedding mixing, and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 45% relative WER reduction) and emotion preservation (up to 40% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.

System Architecture

Training and inference pipelines of the Stream-Voice-Anon system.

Training pipeline of Stream-Voice-Anon

(a) Training: Training pipeline showing the speaker encoder, content encoder, and acoustic encoder modules.

Inference pipeline of Stream-Voice-Anon

(b) Inference: Inference pipeline demonstrating real-time anonymization with prompt selection and acoustic decoding.

Figure 1: Training and inference pipelines of the Stream-Voice-Anon system.

Baseline Comparison

Audio Quality Comparison

Comparison of our method with state-of-the-art streaming anonymization approaches. Samples are used from Dataset (CMU-ARCTIC corpus): http://www.festvox.org/cmu_arctic/ to compare with DarkStream.

Speaker Transcript Original DarkStream
(Wav+CL+KM)
Stream-Voice-Anon
(Ours)
BDL "For the twentieth time that evening the two men shook hands."
CLB "God bless 'em I hope I will go on seeing them forever."
RMS "He turned sharply and faced Gregson across the table."
SLT "Gregson shoved back his chair and rose his feet."

Dynamic Delay Control

Audio Comparison

Our system enables dynamic delay control for adjustable latency-quality trade-offs at inference time without retraining. We show examples with different delays: d=0 (90ms), d=1 (130ms), d=2 (180ms), and d=8 (440ms).

Speaker Transcript Original delay=0
(90ms)
delay=1
(130ms)
delay=2
(180ms)
delay=8
(440ms)
BDL "Not at this particular case Tom apologized Whittemore."

Citation

If you find this work useful, please cite:

@misc{kuzmin2026streamvoiceanonenhancingutilityrealtime,
      title={Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models}, 
      author={Nikita Kuzmin and Songting Liu and Kong Aik Lee and Eng Siong Chng},
      year={2026},
      eprint={2601.13948},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2601.13948}, 
}

Links & Contact