Stream-Voice-Anon: Real-Time Speaker Anonymization

Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models

Nikita Kuzmin^1,2*, Songting Liu^1*, Kong Aik Lee³, Eng Siong Chng¹

^*Equal contribution.

¹Nanyang Technological University, Singapore
²Institute for Infocomm Research, A*STAR, Singapore
³The Hong Kong Polytechnic University, Hong Kong

s220028@e.ntu.edu.sg, lius0114@e.ntu.edu.sg

Protecting speaker identity is crucial for online voice applications, yet streaming speaker anonymization (SA) remains underexplored. Recent research has demonstrated that neural audio codec (NAC) provides superior speaker feature disentanglement and linguistic fidelity. NAC can also be used with causal language models (LM) to enhance linguistic fidelity and prompt control for streaming tasks. However, existing NAC-based online LM systems are designed for voice conversion (VC) rather than anonymization, lacking the techniques required for privacy protection. Building on these advances, we present Stream-Voice-Anon, which adapts modern causal LM-based NAC architectures specifically for streaming SA by integrating anonymization techniques. Our anonymization approach incorporates pseudo-speaker representation sampling, speaker embedding mixing, and diverse prompt selection strategies for LM conditioning that leverage the disentanglement properties of quantized content codes to prevent speaker information leakage. Additionally, we compare dynamic and fixed delay configurations to explore latency-privacy trade-offs in real-time scenarios. Under the VoicePrivacy 2024 Challenge protocol, Stream-Voice-Anon achieves substantial improvements in intelligibility (up to 45% relative WER reduction) and emotion preservation (up to 40% UAR relative) compared to the previous state-of-the-art streaming method DarkStream while maintaining comparable latency (180ms vs 200ms) and privacy protection against lazy-informed attackers, though showing 15% relative degradation against semi-informed attackers.

System Architecture

Training and inference pipelines of the Stream-Voice-Anon system.

(a) Training: Training pipeline showing the speaker encoder, content encoder, and acoustic encoder modules.

(b) Inference: Inference pipeline demonstrating real-time anonymization with prompt selection and acoustic decoding.

Figure 1: Training and inference pipelines of the Stream-Voice-Anon system.

Baseline Comparison

Audio Quality Comparison

Comparison of our method with state-of-the-art streaming anonymization approaches. Samples are used from Dataset (CMU-ARCTIC corpus): http://www.festvox.org/cmu_arctic/ to compare with DarkStream.

Speaker	Transcript	Original	DarkStream (Wav+CL+KM)	Stream-Voice-Anon (Ours)
BDL	"For the twentieth time that evening the two men shook hands."
CLB	"God bless 'em I hope I will go on seeing them forever."
RMS	"He turned sharply and faced Gregson across the table."
SLT	"Gregson shoved back his chair and rose his feet."

Dynamic Delay Control

Audio Comparison

Our system enables dynamic delay control for adjustable latency-quality trade-offs at inference time without retraining. We show examples with different delays: d=0 (90ms), d=1 (130ms), d=2 (180ms), and d=8 (440ms).

Speaker	Transcript	Original	delay=0 (90ms)	delay=1 (130ms)	delay=2 (180ms)	delay=8 (440ms)
BDL	"Not at this particular case Tom apologized Whittemore."

Citation

If you find this work useful, please cite:

@misc{kuzmin2026streamvoiceanonenhancingutilityrealtime,
      title={Stream-Voice-Anon: Enhancing Utility of Real-Time Speaker Anonymization via Neural Audio Codec and Language Models}, 
      author={Nikita Kuzmin and Songting Liu and Kong Aik Lee and Eng Siong Chng},
      year={2026},
      eprint={2601.13948},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2601.13948}, 
}

Links & Contact

For questions or collaboration inquiries, please open an issue on GitHub or contact: s220028@e.ntu.edu.sg

lius0114@e.ntu.edu.sg