Abstract
To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the standard solution. However, existing methods offer limited coverage of real-world scenes and depend on pre-existing noise libraries and scene metadata. This paper presents prompt-based Dynamic Generative Scene-based Noise Addition (DGSNA), a novel approach driven by generative language models that integrates Dynamic Generation of Scene-based Information (DGSI) with Scene-based Noise Addition for Speech (SNAS). The DGSI module, with a BET (Background, Examples, Task) prompt framework, dynamically generates logic-compliant scene-based information, including scene dimensions, sound sources, and microphone positions, thereby addressing the challenges of scene enumeration and detailed description. Complementing this, the SNAS module employs a Time–Frequency Diffusion-based (TFD) Text-to-Audio model to synthesize scene-specific noise. By integrating this noise with clean speech via Room Impulse Response (RIR) filters, the module streamlines the traditionally labor-intensive process of replicating diverse acoustic environments. Experimental results show that DGSNA significantly enhances the robustness of speech recognition and keyword spotting models, achieving relative improvements of up to 11.32%. Furthermore, DGSNA is highly compatible with existing noise addition techniques.
IPC Classification
Keywords
€ 4.00