Abstract

This document is presenting a technical research framework for Bengali automatic speech recognition, containing written text, architectural diagrams, mathematical equations, performance graphs, and comparative data tables. The chapter introduces an end-to-end Bengali automatic speech recognition system built upon a Conformer-CTC backbone augmented with a multi-level embedding fusion mechanism that simultaneously incorporates phoneme-level, syllable-level, and wordpiece-level linguistic representations. Bengali, spoken by over 230 million people, is a morphologically rich and low-resource language whose complex conjunct consonants, diacritics, and syllable structures pose significant challenges for conventional acoustic modeling approaches. The proposed architecture processes raw audio through a preprocessing pipeline comprising resampling, silence trimming, log-Mel spectrogram extraction via short-time Fourier transform and Mel filterbank integration, SpecAugment data augmentation, and feature normalization. The resulting acoustic features are first encoded by an early Conformer stage consisting of twelve Conformer blocks, after which three parallel Transformer-based embedding networks independently encode the acoustic representation at phoneme, syllable, and wordpiece granularities. These complementary linguistic embeddings are fused with the acoustic representation through element-wise summation, and the combined representation is further refined by a late Conformer encoder comprising twenty-four additional Conformer blocks before decoding via Connectionist Temporal Classification. The model was trained and evaluated on the OpenSLR Large Bengali ASR dataset comprising approximately 196,000 utterances, 204,905 FLAC audio files, 45,653 unique words, and over 181 hours of speech. Using the Adam optimizer with a learning rate of 1e-4, a batch size of 32, and 80-dimensional log-Mel features, the proposed system achieved a word error rate of 10.01 percent and a character error rate of 5.03 percent. Comparative analysis against recent state-of-the-art approaches employing GRU, LSTM-RNN, CNN-RNN, and Wav2Vec2 architectures demonstrates that the multi-level embedding Conformer framework yields superior performance on full sentence-level recognition, establishing a scalable and adaptable methodology applicable to other low-resource and morphologically complex languages.

Metadata

Defensive Publication

Reference this publication

Check Findability & Get Protocol for Proof

Found

Not Found

No Data

€ 4.00

← Back to Archive