Perbandingan Algoritma Transformer Dengan Bi-Long Short-Term Memory Untuk Speech-To-Text

Achmad Rizky Zulkarnain; Muhammad Ezar Al Rivan

doi:10.51454/decode.v6i1.1563

Authors

Achmad Rizky Zulkarnain Informatika Universitas Multi Data Palembang
Muhammad Ezar Al Rivan Informatika Universitas Multi Data Palembang

DOI:

https://doi.org/10.51454/decode.v6i1.1563

Keywords:

BiLSTM, Ekstraksi Fitur, Speech-to-text, Transformer

Abstract

Penelitian ini bertujuan untuk membandingkan kinerja dua arsitektur Speech-to-Text, yaitu Bidirectional Long Short-Term Memory (BiLSTM) dan Transformer, dengan menggunakan dua jenis ekstraksi fitur akustik, yaitu Log-Mel Spectrogram dan Filterbank Energies (FBANK). Perbandingan ini dilakukan untuk menganalisis pengaruh kesesuaian antara arsitektur model dan representasi fitur terhadap performa sistem pengenalan suara otomatis. Pemilihan kedua arsitektur didasarkan pada perbedaan mekanisme pemrosesan sekuens, di mana BiLSTM memproses data secara dua arah untuk menangkap konteks temporal dari masa lalu dan masa depan, sedangkan Transformer memanfaatkan mekanisme self-attention yang mampu memproses keseluruhan urutan data secara paralel dan memahami konteks global. Kebaruan penelitian ini terletak pada evaluasi perbandingan yang dilakukan secara konsisten antara model BiLSTM dan Transformer dengan skema ekstraksi fitur yang digunakan agar menemukan kecocokan antara model dengan ekstraksi fitur, dengan tokenisasi yang sudah disesuaikan untuk masing-masing arsitektur, yaitu tokenisasi word-level pada BiLSTM dan tokenisasi sub-word berbasis SentencePiece pada Transformer, sehingga memberikan analisis kuantitatif yang lebih objektif terhadap pengaruh kesesuaian antara model dan jenis fitur akustik. Penelitian ini menggunakan pendekatan eksperimen kuantitatif dengan dataset LibriSpeech sebagai dataset utama. Proses penelitian meliputi ekstraksi fitur audio, pelatihan model menggunakan fungsi loss Connectionist Temporal Classification (CTC) dan optimizer Adam, serta evaluasi performa menggunakan metrik Word Error Rate (WER) dan Character Error Rate (CER). Hasil eksperimen menunjukkan bahwa pemilihan arsitektur model dan jenis fitur akustik memberikan pengaruh yang nyata terhadap performa sistem. Model BiLSTM menghasilkan performa yang lebih stabil pada seluruh kombinasi fitur, dengan nilai WER sekitar 29% pada subset test-clean dan berkisar antara 53%–55% pada subset test-other. Sementara itu, model Transformer menunjukkan performa terbaik ketika dipadukan dengan fitur Log-Mel Spectrogram, namun mengalami peningkatan WER yang signifikan saat menggunakan fitur FBANK.. Hasil yang sudah dijelaskan tadi menunjukkan bahwa kesesuaian antara arsitektur model dan jenis fitur sangat mempengaruhi kualitas transkripsi.

References

Abduh, Z., Nehary, E. A., Abdel Wahed, M., & Kadah, Y. M. (2020). Classification Of Heart Sounds Using Fractional Fourier Transform Based Mel-Frequency Spectral Coefficients And Traditional Classifiers. Biomedical Signal Processing and Control, 57, 101788. https://doi.org/10.1016/j.bspc.2019.101788

Ahlawat, H., Aggarwal, N., & Gupta, D. (2025). Automatic Speech Recognition: A Survey Of Deep Learning Techniques And Approaches. International Journal of Cognitive Computing in Engineering, 6, 201–237. https://doi.org/10.1016/j.ijcce.2024.12.007

Baneras-Roux, T., Rouvier, M., Wottawa, J., & Dufour, R. (2024). A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language. HAL Open Science.

Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Computation and Language. https://doi.org/10.48550/arXiv.2006.11477

Dong, L., Xu, S., & Xu, B. (2018). Speech-Transformer: A No-Recurrence Sequence-To-Sequence Model For Speech Recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018-April, 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506

Feng, Y. (2024). Intelligent Speech Recognition Algorithm In Multimedia Visual Interaction Via Bilstm And Attention Mechanism. Neural Computing and Applications, 36(5), 2371–2383. https://doi.org/10.1007/s00521-023-08959-2

Gao, D., Tang, X., Wan, M., Huang, G., & Zhang, Y. (2023). EEG Driving fatigue detection based on log-Mel spectrogram and convolutional recurrent neural networks. National Library of Medicine, Mar 9:17:1136609.. https://doi.org/10.3389/fnins.2023.1136609

Hannun, A., Lee, A., Xu, Q., & Collobert, R. (2019). Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions. Computation and Language. https://doi.org/10.48550/arXiv.1904.02619

Hung, J. W., Lin, J. S., & Wu, P. J. (2018). Employing Robust Principal Component Analysis For Noise-Robust Speech Feature Extraction In Automatic Speech Recognition With The Structure Of A Deep Neural Network. Applied System Innovation, 1(3), 1–14. https://doi.org/10.3390/asi1030028

Kafle, A., Rajlawat, J., Shah, N., Paudel, N., & Thapa, B. (2024). Advancements in Nepali Speech Recognition : A Comparative Study of BiLSTM , Transformer , and Hybrid Models. International Journal on Engineering Technology, 2(1), 96–105. https://doi.org/10.3126/injet.v2i1.72525

Kamarullah, R. (2024). Implementasi Teknologi Speech To Text Dalam Transkripsi Pembelajaran Online Dengan Algoritma Dynamic Time Warping. http://repository.unas.ac.id/id/eprint/10885

Karita, S., Order, A., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Enrique, N., Soplin, Y., Yamamoto, R., Wang, X., Watanabe, S., Yoshimura, T., & Zhang, W. (2019). A Comparative Study On Transformer Vs Rnn In Speech Applications. IEEE Xplore, 9(2), 449–456. https://doi.org/10.1109/ASRU46091.2019.9003750

Latif, S., Zaidi, S. A. M., Cuayáhuitl, H., Shamshad, F., Shoukat, M., Usama, M., & Qadir, J. (2025). Transformers In Speech Processing: Overcoming Challenges And Paving The Future. Computer Science Review, 58, 100768. https://doi.org/10.1016/j.cosrev.2025.100768

Loubser, A., De Villiers, P., & De Freitas, A. (2024a). End-To-End Automated Speech Recognition Using A Character Based Small Scale Transformer Architecture. Expert Systems with Applications, 252. https://doi.org/10.1016/j.eswa.2024.124119

Lubis, N., Siambaton, M. Z., & Aulia, R. (2025). Implementasi Algoritma Deep Learning pada Aplikasi Speech to Text Online dengan Metode Recurrent Neural Network (RNN). Sudo Jurnal Teknik Informatika, 3(3), 113–126. https://doi.org/10.56211/sudo.v3i3.583

Manohar, K., A R, J., & Rajan, R. (2023). Improving Speech Recognition Systems For The Morphologically Complex Malayalam Language Using Subword Tokens For Language Modeling. Eurasip Journal on Audio, Speech, and Music Processing, 2023(1). https://doi.org/10.1186/s13636-023-00313-7

Moritz, N., Hori, T., Roux, J. Le, & Electric, M. (2020). Streaming Automatic Speech Recognition With The Transformer Model. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6074–6078. https://doi.org/10.1109/ICASSP40776.2020.9054476

Peng, C., Zhang, Y., Lu, J., Lv, D., & Xiong, Y. (2025). A Bird Vocalization Classification Method Based on Improved Adaptive Wavelet Threshold Denoising and Bidirectional FBank. Research Square, 1–19. https://doi.org/10.21203/rs.3.rs-4181087/v1

Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., & Collobert, R. (2020a). MLS: A Large-Scale Multilingual Dataset For Speech Research. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-Octob, 2757–2761. https://doi.org/10.21437/Interspeech.2020-2826

Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., & Collobert, R. (2020b). MLS: A Large-Scale Multilingual Dataset For Speech Research. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-Octob, 2757–2761. https://doi.org/10.21437/Interspeech.2020-2826

Rao, G., Huang, W., Feng, Z., & Cong, Q. (2017). LSTM With Sentence Representations For Document-Level Sentiment Classification. Neurocomputing, 308, 49–57. https://doi.org/10.1016/j.neucom.2018.04.045

Rogers, A., Kovaleva, O., & Rumshisky, A. (2020a). A Primer In Bertology: What We Know About How Bert Works. Transactions of the Association for Computational Linguistics, 8, 842–866. https://doi.org/10.1162/tacl_a_00349

Shehab, S. A., Mohammed, K. K., Darwish, A., & Hassanien, A. E. (2024). Deep Learning And Feature Fusion-Based Lung Sound Recognition Model To Diagnoses The Respiratory Diseases. Soft Computing, 28(19), 11667–11683. https://doi.org/10.1007/s00500-024-09866-x

Sherstinsky, A. (2023). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Physical D: Nonlinear Phenomena, 404, March 2020, 132306. https://doi.org/10.1016/j.physd.2019.132306

Tami, M. A. (2020). Speech To Text Menggunakan Algoritma Deep Bidirectional LSTM. http://repository.unsri.ac.id/id/eprint/31499

Varod, V. S., Jokisch, O., Sinha, Y., & Geri, N. (2021). A cross-language study of speech recognition systems for English, German, and Hebrew. Journal of Applied Knowledge Management (OJAKM), 9(1), 1–15. https://dx.doi.org/10.36965/ojakm.2021.9(1)1-15

Zeyer, A., Bahar, P., Irie, K., Schluter, R., & Ney, H. (2019). A Comparison of Transformer and LSTM Encoder Decoder Models for ASR. 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings, 8–15. https://doi.org/10.1109/ASRU46091.2019.9004025

Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., & Ney, H. (2017). A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition. IEEE. https://doi.org/10.1109/ICASSP.2017.7952599

Zhang, T., & Wu, J. (2019). Discriminative Frequency Filter Banks Learning With Neural Networks. Eurasip Journal On Audio, Speech, And Music Processing, 2019(1), 1–16. https://doi.org/10.1186/s13636-018-0144-6