Perbandingan Algoritma Transformer Dengan Bi-Long Short-Term Memory Untuk Speech-To-Text
DOI:
https://doi.org/10.51454/decode.v6i1.1563Keywords:
BiLSTM, Ekstraksi Fitur, Speech-to-text, TransformerAbstract
Penelitian ini bertujuan untuk membandingkan kinerja dua arsitektur Speech-to-Text, yaitu Bidirectional Long Short-Term Memory (BiLSTM) dan Transformer, dengan menggunakan dua jenis ekstraksi fitur akustik, yaitu Log-Mel Spectrogram dan Filterbank Energies (FBANK). Perbandingan ini dilakukan untuk menganalisis pengaruh kesesuaian antara arsitektur model dan representasi fitur terhadap performa sistem pengenalan suara otomatis. Pemilihan kedua arsitektur didasarkan pada perbedaan mekanisme pemrosesan sekuens, di mana BiLSTM memproses data secara dua arah untuk menangkap konteks temporal dari masa lalu dan masa depan, sedangkan Transformer memanfaatkan mekanisme self-attention yang mampu memproses keseluruhan urutan data secara paralel dan memahami konteks global. Kebaruan penelitian ini terletak pada evaluasi perbandingan yang dilakukan secara konsisten antara model BiLSTM dan Transformer dengan skema ekstraksi fitur yang digunakan agar menemukan kecocokan antara model dengan ekstraksi fitur, dengan tokenisasi yang sudah disesuaikan untuk masing-masing arsitektur, yaitu tokenisasi word-level pada BiLSTM dan tokenisasi sub-word berbasis SentencePiece pada Transformer, sehingga memberikan analisis kuantitatif yang lebih objektif terhadap pengaruh kesesuaian antara model dan jenis fitur akustik. Penelitian ini menggunakan pendekatan eksperimen kuantitatif dengan dataset LibriSpeech sebagai dataset utama. Proses penelitian meliputi ekstraksi fitur audio, pelatihan model menggunakan fungsi loss Connectionist Temporal Classification (CTC) dan optimizer Adam, serta evaluasi performa menggunakan metrik Word Error Rate (WER) dan Character Error Rate (CER). Hasil eksperimen menunjukkan bahwa pemilihan arsitektur model dan jenis fitur akustik memberikan pengaruh yang nyata terhadap performa sistem. Model BiLSTM menghasilkan performa yang lebih stabil pada seluruh kombinasi fitur, dengan nilai WER sekitar 29% pada subset test-clean dan berkisar antara 53%–55% pada subset test-other. Sementara itu, model Transformer menunjukkan performa terbaik ketika dipadukan dengan fitur Log-Mel Spectrogram, namun mengalami peningkatan WER yang signifikan saat menggunakan fitur FBANK.. Hasil yang sudah dijelaskan tadi menunjukkan bahwa kesesuaian antara arsitektur model dan jenis fitur sangat mempengaruhi kualitas transkripsi.
References
Abduh, Z., Nehary, E. A., Abdel Wahed, M., & Kadah, Y. M. (2020). Classification Of Heart Sounds Using Fractional Fourier Transform Based Mel-Frequency Spectral Coefficients And Traditional Classifiers. Biomedical Signal Processing and Control, 57, 101788. https://doi.org/10.1016/j.bspc.2019.101788
Ahlawat, H., Aggarwal, N., & Gupta, D. (2025). Automatic Speech Recognition: A Survey Of Deep Learning Techniques And Approaches. International Journal of Cognitive Computing in Engineering, 6, 201–237. https://doi.org/10.1016/j.ijcce.2024.12.007
Baneras-Roux, T., Rouvier, M., Wottawa, J., & Dufour, R. (2024). A Comprehensive Analysis of Tokenization and Self-Supervised Learning in End-to-End Automatic Speech Recognition applied on French Language. HAL Open Science.
Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. Computation and Language. https://doi.org/10.48550/arXiv.2006.11477
Dong, L., Xu, S., & Xu, B. (2018). Speech-Transformer: A No-Recurrence Sequence-To-Sequence Model For Speech Recognition. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2018-April, 5884–5888. https://doi.org/10.1109/ICASSP.2018.8462506
Feng, Y. (2024). Intelligent Speech Recognition Algorithm In Multimedia Visual Interaction Via Bilstm And Attention Mechanism. Neural Computing and Applications, 36(5), 2371–2383. https://doi.org/10.1007/s00521-023-08959-2
Gao, D., Tang, X., Wan, M., Huang, G., & Zhang, Y. (2023). EEG Driving fatigue detection based on log-Mel spectrogram and convolutional recurrent neural networks. National Library of Medicine, Mar 9:17:1136609.. https://doi.org/10.3389/fnins.2023.1136609
Hannun, A., Lee, A., Xu, Q., & Collobert, R. (2019). Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions. Computation and Language. https://doi.org/10.48550/arXiv.1904.02619
Hung, J. W., Lin, J. S., & Wu, P. J. (2018). Employing Robust Principal Component Analysis For Noise-Robust Speech Feature Extraction In Automatic Speech Recognition With The Structure Of A Deep Neural Network. Applied System Innovation, 1(3), 1–14. https://doi.org/10.3390/asi1030028
Kafle, A., Rajlawat, J., Shah, N., Paudel, N., & Thapa, B. (2024). Advancements in Nepali Speech Recognition : A Comparative Study of BiLSTM , Transformer , and Hybrid Models. International Journal on Engineering Technology, 2(1), 96–105. https://doi.org/10.3126/injet.v2i1.72525
Kamarullah, R. (2024). Implementasi Teknologi Speech To Text Dalam Transkripsi Pembelajaran Online Dengan Algoritma Dynamic Time Warping. http://repository.unas.ac.id/id/eprint/10885
Karita, S., Order, A., Chen, N., Hayashi, T., Hori, T., Inaguma, H., Jiang, Z., Someki, M., Enrique, N., Soplin, Y., Yamamoto, R., Wang, X., Watanabe, S., Yoshimura, T., & Zhang, W. (2019). A Comparative Study On Transformer Vs Rnn In Speech Applications. IEEE Xplore, 9(2), 449–456. https://doi.org/10.1109/ASRU46091.2019.9003750
Latif, S., Zaidi, S. A. M., Cuayáhuitl, H., Shamshad, F., Shoukat, M., Usama, M., & Qadir, J. (2025). Transformers In Speech Processing: Overcoming Challenges And Paving The Future. Computer Science Review, 58, 100768. https://doi.org/10.1016/j.cosrev.2025.100768
Loubser, A., De Villiers, P., & De Freitas, A. (2024a). End-To-End Automated Speech Recognition Using A Character Based Small Scale Transformer Architecture. Expert Systems with Applications, 252. https://doi.org/10.1016/j.eswa.2024.124119
Lubis, N., Siambaton, M. Z., & Aulia, R. (2025). Implementasi Algoritma Deep Learning pada Aplikasi Speech to Text Online dengan Metode Recurrent Neural Network (RNN). Sudo Jurnal Teknik Informatika, 3(3), 113–126. https://doi.org/10.56211/sudo.v3i3.583
Manohar, K., A R, J., & Rajan, R. (2023). Improving Speech Recognition Systems For The Morphologically Complex Malayalam Language Using Subword Tokens For Language Modeling. Eurasip Journal on Audio, Speech, and Music Processing, 2023(1). https://doi.org/10.1186/s13636-023-00313-7
Moritz, N., Hori, T., Roux, J. Le, & Electric, M. (2020). Streaming Automatic Speech Recognition With The Transformer Model. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6074–6078. https://doi.org/10.1109/ICASSP40776.2020.9054476
Peng, C., Zhang, Y., Lu, J., Lv, D., & Xiong, Y. (2025). A Bird Vocalization Classification Method Based on Improved Adaptive Wavelet Threshold Denoising and Bidirectional FBank. Research Square, 1–19. https://doi.org/10.21203/rs.3.rs-4181087/v1
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., & Collobert, R. (2020a). MLS: A Large-Scale Multilingual Dataset For Speech Research. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-Octob, 2757–2761. https://doi.org/10.21437/Interspeech.2020-2826
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G., & Collobert, R. (2020b). MLS: A Large-Scale Multilingual Dataset For Speech Research. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2020-Octob, 2757–2761. https://doi.org/10.21437/Interspeech.2020-2826
Rao, G., Huang, W., Feng, Z., & Cong, Q. (2017). LSTM With Sentence Representations For Document-Level Sentiment Classification. Neurocomputing, 308, 49–57. https://doi.org/10.1016/j.neucom.2018.04.045
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020a). A Primer In Bertology: What We Know About How Bert Works. Transactions of the Association for Computational Linguistics, 8, 842–866. https://doi.org/10.1162/tacl_a_00349
Shehab, S. A., Mohammed, K. K., Darwish, A., & Hassanien, A. E. (2024). Deep Learning And Feature Fusion-Based Lung Sound Recognition Model To Diagnoses The Respiratory Diseases. Soft Computing, 28(19), 11667–11683. https://doi.org/10.1007/s00500-024-09866-x
Sherstinsky, A. (2023). Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network. Physical D: Nonlinear Phenomena, 404, March 2020, 132306. https://doi.org/10.1016/j.physd.2019.132306
Tami, M. A. (2020). Speech To Text Menggunakan Algoritma Deep Bidirectional LSTM. http://repository.unsri.ac.id/id/eprint/31499
Varod, V. S., Jokisch, O., Sinha, Y., & Geri, N. (2021). A cross-language study of speech recognition systems for English, German, and Hebrew. Journal of Applied Knowledge Management (OJAKM), 9(1), 1–15. https://dx.doi.org/10.36965/ojakm.2021.9(1)1-15
Zeyer, A., Bahar, P., Irie, K., Schluter, R., & Ney, H. (2019). A Comparison of Transformer and LSTM Encoder Decoder Models for ASR. 2019 IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2019 - Proceedings, 8–15. https://doi.org/10.1109/ASRU46091.2019.9004025
Zeyer, A., Doetsch, P., Voigtlaender, P., Schlüter, R., & Ney, H. (2017). A Comprehensive Study of Deep Bidirectional LSTM RNNs for Acoustic Modeling in Speech Recognition. IEEE. https://doi.org/10.1109/ICASSP.2017.7952599
Zhang, T., & Wu, J. (2019). Discriminative Frequency Filter Banks Learning With Neural Networks. Eurasip Journal On Audio, Speech, And Music Processing, 2019(1), 1–16. https://doi.org/10.1186/s13636-018-0144-6
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Achmad Rizky Zulkarnain, Muhammad Ezar Al Rivan

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.








