Enhancing retrieval performance of embedding models via fine-tuning on synthetic data in RAG chatbot for Vietnamese military science domain
DOI:
https://doi.org/10.54939/1859-1043.j.mst.99.2024.109-118Keywords:
Retrieval-augmented generation; Fine-tuning; Synthetic data; Large language model; Chatbot.Abstract
Retrieval-Augmented Generation (RAG) is a technology that combines information retrieval with large language models, enabling chatbots to provide accurate answers by querying relevant documents from a data repository before generating responses. While RAG chatbot has demonstrated effectiveness in many applications, there are still limitations in specialized Vietnamese data domains, particularly in the military science field. To address this challenge, this paper proposes a framework for fine-tuning embedding models using synthetic datasets generated by ChatGPT to enhance retrieval performance in a Q&A application focused on the history of the Institute of Information Technology (IoIT). Our approach evaluates 11 popular embedding models and shows a significant average improvement of 18.15% in the MAP@K metric. The resulting IoIT history Q&A chatbot, developed with fine-tuned embedding models and the Vietnamese language model Vistral-7B, outperforms chatbots utilizing OpenAI's embedding models and ChatGPT. These findings highlight the potential of RAG chatbot technology for advancing information retrieval applications in specialized fields like military science.
References
[1]. H. Naveed et al., “A Comprehensive Overview of Large Language Models,” (2024), arXiv: arXiv:2307.06435. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2307.06435
[2]. N. Kandpal, H. Deng, A. Roberts, E. Wallace, and C. Raffel, “Large Language Models Struggle to Learn Long-Tail Knowledge,”, (2023), arXiv: arXiv:2211.08411. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2211.08411
[3]. Y. Gao et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” (2024), arXiv: arXiv:2312.10997. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2312.10997
[4]. D. Q. Nguyen and A. T. Nguyen, “PhoBERT: Pre-trained language models for Vietnamese,”, (2020), arXiv: arXiv:2003.00744. doi: 10.48550/arXiv.2003.00744.
[5]. N. Q. Duc, L. H. Son, N. D. Nhan, N. D. N. Minh, L. T. Huong, and D. V. Sang, “Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models,” (2024), arXiv: arXiv:2403.01616. Accessed: Sep. 09, 2024. [Online]. Available: http://arxiv.org/abs/2403.01616
[6]. “hmthanh/VietnamLegalText-SBERT · Hugging Face.” Accessed: Sep. 09, 2024. [Online]. Available: https://huggingface.co/hmthanh/VietnamLegalText-SBERT
[7]. M. Tran-Tien, H.-L. Le, D. N. Minh, T. T. Khang, H.-T. Vu, and N. Minh-Tien, “ViPubmedDeBERTa: A Pre-trained Model for Vietnamese Biomedical Text,” in Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, C.-R. Huang, Y. Harada, J.-B. Kim, S. Chen, Y.-Y. Hsu, E. Chersoni, P. A, W. H. Zeng, B. Peng, Y. Li, and J. Li, Eds., Hong Kong, China: Association for Computational Linguistics, (2023), pp. 831–840. Accessed: Sep. 08, 2024. [Online]. Available: https://aclanthology.org/2023.paclic-1.83
[8]. “OpenAI Platform.” Accessed: Sep. 08, 2024. [Online]. Available: https://platform.openai.com
[9]. “Viet-Mistral/Vistral-7B-Chat · Hugging Face.” Accessed: Sep. 09, 2024. [Online]. Available: https://huggingface.co/Viet-Mistral/Vistral-7B-Chat
[10]. T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” (2013), arXiv: arXiv:1301.3781. doi: 10.48550/arXiv.1301.3781.
[11]. J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar: Association for Computational Linguistics, (2014), pp. 1532–1543. doi: 10.3115/v1/D14-1162.
[12]. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” (2019), arXiv: arXiv:1810.04805. doi: 10.48550/arXiv.1810.04805.
[13]. S. Xiao, Z. Liu, P. Zhang, N. Muennighoff, D. Lian, and J.-Y. Nie, “C-Pack: Packaged Resources To Advance General Chinese Embedding,” (2024), arXiv: arXiv:2309.07597. doi: 10.48550/arXiv.2309.07597.
[14]. O. Khattab and M. Zaharia, “ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT,” (2020), arXiv: arXiv:2004.12832. doi: 10.48550/arXiv.2004.12832.
[15]. L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Multilingual E5 Text Embeddings: A Technical Report,” (2024), arXiv: arXiv:2402.05672. doi: 10.48550/arXiv.2402.05672.
[16]. “LlamaIndex, Data Framework for LLM Applications.” Accessed: Sep. 09, 2024. [Online]. Available: https://www.llamaindex.ai/
[17]. “Introduction | Ragas.” Accessed: Sep. 09, 2024. [Online]. Available: https://docs.ragas.io/en/stable/
[18]. M. Henderson et al., “Efficient Natural Language Response Suggestion for Smart Reply,” May 01, 2017, arXiv: arXiv:1705.00652. doi: 10.48550/arXiv.1705.00652.