SUBWORD FOR VIETNAMESE-ENGLISH STATISTICAL MACHINE TRANSLATION
216 viewsKeywords:
Subword; Word alignment; Statistical machine translation.Abstract
In this paper, we propose an approach for applying subword methods in SMT to improve word alignment in Vietnamese-English SMT systems. In addition to applying subword methods as a preprocessing step, we propose a new algorithm for decoding alignment table of translation model. The proposed method has been implemented and evaluated with various subword methods: BPE, Wordpiece, unigram, and Morfessor. Experimental results show that the proposed method produces better results with every subword method, and the highest improvement is 0.81 BLEU from the model with the BPE subword method.
References
[1]. Brown, Peter F., et al. “A statistical approach to machine translation.” Computational linguistics 16.2 (1990): 79-85.
[2]. Brown, Peter F., et al. “The mathematics of statistical machine translation: Parameter estimation.” Computational linguistics 19.2 (1993): 263-311.
[3]. Poerner, Nina, et al. “Aligning Very Small Parallel Corpora Using Cross-Lingual Word Embeddings and a Monogamy Objective.” arXiv preprint arXiv:1811.00066 (2018).
[4]. Sennrich, Rico, Barry Haddow, and Alexandra Birch. “Neural machine translation of rare words with subword units.” arXiv preprint arXiv:1508.07909 (2015).
[5]. Kudo, Taku. “Subword regularization: Improving neural network translation models with multiple subword candidates.” arXiv preprint arXiv:1804.10959 (2018).
[6]. Liu, Yang, Qun Liu, and Shouxun Lin. “Discriminative word alignment by linear modeling.” Computational Linguistics 36.3 (2010): 303-339.
[7]. Kamigaito, Hidetaka, et al. “Unsupervised Word Alignment Using Frequency Constraint in Posterior Regularized EM.” Journal of Natural Language Processing 23.4 (2016): 327-351.
[8]. Ghaffar, Shady Abdel, Mohamed Waleed Fakhr, and Cairo Sheraton. “English to arabic statistical machine translation system improvements using preprocessing and arabic morphology analysis.” Recent Researches in Mathematical Methods in Electrical Engineering and Computer Science (2011): 50-54.
[9]. Clifton, Ann, and Anoop Sarkar. “Combining morpheme-based machine translation with post-processing morpheme prediction.” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011.
[10]. Quang-Hung, L. E., and L. E. Anh-Cuong. “Syntactic pattern based Word Alignment for Statistical Machine Translation.” International Journal of Knowledge and Systems Science (IJKSS) 5.3 (2014): 36-45.
[11]. Van Bui, Vuong, et al. “Improving Word Alignment Through Morphological Analysis.” International Symposium on Integrated Uncertainty in Knowledge Modelling and Decision Making. Springer, Cham, 2015.
[12]. Viet, Tran Hong, et al. “Dependency-based pre-ordering for English-Vietnamese statistical machine translation.” VNU Journal of Science: Computer Science and Communication Engineering 33.2 (2017).
[13]. Kudo, Taku. “Subword regularization: Improving neural network translation models with multiple subword candidates.” arXiv preprint arXiv:1804.10959 (2018).
[14]. Papineni, Kishore, et al. “Bleu: a method for automatic evaluation of machine translation.” Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002.
[15]. Heafield, Kenneth. “KenLM: Faster and smaller language model queries.” Proceedings of the sixth workshop on statistical machine translation. 2011.
[16]. Smit, Peter, et al. “Morfessor 2.0: Toolkit for statistical morphological segmentation.” The 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Sweden, April 26-30, 2014. Aalto University, 2014.
[17]. Wu, Yonghui, et al. “Google's neural machine translation system: Bridging the gap between human and machine translation.” arXiv preprint arXiv:1609.08144 (2016).