Papers
arxiv:2403.10882

Optimizing Language Augmentation for Multilingual Large Language Models: A Case Study on Korean

Published on Mar 16
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.

Community

Sign up or log in to comment

Models citing this paper 23

Browse 23 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2403.10882 in a dataset README.md to link it from this page.

Spaces citing this paper 10

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.