nairaxo/bantulm · Hugging Face

Model Description

BantuLM is a multilingual BERT-based model specifically designed for Bantu languages, including Chichewa, Kinyarwanda, Swahili, Zulu, and others. The model is trained for masked language modeling (Fill-Mask), which predicts masked tokens in sentences, and is tailored to capture linguistic nuances in Bantu languages.

Intended uses & limitations

This model is primarily useful for:

Masked language modeling: Predicting missing or masked words in Bantu language sentences.
Feature extraction: Using BERT's embeddings for downstream NLP tasks in Bantu languages. It is not recommended for tasks outside Bantu language contexts, as its training data is specific to these languages.

How to use

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='nairaxo/bantulm')
>>> unmasker("rais wa [MASK] ya tanzania")

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('nairaxo/bantulm')
model = BertModel.from_pretrained("nairaxo/bantulm")
text = "rais wa jamhuri ya tanzania"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('nairaxo/bantulm')
model = TFBertModel.from_pretrained("nairaxo/bantulm")
text = "rais wa jamhuri ya tanzania"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Training data

The model is pre-trained on a collection of text data from several Bantu languages, aiming to represent the diversity of linguistic structures within the Bantu language family.

Limitations and biases

The model may reflect certain biases based on the specific texts used in training and may not perform well for languages outside the Bantu family. Performance on under-represented dialects could also vary.

Evaluation

Performance metrics and evaluation data for this model are currently limited due to the specialized nature of Bantu languages and the available benchmarks.

Citation

If you use this model, please cite it as follows:

@article{Mohamed2023,
  title = {BantuLM: Enhancing Cross-Lingual Learning in the Bantu Language Family},
  url = {http://dx.doi.org/10.21203/rs.3.rs-3793749/v1},
  DOI = {10.21203/rs.3.rs-3793749/v1},
  publisher = {Research Square Platform LLC},
  author = {Mohamed,  Naira Abdou and Benelallam,  Imade and Bahafid,  Abdessalam and Erraji,  Zakarya},
  year = {2023},
  month = dec 
}