t5-small-spanish-nahuatl

Model description

This model is a T5 Transformer (t5-small) fine-tuned on 29,007 spanish and nahuatl sentences using 12,890 samples collected from the web and 16,117 samples from the Axolotl dataset.

The dataset is normalized using 'sep' normalization from py-elotl.

Usage

from transformers import AutoModelForSeq2SeqLM
from transformers import AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('milmor/t5-small-spanish-nahuatl')
tokenizer = AutoTokenizer.from_pretrained('milmor/t5-small-spanish-nahuatl')

model.eval()
sentence = 'muchas flores son blancas'
input_ids = tokenizer('translate Spanish to Nahuatl: ' + sentence, return_tensors='pt').input_ids
outputs = model.generate(input_ids)
# outputs = miak xochitl istak
outputs = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

Evaluation results

The model is evaluated on 400 validation sentences.

Validation loss: 1.36

Note: Since the Axolotl corpus contains multiple misalignments, the real Validation loss is slightly better. These misalignments also introduce noise into the training.

References

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2019. Exploring the limits of transfer learning with a unified Text-to-Text transformer.
Ximena Gutierrez-Vasques, Gerardo Sierra, and Hernandez Isaac. 2016. Axolotl: a web accessible parallel corpus for Spanish-Nahuatl. In International Conference on Language Resources and Evaluation (LREC).

Created by Emilio Morales.