bigscience/mt0-small · Translation Issues

21 days ago

•

Can anyone help me identify why I am unable to produce translations for longer inputs? I have a dataset of long texts that I know I will have to chunk. When I am testing however, I'm not able to produce translations for long sequences that are well under the max length of the model. My code snippet is below.

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "bigscience/mt0-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = 'Translate the text following the colon to Spanish: I have a bunch of cats. I would like to go to the beach with them but cats do not like water. Should I take my dogs instead?'
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

translated = model.generate(**inputs, max_new_tokens=512, max_length=512)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

print(translated_text)

unbias

BigScience Workshop org 21 days ago

@plyons would you mind to provide with the results of

print(translated_text)

plyons

21 days ago

•

edited 21 days ago

@unbias
The output translation is No me gustaría tomar gatos. No me gustaría tomar gatos.

Also, what is the most appropriate way of knowing what the max length for generation? Should max_new_tokens be set to this value?

unbias

BigScience Workshop org 21 days ago

@plyons
Curious this roughly translates to "I would rather not have cats" which is irrelevant to the matter.
Would you mind trying with a larger model ?