Translation Issues

#7
by plyons - opened

Can anyone help me identify why I am unable to produce translations for longer inputs? I have a dataset of long texts that I know I will have to chunk. When I am testing however, I'm not able to produce translations for long sequences that are well under the max length of the model. My code snippet is below.

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "bigscience/mt0-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = 'Translate the text following the colon to Spanish: I have a bunch of cats. I would like to go to the beach with them but cats do not like water. Should I take my dogs instead?'
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

translated = model.generate(**inputs, max_new_tokens=512, max_length=512)
translated_text = tokenizer.decode(translated[0], skip_special_tokens=True)

print(translated_text)
BigScience Workshop org

@plyons would you mind to provide with the results of

print(translated_text)

@unbias
The output translation is No me gustaría tomar gatos. No me gustaría tomar gatos.

Also, what is the most appropriate way of knowing what the max length for generation? Should max_new_tokens be set to this value?

BigScience Workshop org

@plyons
Curious this roughly translates to "I would rather not have cats" which is irrelevant to the matter.
Would you mind trying with a larger model ?

Sign up or log in to comment