Minimal number of sequences for fine tuning
Dear Authors,
Thanks for your excellent work! I plan to fine tune the model. Can you please advice on the minimal number of protein sequences needed for fine tuning? Thanks a lot!
Hi Yuanji Zhang,
Thanks for your interest! I am afraid I do not have a rule of thumb yet. I tried to fine-tune a model with 500 sequences, and it did not work very well. However, I know someone who fine-tuned 900 sequences, and the training curves looked fine and obtained the expected results. So I guess you will have to try :). I am happy to assist if you need any help!
Noelia
Hi Noelia,
I tested the example code from zenodo as
protgpt2 = pipeline('text-generation', model="nferruz/ProtGPT2")
sequences = protgpt2("M", max_length=100, min_length=80, ...)
The actual length of generated protein sequences is from 239..298. Are "max_length" and "min_length" actually the number of tokens?
Yes, min and max length correspond to the number of tokens