nferruz/ProtGPT2 · clarification question- sequence with invalid proteins

Jan 29, 2023

If I feed it invalid sequence it still predicts a sequence following this amino acid sequence list.

That is probably due to the basic model coming from chatGPT.
It is probably never an situation that someone will use this for. But I wanted to hear any thoughts on this.

Similarly I ran the sequence on colabfold but got non converging sequence error. For reference:

nferruz

Owner Jan 29, 2023

•

edited Jan 29, 2023

Hi skr3178,

Thanks for writing!
The model will always predict a sequence after an input. It is autoregressive and chooses the next token based on its context. In your case, the context is 'BJOUX, it has no biological meaning, but it still corresponds to a set of tokens. Hence, the model can compute associate probabilities for the tokens after that. But I can imagine that the perplexities for those sequences should be a bit high. If you want to avoid specific tokens during generation, you could use the bad_words parameter.
Not an expert on Colabfold, but what sequence did you try?
Thanks!
noelia

skr3178

Apr 27, 2023

Hi Noelia,
Thank you for the clarification and for sharing this work.
I find it very interesting and learnt a lot :-)
I tried a random combination on colabfold (not a real sequence).
Thanks!

skr3178 changed discussion status to closed Apr 27, 2023