clarification question- sequence with invalid proteins
If I feed it invalid sequence it still predicts a sequence following this amino acid sequence list.
That is probably due to the basic model coming from chatGPT.
It is probably never an situation that someone will use this for. But I wanted to hear any thoughts on this.
Similarly I ran the sequence on colabfold but got non converging sequence error. For reference:
Hi skr3178,
Thanks for writing!
The model will always predict a sequence after an input. It is autoregressive and chooses the next token based on its context. In your case, the context is 'BJOUX, it has no biological meaning, but it still corresponds to a set of tokens. Hence, the model can compute associate probabilities for the tokens after that. But I can imagine that the perplexities for those sequences should be a bit high. If you want to avoid specific tokens during generation, you could use the bad_words parameter.
Not an expert on Colabfold, but what sequence did you try?
Thanks!
noelia
Hi Noelia,
Thank you for the clarification and for sharing this work.
I find it very interesting and learnt a lot :-)
I tried a random combination on colabfold (not a real sequence).
Thanks!