Many generated sequences are highly similar to WT sequences

#10

by ShanGao - opened Nov 14, 2023

Nov 14, 2023

Dear Authors,

I tried the model to generate sequences for some random Brenda enzymes. Many generated sequences are over 90% similar to sequences in Brenda, and some are 100% identical. I just want to know if this is expected. I used the parameters (top_p, top_k, temperature) recommended in your manuscript.

nferruz

AI for protein design org Nov 14, 2023

Hi Shangao,

It is only expected in BRENDA classes with high redundancy. For example, if a Brenda class only contains 10 sequences, but they are in 10 different clusters at 50%, ZymCTRL will generate sequences at that distance. To decrease the identity, you could fine-tune the model in a less redundant dataset.
Best wishes
Noelia

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment