Many generated sequences are highly similar to WT sequences
#10
by
ShanGao
- opened
Dear Authors,
I tried the model to generate sequences for some random Brenda enzymes. Many generated sequences are over 90% similar to sequences in Brenda, and some are 100% identical. I just want to know if this is expected. I used the parameters (top_p, top_k, temperature) recommended in your manuscript.
Hi Shangao,
It is only expected in BRENDA classes with high redundancy. For example, if a Brenda class only contains 10 sequences, but they are in 10 different clusters at 50%, ZymCTRL will generate sequences at that distance. To decrease the identity, you could fine-tune the model in a less redundant dataset.
Best wishes
Noelia