Fine-tuning ProLLaMA on custom dataset

by aayush-shah - opened May 22

May 22

Hi, thank you for this wonderful work on protein language modeling! I have a csv dataset consisting of only non-hemolytic protein sequences as shown in the picture and I want to use it to fine-tune ProLLaMA to generate only non-hemolytic proteins. Could you provide the steps I should take to do that? How should I add the conditions? Thank you!

GreatCaptainNemo

Owner May 23

Thanks for your attention! Suppose 'xxx' denotes one of your sequence. You could convert your raw data into:

[Generate non-hemolytic protein] Seq=<xxx>

You can then train ProLLaMA on the processed dataset using some existing code base (like HF trainer).
I am also planning to open source the training code on my github.

aayush-shah

May 23

Thank you for answering my query, I will try the approach and also looking forward to the training code as well! Could you also tell the time it took to train ProLLaMA, for both continual training and instruction fine-tuning, and how many GPUs were used? It will help for my research as well.

GreatCaptainNemo

Owner May 23

Sure. For stage 1, it takes about 6 days on 8 A6000 GPUs. For stage 2, it takes 5 days on 8 A6000 GPUs. Flash-attention2 and DeepSpeed can speed up the training.

aayush-shah

May 23

Got it, thank you!

GreatCaptainNemo

Owner Jun 27

Hello, we have released our training codes here.

wenjun99

Aug 27

•

edited Aug 27

Hello,
I came across your model and found it really interesting and promising. I wanted to try running the fine tuning portion at Stage 2 and have checked out your code provided under the Quick Training. However, for the run_it.sh file, under the row torchrun --nproc_per_node 8 run_clm_sft_with_peft.py, i could not find any run_clm_sft_with_peft.py file in your github or huggingface. Apologies, i found it in the run_it.sh script itself under the wiki https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/sft_scripts_zh, https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/blob/main/scripts/training/run_clm_sft_with_peft.py

GreatCaptainNemo

Owner Aug 27

@wenjun99 Hello, I have corrected the typo on github. It should be the instruction_tune.py. Have fun~

wenjun99

Sep 3

Hello again,
Below is the example.json file provided but I believe the inputs and outputs are swapped? Since the instruction is to Generate by superfamily, shouldn't the input be Superfamily=<>, so that the output is a sequence from the queried superfamily?
[
{
"instruction":"[Generate by superfamily]",
"input":"Seq=",
"output":"Superfamily="
},
{
"instruction":"[Determine superfamily]",
"input":"Superfamily=",
"output":"Seq="
}
]

GreatCaptainNemo

Owner Sep 3

•

edited Sep 3

Yes, you are right. I have fixed these in example.json.
Have fun. @wenjun99

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment