Fine-tuning ProLLaMA on custom dataset
Hi, thank you for this wonderful work on protein language modeling! I have a csv dataset consisting of only non-hemolytic protein sequences as shown in the picture and I want to use it to fine-tune ProLLaMA to generate only non-hemolytic proteins. Could you provide the steps I should take to do that? How should I add the conditions? Thank you!
Thanks for your attention! Suppose 'xxx' denotes one of your sequence. You could convert your raw data into:
[Generate non-hemolytic protein] Seq=<xxx>
You can then train ProLLaMA on the processed dataset using some existing code base (like HF trainer).
I am also planning to open source the training code on my github.
Thank you for answering my query, I will try the approach and also looking forward to the training code as well! Could you also tell the time it took to train ProLLaMA, for both continual training and instruction fine-tuning, and how many GPUs were used? It will help for my research as well.
Sure. For stage 1, it takes about 6 days on 8 A6000 GPUs. For stage 2, it takes 5 days on 8 A6000 GPUs. Flash-attention2 and DeepSpeed can speed up the training.
Got it, thank you!
Hello, we have released our training codes here.
Hello,
I came across your model and found it really interesting and promising. I wanted to try running the fine tuning portion at Stage 2 and have checked out your code provided under the Quick Training. However, for the run_it.sh file, under the row torchrun --nproc_per_node 8 run_clm_sft_with_peft.py, i could not find any run_clm_sft_with_peft.py file in your github or huggingface. Apologies, i found it in the run_it.sh script itself under the wiki https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/wiki/sft_scripts_zh, https://github.com/ymcui/Chinese-LLaMA-Alpaca-2/blob/main/scripts/training/run_clm_sft_with_peft.py
@wenjun99 Hello, I have corrected the typo on github. It should be the instruction_tune.py. Have fun~
Hello again,
Below is the example.json file provided but I believe the inputs and outputs are swapped? Since the instruction is to Generate by superfamily, shouldn't the input be Superfamily=<>, so that the output is a sequence from the queried superfamily?
[
{
"instruction":"[Generate by superfamily]",
"input":"Seq=",
"output":"Superfamily="
},
{
"instruction":"[Determine superfamily]",
"input":"Superfamily=",
"output":"Seq="
}
]