Resources required for fine tuning Protgpt2
I'm in the process of fine-tuning the protGPT-2 model using a set of training examples I've collected. I have about 500k sequences and am thinking of starting with a small subset of these, say 2k sequences. I have access to two instances of NVIDIA A100 GPUs, each with 40 GB of memory. Will this be enough for training? Will I need any kind of memory saving tricks? Also, can someone give me an idea of how long it'll take to run the fine tuning for different number of sequences (between 2k and 500k)?
Thanks!
Hi,
Your resources sound like more than plenty. With 2k you wouldn’t need more than one hour. With 5000k it will take longer, but using 2 A100s I don’t imagine it takes you longer than a day to run several epochs (but I’m doing very simple maths here). If you want to save some vRAM you could use deep speed. In fact I recommend it it’s super easy to use!
Hi,
Thanks for the quick response. I'll definitely look into deep speed library! One more question, will V100 GPUs also work for this fine tuning task?
I haven't tried them myself but they should, absolutely.
Great, thanks! I'll try them out!
Hi @nferruz ,
Are there any back of the envelope calculations for estimating how much memory will be required for a job?
I'm asking because finetuning the model has required a lot more memory than we expected.
We are finetuning with around 2250 training sequences (~300 AA long) and ran into OOM errors on the following machines:
- 4 V100s (16Gb VRAM each)
- A100 (40Gb VRAM)
The job ran (in about 5 minutes) on these machines:
- A100 (80Gb VRAM)
- H100 (80Gb VRAM)
For the 80Gb VRAM GPUs, utilization was close to 100%
Details:
We followed the instructions provided on the hugging face page. In essence, this is our environment setup:
NAME=protgpt
conda deactivate
conda deactivate
conda deactivate
conda deactivate
conda env remove --name $NAME
conda create -y -n $NAME python=3.10
conda activate $NAME
pip install git+https://github.com/huggingface/transformers
# Reqs for running `run_clm.py`
pip install -r <(cat << EOF
accelerate >= 0.12.0
torch >= 1.3
datasets >= 2.14.0
sentencepiece != 0.1.92
protobuf
evaluate
scikit-learn
EOF
)
conda install pytorch cudatoolkit -c pytorch -c nvidia -y
wget https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py
Our run command is:
python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file dck_sequences_train.txt --validation_file dck_sequences_val.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir output --learning_rate 1e-06
We haven't yet experimented with deep speed, as you recommend above. If you have a working example, that would be much appreciated!
And if you would like to experiment with the training/validation data yourself, I've sent them in an email.
Thanks again for your time.
Cheers,
Evan
What is the batch size that the script uses by default? I'd decrease it considerably, you should be able to fine-tune the model in all these cards (or most of them).
We've experimented with the batch size. In the above specs, the default batch size is 8. On 24GB of VRAM, a batch size of 2 is possible, but 3 is not. Is this in line with your expectations?
The workaround we've opted for is using a batch size of 1 (--per_device_train_batch_size 1
). Since this leads to noisy learning that's hard to generalize, we've simulated a batch size of 16 with --gradient_accumulation_steps 16
.
Hi, yes, that sounds reasonable depending on your GPU. Doing gradient accumulation also sounds fine if you have no other option, as you pay the price of lower performance to overcome the memory problem: https://discuss.huggingface.co/t/batch-size-vs-gradient-accumulation/5260/5.
But in all honesty we use it quite frequently as well.