Recommended GPU architecture for fine tuning

#19
by codev - opened

Hello,

Thank you so much for your work to make this model freely accessible, easy to use and well documented!

I am wondering what the minimum requirements are for GPU memory when fine-tuning. I am currently using four NVIDIA A10G Tensor Core GPUs (24 GB memory each) in a distributed manner, using all of the memory saving tricks I know (batch size = 1, setting --gradient_accumulation_steps to 16 and enabling gradient checkpointing, and setting the optimizer to use adafactor (per the blog post here) but I am still running out of memory. My command is below:

python -m torch.distributed.launch --nproc_per_node 4 run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file train.txt --validation_file test.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir output --learning_rate 1e-06 --per_device_train_batch_size=1 --gradient_accumulation_steps=16 --gradient_checkpointing --adafactor

I saw in the documentation that the model was trained using 100 NVIDIA A100 GPU's, but I was thinking that fine-tuning could be done on a smaller system.

Thanks in advance!

Kathryn

codev changed discussion title from Recommended architecture for fine tuning to Recommended GPU architecture for fine tuning

Hi codec,

That is unfortunate; I would have expected that it would also run in your system. It seems that you might need a card with more memory.
I can only think of changing the data type to --fp16 or --bf16 and seeing if that fits: https://huggingface.co/docs/transformers/v4.13.0/en/performance#floating-data-types

I don't know the minimum requirements because we always fine-tune on a single A40. However, if I remember correctly, I tried once to fine-tune on a single RTX 3090, and that didn't fit either...

I still hope that changing the data type works!

Best
Noelia

I have tried that as well, but didn't seem to make a difference - after reading the documentation it seems like converting to --fp16 is more to speed up training than save memory. I will try increasing my GPU memory. Thanks for the help!

Kathryn

Just an FYI for others who might be running into similar issues - I was able to fine-tune the model on an NVIDIA A100 (40 GB ram) with the command mentioned in my first message. I ran out of memory when not using the memory saving strategies.

@codev Thanks for this post.

I encountered the same problem. Please see my post here.
Since I'm using AWS, I'm not sure what's the right instances I can use.
Currently, I tried even with AWS p3.16xlarge, but I got a Cuda OOM error.

AWS EC2 p3.16xlarge instance type is powered by 8 NVIDIA Tesla V100 GPUs, each with 16 GB of GPU memory.
In total, it provides 128 GB of GPU memory.

Do you have any suggestions on how to resolve this?

Sincerely,
Littleworth

@littleworth The most important thing is the amount of memory per GPU. The total GPU memory does not matter, because it parallelizes batches across the GPUs on your instance, each GPU instance will get a single batch, so if that batch exceeds your single GPU memory you will get an OOM error. The only instance I am aware of that uses NVIDIA A100's is p4d.24xlarge, however I have had trouble getting access to these which is why I switched to using google colab for fine tuning; this should get you access to A100's which have 40GB GPU memory per GPU. I also use some memory saving tricks when I run the fine tuning, see this article here for more info: https://huggingface.co/docs/transformers/v4.18.0/en/performance.

@codev Thanks for your advice. I'll give it a try. Meanwhile, I managed to get my code working with DeepSpeed.
See my comment here
Please let me know if that approach is not good or something :-)

I'm not familiar with DeepSpeed, but glad you got it working!

Sign up or log in to comment