How to load Falcon-40B on Nvidia H100 GPU with 80GB VRAM?
Even in load_8_bit=True
setting, the model doesn't load on the GPU, how to load it for inference?
@FalconLLM
Unfortunately, I do not currently have access to an H100, so it will be hard to debug issues there specifically. Some people do seem to be able to run on H100: https://www.youtube.com/watch?v=iEuf1PrmZ0Q, maybe seeing what they do might be of some help?
80GB is going to be very tight though, so will require some cpu offloading with accelerate. If I understand things correctly accelerate is able to automatically offload to cpu memory, but I am not too familiar with this process.
The smallest we've run it on is 4xA10(4x24GB=96GB).
Sorry to not be of more help, hopefully some other people that has managed to make it run can chime in
I have not been able to run at all, even on massive deployment of 240 VGPU. I used the code from the main page. It is clearly a memory issue, because 7B runs (but event that takes up more than 50% of VGPU on 240 GB setup). Any ideas, can you help?
@dstatch, Which/How many GPUs were you trying to run it on?
4 X 80 GB I tried to use Runpod and Datacrunch, fails in both places. It seems that it is not even a VRAM issue, but in inter-GPU communication. Really excited about the potential of this, but as it stands even throwing very large resources at it does not help.
but I am not sure what the issue, just positive that I am not the only one experiencing it, since I have tried in multiple places
I'm running it with the following code on A datacrunch 80G A100 (using 8bit mode).
Credit where credit is due, I basically lifted this code from Sam Witteveen's excellent youtube video & colab:
https://www.youtube.com/watch?v=5M1ZpG2Zz90
Should work on H100 as well.
import torch
import transformers
from transformers import GenerationConfig, pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from transformers import BitsAndBytesConfig
import bitsandbytes as bnb
from torch.cuda.amp import autocast
model = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model,
load_in_8bit=True,
trust_remote_code=True,
device_map='auto',
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
with autocast(dtype=torch.float16):
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
I'm running the following conda env.
(kind of a mess, but seems to work)
conda create --name llm python=3.10
conda activate llm
conda install pytorch==2.0.0 pytorch-cuda=11.8 transformers -c pytorch -c nvidia
pip install einops accelerate
pip install -q -U bitsandbytes
pip install -q -U git+https://github.com/huggingface/transformers.git
pip install -q -U git+https://github.com/huggingface/peft.git
pip install -q -U git+https://github.com/huggingface/accelerate.git
pip -q install sentencepiece Xformers einops
pip -q install langchain
Would this run with 5 12GB VRAM (3060) gpus? I run a mining rig at home..
Thanks @tinkertank ! I tried your install + run on H100 (lamnbda labs) but I'm getting cublas errors...
File "/home/ubuntu/.cache/huggingface/modules/transformers_modules/tiiuae/falcon-40b/b0462812b2f53caab9ccc64051635a74662fc73b/modelling_RW.py", line 252, in forward
fused_qkv = self.query_key_value(hidden_states) # [batch_size, seq_length, 3 x hidden_size]
File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 388, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 559, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/ubuntu/miniconda3/envs/llm/lib/python3.10/site-packages/bitsandbytes/functional.py", line 1781, in igemmlt
raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!
Any ideas?
Also was getting error @Adrians was getting. Looked to me like some issue in 8-bit, probably because some wrong operation is being called. So, I skipped it, and the below worked for me on H100 from Lambda.
Just checked, and the below worked on a fresh instance (I ran no other commands).
Install miniconda
We only do this because the install for torch/cuda works smoothly.
# Download latest miniconda.
wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# Install. -b is used to skip prompt
bash Miniconda3-latest-Linux-x86_64.sh -b
# Activate.
eval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"
# (optional) Add activation cmd to bashrc so you don't have to run the above every time.
printf '\neval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"' >> ~/.bashrc
Setup env
Note: I don't think you need to install transformers from github if you do device_map={"": 0}
later instead of device_map=0
, but I haven't checked.
# Create and activate env. -y skips confirmation prompt.
conda create -n falcon-env python=3.9 -y
conda activate falcon-env
# newest torch with cuda 11.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# For transformers, the commit I installed was f49a3453caa6fe606bb31c571423f72264152fce
pip install -U accelerate einops sentencepiece git+https://github.com/huggingface/transformers.git
Run it
This will use up basically all the memory, but it works.
import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model)
model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=0)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map=0,
)
sequences = pipeline(
"To make the perfect chocolate chip cookies,",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
pad_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Hi guys
I am back, the code from
@nateraw
worked on my Lambda H100 instance, only needed to upgrade Transformers to 4.30.0 from 4.29.2, without that it was giving a device_map
int type doesn't have .values()
error and took me a while to figure out.
But looks the model tightly fits, here's the GPU usage at 99.1%
Next up, load in langchain
Also was getting error @Adrians was getting. Looked to me like some issue in 8-bit, probably because some wrong operation is being called. So, I skipped it, and the below worked for me on H100 from Lambda.
Just checked, and the below worked on a fresh instance (I ran no other commands).
Install miniconda
We only do this because the install for torch/cuda works smoothly.
# Download latest miniconda. wget -nc https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # Install. -b is used to skip prompt bash Miniconda3-latest-Linux-x86_64.sh -b # Activate. eval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)" # (optional) Add activation cmd to bashrc so you don't have to run the above every time. printf '\neval "$(/home/ubuntu/miniconda3/bin/conda shell.bash hook)"' >> ~/.bashrc
Setup env
Note: I don't think you need to install transformers from github if you do
device_map={"": 0}
later instead ofdevice_map=0
, but I haven't checked.
# Create and activate env. -y skips confirmation prompt. conda create -n falcon-env python=3.9 -y conda activate falcon-env # newest torch with cuda 11.8 conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia # For transformers, the commit I installed was f49a3453caa6fe606bb31c571423f72264152fce pip install -U accelerate einops sentencepiece git+https://github.com/huggingface/transformers.git
Run it
This will use up basically all the memory, but it works.
import torch import transformers from transformers import AutoModelForCausalLM, AutoTokenizer model = "tiiuae/falcon-40b" tokenizer = AutoTokenizer.from_pretrained(model) model = AutoModelForCausalLM.from_pretrained(model, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=0) pipeline = transformers.pipeline( "text-generation", model=model, tokenizer=tokenizer, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map=0, ) sequences = pipeline( "To make the perfect chocolate chip cookies,", max_length=200, do_sample=True, top_k=10, num_return_sequences=1, pad_token_id=tokenizer.eos_token_id, ) for seq in sequences: print(f"Result: {seq['generated_text']}")
How much was the inference time on this? @nateraw