CUDA Out of Memory Error when Running deepseek-coder-v2-instruct on 8x A100 GPUs

#10
by Mann1904 - opened

According to the model card, running the deepseek-coder-v2-instruct model requires 8 A100 GPUs. However, when I attempt to run the model on 8x A100 80GB GPUs, I encounter the following error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 7 has a total capacity of 79.14 GiB of which 12.75 MiB is free. Process 3830990 has 79.12 GiB memory in use. Of the allocated memory 78.70 GiB is allocated by PyTorch, and 8.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

Are you using vLLM? You might need to experiment with the tp_size parameter

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-Coder-V2-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "write a quick sort algorithm in python."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

I am not using vLLM. I used the same code mentioned in the model card.

This model is massive, and vLLM is probably the best way to use it efficiently

Sign up or log in to comment