deepseek-ai/DeepSeek-Coder-V2-Instruct · CUDA Out of Memory Error when Running deepseek-coder-v2-instruct on 8x A100 GPUs

Mann1904

12 days ago

According to the model card, running the deepseek-coder-v2-instruct model requires 8 A100 GPUs. However, when I attempt to run the model on 8x A100 80GB GPUs, I encounter the following error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 7 has a total capacity of 79.14 GiB of which 12.75 MiB is free. Process 3830990 has 79.12 GiB memory in use. Of the allocated memory 78.70 GiB is allocated by PyTorch, and 8.95 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

nbroad

11 days ago

Are you using vLLM? You might need to experiment with the tp_size parameter

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams

max_model_len, tp_size = 8192, 8
model_name = "deepseek-ai/DeepSeek-Coder-V2-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])

messages_list = [
    [{"role": "user", "content": "Who are you?"}],
    [{"role": "user", "content": "write a quick sort algorithm in python."}],
    [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
]

prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]

outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)

generated_text = [output.outputs[0].text for output in outputs]
print(generated_text)

Mann1904

11 days ago

I am not using vLLM. I used the same code mentioned in the model card.

nbroad

9 days ago

This model is massive, and vLLM is probably the best way to use it efficiently