CPU seems to be the bottleneck preventing full use of the GPU

#211
by vangap - opened

Hello,

I have Llama-3-8B-Instruct running on a L4 GPU(GCP VM). When I am doing the inference, I see the GPU usage around 50%. Digging a little further, I notice that one CPU core is at 100% through out the inference, so I am guessing that this is a bottleneck preventing full usage of the GPU. Upon CPU profiling, I notice that most of this CPU usage is related to libcuda. So, I am wondering if this is normal or if there is something wrong with my env that is leading to this behavior.

image.png

Below is my code

        self.pipeline = pipeline(
            "text-generation",
            model="meta-llama/Meta-Llama-3-8B-Instruct",
            model_kwargs={"torch_dtype": torch.bfloat16},
            device="cuda",
        )
        prompt = self.pipeline.tokenizer.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        terminators = [
            self.pipeline.tokenizer.eos_token_id,
            self.pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>"),
        ]

        outputs = self.pipeline(
            prompt,
            max_new_tokens=max_length,
            eos_token_id=terminators,
            do_sample=False,
            temperature=0.0,
            top_p=0.9,
        )

Just out of curiousity, which version of torch are you using? Because mine is not cuda enabled and I am instructed to recompile the source to enable cuda.

Sign up or log in to comment