Probably Memory Leak issue？

#46

by Tieni - opened Apr 29

Apr 29

When performing batch processing with the 128k model for long-context (>10k token) reasoning, GPU memory continues to rise until it runs out of memory (OOM). To resolve this issue, I have to add “torch.cuda.empty_cache()” after the pipeline call. However, I’m uncertain whether this is the expected behavior or if there’s something else I should do to address this problem.

dabs-iic

Apr 29

This comment has been hidden

gugarosa

Microsoft org May 1

Thanks for raising this issue. We are unaware of any memory impacts that LongRoPE extension might have.

Could it be that during the batch processing, all of the generations are getting extended to the maximum generated length inside the batch? And thus, because it is a long context, fills up the GPU memory?

nguyenbh changed discussion status to closed May 30

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment