Probably Memory Leak issue?
#46
by
Tieni
- opened
When performing batch processing with the 128k model for long-context (>10k token) reasoning, GPU memory continues to rise until it runs out of memory (OOM). To resolve this issue, I have to add “torch.cuda.empty_cache()” after the pipeline call. However, I’m uncertain whether this is the expected behavior or if there’s something else I should do to address this problem.
This comment has been hidden
Thanks for raising this issue. We are unaware of any memory impacts that LongRoPE extension might have.
Could it be that during the batch processing, all of the generations are getting extended to the maximum generated length inside the batch? And thus, because it is a long context, fills up the GPU memory?
nguyenbh
changed discussion status to
closed