Spaces:
Running
on
CPU Upgrade
72b models eval failed
Hello @clefourrier ,
It seems the models submitted yesterday failed again.
Could we push this one first in priority if pushing too many makes the system fail ?
We are currently benchmarking it on eqbench and need also the other benchmark scoring for our research paper.
Is it possible to also get the logs ?
Thanks for your help,
Andre
Hi! Next time, please reopen the issues instead of opening new ones - that way it tags everyone who was part of the convo.
Operation was aborted for all models when trying to assemble the shards. Are you sure your models are formatted properly?
Loading checkpoint shards: 100%|██████████| 82/82 [00:39<00:00, 2.10it/s]
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000225 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000225 milliseconds before timing out.
Hi ! @clefourrier
Thanks for the logs !
We are surprised as we have successfully tested a lot of them without error.
Here is the Benchmarks results on EQBench for the two following model (done with 5H100) :
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/aa62e2c672268c5737dcb2cd554872ad26b2ccb6
Benchmark Results
Id : ECE-TW3-JRGL-V1
Date : 2024-04-09 19:23:47
Success : True
Model : paloalma/ECE-TW3-JRGL-V1
this_score : 82.8
Bench tries : 0
Parseable : 171.0
EQ-Bench version : v2
and
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/7f30b5503a42c4a4174b3a3a9b081fcc061b9c4f
Benchmark Results
Id : TW3-JRGL-v2
Date : 2024-04-10 05:46:49
Success : True
Model : paloalma/TW3-JRGL-v2
this_score : 82.15
Bench tries : 0
Parseable : 170.0
EQ-Bench version : v2
Is it possible to submit these two again for evaluation ?
Thank you for sharing the EQBench results! We are investigating the problem with your models on our side. We'll keep you posted as we figure things out and will relaunch your models as soon as the problem will be solved
Thank you for sharing the EQBench results! We are investigating the problem with your models on our side. We'll keep you posted as we figure things out and will relaunch your models as soon as the problem will be solved
Hello Guys !
Thanks for relaunching the models !
Unfortunately, they seem to have failed once again. Do you have any idea why they can't be evaluated, or any leads?
We're wondering if it's due to the fact that this is a merged model, as we haven't seen a lot of them of this size?
Hope to receive good news from you soon !
Thanks, HF Team !
@alozowski
@clefourrier
Hi!
The merged aspect is an interesting hypothesis, but we've never had issues for merged models before.
However, since the CUDA issue above has appeared consistently on all the restarts of your models, and only for them specifically (so it's hard to pinpoint where the problem comes from since we have no comparison point), we'll close this issue as we don't have the bandwidth to investigate.
Hello guys !!!!
It's incredible !!! We saw that TW3-JRGL-V1 went through and is now TOP 1 on the leaderboard we are so excited !!!!
Thank you for running the evaluation again !
Do you think you'll be able to run the other ones again ?
Also what was the final cause of all these errors when running the other models ?
Thank you for everything !
Also just Fyi since it came first we changed its name from TW3-JRGL-V1 to Le_Triomphant-ECE-TW3, will the change appear on the leaderboard also ?
Thank you very much.
Hi @paloalma ,
Congrats! ✨
Considering renaming, I see that there is a separate request file with a model called Le_Triomphant-ECE-TW3
– is this a different model than TW3-JRGL-v1?
Hi @alozowski !
Thank you !
It's indeed the same model, we were afraid that it was deleted and that the scores were lost so we tried to re-submit it.
So Le_Triomphant-ECE-TW3 and TW3-JRGL-v1 are the same.
Thanks !
Thank you very much @alozowski !
Just a quick question, what were the issues with our models, and what was modified to allow them to get through the evaluation ?
It could be very interesting for us to know.
Again, thank you for your work !
Paloalma
Hi @paloalma !
Nothing extraordinary, sometimes we get quite a few models coming in for evaluation and there can be various hardware issues, so resubmitting can help – this seems to be the case here.
I'll close this issue now, please, feel free to open a new one if you need any help :)