Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1012

72b models eval failed

#689

by paloalma - opened Apr 18

Discussion

paloalma

Apr 18

Hello @clefourrier ,

It seems the models submitted yesterday failed again.

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/TW3PartnersLLM/tw3jrglv3_eval_request_False_float16_Original.json

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/paloalma/TW3-JRGL-v1_eval_request_False_bfloat16_Original.json

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/paloalma/ECE-TW3-JRGL-V3_eval_request_False_bfloat16_Original.json

Could we push this one first in priority if pushing too many makes the system fail ?

https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/aa62e2c672268c5737dcb2cd554872ad26b2ccb6

We are currently benchmarking it on eqbench and need also the other benchmark scoring for our research paper.

Is it possible to also get the logs ?

Thanks for your help,
Andre

clefourrier

Open LLM Leaderboard org Apr 18

Hi! Next time, please reopen the issues instead of opening new ones - that way it tags everyone who was part of the convo.

Operation was aborted for all models when trying to assemble the shards. Are you sure your models are formatted properly?

Loading checkpoint shards: 100%|██████████| 82/82 [00:39<00:00,  2.10it/s]
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000225 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000225 milliseconds before timing out.

paloalma

Apr 18

Hi ! @clefourrier

Thanks for the logs !

We are surprised as we have successfully tested a lot of them without error.
Here is the Benchmarks results on EQBench for the two following model (done with 5H100) :
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/aa62e2c672268c5737dcb2cd554872ad26b2ccb6
Benchmark Results

Id : ECE-TW3-JRGL-V1
Date : 2024-04-09 19:23:47
Success : True
Model : paloalma/ECE-TW3-JRGL-V1
this_score : 82.8
Bench tries : 0
Parseable : 171.0
EQ-Bench version : v2

and
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/7f30b5503a42c4a4174b3a3a9b081fcc061b9c4f
Benchmark Results

Id : TW3-JRGL-v2
Date : 2024-04-10 05:46:49
Success : True
Model : paloalma/TW3-JRGL-v2
this_score : 82.15
Bench tries : 0
Parseable : 170.0
EQ-Bench version : v2

Is it possible to submit these two again for evaluation ?

alozowski

Open LLM Leaderboard org Apr 22

Thank you for sharing the EQBench results! We are investigating the problem with your models on our side. We'll keep you posted as we figure things out and will relaunch your models as soon as the problem will be solved

paloalma

Apr 30

Thank you for sharing the EQBench results! We are investigating the problem with your models on our side. We'll keep you posted as we figure things out and will relaunch your models as soon as the problem will be solved

Hello Guys !

Thanks for relaunching the models !
Unfortunately, they seem to have failed once again. Do you have any idea why they can't be evaluated, or any leads?
We're wondering if it's due to the fact that this is a merged model, as we haven't seen a lot of them of this size?

Hope to receive good news from you soon !

Thanks, HF Team !
@alozowski @clefourrier

clefourrier

Open LLM Leaderboard org May 2

Hi!
The merged aspect is an interesting hypothesis, but we've never had issues for merged models before.
However, since the CUDA issue above has appeared consistently on all the restarts of your models, and only for them specifically (so it's hard to pinpoint where the problem comes from since we have no comparison point), we'll close this issue as we don't have the bandwidth to investigate.

clefourrier changed discussion status to closed May 2

paloalma

May 6

Hello guys !!!!

It's incredible !!! We saw that TW3-JRGL-V1 went through and is now TOP 1 on the leaderboard we are so excited !!!!

Thank you for running the evaluation again !

Do you think you'll be able to run the other ones again ?

Also what was the final cause of all these errors when running the other models ?

Thank you for everything !

@alozowski @clefourrier

paloalma

May 6

Also just Fyi since it came first we changed its name from TW3-JRGL-V1 to Le_Triomphant-ECE-TW3, will the change appear on the leaderboard also ?

Thank you very much.

alozowski

Open LLM Leaderboard org May 7

Hi @paloalma ,

Congrats! ✨

Considering renaming, I see that there is a separate request file with a model called Le_Triomphant-ECE-TW3– is this a different model than TW3-JRGL-v1?

alozowski changed discussion status to open May 7

paloalma

May 7

Hi @alozowski !

Thank you !

It's indeed the same model, we were afraid that it was deleted and that the scores were lost so we tried to re-submit it.

So Le_Triomphant-ECE-TW3 and TW3-JRGL-v1 are the same.

Thanks !

alozowski

Open LLM Leaderboard org May 7

Great! I renamed your model, please, check out my screenshot

alozowski

Open LLM Leaderboard org May 7

@paloalma can I help you with something else?

paloalma

May 7

Thank you very much @alozowski !

Just a quick question, what were the issues with our models, and what was modified to allow them to get through the evaluation ?

It could be very interesting for us to know.

Again, thank you for your work !

Paloalma

alozowski

Open LLM Leaderboard org May 8

Hi @paloalma !

Nothing extraordinary, sometimes we get quite a few models coming in for evaluation and there can be various hardware issues, so resubmitting can help – this seems to be the case here.

I'll close this issue now, please, feel free to open a new one if you need any help :)

alozowski changed discussion status to closed May 8

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

72b models eval failed

We are surprised as we have successfully tested a lot of them without error. Here is the Benchmarks results on EQBench for the two following model (done with 5H100) :https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/aa62e2c672268c5737dcb2cd554872ad26b2ccb6Benchmark Results

andhttps://huggingface.co/datasets/open-llm-leaderboard/requests/commit/7f30b5503a42c4a4174b3a3a9b081fcc061b9c4fBenchmark Results

Id : TW3-JRGL-v2Date : 2024-04-10 05:46:49Success : TrueModel : paloalma/TW3-JRGL-v2this_score : 82.15Bench tries : 0Parseable : 170.0EQ-Bench version : v2

We are surprised as we have successfully tested a lot of them without error.
Here is the Benchmarks results on EQBench for the two following model (done with 5H100) :
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/aa62e2c672268c5737dcb2cd554872ad26b2ccb6
Benchmark Results

and
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/7f30b5503a42c4a4174b3a3a9b081fcc061b9c4f
Benchmark Results

Id : TW3-JRGL-v2
Date : 2024-04-10 05:46:49
Success : True
Model : paloalma/TW3-JRGL-v2
this_score : 82.15
Bench tries : 0
Parseable : 170.0
EQ-Bench version : v2