Question about the results for HumanEvalFix

#2
by Borko24 - opened

Hi,
I see the Granite models performing really well for HumanEvalFix. I carried out the evaluation according to this paper: https://arxiv.org/html/2308.07124v2
In the technical report of Granite the instruct models have these results:

image.png
While I get this (using the subvariant of evaluation without docstrings):

image.png
As you can see the 20b model is underperforming and following the trend of increasing the accuracy.
Can you upload also the steps you took to evaluate the models. Also other models such as StarCoder2 significantly outperform the results from the Granite paper.

IBM Granite org

Hi, thanks for your interest in the Granite Code family!

HumanEvalPack has shown often to be brittle to parameters, systems, and environments used for evaluations. For example, Rust does not have deterministic metrics as it pulls crates over the network during each evaluation by default. However, please find below our exact evaluation suite and experimental settings to reproduce the number of different Granite models.

  • Our results in the technical report are true pass@1 scores using greedy decoding. They are not pass@1 estimates over 20 samples or similar, and they are not pass@10 scores like some other reports have.
  • Unlike the original benchmark, we set max_new_tokens to 512 to not only test the model’s stopping ability but also to take account for models that have different length system prompts. We felt setting just the max model length does not make a fair benchmark for models that have long system prompts.
  • Our instruct model prompting format is yet supported in the public version BigCode Evaluation Harness. Prompting techniques can drastically change the scores so please make sure that you are using the right prompt template when evaluating.

I've uploaded the above changes to GitHub (found here). You can clone that repo and run the following command to reproduce the results for fixing Python without docstrings, humanevalfixtests-python:

$ accelerate launch main.py --model ibm-granite/granite-20b-code-instruct  --tasks humanevalfixtests-python --allow_code_execution --prompt octocoder_system --use_auth_token --precision bf16 --max_new_tokens 512 --batch_size 1 --n_samples 1 --do_sample False --trust_remote_code --save_generations

And this is my current environment:

$ accelerate env

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.29.3
- Platform: Linux-5.14.0-362.18.1.el9_3.x86_64-x86_64-with-glibc2.34
- `accelerate` bash location: /u/stallone/envs/bigcode/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.0+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 2015.38 GB
- GPU type: NVIDIA H100 80GB HBM3
- `Accelerate` default config:
    Not found

Let us know if you have any further questions! Thanks

Sign up or log in to comment