Error during Finetuning

#48
by pawanpc - opened

Hi,

Thanks so much for developing such a wonderful language model.

I tried to do finetuning based on the Example 2 details mentioned here (https://huggingface.co/nferruz/ProtGPT2), but I am encountering an error. I am using 16 sequences for training and 4 sequences for testing (smaller number of sequences are used as this was just a test run) -

This is the command I am using -
python3 run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2 --do_train --do_eval --output_dir output --learning_rate 1e-06

Training gets completed but during evaluation it ends in an error - AttributeError: 'NoneType' object has no attribute 'get'

Can you please let me know how I can resolve this error?

Owner

Hi pawanpc,

I've also seen this error in more recent versions. Where did you get the run_clm.py file from? If from the huggingface github page, what transformers version are you using?

Best
Noelia

Hi Noelia,

Thanks so much for looking in to this.

Yes, I downloaded the "run_clm.py" from the huggingface github page - https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py.

This is the transformers version I am using -
transformers==4.42.0.dev0

Do I need to change the transformers version - any suggestions regarding how to resolve the error will be very helpful.

Owner

I am also not sure why the latest version of the script does not work with the latest version of HF, but it all seems to work in previous versions (eg 4.21). Ideally, however, we would want to understand how to make the latest scripts and version work (but I do not have the bandwidth until a couple of weeks, unfortunately)

Hi Noelia,

Thanks for your inputs - let me try v4.21

Hi Noelia,

I have an update on this. The issue was not with the transformers version. The issue was regarding the number of sequences in the validation file - as the batch size was 8, increasing the number of sequences in the validation file helped to resolve the issue.

This is how the evaluation process looked like, is this expected ? I think I still need more data for training and validation.

06/20/2024 17:07:58 - INFO - main - *** Evaluate ***
[INFO|trainer.py:3783] 2024-06-20 17:07:58,041 >>
***** Running Evaluation *****
[INFO|trainer.py:3785] 2024-06-20 17:07:58,041 >> Num examples = 1
[INFO|trainer.py:3788] 2024-06-20 17:07:58,042 >> Batch size = 8
100%|| 1/1 [00:00<00:00, 57.78it/s]
***** eval metrics *****
epoch = 3.0
eval_accuracy = 0.2063
eval_loss = 6.7518
eval_runtime = 0:00:05.61
eval_samples = 1
eval_samples_per_second = 0.178
eval_steps_per_second = 0.178
perplexity = 855.5718
[INFO|modelcard.py:449] 2024-06-20 17:08:03,918 >> Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Causal Language Modeling', 'type': 'text-generation'}, 'metrics': [{'name': 'Accuracy', 'type': 'accuracy', 'value': 0.20625610948191594}]}

Owner

Ah that's good to hear.
It looks like the standard output we get, so i'd say it ran fine. It could be, though, that you need more data, as, as you say, there seems to be only one eval sample. But you'd need to check the generation quality!
Best wishes
Noelia

Thank you Noelia - nice to know that that the process ran fine.
Yes, I will definitely try with more data again.
Regarding the generation quality - I did not see it in the log, can I please know where can I check that info ?

Owner

You can either check at the perplexity of the generated sequences or see if the sequences present properties you’d expect. For example, afrer finetuning on a family of globular, I’d expect the generated sequences to present high average pLDDT values.

Thank you Noelia for your suggestions - I appreciate it.

I tried to fine tune again with a bigger dataset of training and validation sequences using "run_clm.py", the analysis has completed successfully but I did not notice the "pytorch_model" file in the output directory. The "README.md" file looks like this, the results are [] - can I please get your insights regarding this ?


license: apache-2.0
base_model: nferruz/ProtGPT2
tags:
- generated_from_trainer
metrics:
- accuracy
model-index:
- name: output
results: []

These are the different output files generated in the output folder, besides these files, there are two folders - "runs" and "checkpoint" -
generation_config.json
config.json
vocab.json
train_results.json
training_args.bin
trainer_state.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
model.safetensors
merges.txt
README.md
eval_results.json
all_results.json

Owner

Sorry to hear.. I am afraid I don't know what happened.

I came across this post (https://huggingface.co/docs/diffusers/v0.13.0/en/using-diffusers/using_safetensors) which suggest that "model.safetensors" is a different format compared to the classic "pytorch_model.bin" file (generated by Pytorch), is that applicable for the case of ProtGPT2 as well ?

Also, the "results[]" suggests like empty results - do you have some test validation and training dataset so that I can run the analysis and make sure my installed version and workflow is executing properly ?

I am running this for the first time - my apologies if my queries are too trivial for you.

Hi Noelia,
Can you please suggest a test dataset to execute "run_clm.py" so that I have more detailed results to check further ?
I look forward to hearing from you.

Sign up or log in to comment