Help with training dataset output and validation .txts
Hello! This is such a great resource and I'm really looking forward to using this. I want to train on a dataset (relatively small, <677 lines so ~10k tokens). I followed the directions and created a training.txt file with <|endoftext|> as a header for each AA sequence. I saved ~10% (70 lines) as a validation.txt dataset. My command: python run_clm.py --model_name_or_path nferruz/ProtGPT2 --train_file training.txt --validation_file validation.txt --tokenizer_name nferruz/ProtGPT2 --output_dir /home/grant/test
The command runs successfully without any errors and to my knowledge there is no error.txt output (yay!). However, my output only has a README.md that reads as follows (see below). Clearly my results is null ([]) so I'm assuming that the training dataset is too small (I saw some other posts where people used >500 sequences). Am I right in my assumption? Is this the end of the road?
Another way to answer this would be if there were an example training.txt (and accompanying validation.txt) I could download so I know what a "good" validation run looks like.
ANY help would be appreicated. THANKS!
my output: README.md
license: apache-2.0
base_model: nferruz/ProtGPT2
tags:
- generated_from_trainer
model-index:
- name: test
results: []
test
This model is a fine-tuned version of nferruz/ProtGPT2 on an unknown dataset.
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 3.0
Framework versions
- Transformers 4.35.0.dev0
- Pytorch 2.1.0+cu118
- Datasets 2.14.5
- Tokenizers 0.14.1