BEE-spoke-data/bert-plus-L8-4096-v1.0
still running some evals, etc. expect the model card to change a bit
* No additional code. This model uses position_embedding_type="relative_key"
to help with long ctx.
this checkpoint
Further progression after multitask training etc. The most recent/last dataset it saw was the euirim/goodwiki dataset.
It achieves the following results on the evaluation set:
- Loss: 1.9835
- Accuracy: 0.6159
GLUE benchmark
WIP till this text is removed
Thus far, all completed in fp32 (using nvidia tf32 dtype behind the scenes when supported)
Model | Size | Avg | CoLA | SST2 | MRPC | STSB | QQP | MNLI | QNLI | RTE |
---|---|---|---|---|---|---|---|---|---|---|
bert-plus-L8-4096-v1.0 | 88.1M | 82.78 | 62.72 | 90.6 | 86.59 | 92.07 | 90.6 | 83.2 | 90.0 | 66.43 |
bert_uncased_L-8_H-768_A-12 | 81.2M | 81.65 | 54.0 | 92.6 | 85.43 | 92.60 | 90.6 | 81.0 | 90.0 | 67.0 |
bert-base-uncased | 110M | 79.05 | 52.1 | 93.5 | 88.9 | 85.8 | 71.2 | 84.0 | 90.5 | 66.4 |
and some comparisons to recent BERT models taken from nomic's blog post:
Model | Size | Avg | CoLA | SST2 | MRPC | STSB | QQP | MNLI | QNLI | RTE |
---|---|---|---|---|---|---|---|---|---|---|
NomicBERT | 137M | 84.00 | 50.00 | 93.00 | 88.00 | 90.00 | 92.00 | 86.00 | 92.00 | 82.00 |
RobertaBase | 125M | 86.00 | 64.00 | 95.00 | 90.00 | 91.00 | 92.00 | 88.00 | 93.00 | 79.00 |
JinaBERTBase | 137M | 83.00 | 51.00 | 95.00 | 88.00 | 90.00 | 81.00 | 86.00 | 92.00 | 79.00 |
MosaicBERT | 137M | 85.00 | 59.00 | 94.00 | 89.00 | 90.00 | 92.00 | 86.00 | 91.00 | 83.00 |
Observations:
Performance Variation Across Models and Tasks: The data highlights significant performance variability both across and within models for different GLUE tasks. This variability underscores the complexity of natural language understanding tasks and the need for models to be versatile in handling different types of linguistic challenges.
Model Size and Efficiency: Despite the differences in model size, there is not always a direct correlation between size and performance across tasks. For instance,
bert_uncased_L-8_H-768_A-12
performs competitively with larger models in certain tasks, suggesting that efficiency in model architecture and training can compensate for smaller model sizes.Task-specific Challenges: Certain tasks, such as RTE, present considerable challenges to all models, indicating the difficulty of tasks that require deep understanding and reasoning over language. This suggests areas where further research and model innovation are needed to improve performance.
Overall Model Performance: Models like
roberta-base
show strong performance across a broad spectrum of tasks, indicating the effectiveness of its architecture and pre-training methodology. Meanwhile, models such asBEE-spoke-data/bert-plus-L8-4096-v1.0
showcase the potential for achieving competitive performance with relatively smaller sizes, emphasizing the importance of model design and optimization.
Training procedure
The below is auto-generated and just applies to the 'finishing touches' run on goodwiki
.
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 4
- seed: 31010
- gradient_accumulation_steps: 16
- total_train_batch_size: 64
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 100
- num_epochs: 1.0
Training results
Training Loss | Epoch | Step | Validation Loss | Accuracy |
---|---|---|---|---|
2.1283 | 0.25 | 150 | 2.0892 | 0.6018 |
2.0999 | 0.5 | 300 | 2.0387 | 0.6084 |
2.0595 | 0.75 | 450 | 1.9971 | 0.6143 |
2.0481 | 1.0 | 600 | 1.9893 | 0.6152 |
Framework versions
- Transformers 4.37.2
- Pytorch 2.3.0.dev20240206+cu121
- Datasets 2.16.1
- Tokenizers 0.15.1
- Downloads last month
- 3