File size: 4,498 Bytes
6f209a9 c2ba13f 00a6645 c2ba13f 3f2aa1f c2ba13f 1a9f89f c2ba13f 6f209a9 c2ba13f 1a9f89f c2ba13f 00a6645 c2ba13f 00a6645 c2ba13f 00a6645 c2ba13f 00a6645 c2ba13f 00a6645 c2ba13f 00a6645 c2ba13f 3f2aa1f 00a6645 3f2aa1f c2ba13f 00a6645 c2ba13f 1a9f89f 00a6645 c2ba13f 3f2aa1f 00a6645 c2ba13f 00a6645 c2ba13f 00a6645 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 |
---
license: cc-by-nc-sa-4.0
pipeline_tag: fill-mask
language: en
arxiv: 2210.05529
tags:
- long-documents
datasets:
- wikipedia
model-index:
- name: kiddothe2b/hierarchical-transformer-I3-mini-1024
results: []
---
# Hierarchical Attention Transformer (HAT) / hierarchical-transformer-I3-mini-1024
## Model description
This is a Hierarchical Attention Transformer (HAT) model as presented in [An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification (Chalkidis et al., 2022)](https://arxiv.org/abs/2210.05529).
The model has been warm-started re-using the weights of miniature BERT (Turc et al., 2019), and continued pre-trained for MLM following the paradigm of Longformer released by Beltagy et al. (2020). It supports sequences of length up to 1,024.
HAT uses hierarchical attention, which is a combination of segment-wise and cross-segment attention operations. You can think of segments as paragraphs or sentences.
## Intended uses & limitations
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
See the [model hub](https://huggingface.co/models?filter=hierarchical-transformer) to look for other versions of HAT or fine-tuned versions on a task that interests you.
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole document to make decisions, such as document classification, sequential sentence classification, or question answering.
## How to use
You can use this model directly for masked language modeling:
```python
from transformers import AutoTokenizer, AutoModelforForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/hierarchical-transformer-I3-mini-1024", trust_remote_code=True)
mlm_model = AutoModelforForMaskedLM("kiddothe2b/hierarchical-transformer-I3-mini-1024", trust_remote_code=True)
```
You can also fine-tune it for SequenceClassification, SequentialSentenceClassification, and MultipleChoice down-stream tasks:
```python
from transformers import AutoTokenizer, AutoModelforSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/hierarchical-transformer-I3-mini-1024", trust_remote_code=True)
doc_classifier = AutoModelforSequenceClassification("kiddothe2b/hierarchical-transformer-I3-mini-1024", trust_remote_code=True)
```
## Limitations and bias
The training data used for this model contains a lot of unfiltered content from the internet, which is far from
neutral. Therefore, the model can have biased predictions.
## Training procedure
### Training and evaluation data
The model has been warm-started from [google/bert_uncased_L-6_H-256_A-4](https://huggingface.co/google/bert_uncased_L-6_H-256_A-4) checkpoint and has been continued pre-trained for additional 50k steps on English [Wikipedia](https://huggingface.co/datasets/wikipedia).
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 4
- eval_batch_size: 4
- seed: 42
- distributed_type: tpu
- num_devices: 8
- gradient_accumulation_steps: 4
- total_train_batch_size: 128
- total_eval_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- training_steps: 50000
### Training results
| Training Loss | Epoch | Step | Validation Loss |
|:-------------:|:-----:|:-----:|:---------------:|
| 2.7353 | 0.2 | 10000 | 2.5067 |
| 2.6081 | 0.4 | 20000 | 2.3966 |
| 2.5552 | 0.6 | 30000 | 2.3446 |
| 2.5105 | 0.8 | 40000 | 2.3117 |
| 2.4978 | 1.14 | 50000 | 2.2954 |
### Framework versions
- Transformers 4.19.0.dev0
- Pytorch 1.11.0+cu102
- Datasets 2.0.0
- Tokenizers 0.11.6
## Citing
If you use HAT in your research, please cite:
[An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification](https://arxiv.org/abs/2210.05529). Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. 2022. arXiv:2210.05529 (Preprint).
```
@misc{chalkidis-etal-2022-hat,
url = {https://arxiv.org/abs/2210.05529},
author = {Chalkidis, Ilias and Dai, Xiang and Fergadiotis, Manos and Malakasiotis, Prodromos and Elliott, Desmond},
title = {An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification},
publisher = {arXiv},
year = {2022},
}
```
|