jackstanley commited on
Commit
7140f4d
1 Parent(s): 5643fe7

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -116
README.md DELETED
@@ -1,116 +0,0 @@
1
- ---
2
- license: cc-by-sa-4.0
3
- pipeline_tag: fill-mask
4
- arxiv: 2210.05529
5
- language: en
6
- thumbnail: https://github.com/coastalcph/hierarchical-transformers/raw/main/data/figures/hat_encoder.png
7
- tags:
8
- - long-documents
9
- datasets:
10
- - c4
11
- model-index:
12
- - name: kiddothe2b/hierarchical-transformer-base-4096
13
- results: []
14
- ---
15
-
16
- # Hierarchical Attention Transformer (HAT) / hierarchical-transformer-base-4096
17
-
18
- ## Model description
19
-
20
- This is a Hierarchical Attention Transformer (HAT) model as presented in [An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification (Chalkidis et al., 2022)](https://arxiv.org/abs/2210.05529).
21
-
22
- The model has been warm-started re-using the weights of RoBERTa (Liu et al., 2019), and continued pre-trained for MLM in long sequences following the paradigm of Longformer released by Beltagy et al. (2020). It supports sequences of length up to 4,096.
23
-
24
- HAT uses hierarchical attention, which is a combination of segment-wise and cross-segment attention operations. You can think of segments as paragraphs or sentences.
25
-
26
- ## Intended uses & limitations
27
-
28
- You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
29
- See the [model hub](https://huggingface.co/models?filter=hierarchical-transformer) to look for other versions of HAT or fine-tuned versions on a task that interests you.
30
-
31
- Note that this model is primarily aimed at being fine-tuned on tasks that use the whole document to make decisions, such as document classification, sequential sentence classification, or question answering.
32
-
33
- ## How to use
34
-
35
- You can use this model directly for masked language modeling:
36
-
37
- ```python
38
- from transformers import AutoTokenizer, AutoModelforForMaskedLM
39
- tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
40
- mlm_model = AutoModelforForMaskedLM("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
41
- ```
42
-
43
- You can also fine-tune it for SequenceClassification, SequentialSentenceClassification, and MultipleChoice down-stream tasks:
44
-
45
- ```python
46
- from transformers import AutoTokenizer, AutoModelforSequenceClassification
47
- tokenizer = AutoTokenizer.from_pretrained("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
48
- doc_classifier = AutoModelforSequenceClassification("kiddothe2b/hierarchical-transformer-base-4096", trust_remote_code=True)
49
- ```
50
-
51
- ## Limitations and bias
52
-
53
- The training data used for this model contains a lot of unfiltered content from the internet, which is far from
54
- neutral. Therefore, the model can have biased predictions.
55
-
56
-
57
- ## Training procedure
58
-
59
- ### Training and evaluation data
60
-
61
- The model has been warm-started from [roberta-base](https://huggingface.co/roberta-base) checkpoint and has been continued pre-trained for additional 50k steps in long sequences (> 1024 subwords) of [C4](https://huggingface.co/datasets/c4) (Raffel et al., 2020).
62
-
63
-
64
- ### Training hyperparameters
65
-
66
- The following hyperparameters were used during training:
67
- - learning_rate: 0.0001
68
- - train_batch_size: 2
69
- - eval_batch_size: 2
70
- - seed: 42
71
- - distributed_type: tpu
72
- - num_devices: 8
73
- - gradient_accumulation_steps: 8
74
- - total_train_batch_size: 128
75
- - total_eval_batch_size: 16
76
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
77
- - lr_scheduler_type: linear
78
- - lr_scheduler_warmup_ratio: 0.1
79
- - training_steps: 50000
80
-
81
- ### Training results
82
-
83
- | Training Loss | Epoch | Step | Validation Loss |
84
- |:-------------:|:-----:|:-----:|:---------------:|
85
- | 1.7437 | 0.2 | 10000 | 1.6370 |
86
- | 1.6994 | 0.4 | 20000 | 1.6054 |
87
- | 1.6726 | 0.6 | 30000 | 1.5718 |
88
- | 1.644 | 0.8 | 40000 | 1.5526 |
89
- | 1.6299 | 1.0 | 50000 | 1.5368 |
90
-
91
-
92
- ### Framework versions
93
-
94
- - Transformers 4.19.0.dev0
95
- - Pytorch 1.11.0+cu102
96
- - Datasets 2.0.0
97
- - Tokenizers 0.11.6
98
-
99
-
100
- ## Citing
101
-
102
- If you use HAT in your research, please cite:
103
-
104
- [An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification](https://arxiv.org/abs/2210.05529). Ilias Chalkidis, Xiang Dai, Manos Fergadiotis, Prodromos Malakasiotis, and Desmond Elliott. 2022. arXiv:2210.05529 (Preprint).
105
-
106
- ```
107
- @misc{chalkidis-etal-2022-hat,
108
- url = {https://arxiv.org/abs/2210.05529},
109
- author = {Chalkidis, Ilias and Dai, Xiang and Fergadiotis, Manos and Malakasiotis, Prodromos and Elliott, Desmond},
110
- title = {An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification},
111
- publisher = {arXiv},
112
- year = {2022},
113
- }
114
- ```
115
-
116
-