ltgoslo commited on
Commit
7876337
1 Parent(s): 2d5bc7d

Updating README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -9
README.md CHANGED
@@ -16,7 +16,7 @@ datasets:
16
  <img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>
17
 
18
  This is one of the encoder-only monolingual language models trained as a first release by the [HPLT project](https://hplt-project.org/).
19
- It is a so called masked language models. In particular, we used the modification of the classic BERT model named [LTG-BERT](https://aclanthology.org/2023.findings-eacl.146/).
20
 
21
  A monolingual LTG-BERT model is trained for every major language in the [HPLT 1.2 data release](https://hplt-project.org/datasets/v1.2) (*75* models total).
22
 
@@ -55,15 +55,54 @@ print(tokenizer.decode(output_text[0].tolist()))
55
 
56
  The following classes are currently implemented: `AutoModel`, `AutoModelMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`, `AutoModelForQuestionAnswering` and `AutoModeltForMultipleChoice`.
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ## Cite us
59
 
60
  ```bibtex
61
- @misc{degibert2024new,
62
- title={A New Massive Multilingual Dataset for High-Performance Language Technologies},
63
- author={Ona de Gibert and Graeme Nail and Nikolay Arefyev and Marta Bañón and Jelmer van der Linde and Shaoxiong Ji and Jaume Zaragoza-Bernabeu and Mikko Aulamo and Gema Ramírez-Sánchez and Andrey Kutuzov and Sampo Pyysalo and Stephan Oepen and Jörg Tiedemann},
64
- year={2024},
65
- eprint={2403.14009},
66
- archivePrefix={arXiv},
67
- primaryClass={cs.CL}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
  }
69
- ```
 
 
16
  <img src="https://hplt-project.org/_next/static/media/logo-hplt.d5e16ca5.svg" width=12.5%>
17
 
18
  This is one of the encoder-only monolingual language models trained as a first release by the [HPLT project](https://hplt-project.org/).
19
+ It is a so called masked language model. In particular, we used the modification of the classic BERT model named [LTG-BERT](https://aclanthology.org/2023.findings-eacl.146/).
20
 
21
  A monolingual LTG-BERT model is trained for every major language in the [HPLT 1.2 data release](https://hplt-project.org/datasets/v1.2) (*75* models total).
22
 
 
55
 
56
  The following classes are currently implemented: `AutoModel`, `AutoModelMaskedLM`, `AutoModelForSequenceClassification`, `AutoModelForTokenClassification`, `AutoModelForQuestionAnswering` and `AutoModeltForMultipleChoice`.
57
 
58
+ ## Intermediate checkpoints
59
+
60
+ We are releasing 10 intermediate checkpoints for each model at intervals of every 3125 training steps in separate branches. The naming convention is `stepXXX`: for example, `step18750`.
61
+
62
+ You can load a specific model revision with `transformers` using the argument `revision`:
63
+ ```python
64
+ model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_en", revision="step21875", trust_remote_code=True)
65
+ ```
66
+
67
+ You can access all the revisions for the models with the following code:
68
+ ```python
69
+ from huggingface_hub import list_repo_refs
70
+ out = list_repo_refs("HPLT/hplt_bert_base_en")
71
+ print([b.name for b in out.branches])
72
+ ```
73
+
74
  ## Cite us
75
 
76
  ```bibtex
77
+ @inproceedings{de-gibert-etal-2024-new-massive,
78
+ title = "A New Massive Multilingual Dataset for High-Performance Language Technologies",
79
+ author = {de Gibert, Ona and
80
+ Nail, Graeme and
81
+ Arefyev, Nikolay and
82
+ Ba{\~n}{\'o}n, Marta and
83
+ van der Linde, Jelmer and
84
+ Ji, Shaoxiong and
85
+ Zaragoza-Bernabeu, Jaume and
86
+ Aulamo, Mikko and
87
+ Ram{\'\i}rez-S{\'a}nchez, Gema and
88
+ Kutuzov, Andrey and
89
+ Pyysalo, Sampo and
90
+ Oepen, Stephan and
91
+ Tiedemann, J{\"o}rg},
92
+ editor = "Calzolari, Nicoletta and
93
+ Kan, Min-Yen and
94
+ Hoste, Veronique and
95
+ Lenci, Alessandro and
96
+ Sakti, Sakriani and
97
+ Xue, Nianwen",
98
+ booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
99
+ month = may,
100
+ year = "2024",
101
+ address = "Torino, Italia",
102
+ publisher = "ELRA and ICCL",
103
+ url = "https://aclanthology.org/2024.lrec-main.100",
104
+ pages = "1116--1128",
105
+ abstract = "We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of {\mbox{$\approx$}} 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.",
106
  }
107
+ ```
108
+