File size: 3,191 Bytes
af37612 b871af6 af37612 7ffca30 05934a6 476b649 05934a6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
---
license: apache-2.0
language:
- fr
tags:
- historical
- french
- public domain
- teams
datasets:
- PleIAs/French-PD-Newspapers
---
# Journaux-LM
![Journaux-LM](journaux-lm-v1.png)
The Journaux-LM is a language model pretrained on historical French newspapers. Technically the model itself is an ELECTRA model, which was pretrained with the [TEAMS](https://aclanthology.org/2021.findings-acl.219/) approach.
## Datasets
Version 1 of the Journaux-LM was pretrained on the following publicly available datasets:
* [`PleIAs/French-PD-Newspapers`](https://huggingface.co/datasets/PleIAs/French-PD-Newspapers)
In total, the pretraining corpus has a size of 408GB.
## Benchmarks (Named Entity Recognition)
We compare our Zeitungs-LM directly to the French Europeana BERT model (as Zeitungs-LM is supposed to be the successor of it) on various downstream tasks from the [hmBench](https://github.com/stefan-it/hmBench) repository, which is focussed on Named Entity Recognition.
We report averaged micro F1-Score over 5 runs with different seeds and use the best hyper-parameter configuration on the development set of each dataset to report the final test score.
### Development Set
The results on the development set can be seen in the following table:
| Model \ Dataset | [AjMC][1] | [ICDAR][2] | [LeTemps][3] | [NewsEye][4] | [HIPE-2020][5] | Avg. |
|:--------------------|:----------|:-----------|:-------------|:-------------|:---------------|:----------|
| [Europeana BERT][6] | 85.7 | 77.63 | 67.14 | 82.68 | 85.98 | 79.83 |
| Journaux-LM v1 | 86.25 | 78.51 | 67.76 | 84.07 | 88.17 | 80.95 |
Our Journaux-LM leads to a performance boost of 1.12% compared to the German Europeana BERT model.
### Test Set
The final results on the test set can be seen here:
| Model \ Dataset | [AjMC][1] | [ICDAR][2] | [LeTemps][3] | [NewsEye][4] | [HIPE-2020][5] | Avg. |
|:--------------------|:----------|:-----------|:-------------|:-------------|:---------------|:----------|
| [Europeana BERT][6] | 81.06 | 78.17 | 67.22 | 73.51 | 81.00 | 76.19 |
| Journaux-LM v1 | 83.41 | 77.73 | 67.11 | 74.48 | 83.14 | 77.17 |
Our Journaux-LM beats the French Europeana BERT model by 0.98%.
[1]: https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-ajmc.md
[2]: https://github.com/stefan-it/historic-domain-adaptation-icdar
[3]: https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-letemps.md
[4]: https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-newseye.md
[5]: https://github.com/hipe-eval/HIPE-2022-data/blob/main/documentation/README-hipe2020.md
[6]: https://huggingface.co/dbmdz/bert-base-french-europeana-cased
# Changelog
* 02.11.2024: Initial version of the model. More details are coming very soon!
# Acknowledgements
Research supported with Cloud TPUs from Google's [TPU Research Cloud](https://sites.research.google/trc/about/) (TRC).
Many Thanks for providing access to the TPUs ❤️
Made from Bavarian Oberland with ❤️ and 🥨. |