Update README.md
Browse files
README.md
CHANGED
@@ -3,69 +3,37 @@ language: fr
|
|
3 |
tags:
|
4 |
- Early Modern French
|
5 |
- Historical
|
|
|
6 |
license: apache-2.0
|
7 |
datasets:
|
8 |
-
-
|
9 |
---
|
10 |
|
11 |
<a href="https://portizs.eu/publication/2022/lrec/dalembert/">
|
12 |
<img width="300px" src="https://portizs.eu/publication/2022/lrec/dalembert/featured_hu18bf34d40cdc71c744bdd15e48ff0b23_61788_720x2500_fit_q100_h2_lanczos_3.webp">
|
13 |
</a>
|
14 |
|
15 |
-
# D'AlemBERT
|
16 |
|
17 |
-
This model is a [
|
18 |
-
introduced in [this paper](https://aclanthology.org/2022.
|
19 |
|
20 |
-
|
21 |
-
|
22 |
-
D'AlemBERT is a transformers mode pretrained on the raw texts only with no humans labelling them in any way with an automatic process to generate inputs and labels from those texts using the RoBERTa base model. More precisely, it was pretrained
|
23 |
-
with one objective:
|
24 |
-
|
25 |
-
- Masked language modeling (MLM): this is part of the original training loss of the BERT base model. When taking a
|
26 |
-
sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the
|
27 |
-
model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that
|
28 |
-
usually see the words one after the other, or from autoregressive models like GPT which internally mask the future
|
29 |
-
tokens. It allows the model to learn a bidirectional representation of the sentence.
|
30 |
-
|
31 |
-
## Intended uses & limitations
|
32 |
-
|
33 |
-
You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task.
|
34 |
-
|
35 |
-
Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT.
|
36 |
-
|
37 |
-
The model is primarily intended for use in Digital Humanities and Historical NLP.
|
38 |
-
|
39 |
-
### Limitations and bias
|
40 |
-
|
41 |
-
This model is trained with historical French data from starting from the 16th c., so it might produce results that seem extremely biased by today standards. It might not work well on contemporary data and it is not intended to be used on it.
|
42 |
-
|
43 |
-
This bias will also affect all fine-tuned versions of this model.
|
44 |
-
|
45 |
-
## Training data
|
46 |
-
|
47 |
-
D'AlemBERT was pretrained on the non-freely available version of the [FreEMmax corpus](https://doi.org/10.5281/zenodo.6481135), a dataset
|
48 |
-
consisting of more than 180k tokens coming from 22 different sources, and comprising French textual data going from the 16th c to the early 20th c.
|
49 |
|
50 |
### BibTeX entry and citation info
|
51 |
|
52 |
```bibtex
|
53 |
-
@inproceedings{gabay-
|
54 |
-
title = "
|
55 |
-
author = "
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
Bawden, Rachel and
|
60 |
-
Gambette, Philippe and
|
61 |
-
Sagot, Beno{\^\i}t",
|
62 |
-
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
|
63 |
-
month = jun,
|
64 |
year = "2022",
|
65 |
-
address = "
|
66 |
-
publisher = "
|
67 |
-
url = "https://aclanthology.org/2022.
|
68 |
-
pages = "
|
69 |
-
abstract = "
|
70 |
}
|
71 |
```
|
|
|
3 |
tags:
|
4 |
- Early Modern French
|
5 |
- Historical
|
6 |
+
- NER
|
7 |
license: apache-2.0
|
8 |
datasets:
|
9 |
+
- freemner
|
10 |
---
|
11 |
|
12 |
<a href="https://portizs.eu/publication/2022/lrec/dalembert/">
|
13 |
<img width="300px" src="https://portizs.eu/publication/2022/lrec/dalembert/featured_hu18bf34d40cdc71c744bdd15e48ff0b23_61788_720x2500_fit_q100_h2_lanczos_3.webp">
|
14 |
</a>
|
15 |
|
16 |
+
# D'AlemBERT-NER model
|
17 |
|
18 |
+
This model is fine-tuned version of a [D'AlemBERT](https://huggingface.co/pjox/DalemBERT) on the [FreEMNER corpus](https://doi.org/10.5281/zenodo.6481135) for Early Modern French. It was
|
19 |
+
introduced in [this paper](https://aclanthology.org/2022.coling-1.327/).
|
20 |
|
21 |
+
This model will be uploaded soon!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
### BibTeX entry and citation info
|
24 |
|
25 |
```bibtex
|
26 |
+
@inproceedings{ortiz-suarez-gabay-2022-data,
|
27 |
+
title = "A Data-driven Approach to Named Entity Recognition for Early {M}odern {F}rench",
|
28 |
+
author = "Ortiz Suarez, Pedro and
|
29 |
+
Gabay, Simon",
|
30 |
+
booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
|
31 |
+
month = oct,
|
|
|
|
|
|
|
|
|
|
|
32 |
year = "2022",
|
33 |
+
address = "Gyeongju, Republic of Korea",
|
34 |
+
publisher = "International Committee on Computational Linguistics",
|
35 |
+
url = "https://aclanthology.org/2022.coling-1.327",
|
36 |
+
pages = "3722--3730",
|
37 |
+
abstract = "Named entity recognition has become an increasingly useful tool for digital humanities research, specially when it comes to historical texts. However, historical texts pose a wide range of challenges to both named entity recognition and natural language processing in general that are still difficult to address even with modern neural methods. In this article we focus in named entity recognition for historical French, and in particular for Early Modern French (16th-18th c.), i.e. Ancien R{\'e}gime French. However, instead of developing a specialised architecture to tackle the particularities of this state of language, we opt for a data-driven approach by developing a new corpus with fine-grained entity annotation, covering three centuries of literature corresponding to the early modern period; we try to annotate as much data as possible producing a corpus that is many times bigger than the most popular NER evaluation corpora for both Contemporary English and French. We then fine-tune existing state-of-the-art architectures for Early Modern and Contemporary French, obtaining results that are on par with those of the current state-of-the-art NER systems for Contemporary English. Both the corpus and the fine-tuned models are released.",
|
38 |
}
|
39 |
```
|