|
--- |
|
language: fr |
|
license: apache-2.0 |
|
tags: |
|
- legal |
|
- feature-extraction |
|
datasets: maastrichtlawtech/bsard |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: >- |
|
Chaque commune de la Région peut adopter un <mask> communal de |
|
développement, applicable à l'ensemble de son territoire. |
|
library_name: transformers |
|
--- |
|
|
|
# Legal-DistilCamemBERT-base |
|
|
|
This is a [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) model further pre-trained on 22,000+ legal articles from the Belgian legislation in French. |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("maastrichtlawtech/legal-distilcamembert") |
|
model = AutoModel.from_pretrained("maastrichtlawtech/legal-distilcamembert") |
|
``` |
|
|
|
## Training |
|
|
|
#### Background |
|
|
|
We utilize the [distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) checkpoint and further pre-train it with a masked language modeling (MLM) objective on legislation in French using the [script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) from Hugging Face. |
|
|
|
#### Hyperparameters |
|
|
|
We train the model on a single Tesla V100 GPU with 32GBs of memory during 200 epochs (i.e., ~50k steps) using a batch size of 32. We use the AdamW optimizer with an initial learning rate of 5e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens. |
|
|
|
#### Data |
|
|
|
We use the [Belgian Statutory Article Retrieval Dataset (BSARD)](https://huggingface.co/datasets/maastrichtlawtech/bsard) to further pre-train the model. BSARD is a French native dataset for studying legal information retrieval that includes more than 22,600 statutory articles from the Belgian legislation. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@inproceedings{louis2023finding, |
|
title = {Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks}, |
|
author = {Louis, Antoine and van Dijck, Gijs and Spanakis, Gerasimos}, |
|
booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics}, |
|
month = may, |
|
year = {2023}, |
|
address = {Dubrovnik, Croatia}, |
|
publisher = {Association for Computational Linguistics}, |
|
url = {https://aclanthology.org/2023.eacl-main.203/}, |
|
pages = {2753–2768}, |
|
} |
|
``` |
|
[//]: # (https://arxiv.org/abs/2301.12847) |