maastrichtlawtech
/

legal-distilcamembert

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

legal-distilcamembert / README.md

antoinelouis's picture

Update README.md

a00dc46 verified 8 months ago

|

history blame contribute delete

2.42 kB

	---
	language: fr
	license: apache-2.0
	tags:
	- legal
	- feature-extraction
	datasets: maastrichtlawtech/bsard
	pipeline_tag: fill-mask
	widget:
	- text: >-
	Chaque commune de la Région peut adopter un <mask> communal de
	développement, applicable à l'ensemble de son territoire.
	library_name: transformers
	---

	# Legal-DistilCamemBERT-base

	This is a [DistilCamemBERT](https://huggingface.co/cmarkea/distilcamembert-base) model further pre-trained on 22,000+ legal articles from the Belgian legislation in French.

	## Usage

	```python
	from transformers import AutoTokenizer, AutoModel

	tokenizer = AutoTokenizer.from_pretrained("maastrichtlawtech/legal-distilcamembert")
	model = AutoModel.from_pretrained("maastrichtlawtech/legal-distilcamembert")
	```

	## Training

	#### Background

	We utilize the [distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) checkpoint and further pre-train it with a masked language modeling (MLM) objective on legislation in French using the [script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py) from Hugging Face.

	#### Hyperparameters

	We train the model on a single Tesla V100 GPU with 32GBs of memory during 200 epochs (i.e., ~50k steps) using a batch size of 32. We use the AdamW optimizer with an initial learning rate of 5e-05, weight decay of 0.01, learning rate warmup over the first 500 steps, and linear decay of the learning rate. The sequence length was limited to 512 tokens.

	#### Data

	We use the [Belgian Statutory Article Retrieval Dataset (BSARD)](https://huggingface.co/datasets/maastrichtlawtech/bsard) to further pre-train the model. BSARD is a French native dataset for studying legal information retrieval that includes more than 22,600 statutory articles from the Belgian legislation.

	## Citation

	```bibtex
	@inproceedings{louis2023finding,
	title = {Finding the Law: Enhancing Statutory Article Retrieval via Graph Neural Networks},
	author = {Louis, Antoine and van Dijck, Gijs and Spanakis, Gerasimos},
	booktitle = {Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics},
	month = may,
	year = {2023},
	address = {Dubrovnik, Croatia},
	publisher = {Association for Computational Linguistics},
	url = {https://aclanthology.org/2023.eacl-main.203/},
	pages = {2753–2768},
	}
	```
	[//]: # (https://arxiv.org/abs/2301.12847)