File size: 4,897 Bytes
88a291e
566dbff
2d991dd
f9b5165
2d991dd
88a291e
89e2777
8c90089
8d866c0
 
c3b687e
8d866c0
8f704e3
358b3e4
274b5ab
e748637
358b3e4
 
a16e4d9
 
 
 
117d1b1
3fa4757
8f704e3
 
3c27e83
8f704e3
 
 
3c27e83
56a273a
69d53a2
3c27e83
 
 
 
 
 
 
 
0039f4f
 
 
 
 
 
 
 
 
 
 
 
 
e5abd54
 
911929b
ebd60af
911929b
ebd60af
 
 
e5abd54
8bbcc74
 
 
c15bf0c
8bbcc74
3905996
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bbcc74
f9f3af3
8bbcc74
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
language: it
license: afl-3.0
widget:
- text: Il [MASK] ha chiesto revocarsi l'obbligo di pagamento
---

<img  src="https://huggingface.co/dlicari/Italian-Legal-BERT/resolve/main/ITALIAN_LEGAL_BERT.jpg" width="600"/> 
<h1> ITALIAN-LEGAL-BERT:A pre-trained Transformer Language Model for Italian Law </h1>

ITALIAN-LEGAL-BERT is based on <a href="https://huggingface.co/dbmdz/bert-base-italian-xxl-cased">bert-base-italian-xxl-cased</a> with additional pre-training of the Italian BERT model on Italian civil law corpora. 
It achieves better results than the ‘general-purpose’ Italian BERT in different domain-specific tasks.



<b>ITALIAN-LEGAL-BERT variants [NEW!!!]</b>
* <a href="https://huggingface.co/dlicari/Italian-Legal-BERT-SC">FROM SCRATCH</a>, It is the ITALIAN-LEGAL-BERT variant pre-trained from scratch on Italian legal documents (<a href="https://huggingface.co/dlicari/Italian-Legal-BERT-SC">ITA-LEGAL-BERT-SC</a>) based on the CamemBERT architecture
* <a href="https://huggingface.co/dlicari/distil-ita-legal-bert">DISTILLED</a>, a distilled version of ITALIAN-LEGAL-BERT ( <a href="https://huggingface.co/dlicari/distil-ita-legal-bert">DISTIL-ITA-LEGAL-BERT</a>)
* For long documents
  * [LSG ITA LEGAL BERT](https://huggingface.co/dlicari/lsg16k-Italian-Legal-BERT), Local-Sparse-Global version of ITALIAN-LEGAL-BERT (FURTHER PRETRAINED)
  * [LSG ITA LEGAL BERT-SC](https://huggingface.co/dlicari/lsg16k-Italian-Legal-BERT-SC), Local-Sparse-Global version of ITALIAN-LEGAL-BERT-SC (FROM SCRATCH)
     
*Note: We are working on the extended version of the paper with more details and the results of these new models. We will update you soon*

<h2>Training procedure</h2> 
We initialized ITALIAN-LEGAL-BERT with ITALIAN XXL BERT
and pretrained for an additional 4 epochs on 3.7 GB of preprocessed text from the National Jurisprudential
Archive using the Huggingface PyTorch-Transformers library. We used BERT architecture
with a language modeling head on top, AdamW Optimizer, initial learning rate 5e-5 (with
linear learning rate decay, ends at 2.525e-9), sequence length 512, batch size 10 (imposed
by GPU capacity), 8.4 million training steps, device 1*GPU V100 16GB
<p />
<h2> Usage </h2> 

ITALIAN-LEGAL-BERT model can be loaded like:

```python
from transformers import AutoModel, AutoTokenizer
model_name = "dlicari/Italian-Legal-BERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
```

You can use the Transformers library fill-mask pipeline to do inference with ITALIAN-LEGAL-BERT. 
```python
from transformers import pipeline
model_name = "dlicari/Italian-Legal-BERT"
fill_mask = pipeline("fill-mask", model_name)
fill_mask("Il [MASK] ha chiesto revocarsi l'obbligo di pagamento")
#[{'sequence': "Il ricorrente ha chiesto revocarsi l'obbligo di pagamento",'score': 0.7264330387115479},
# {'sequence': "Il convenuto ha chiesto revocarsi l'obbligo di pagamento",'score': 0.09641049802303314},
# {'sequence': "Il resistente ha chiesto revocarsi l'obbligo di pagamento",'score': 0.039877112954854965},
# {'sequence': "Il lavoratore ha chiesto revocarsi l'obbligo di pagamento",'score': 0.028993653133511543},
# {'sequence': "Il Ministero ha chiesto revocarsi l'obbligo di pagamento", 'score': 0.025297977030277252}]
```

In this  [COLAB: ITALIAN-LEGAL-BERT: Minimal Start for Italian Legal Downstream Tasks](https://colab.research.google.com/drive/1ZOWaWnLaagT_PX6MmXMP2m3MAOVXkyRK?usp=sharing)
 how to use it for sentence similarity, sentence classification, and named entity recognition
 - https://colab.research.google.com/drive/1ZOWaWnLaagT_PX6MmXMP2m3MAOVXkyRK?usp=sharing

<img  src="https://huggingface.co/dlicari/Italian-Legal-BERT/resolve/main/semantic_text_similarity.jpg" width="700"/> 



<h2> Citation </h2>
If you find our resource or paper is useful, please consider including the following citation in your paper.

```
@inproceedings{licari_italian-legal-bert_2022,
	address = {Bozen-Bolzano, Italy},
	series = {{CEUR} {Workshop} {Proceedings}},
	title = {{ITALIAN}-{LEGAL}-{BERT}: {A} {Pre}-trained {Transformer} {Language} {Model} for {Italian} {Law}},
	volume = {3256},
	shorttitle = {{ITALIAN}-{LEGAL}-{BERT}},
	url = {https://ceur-ws.org/Vol-3256/#km4law3},
	language = {en},
	urldate = {2022-11-19},
	booktitle = {Companion {Proceedings} of the 23rd {International} {Conference} on {Knowledge} {Engineering} and {Knowledge} {Management}},
	publisher = {CEUR},
	author = {Licari, Daniele and Comandè, Giovanni},
	editor = {Symeonidou, Danai and Yu, Ran and Ceolin, Davide and Poveda-Villalón, María and Audrito, Davide and Caro, Luigi Di and Grasso, Francesca and Nai, Roberto and Sulis, Emilio and Ekaputra, Fajar J. and Kutz, Oliver and Troquard, Nicolas},
	month = sep,
	year = {2022},
	note = {ISSN: 1613-0073},
	file = {Full Text PDF:https://ceur-ws.org/Vol-3256/km4law3.pdf},
}

```