File size: 3,221 Bytes
88a291e
566dbff
88a291e
f9b5165
 
88a291e
89e2777
8c90089
8d866c0
 
c3b687e
8d866c0
8f704e3
 
 
3c27e83
8f704e3
 
 
3c27e83
56a273a
69d53a2
3c27e83
 
 
 
 
 
 
 
0039f4f
 
 
 
 
 
 
 
 
 
 
 
 
e5abd54
 
ebd60af
 
 
 
 
 
e5abd54
8bbcc74
 
 
c15bf0c
8bbcc74
018976c
8435fda
8b7fb7f
8435fda
611d153
 
8435fda
8bbcc74
f9f3af3
8bbcc74
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
language: it
license: apache-2.0
widget:
- text: "Il [MASK] ha chiesto revocarsi l'obbligo di pagamento"
---

<img  src="https://huggingface.co/dlicari/Italian-Legal-BERT/resolve/main/ITALIAN_LEGAL_BERT.jpg" width="600"/> 
<h1> ITALIAN-LEGAL-BERT:A pre-trained Transformer Language Model for Italian Law </h1>

ITALIAN-LEGAL-BERT is based on <a href="https://huggingface.co/dbmdz/bert-base-italian-xxl-cased">bert-base-italian-xxl-cased</a> with additional pre-training of the Italian BERT model on Italian civil law corpora. 
It achieves better results than the ‘general-purpose’ Italian BERT in different domain-specific tasks.

<h2>Training procedure</h2> 
We initialized ITALIAN-LEGAL-BERT with ITALIAN XXL BERT
and pretrained for an additional 4 epochs on 3.7 GB of preprocessed text from the National Jurisprudential
Archive using the Huggingface PyTorch-Transformers library. We used BERT architecture
with a language modeling head on top, AdamW Optimizer, initial learning rate 5e-5 (with
linear learning rate decay, ends at 2.525e-9), sequence length 512, batch size 10 (imposed
by GPU capacity), 8.4 million training steps, device 1*GPU V100 16GB
<p />
<h2> Usage </h2> 

ITALIAN-LEGAL-BERT model can be loaded like:

```python
from transformers import AutoModel, AutoTokenizer
model_name = "dlicari/Italian-Legal-BERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
```

You can use the Transformers library fill-mask pipeline to do inference with ITALIAN-LEGAL-BERT. 
```python
from transformers import pipeline
model_name = "dlicari/Italian-Legal-BERT"
fill_mask = pipeline("fill-mask", model_name)
fill_mask("Il [MASK] ha chiesto revocarsi l'obbligo di pagamento")
#[{'sequence': "Il ricorrente ha chiesto revocarsi l'obbligo di pagamento",'score': 0.7264330387115479},
# {'sequence': "Il convenuto ha chiesto revocarsi l'obbligo di pagamento",'score': 0.09641049802303314},
# {'sequence': "Il resistente ha chiesto revocarsi l'obbligo di pagamento",'score': 0.039877112954854965},
# {'sequence': "Il lavoratore ha chiesto revocarsi l'obbligo di pagamento",'score': 0.028993653133511543},
# {'sequence': "Il Ministero ha chiesto revocarsi l'obbligo di pagamento", 'score': 0.025297977030277252}]
```

In this  [COLAB: ITALIAN-LEGAL-BERT: Minimal Start for Italian Legal Downstream Tasks](https://colab.research.google.com/drive/1aXOmqr70fjm8lYgIoGJMZDsK0QRIL4Lt?usp=sharing)
 how to use it for sentence similarity, sentence classification, and named entity recognition
 - https://colab.research.google.com/drive/1aXOmqr70fjm8lYgIoGJMZDsK0QRIL4Lt?usp=sharing

<img  src="https://huggingface.co/dlicari/Italian-Legal-BERT/resolve/main/semantic_text_similarity.jpg" width="700"/> 



<h2> Citation </h2>
If you find our resource or paper is useful, please consider including the following citation in your paper.

```
@article{ita_legalbert_2022,
     author = {Daniele Licari and Giovanni Comandè},
     title = {ITALIAN-LEGAL-BERT: A Pre-trained Transformer
     Language Model for Italian Law},
     booktitle = {Proceedings of The Knowledge Management for Law Workshop (KM4LAW)}
     note = {Accepted for publication},
     year = {2022}
}

```