rufimelo commited on
Commit
947975d
1 Parent(s): 678c9f6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -44
README.md CHANGED
@@ -10,7 +10,6 @@ tags:
10
  datasets:
11
  - assin
12
  - assin2
13
-
14
  widget:
15
  - source_sentence: "O advogado apresentou as provas ao juíz."
16
  sentences:
@@ -21,37 +20,25 @@ widget:
21
  metrics:
22
  - bleu
23
  ---
24
-
25
- # rufimelo/Legal-SBERTimbau-nli-large
26
-
27
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
28
- Legal-SBERTimbau-large is based on Legal-BERTimbau-large which derives from [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) Large.
29
- It is adapted to the Portuguese legal domain.
30
-
31
  ## Usage (Sentence-Transformers)
32
-
33
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
34
-
35
  ```
36
  pip install -U sentence-transformers
37
  ```
38
-
39
  Then you can use the model like this:
40
-
41
  ```python
42
  from sentence_transformers import SentenceTransformer
43
  sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]
44
 
45
- model = SentenceTransformer('rufimelo/Legal-SBERTimbau-nli-large')
46
  embeddings = model.encode(sentences)
47
  print(embeddings)
48
  ```
49
-
50
-
51
-
52
  ## Usage (HuggingFace Transformers)
53
-
54
-
55
  ```python
56
  from transformers import AutoTokenizer, AutoModel
57
  import torch
@@ -63,13 +50,12 @@ def mean_pooling(model_output, attention_mask):
63
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
64
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
65
 
66
-
67
  # Sentences we want sentence embeddings for
68
  sentences = ['This is an example sentence', 'Each sentence is converted']
69
 
70
  # Load model from HuggingFace Hub
71
- tokenizer = AutoTokenizer.from_pretrained('rufimelo/Legal-SBERTimbau-nli-large')
72
- model = AutoModel.from_pretrained('rufimelo/Legal-SBERTimbau-nli-large}')
73
 
74
  # Tokenize sentences
75
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -77,26 +63,21 @@ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tenso
77
  # Compute token embeddings
78
  with torch.no_grad():
79
  model_output = model(**encoded_input)
80
-
81
  # Perform pooling. In this case, mean pooling.
82
  sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
83
-
84
  print("Sentence embeddings:")
85
  print(sentence_embeddings)
86
  ```
87
-
88
-
89
  ## Evaluation Results STS
90
-
91
  | Model| Dataset | PearsonCorrelation |
92
  | ---------------------------------------- | ---------- | ---------- |
93
- | Legal-SBERTimbau-large| Assin | 0.76629 |
94
- | Legal-SBERTimbau-large| Assin2| 0.82357 |
95
- | Legal-SBERTimbau-base| Assin | 0.71457 |
96
- | Legal-SBERTimbau-base| Assin2| 0.73545|
97
- | Legal-SBERTimbau-sts-large| Assin | 0.76299 |
98
- | Legal-SBERTimbau-sts-large| Assin2| 0.81121 |
99
- | Legal-SBERTimbau-sts-large| stsb_multi_mt pt| 0.81726 |
100
  | ---------------------------------------- | ---------- |---------- |
101
  | paraphrase-multilingual-mpnet-base-v2| Assin | 0.71457|
102
  | paraphrase-multilingual-mpnet-base-v2| Assin2| 0.79831 |
@@ -104,26 +85,18 @@ print(sentence_embeddings)
104
  | paraphrase-multilingual-mpnet-base-v2 Fine tuned with assin(s)| Assin | 0.77641 |
105
  | paraphrase-multilingual-mpnet-base-v2 Fine tuned with assin(s)| Assin2| 0.79831 |
106
  | paraphrase-multilingual-mpnet-base-v2 Fine tuned with assin(s)| stsb_multi_mt pt| 0.84575 |
107
-
108
-
109
  ## Training
110
-
111
- Legal-SBERTimbau-large is based on Legal-BERTimbau-large which derives from [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) Large.
112
- It was trained for Natural Language Inference (NLI). This was chosen due to the lack of Portuguese available data.
113
- In addition to that, it was submitted to a fine tuning stage with the [assin](https://huggingface.co/datasets/assin) and [assin2](https://huggingface.co/datasets/assin2) datasets.
114
-
115
  ## Full Model Architecture
116
  ```
117
  SentenceTransformer(
118
- (0): Transformer({'max_seq_length': 75, 'do_lower_case': False}) with Transformer model: BertModel
119
- (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
120
  )
121
  ```
122
-
123
  ## Citing & Authors
124
-
125
  If you use this work, please cite BERTimbau's work:
126
-
127
  ```bibtex
128
  @inproceedings{souza2020bertimbau,
129
  author = {F{\'a}bio Souza and
 
10
  datasets:
11
  - assin
12
  - assin2
 
13
  widget:
14
  - source_sentence: "O advogado apresentou as provas ao juíz."
15
  sentences:
 
20
  metrics:
21
  - bleu
22
  ---
23
+ # rufimelo/Legal-SBERTimbau-sts-large
 
 
24
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 1024 dimensional dense vector space and can be used for tasks like clustering or semantic search.
25
+ rufimelo/Legal-SBERTimbau-sts-large is based on Legal-BERTimbau-large which derives from [BERTimbau](https://huggingface.co/neuralmind/bert-large-portuguese-cased) alrge.
26
+ It is adapted to the Portuguese legal domain and trained for STS on portuguese datasets.
 
27
  ## Usage (Sentence-Transformers)
 
28
  Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
 
29
  ```
30
  pip install -U sentence-transformers
31
  ```
 
32
  Then you can use the model like this:
 
33
  ```python
34
  from sentence_transformers import SentenceTransformer
35
  sentences = ["Isto é um exemplo", "Isto é um outro exemplo"]
36
 
37
+ model = SentenceTransformer('rufimelo/Legal-SBERTimbau-sts-large')
38
  embeddings = model.encode(sentences)
39
  print(embeddings)
40
  ```
 
 
 
41
  ## Usage (HuggingFace Transformers)
 
 
42
  ```python
43
  from transformers import AutoTokenizer, AutoModel
44
  import torch
 
50
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
51
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
52
 
 
53
  # Sentences we want sentence embeddings for
54
  sentences = ['This is an example sentence', 'Each sentence is converted']
55
 
56
  # Load model from HuggingFace Hub
57
+ tokenizer = AutoTokenizer.from_pretrained('rufimelo/Legal-SBERTimbau-sts-large')
58
+ model = AutoModel.from_pretrained('rufimelo/Legal-SBERTimbau-sts-large')
59
 
60
  # Tokenize sentences
61
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 
63
  # Compute token embeddings
64
  with torch.no_grad():
65
  model_output = model(**encoded_input)
 
66
  # Perform pooling. In this case, mean pooling.
67
  sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
 
68
  print("Sentence embeddings:")
69
  print(sentence_embeddings)
70
  ```
 
 
71
  ## Evaluation Results STS
 
72
  | Model| Dataset | PearsonCorrelation |
73
  | ---------------------------------------- | ---------- | ---------- |
74
+ | Legal-SBERTimbau-sts-large| Assin | 0.76629 |
75
+ | Legal-SBERTimbau-sts-large| Assin2| 0.82357 |
76
+ | Legal-SBERTimbau-sts-base| Assin | 0.71457 |
77
+ | Legal-SBERTimbau-sts-base| Assin2| 0.73545|
78
+ | Legal-SBERTimbau-sts-large-v2| Assin | 0.76299 |
79
+ | Legal-SBERTimbau-sts-large-v2| Assin2| 0.81121 |
80
+ | Legal-SBERTimbau-sts-large-v2| stsb_multi_mt pt| 0.81726 |
81
  | ---------------------------------------- | ---------- |---------- |
82
  | paraphrase-multilingual-mpnet-base-v2| Assin | 0.71457|
83
  | paraphrase-multilingual-mpnet-base-v2| Assin2| 0.79831 |
 
85
  | paraphrase-multilingual-mpnet-base-v2 Fine tuned with assin(s)| Assin | 0.77641 |
86
  | paraphrase-multilingual-mpnet-base-v2 Fine tuned with assin(s)| Assin2| 0.79831 |
87
  | paraphrase-multilingual-mpnet-base-v2 Fine tuned with assin(s)| stsb_multi_mt pt| 0.84575 |
 
 
88
  ## Training
89
+ rufimelo/Legal-SBERTimbau-sts-large is based on Legal-BERTimbau-largewhich derives from [BERTimbau](https://huggingface.co/neuralmind/bert-base-portuguese-cased) large.
90
+ It was trained for Semantic Textual Similarity, being submitted to a fine tuning stage with the [assin](https://huggingface.co/datasets/assin) and [assin2](https://huggingface.co/datasets/assin2) datasets.
 
 
 
91
  ## Full Model Architecture
92
  ```
93
  SentenceTransformer(
94
+ (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel
95
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
96
  )
97
  ```
 
98
  ## Citing & Authors
 
99
  If you use this work, please cite BERTimbau's work:
 
100
  ```bibtex
101
  @inproceedings{souza2020bertimbau,
102
  author = {F{\'a}bio Souza and