tomaarsen HF staff commited on
Commit
f00c051
·
1 Parent(s): 106d24a

Upload model

Browse files
README.md ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: span-marker
5
+ tags:
6
+ - span-marker
7
+ - token-classification
8
+ - ner
9
+ - named-entity-recognition
10
+ - generated_from_span_marker_trainer
11
+ datasets:
12
+ - tomaarsen/ner-orgs
13
+ metrics:
14
+ - precision
15
+ - recall
16
+ - f1
17
+ widget:
18
+ - text: Hallacas are also commonly consumed in eastern Cuba parts of Colombia, Ecuador,
19
+ Aruba, and Curaçao.
20
+ - text: The co-production of Yvon Michel's GYM and Jean Bédard's Interbox promotions
21
+ and televised via HBO, has trumped a proposed HBO -televised rematch between Jean
22
+ Pascal and RING and WBC 175-pound champion Chad Dawson that was slated for the
23
+ same date at Bell Centre in Montreal.
24
+ - text: The synoptic conditions see a low over southern Norway, bringing warm south
25
+ and southwesterly flows of air up from the inner continental areas of Russia and
26
+ Belarus.
27
+ - text: The RCIS recommended amongst other things that the Australian Security Intelligence
28
+ Organisation (ASIO) areas of investigation be widened to include terrorism.
29
+ - text: The large network had multiple campuses in Minnesota, Wisconsin, and South
30
+ Dakota.
31
+ pipeline_tag: token-classification
32
+ co2_eq_emissions:
33
+ emissions: 532.6472478623315
34
+ source: codecarbon
35
+ training_type: fine-tuning
36
+ on_cloud: false
37
+ cpu_model: 13th Gen Intel(R) Core(TM) i7-13700K
38
+ ram_total_size: 31.777088165283203
39
+ hours_used: 3.696
40
+ hardware_used: 1 x NVIDIA GeForce RTX 3090
41
+ base_model: bert-base-cased
42
+ model-index:
43
+ - name: SpanMarker with bert-base-cased on FewNERD, CoNLL2003, OntoNotes v5, and MultiNERD
44
+ results:
45
+ - task:
46
+ type: token-classification
47
+ name: Named Entity Recognition
48
+ dataset:
49
+ name: FewNERD, CoNLL2003, OntoNotes v5, and MultiNERD
50
+ type: tomaarsen/ner-orgs
51
+ split: test
52
+ metrics:
53
+ - type: f1
54
+ value: 0.0
55
+ name: F1
56
+ - type: precision
57
+ value: 0.0
58
+ name: Precision
59
+ - type: recall
60
+ value: 0.0
61
+ name: Recall
62
+ ---
63
+
64
+ # SpanMarker with bert-base-cased on FewNERD, CoNLL2003, OntoNotes v5, and MultiNERD
65
+
66
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [FewNERD, CoNLL2003, OntoNotes v5, and MultiNERD](https://huggingface.co/datasets/tomaarsen/ner-orgs) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-cased](https://huggingface.co/bert-base-cased) as the underlying encoder.
67
+
68
+ ## Model Details
69
+
70
+ ### Model Description
71
+ - **Model Type:** SpanMarker
72
+ - **Encoder:** [bert-base-cased](https://huggingface.co/bert-base-cased)
73
+ - **Maximum Sequence Length:** 256 tokens
74
+ - **Maximum Entity Length:** 8 words
75
+ - **Training Dataset:** [FewNERD, CoNLL2003, OntoNotes v5, and MultiNERD](https://huggingface.co/datasets/tomaarsen/ner-orgs)
76
+ - **Language:** en
77
+ <!-- - **License:** Unknown -->
78
+
79
+ ### Model Sources
80
+
81
+ - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
82
+ - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
83
+
84
+ ### Model Labels
85
+ | Label | Examples |
86
+ |:------|:---------------------------------------------|
87
+ | ORG | "IAEA", "Church 's Chicken", "Texas Chicken" |
88
+
89
+ ## Evaluation
90
+
91
+ ### Metrics
92
+ | Label | Precision | Recall | F1 |
93
+ |:--------|:----------|:-------|:----|
94
+ | **all** | 0.0 | 0.0 | 0.0 |
95
+ | ORG | 0.0 | 0.0 | 0.0 |
96
+
97
+ ## Uses
98
+
99
+ ### Direct Use for Inference
100
+
101
+ ```python
102
+ from span_marker import SpanMarkerModel
103
+
104
+ # Download from the 🤗 Hub
105
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-orgs")
106
+ # Run inference
107
+ entities = model.predict("The large network had multiple campuses in Minnesota, Wisconsin, and South Dakota.")
108
+ ```
109
+
110
+ ### Downstream Use
111
+ You can finetune this model on your own dataset.
112
+
113
+ <details><summary>Click to expand</summary>
114
+
115
+ ```python
116
+ from span_marker import SpanMarkerModel, Trainer
117
+
118
+ # Download from the 🤗 Hub
119
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-orgs")
120
+
121
+ # Specify a Dataset with "tokens" and "ner_tag" columns
122
+ dataset = load_dataset("conll2003") # For example CoNLL2003
123
+
124
+ # Initialize a Trainer using the pretrained model & dataset
125
+ trainer = Trainer(
126
+ model=model,
127
+ train_dataset=dataset["train"],
128
+ eval_dataset=dataset["validation"],
129
+ )
130
+ trainer.train()
131
+ trainer.save_model("tomaarsen/span-marker-bert-base-orgs-finetuned")
132
+ ```
133
+ </details>
134
+
135
+ <!--
136
+ ### Out-of-Scope Use
137
+
138
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
139
+ -->
140
+
141
+ <!--
142
+ ## Bias, Risks and Limitations
143
+
144
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
145
+ -->
146
+
147
+ <!--
148
+ ### Recommendations
149
+
150
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
151
+ -->
152
+
153
+ ## Training Details
154
+
155
+ ### Training Set Metrics
156
+ | Training set | Min | Median | Max |
157
+ |:----------------------|:----|:--------|:----|
158
+ | Sentence length | 1 | 22.1911 | 267 |
159
+ | Entities per sentence | 0 | 0.8144 | 39 |
160
+
161
+ ### Training Hyperparameters
162
+ - learning_rate: 5e-05
163
+ - train_batch_size: 32
164
+ - eval_batch_size: 32
165
+ - seed: 42
166
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
167
+ - lr_scheduler_type: linear
168
+ - lr_scheduler_warmup_ratio: 0.1
169
+ - num_epochs: 3
170
+
171
+ ### Training Results
172
+ | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
173
+ |:------:|:-----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
174
+ | 0.3273 | 3000 | 0.0052 | 0.0 | 0.0 | 0.0 | 0.9413 |
175
+ | 0.6546 | 6000 | 0.0047 | 0.0 | 0.0 | 0.0 | 0.9334 |
176
+ | 0.9819 | 9000 | 0.0045 | 0.0 | 0.0 | 0.0 | 0.9376 |
177
+ | 1.3092 | 12000 | 0.0047 | 0.0 | 0.0 | 0.0 | 0.9377 |
178
+ | 1.6365 | 15000 | 0.0045 | 0.0 | 0.0 | 0.0 | 0.9339 |
179
+ | 1.9638 | 18000 | 0.0046 | 0.0 | 0.0 | 0.0 | 0.9373 |
180
+ | 2.2911 | 21000 | 0.0054 | 0.0 | 0.0 | 0.0 | 0.9351 |
181
+ | 2.6184 | 24000 | 0.0053 | 0.0 | 0.0 | 0.0 | 0.9373 |
182
+ | 2.9457 | 27000 | 0.0052 | 0.0 | 0.0 | 0.0 | 0.9359 |
183
+
184
+ ### Environmental Impact
185
+ Carbon emissions were measured using [CodeCarbon](https://github.com/mlco2/codecarbon).
186
+ - **Carbon Emitted**: 0.533 kg of CO2
187
+ - **Hours Used**: 3.696 hours
188
+
189
+ ### Training Hardware
190
+ - **On Cloud**: No
191
+ - **GPU Model**: 1 x NVIDIA GeForce RTX 3090
192
+ - **CPU Model**: 13th Gen Intel(R) Core(TM) i7-13700K
193
+ - **RAM Size**: 31.78 GB
194
+
195
+ ### Framework Versions
196
+ - Python: 3.9.16
197
+ - SpanMarker: 1.5.1.dev
198
+ - Transformers: 4.30.0
199
+ - PyTorch: 2.0.1+cu118
200
+ - Datasets: 2.14.0
201
+ - Tokenizers: 0.13.3
202
+
203
+ ## Citation
204
+
205
+ ### BibTeX
206
+ ```
207
+ @software{Aarsen_SpanMarker,
208
+ author = {Aarsen, Tom},
209
+ license = {Apache-2.0},
210
+ title = {{SpanMarker for Named Entity Recognition}},
211
+ url = {https://github.com/tomaarsen/SpanMarkerNER}
212
+ }
213
+ ```
214
+
215
+ <!--
216
+ ## Glossary
217
+
218
+ *Clearly define terms in order to be accessible across audiences.*
219
+ -->
220
+
221
+ <!--
222
+ ## Model Card Authors
223
+
224
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
225
+ -->
226
+
227
+ <!--
228
+ ## Model Card Contact
229
+
230
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
231
+ -->
added_tokens.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "<end>": 28997,
3
+ "<start>": 28996
4
+ }
config.json ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "SpanMarkerModel"
4
+ ],
5
+ "encoder": {
6
+ "_name_or_path": "bert-base-cased",
7
+ "add_cross_attention": false,
8
+ "architectures": [
9
+ "BertForMaskedLM"
10
+ ],
11
+ "attention_probs_dropout_prob": 0.1,
12
+ "bad_words_ids": null,
13
+ "begin_suppress_tokens": null,
14
+ "bos_token_id": null,
15
+ "chunk_size_feed_forward": 0,
16
+ "classifier_dropout": null,
17
+ "cross_attention_hidden_size": null,
18
+ "decoder_start_token_id": null,
19
+ "diversity_penalty": 0.0,
20
+ "do_sample": false,
21
+ "early_stopping": false,
22
+ "encoder_no_repeat_ngram_size": 0,
23
+ "eos_token_id": null,
24
+ "exponential_decay_length_penalty": null,
25
+ "finetuning_task": null,
26
+ "forced_bos_token_id": null,
27
+ "forced_eos_token_id": null,
28
+ "gradient_checkpointing": false,
29
+ "hidden_act": "gelu",
30
+ "hidden_dropout_prob": 0.1,
31
+ "hidden_size": 768,
32
+ "id2label": {
33
+ "0": "O",
34
+ "1": "B-ORG",
35
+ "2": "I-ORG"
36
+ },
37
+ "initializer_range": 0.02,
38
+ "intermediate_size": 3072,
39
+ "is_decoder": false,
40
+ "is_encoder_decoder": false,
41
+ "label2id": {
42
+ "B-ORG": 1,
43
+ "I-ORG": 2,
44
+ "O": 0
45
+ },
46
+ "layer_norm_eps": 1e-12,
47
+ "length_penalty": 1.0,
48
+ "max_length": 20,
49
+ "max_position_embeddings": 512,
50
+ "min_length": 0,
51
+ "model_type": "bert",
52
+ "no_repeat_ngram_size": 0,
53
+ "num_attention_heads": 12,
54
+ "num_beam_groups": 1,
55
+ "num_beams": 1,
56
+ "num_hidden_layers": 12,
57
+ "num_return_sequences": 1,
58
+ "output_attentions": false,
59
+ "output_hidden_states": false,
60
+ "output_scores": false,
61
+ "pad_token_id": 0,
62
+ "position_embedding_type": "absolute",
63
+ "prefix": null,
64
+ "problem_type": null,
65
+ "pruned_heads": {},
66
+ "remove_invalid_values": false,
67
+ "repetition_penalty": 1.0,
68
+ "return_dict": true,
69
+ "return_dict_in_generate": false,
70
+ "sep_token_id": null,
71
+ "suppress_tokens": null,
72
+ "task_specific_params": null,
73
+ "temperature": 1.0,
74
+ "tf_legacy_loss": false,
75
+ "tie_encoder_decoder": false,
76
+ "tie_word_embeddings": true,
77
+ "tokenizer_class": null,
78
+ "top_k": 50,
79
+ "top_p": 1.0,
80
+ "torch_dtype": null,
81
+ "torchscript": false,
82
+ "transformers_version": "4.30.0",
83
+ "type_vocab_size": 2,
84
+ "typical_p": 1.0,
85
+ "use_bfloat16": false,
86
+ "use_cache": true,
87
+ "vocab_size": 28998
88
+ },
89
+ "entity_max_length": 8,
90
+ "id2label": {
91
+ "0": "O",
92
+ "1": "ORG"
93
+ },
94
+ "id2reduced_id": {
95
+ "0": 0,
96
+ "1": 1,
97
+ "2": 1
98
+ },
99
+ "label2id": {
100
+ "O": 0,
101
+ "ORG": 1
102
+ },
103
+ "marker_max_length": 128,
104
+ "max_next_context": null,
105
+ "max_prev_context": null,
106
+ "model_max_length": 256,
107
+ "model_max_length_default": 512,
108
+ "model_type": "span-marker",
109
+ "span_marker_version": "1.5.1.dev",
110
+ "torch_dtype": "float32",
111
+ "trained_with_document_context": false,
112
+ "transformers_version": "4.30.0",
113
+ "vocab_size": 28998
114
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f64ee8bee4e465b21fba71e70d47d4bb19ba4eef09d7565dc544b41248ae8e58
3
+ size 433332917
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "[CLS]",
5
+ "do_lower_case": false,
6
+ "entity_max_length": 8,
7
+ "marker_max_length": 128,
8
+ "mask_token": "[MASK]",
9
+ "model_max_length": 256,
10
+ "pad_token": "[PAD]",
11
+ "sep_token": "[SEP]",
12
+ "strip_accents": null,
13
+ "tokenize_chinese_chars": true,
14
+ "tokenizer_class": "BertTokenizer",
15
+ "unk_token": "[UNK]"
16
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff