tomaarsen HF staff commited on
Commit
5da50f4
1 Parent(s): 522cc63

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -11
README.md CHANGED
@@ -1,4 +1,7 @@
1
  ---
 
 
 
2
  library_name: span-marker
3
  tags:
4
  - span-marker
@@ -6,35 +9,133 @@ tags:
6
  - ner
7
  - named-entity-recognition
8
  - generated_from_span_marker_trainer
 
 
9
  metrics:
10
  - precision
11
  - recall
12
  - f1
13
- widget: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  pipeline_tag: token-classification
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ---
16
 
17
- # SpanMarker
18
 
19
- This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
20
 
21
  ## Model Details
22
 
23
  ### Model Description
24
 
25
  - **Model Type:** SpanMarker
26
- <!-- - **Encoder:** [Unknown](https://huggingface.co/models/unknown) -->
27
  - **Maximum Sequence Length:** 256 tokens
28
  - **Maximum Entity Length:** 8 words
29
- <!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
30
- <!-- - **Language:** Unknown -->
31
- <!-- - **License:** Unknown -->
32
 
33
  ### Model Sources
34
 
35
  - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
36
  - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## Uses
39
 
40
  ### Direct Use for Inference
@@ -43,9 +144,9 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
43
  from span_marker import SpanMarkerModel
44
 
45
  # Download from the 🤗 Hub
46
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
47
  # Run inference
48
- entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
49
  ```
50
 
51
  ### Downstream Use
@@ -57,7 +158,7 @@ You can finetune this model on your own dataset.
57
  from span_marker import SpanMarkerModel, Trainer
58
 
59
  # Download from the 🤗 Hub
60
- model = SpanMarkerModel.from_pretrained("span_marker_model_id")
61
 
62
  # Specify a Dataset with "tokens" and "ner_tag" columns
63
  dataset = load_dataset("conll2003") # For example CoNLL2003
@@ -69,7 +170,7 @@ trainer = Trainer(
69
  eval_dataset=dataset["validation"],
70
  )
71
  trainer.train()
72
- trainer.save_model("span_marker_model_id-finetuned")
73
  ```
74
  </details>
75
 
@@ -93,6 +194,31 @@ trainer.save_model("span_marker_model_id-finetuned")
93
 
94
  ## Training Details
95
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
  ### Framework Versions
97
 
98
  - Python: 3.9.16
 
1
  ---
2
+ language:
3
+ - en
4
+ license: cc-by-4.0
5
  library_name: span-marker
6
  tags:
7
  - span-marker
 
9
  - ner
10
  - named-entity-recognition
11
  - generated_from_span_marker_trainer
12
+ datasets:
13
+ - EMBO/SourceData
14
  metrics:
15
  - precision
16
  - recall
17
  - f1
18
+ widget:
19
+ - text: Comparison of ENCC-derived neurospheres treated with intestinal extract
20
+ from hypoganglionosis rats, hypoganglionosis treated with Fecal microbiota transplantation
21
+ (FMT) sham rat. Comparison of neuronal markers. (J) Immunofluorescence stain
22
+ number of PGP9.5+. Nuclei were stained blue with DAPI; Triangles indicate
23
+ PGP9.5+.
24
+ - text: 'Histochemical (H & E) immunostaining (red) show T (CD3+) neutrophil
25
+ (Ly6b+) infiltration in skin of mice in (A). Scale bar, 100 μm. (of CD3
26
+ Ly6b immunostaining from CsA treated mice represent seperate analyses performed
27
+ on serial thin sections.) of epidermal thickness, T (CD3+) neutrophil (Ly6b+)
28
+ infiltration (red) in skin thin sections from (C), (n = 6). Data
29
+ information: Data represent mean ± SD. * P < 0.05, * * P < 0.01 by two
30
+ -Mann-Whitney; two independent experiments.'
31
+ - text: 'C African green monkey kidney epithelial (Vero) were transfected with NC,
32
+ siMLKL, or miR-324-5p for 48 h. qPCR for expression of MLKL. Data information:
33
+ data are represented as means ± SD of three biological replicates. Statistical
34
+ analyses were performed using unpaired Student '' s t -. experiments were performed
35
+ at least three times, representative data are shown.'
36
+ - text: (F) Binding between FTCD p47 between p47 p97 is necessary for mitochondria
37
+ aggregation mediated by FTCDwt-HA-MAO. HeLa Tet-off inducibly expressing
38
+ FTCDwt-HA-MAO were transfected with mammalian expression constructs of
39
+ siRNA-insensitive Flag-tagged p47wt / mutants at same time as treatment of p47
40
+ siRNA, cultured for 24 hrs. were further cultured in DOX-free medium for 48 hrs
41
+ for induction of FTCD-HA-MAO. After fixation, were visualized with a monoclonal
42
+ antibody to mitochondria polyclonal antibodies to HA Flag. Panels a-l display
43
+ representative. Scale bar = 10 μm. (G) Binding between FTCD p97 is necessary
44
+ for mitochondria aggregation mediated by FTCDwt-HA-MAO. HeLa Tet-off inducibly
45
+ expressing FTCDwt-HA-MAO were transfected with mammalian expression construct
46
+ of siRNA-insensitive Flag-tagged p97wt / mutant at same time as treatment
47
+ with p97 siRNA. following procedures were same as in (F). Panels a-i display
48
+ representative. Scale bar = 10 μm. (H) results of of (F) (G). Results
49
+ are shown as mean ± SD of five sets of independent experiments, with 100 counted
50
+ in each group in each independent experiment. Asterisks indicate a significant
51
+ difference at P < 0.01 compared with siRNA treatment alone ('none') compared
52
+ with mutant expression (Bonferroni method).
53
+ - text: (b) Parkin is recruited selectively to depolarized mitochondria directs
54
+ mitophagy. HeLa transfected with HA-Parkin were treated with CCCP for indicated
55
+ times. Mitochondria were stained by anti-TOM20 (pseudo coloured; blue) a
56
+ ΔΨm dependent MitoTracker (red). Parkin was stained with anti-HA (green).
57
+ Without treatment, mitochondria are intact stained by both mitochondrial
58
+ markers, whereas Parkin is equally distributed in cytoplasm. After 2 h of CCCP
59
+ treatment, mitochondria are depolarized as shown by loss of MitoTracker. Parkin
60
+ completely translocates to mitochondria clustering at perinuclear regions. After
61
+ 24h of CCCP treatment, massive loss of mitochondria is observed as shown by
62
+ disappearance of mitochondrial marker. Only Parkin-positive show mitochondrial
63
+ clustering clearance, in contrast to adjacent untransfected. Scale bars, 10
64
+ μm.
65
  pipeline_tag: token-classification
66
+ base_model: bert-base-uncased
67
+ model-index:
68
+ - name: SpanMarker with bert-base-uncased on SourceData
69
+ results:
70
+ - task:
71
+ type: token-classification
72
+ name: Named Entity Recognition
73
+ dataset:
74
+ name: SourceData
75
+ type: EMBO/SourceData
76
+ split: test
77
+ metrics:
78
+ - type: f1
79
+ value: 0.8336481983993405
80
+ name: F1
81
+ - type: precision
82
+ value: 0.8345368269032392
83
+ name: Precision
84
+ - type: recall
85
+ value: 0.8327614603348888
86
+ name: Recall
87
  ---
88
 
89
+ # SpanMarker with bert-base-uncased on SourceData
90
 
91
+ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [SourceData](https://huggingface.co/datasets/EMBO/SourceData) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-uncased](https://huggingface.co/models/bert-base-uncased) as the underlying encoder.
92
 
93
  ## Model Details
94
 
95
  ### Model Description
96
 
97
  - **Model Type:** SpanMarker
98
+ - **Encoder:** [bert-base-uncased](https://huggingface.co/models/bert-base-uncased)
99
  - **Maximum Sequence Length:** 256 tokens
100
  - **Maximum Entity Length:** 8 words
101
+ - **Training Dataset:** [SourceData](https://huggingface.co/datasets/EMBO/SourceData)
102
+ - **Language:** en
103
+ - **License:** cc-by-4.0
104
 
105
  ### Model Sources
106
 
107
  - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
108
  - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
109
 
110
+ ### Model Labels
111
+ | Label | Examples |
112
+ |:---------------|:--------------------------------------------------------|
113
+ | CELL_LINE | "293T", "WM266.4 451Lu", "501mel" |
114
+ | CELL_TYPE | "BMDMs", "protoplasts", "epithelial" |
115
+ | DISEASE | "melanoma", "lung metastasis", "breast prostate cancer" |
116
+ | EXP_ASSAY | "interactions", "Yeast two-hybrid", "BiFC" |
117
+ | GENEPROD | "CPL1", "FREE1 CPL1", "FREE1" |
118
+ | ORGANISM | "Arabidopsis", "yeast", "seedlings" |
119
+ | SMALL_MOLECULE | "polyacrylamide", "CHX", "SDS polyacrylamide" |
120
+ | SUBCELLULAR | "proteasome", "D-bodies", "plasma" |
121
+ | TISSUE | "Colon", "roots", "serum" |
122
+
123
+ ## Evaluation
124
+
125
+ ### Metrics
126
+ | Label | Precision | Recall | F1 |
127
+ |:---------------|:----------|:-------|:-------|
128
+ | **all** | 0.8345 | 0.8328 | 0.8336 |
129
+ | CELL_LINE | 0.9060 | 0.8866 | 0.8962 |
130
+ | CELL_TYPE | 0.7365 | 0.7746 | 0.7551 |
131
+ | DISEASE | 0.6204 | 0.6531 | 0.6363 |
132
+ | EXP_ASSAY | 0.7224 | 0.7096 | 0.7160 |
133
+ | GENEPROD | 0.8944 | 0.8960 | 0.8952 |
134
+ | ORGANISM | 0.8752 | 0.8902 | 0.8826 |
135
+ | SMALL_MOLECULE | 0.8304 | 0.8223 | 0.8263 |
136
+ | SUBCELLULAR | 0.7859 | 0.7699 | 0.7778 |
137
+ | TISSUE | 0.8134 | 0.8056 | 0.8094 |
138
+
139
  ## Uses
140
 
141
  ### Direct Use for Inference
 
144
  from span_marker import SpanMarkerModel
145
 
146
  # Download from the 🤗 Hub
147
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-sourcedata")
148
  # Run inference
149
+ entities = model.predict("Comparison of ENCC-derived neurospheres treated with intestinal extract from hypoganglionosis rats, hypoganglionosis treated with Fecal microbiota transplantation (FMT) sham rat. Comparison of neuronal markers. (J) Immunofluorescence stain number of PGP9.5+. Nuclei were stained blue with DAPI; Triangles indicate PGP9.5+.")
150
  ```
151
 
152
  ### Downstream Use
 
158
  from span_marker import SpanMarkerModel, Trainer
159
 
160
  # Download from the 🤗 Hub
161
+ model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-sourcedata")
162
 
163
  # Specify a Dataset with "tokens" and "ner_tag" columns
164
  dataset = load_dataset("conll2003") # For example CoNLL2003
 
170
  eval_dataset=dataset["validation"],
171
  )
172
  trainer.train()
173
+ trainer.save_model("tomaarsen/span-marker-bert-base-uncased-sourcedata-finetuned")
174
  ```
175
  </details>
176
 
 
194
 
195
  ## Training Details
196
 
197
+ ### Training Set Metrics
198
+ | Training set | Min | Median | Max |
199
+ |:----------------------|:----|:--------|:-----|
200
+ | Sentence length | 4 | 71.0253 | 2609 |
201
+ | Entities per sentence | 0 | 8.3186 | 162 |
202
+
203
+ ### Training Hyperparameters
204
+ - learning_rate: 5e-05
205
+ - train_batch_size: 32
206
+ - eval_batch_size: 32
207
+ - seed: 42
208
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
209
+ - lr_scheduler_type: linear
210
+ - lr_scheduler_warmup_ratio: 0.1
211
+ - num_epochs: 3
212
+
213
+ ### Training Results
214
+ | Epoch | Step | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
215
+ |:------:|:-----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
216
+ | 0.5237 | 3000 | 0.0162 | 0.7972 | 0.8162 | 0.8065 | 0.9520 |
217
+ | 1.0473 | 6000 | 0.0155 | 0.8188 | 0.8251 | 0.8219 | 0.9560 |
218
+ | 1.5710 | 9000 | 0.0155 | 0.8213 | 0.8324 | 0.8268 | 0.9563 |
219
+ | 2.0946 | 12000 | 0.0163 | 0.8315 | 0.8347 | 0.8331 | 0.9581 |
220
+ | 2.6183 | 15000 | 0.0167 | 0.8303 | 0.8378 | 0.8340 | 0.9582 |
221
+
222
  ### Framework Versions
223
 
224
  - Python: 3.9.16