tomaarsen
/

span-marker-bert-base-uncased-sourcedata

@@ -1,4 +1,7 @@
 ---
 library_name: span-marker
 tags:
 - span-marker
@@ -6,35 +9,133 @@ tags:
 - ner
 - named-entity-recognition
 - generated_from_span_marker_trainer
 metrics:
 - precision
 - recall
 - f1
-widget: []
 pipeline_tag: token-classification
 ---
-# SpanMarker
-This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that can be used for Named Entity Recognition.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
-<!-- - **Encoder:** [Unknown](https://huggingface.co/models/unknown) -->
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
-<!-- - **Training Dataset:** [Unknown](https://huggingface.co/datasets/unknown) -->
-<!-- - **Language:** Unknown -->
-<!-- - **License:** Unknown -->
 ### Model Sources
 - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
 - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
 ## Uses
 ### Direct Use for Inference
@@ -43,9 +144,9 @@ This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model that ca
 from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
-model = SpanMarkerModel.from_pretrained("span_marker_model_id")
 # Run inference
-entities = model.predict("Amelia Earhart flew her single engine Lockheed Vega 5B across the Atlantic to Paris.")
 ```
 ### Downstream Use
@@ -57,7 +158,7 @@ You can finetune this model on your own dataset.
 from span_marker import SpanMarkerModel, Trainer
 # Download from the 🤗 Hub
-model = SpanMarkerModel.from_pretrained("span_marker_model_id")
 # Specify a Dataset with "tokens" and "ner_tag" columns
 dataset = load_dataset("conll2003") # For example CoNLL2003
@@ -69,7 +170,7 @@ trainer = Trainer(
     eval_dataset=dataset["validation"],
 )
 trainer.train()
-trainer.save_model("span_marker_model_id-finetuned")
 ```
 </details>
@@ -93,6 +194,31 @@ trainer.save_model("span_marker_model_id-finetuned")
 ## Training Details
 ### Framework Versions
 - Python: 3.9.16

 ---
+language:
+- en
+license: cc-by-4.0
 library_name: span-marker
 tags:
 - span-marker
 - ner
 - named-entity-recognition
 - generated_from_span_marker_trainer
+datasets:
+- EMBO/SourceData
 metrics:
 - precision
 - recall
 - f1
+widget:
+- text: Comparison of ENCC-derived neurospheres treated with intestinal extract
+    from hypoganglionosis rats, hypoganglionosis treated with Fecal microbiota transplantation
+    (FMT) sham rat. Comparison of neuronal markers. (J) Immunofluorescence stain
+    number of PGP9.5+. Nuclei were stained blue with DAPI; Triangles indicate
+    PGP9.5+.
+- text: 'Histochemical (H & E) immunostaining (red) show T (CD3+) neutrophil
+    (Ly6b+) infiltration in skin of mice in (A). Scale bar, 100 μm. (of CD3
+    Ly6b immunostaining from CsA treated mice represent seperate analyses performed
+    on serial thin sections.) of epidermal thickness, T (CD3+) neutrophil (Ly6b+)
+    infiltration (red) in skin thin sections from (C), (n = 6). Data
+    information: Data represent mean ± SD. * P < 0.05, * * P < 0.01 by two
+    -Mann-Whitney; two independent experiments.'
+- text: 'C African green monkey kidney epithelial (Vero) were transfected with NC,
+    siMLKL, or miR-324-5p for 48 h. qPCR for expression of MLKL. Data information:
+    data are represented as means ± SD of three biological replicates. Statistical
+    analyses were performed using unpaired Student '' s t -. experiments were performed
+    at least three times, representative data are shown.'
+- text: (F) Binding between FTCD p47 between p47 p97 is necessary for mitochondria
+    aggregation mediated by FTCDwt-HA-MAO. HeLa Tet-off inducibly expressing
+    FTCDwt-HA-MAO were transfected with mammalian expression constructs of
+    siRNA-insensitive Flag-tagged p47wt / mutants at same time as treatment of p47
+    siRNA, cultured for 24 hrs. were further cultured in DOX-free medium for 48 hrs
+    for induction of FTCD-HA-MAO. After fixation, were visualized with a monoclonal
+    antibody to mitochondria polyclonal antibodies to HA Flag. Panels a-l display
+    representative. Scale bar = 10 μm. (G) Binding between FTCD p97 is necessary
+    for mitochondria aggregation mediated by FTCDwt-HA-MAO. HeLa Tet-off inducibly
+    expressing FTCDwt-HA-MAO were transfected with mammalian expression construct
+    of siRNA-insensitive Flag-tagged p97wt / mutant at same time as treatment
+    with p97 siRNA. following procedures were same as in (F). Panels a-i display
+    representative. Scale bar = 10 μm. (H) results of of (F) (G). Results
+    are shown as mean ± SD of five sets of independent experiments, with 100 counted
+    in each group in each independent experiment. Asterisks indicate a significant
+    difference at P < 0.01 compared with siRNA treatment alone ('none') compared
+    with mutant expression (Bonferroni method).
+- text: (b) Parkin is recruited selectively to depolarized mitochondria directs
+    mitophagy. HeLa transfected with HA-Parkin were treated with CCCP for indicated
+    times. Mitochondria were stained by anti-TOM20 (pseudo coloured; blue) a
+    ΔΨm dependent MitoTracker (red). Parkin was stained with anti-HA (green).
+    Without treatment, mitochondria are intact stained by both mitochondrial
+    markers, whereas Parkin is equally distributed in cytoplasm. After 2 h of CCCP
+    treatment, mitochondria are depolarized as shown by loss of MitoTracker. Parkin
+    completely translocates to mitochondria clustering at perinuclear regions. After
+    24h of CCCP treatment, massive loss of mitochondria is observed as shown by
+    disappearance of mitochondrial marker. Only Parkin-positive show mitochondrial
+    clustering clearance, in contrast to adjacent untransfected. Scale bars, 10
+    μm.
 pipeline_tag: token-classification
+base_model: bert-base-uncased
+model-index:
+- name: SpanMarker with bert-base-uncased on SourceData
+  results:
+  - task:
+      type: token-classification
+      name: Named Entity Recognition
+    dataset:
+      name: SourceData
+      type: EMBO/SourceData
+      split: test
+    metrics:
+    - type: f1
+      value: 0.8336481983993405
+      name: F1
+    - type: precision
+      value: 0.8345368269032392
+      name: Precision
+    - type: recall
+      value: 0.8327614603348888
+      name: Recall
 ---
+# SpanMarker with bert-base-uncased on SourceData
+This is a [SpanMarker](https://github.com/tomaarsen/SpanMarkerNER) model trained on the [SourceData](https://huggingface.co/datasets/EMBO/SourceData) dataset that can be used for Named Entity Recognition. This SpanMarker model uses [bert-base-uncased](https://huggingface.co/models/bert-base-uncased) as the underlying encoder.
 ## Model Details
 ### Model Description
 - **Model Type:** SpanMarker
+- **Encoder:** [bert-base-uncased](https://huggingface.co/models/bert-base-uncased)
 - **Maximum Sequence Length:** 256 tokens
 - **Maximum Entity Length:** 8 words
+- **Training Dataset:** [SourceData](https://huggingface.co/datasets/EMBO/SourceData)
+- **Language:** en
+- **License:** cc-by-4.0
 ### Model Sources
 - **Repository:** [SpanMarker on GitHub](https://github.com/tomaarsen/SpanMarkerNER)
 - **Thesis:** [SpanMarker For Named Entity Recognition](https://raw.githubusercontent.com/tomaarsen/SpanMarkerNER/main/thesis.pdf)
+### Model Labels
+| Label          | Examples                                                |
+|:---------------|:--------------------------------------------------------|
+| CELL_LINE      | "293T", "WM266.4 451Lu", "501mel"                     |
+| CELL_TYPE      | "BMDMs", "protoplasts", "epithelial"                    |
+| DISEASE        | "melanoma", "lung metastasis", "breast prostate cancer" |
+| EXP_ASSAY      | "interactions", "Yeast two-hybrid", "BiFC"            |
+| GENEPROD       | "CPL1", "FREE1 CPL1", "FREE1"                           |
+| ORGANISM       | "Arabidopsis", "yeast", "seedlings"                     |
+| SMALL_MOLECULE | "polyacrylamide", "CHX", "SDS polyacrylamide"           |
+| SUBCELLULAR    | "proteasome", "D-bodies", "plasma"                    |
+| TISSUE         | "Colon", "roots", "serum"                               |
+## Evaluation
+### Metrics
+| Label          | Precision | Recall | F1     |
+|:---------------|:----------|:-------|:-------|
+| **all**        | 0.8345    | 0.8328 | 0.8336 |
+| CELL_LINE      | 0.9060    | 0.8866 | 0.8962 |
+| CELL_TYPE      | 0.7365    | 0.7746 | 0.7551 |
+| DISEASE        | 0.6204    | 0.6531 | 0.6363 |
+| EXP_ASSAY      | 0.7224    | 0.7096 | 0.7160 |
+| GENEPROD       | 0.8944    | 0.8960 | 0.8952 |
+| ORGANISM       | 0.8752    | 0.8902 | 0.8826 |
+| SMALL_MOLECULE | 0.8304    | 0.8223 | 0.8263 |
+| SUBCELLULAR    | 0.7859    | 0.7699 | 0.7778 |
+| TISSUE         | 0.8134    | 0.8056 | 0.8094 |
 ## Uses
 ### Direct Use for Inference
 from span_marker import SpanMarkerModel
 # Download from the 🤗 Hub
+model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-sourcedata")
 # Run inference
+entities = model.predict("Comparison of ENCC-derived neurospheres treated with intestinal extract from hypoganglionosis rats, hypoganglionosis treated with Fecal microbiota transplantation (FMT) sham rat. Comparison of neuronal markers. (J) Immunofluorescence stain number of PGP9.5+. Nuclei were stained blue with DAPI; Triangles indicate PGP9.5+.")
 ```
 ### Downstream Use
 from span_marker import SpanMarkerModel, Trainer
 # Download from the 🤗 Hub
+model = SpanMarkerModel.from_pretrained("tomaarsen/span-marker-bert-base-uncased-sourcedata")
 # Specify a Dataset with "tokens" and "ner_tag" columns
 dataset = load_dataset("conll2003") # For example CoNLL2003
     eval_dataset=dataset["validation"],
 )
 trainer.train()
+trainer.save_model("tomaarsen/span-marker-bert-base-uncased-sourcedata-finetuned")
 ```
 </details>
 ## Training Details
+### Training Set Metrics
+| Training set          | Min | Median  | Max  |
+|:----------------------|:----|:--------|:-----|
+| Sentence length       | 4   | 71.0253 | 2609 |
+| Entities per sentence | 0   | 8.3186  | 162  |
+### Training Hyperparameters
+- learning_rate: 5e-05
+- train_batch_size: 32
+- eval_batch_size: 32
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- lr_scheduler_warmup_ratio: 0.1
+- num_epochs: 3
+### Training Results
+| Epoch  | Step  | Validation Loss | Validation Precision | Validation Recall | Validation F1 | Validation Accuracy |
+|:------:|:-----:|:---------------:|:--------------------:|:-----------------:|:-------------:|:-------------------:|
+| 0.5237 | 3000  | 0.0162          | 0.7972               | 0.8162            | 0.8065        | 0.9520              |
+| 1.0473 | 6000  | 0.0155          | 0.8188               | 0.8251            | 0.8219        | 0.9560              |
+| 1.5710 | 9000  | 0.0155          | 0.8213               | 0.8324            | 0.8268        | 0.9563              |
+| 2.0946 | 12000 | 0.0163          | 0.8315               | 0.8347            | 0.8331        | 0.9581              |
+| 2.6183 | 15000 | 0.0167          | 0.8303               | 0.8378            | 0.8340        | 0.9582              |
 ### Framework Versions
 - Python: 3.9.16