|
--- |
|
language: |
|
- en |
|
tags: |
|
- ner |
|
- chemical |
|
- bionlp |
|
- bc4cdr |
|
- bioinfomatics |
|
license: apache-2.0 |
|
datasets: |
|
- bionlp |
|
- bc4cdr |
|
widget: |
|
- text: "Serotonin receptor 2A (HTR2A) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder." |
|
|
|
--- |
|
|
|
# NER to find Gene & Gene products |
|
> The model was trained on bionlp and bc4cdr dataset, pretrained on this [pubmed-pretrained roberta model](/raynardj/roberta-pubmed) |
|
All the labels, the possible token classes. |
|
```json |
|
{"label2id": |
|
{ |
|
"O": 0, |
|
"Chemical": 1, |
|
} |
|
} |
|
``` |
|
|
|
Notice, we removed the 'B-','I-' etc from data label.🗡 |
|
|
|
## This is the template we suggest for using the model |
|
Of course I'm well aware of the ```aggregation_strategy``` arguments offered by hf, but by the way of training, I discard any entropy loss for appending subwords, like only the label for the 1st subword token is not -100, after many search effort, I can't find a way to achieve that with default pipeline, hence I fancy an inference class myself. |
|
```python |
|
!pip install forgebox |
|
from forgebox.hf.train import NERInference |
|
ner = NERInference.from_pretrained("raynardj/ner-chemical-bionlp-bc5cdr-pubmed") |
|
a_df = ner.predict(["text1", "text2"]) |
|
``` |
|
|
|
> check our NER model on |
|
* [gene and gene products](/raynardj/ner-gene-dna-rna-jnlpba-pubmed) |
|
* [chemical substance](/raynardj/ner-chemical-bionlp-bc5cdr-pubmed). |
|
* [disease](/raynardj/ner-disease-ncbi-bionlp-bc5cdr-pubmed) |