raynardj
/

ner-chemical-bionlp-bc5cdr-pubmed

Token Classification

Inference Endpoints

Model card Files Files and versions Community

raynardj commited on Nov 4, 2021

Commit

ab2deef

·

1 Parent(s): e48fe2f

a readme for 1st time

Files changed (1) hide show

README.md +91 -0

README.md ADDED Viewed

	@@ -0,0 +1,91 @@

+---
+language:
+- en
+tags:
+- ner
+- chemical
+- bionlp
+- bc4cdr
+- bioinfomatics
+license: apache-2.0
+datasets:
+- bionlp
+- bc4cdr
+widget:
+- text: "Serotonin receptor 2A (HTR2A) gene polymorphism predicts treatment response to venlafaxine XR in generalized anxiety disorder."
+---
+# NER to find Gene & Gene products
+> The model was trained on bionlp and bc4cdr dataset, pretrained on this [pubmed-pretrained roberta model](/raynardj/roberta-pubmed)
+All the labels, the possible token classes.
+```json
+{"label2id":
+  {
+    "O": 0,
+    "Chemical": 1,
+  }
+ }
+```
+Notice, we removed the 'B-','I-' etc from data label.🗡
+## This is the template we suggest for using the model
+```python
+from transformers import pipeline
+PRETRAINED = "raynardj/ner-chemical-bionlp-bc5cdr-pubmed"
+ner = pipeline(task="ner",model=PRETRAINED, tokenizer=PRETRAINED)
+ner("Your text", aggregation_strategy="first")
+```
+And here is to make your output more consecutive ⭐️
+```python
+import pandas as pd
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained(PRETRAINED)
+def clean_output(outputs):
+    results = []
+    current = []
+    last_idx = 0
+    # make to sub group by position
+    for output in outputs:
+        if output["index"]-1==last_idx:
+            current.append(output)
+        else:
+            results.append(current)
+            current = [output, ]
+        last_idx = output["index"]
+    if len(current)>0:
+        results.append(current)
+    # from tokens to string
+    strings = []
+    for c in results:
+        tokens = []
+        starts = []
+        ends = []
+        for o in c:
+            tokens.append(o['word'])
+            starts.append(o['start'])
+            ends.append(o['end'])
+        new_str = tokenizer.convert_tokens_to_string(tokens)
+        if new_str!='':
+            strings.append(dict(
+                word=new_str,
+                start = min(starts),
+                end = max(ends),
+                entity = c[0]['entity']
+            ))
+    return strings
+def entity_table(pipeline, **pipeline_kw):
+    if "aggregation_strategy" not in pipeline_kw:
+        pipeline_kw["aggregation_strategy"] = "first"
+    def create_table(text):
+        return pd.DataFrame(
+            clean_output(
+                pipeline(text, **pipeline_kw)
+            )
+        )
+    return create_table
+# will return a dataframe
+entity_table(ner)(YOUR_VERY_CONTENTFUL_TEXT)
+```