guishe/nuner-v1_orgs · not able to identify organizations in lowercase.

Sep 25

This model is not able to predict organization names in lowercase. For example, when I use this sentence : "I want to know more about Facebook", it is able to provide the below output.
[{'entity_group': 'ORG', 'score': 0.8983287215232849, 'word': ' Facebook', 'start': 26, 'end': 34}]
If I use I want to know more about facebook, it is not identifying anything.

guishe

Owner Oct 9

•

edited Oct 9

Hi! sorry for the late response, I couldn't answer before.

Yes, I fine-tuned this model from numind/NuNER-v1.0, which is trained to act as a generic token-level embedding generator for NER tasks. This, in turn, is trained upon FacebookAI/roberta-base, which is a pre-trained LM that uses a case sensitive tokenizer.

Thus, at each stage of the process (pre-training, training generic token-level embeddings for NER and fine-tuning on specific NER dataset), the model was available to discover patterns of those tokens that had cased letters. It does not mean that it cannot work at all on uncased sentences but, it definitely will work worse, because of the simple reason that it wasn't trained for that. And even more for a NER task, where the uppercase letters play such an important role for detecting which or which not is an entity.

If you have the requirement of working with uncased text, I would recommend you to look for NER models fine-tuned on google-bert/bert-base-uncased, or any other model pre-trained with an uncased tokenizer.

Here I add an example that shows that, you can detect "facebook" lowercased, but it will depend on the context and text length, and for sure will miss many entities:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(
    "guishe/nuner-v1_orgs",
    model_max_length=512,
    add_prefix_space=True,
)
model = AutoModelForTokenClassification.from_pretrained("guishe/nuner-v1_orgs")

ner_pipe = pipeline(
    task="ner",
    model=model,
    tokenizer=tokenizer,
    device=device,
    stride=16,
    aggregation_strategy="simple",
)

text = """New body to handle disputes between EU users and Facebook, TikTok, YouTube. STOCKHOLM, Oct 8 (Reuters) - Social media users in the European Union will be able to make complaints against Facebook, ByteDance's TikTok and Alphabet's (GOOGL.O), opens new tab YouTube over content moderation to a new independent body set up in Ireland.
The body, supported by Meta Platforms' (META.O), opens new tab Oversight Board Trust and certified by Ireland's media regulator, will act as an out-of-court dispute settlement body under the EU Digital Services Act (DSA)."""

ner_pipe(text.lower())

[{'entity_group': 'ORG',
  'score': 0.5538119,
  'word': ' facebook',
  'start': 186,
  'end': 194},
 {'entity_group': 'ORG',
  'score': 0.7759812,
  'word': ' by',
  'start': 196,
  'end': 198},
 {'entity_group': 'ORG',
  'score': 0.520252,
  'word': 'tedance',
  'start': 198,
  'end': 205},
 {'entity_group': 'ORG',
  'score': 0.59606785,
  'word': ' alphabet',
  'start': 219,
  'end': 227}]

Instead of using raw cased text:

[{'entity_group': 'ORG',
  'score': 0.66524726,
  'word': ' EU',
  'start': 36,
  'end': 38},
 {'entity_group': 'ORG',
  'score': 0.96004397,
  'word': ' Facebook',
  'start': 49,
  'end': 57},
 {'entity_group': 'ORG',
  'score': 0.9502812,
  'word': ' TikTok',
  'start': 59,
  'end': 65},
 {'entity_group': 'ORG',
  'score': 0.94544315,
  'word': ' YouTube',
  'start': 67,
  'end': 74},
 {'entity_group': 'ORG',
  'score': 0.9496799,
  'word': 'Reuters',
  'start': 94,
  'end': 101},
 {'entity_group': 'ORG',
  'score': 0.75246876,
  'word': ' European Union',
  'start': 131,
  'end': 145},
 {'entity_group': 'ORG',
  'score': 0.9829754,
  'word': ' Facebook',
  'start': 186,
  'end': 194},
 {'entity_group': 'ORG',
  'score': 0.9930758,
  'word': ' ByteDance',
  'start': 196,
  'end': 205},
 {'entity_group': 'ORG',
  'score': 0.95183456,
  'word': ' TikTok',
  'start': 208,
  'end': 214},
 {'entity_group': 'ORG',
  'score': 0.98598623,
  'word': ' Alphabet',
  'start': 219,
  'end': 227},
 {'entity_group': 'ORG',
  'score': 0.8133524,
  'word': 'GOOGL',
  'start': 231,
  'end': 236},
 {'entity_group': 'ORG',
  'score': 0.6480048,
  'word': 'O',
  'start': 237,
  'end': 238},
 {'entity_group': 'ORG',
  'score': 0.93195355,
  'word': ' YouTube',
  'start': 255,
  'end': 262},
 {'entity_group': 'ORG',
  'score': 0.8795058,
  'word': " Meta Platforms'",
  'start': 355,
  'end': 370},
 {'entity_group': 'ORG',
  'score': 0.8048644,
  'word': 'META',
  'start': 372,
  'end': 376},
 {'entity_group': 'ORG',
  'score': 0.64038706,
  'word': 'O',
  'start': 377,
  'end': 378},
 {'entity_group': 'ORG',
  'score': 0.8921781,
  'word': ' Oversight Board Trust',
  'start': 395,
  'end': 416}]

Best,

Guille

guishe changed discussion status to closed Oct 16