ner-bert-indonesian-v1

Model Description

ner-bert-indonesian-v1 is a fine-tuned google-bert/bert-base-multilingual-uncased which is used for named-entity-recognition (NER) tasks in Indonesian. In version 1, the model is quite good at recognizing the following 4 entity types:

  • 0 others (entities not yet recognized by the model) - Lainnya
  • Person - Orang
  • Organisation - Organisasi
  • Place - Tempat/Lokasi

Usage

Using pipelines

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained('wuriyanto/ner-bert-indonesian-v1')
model = AutoModelForTokenClassification.from_pretrained('wuriyanto/ner-bert-indonesian-v1')

nlp = pipeline("ner", model=model, tokenizer=tokenizer)
example = "OpenAI adalah laboratorium penelitan kecerdasan buatan yang terdiri atas perusahaan waralaba OpenAI LP dan perusahaan induk nirlabanya, OpenAI Inc. Para pendirinya (sam altman) terdorong oleh ketakutan mereka akan kemungkinan bahwa kecerdasan buatan dapat mengancam keberadaan manusia, perusahaan ini ada di amerika serikat. PT. Indodana , salah satu perusahann di Indonesia mulai mengadopsi teknologi ini."

ner_results = nlp(example)
for n in ner_results:
  print(n)

Using using custom parsers

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

id_to_label = {0: 'O', 1: 'Place', 2: 'Organisation', 3: 'Person'}

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('wuriyanto/ner-bert-indonesian-v1')
model = AutoModelForTokenClassification.from_pretrained('wuriyanto/ner-bert-indonesian-v1')

def tokenize_input(sentence):
  tokenized_input = tokenizer(sentence, return_tensors="pt", padding=True, truncation=True)
  return tokenized_input

def predict_ner(sentence):
    inputs = tokenize_input(sentence)

    with torch.no_grad():
        outputs = model(**inputs)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=2)

    # Convert predictions and tokens back to readable format
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    predicted_labels = [id_to_label[p.item()] for p in predictions[0]]

    # Merge subwords and filter out special tokens
    merged_tokens, merged_labels = [], []
    current_token, current_label = "", None
    for token, label in zip(tokens, predicted_labels):
        print(token, ' ', label)
        # Skip special tokens and punctuation (like [CLS], [SEP], commas, and periods)
        if token in ["[CLS]", "[SEP]"] or (label == "O" and token in [",", "."]):
            continue
        if token.startswith("##"):
            current_token += token[2:]
            if current_label == 'O':
              current_label = label
        else:
            if current_token:
                merged_tokens.append(current_token)
                merged_labels.append(current_label)
            current_token = token
            current_label = label
    if current_token:
        merged_tokens.append(current_token)
        merged_labels.append(current_label)

    results = list(zip(merged_tokens, merged_labels))
    return results

sentence = "OpenAI adalah laboratorium penelitan kecerdasan buatan yang terdiri atas perusahaan waralaba OpenAI LP dan perusahaan induk nirlabanya, OpenAI Inc. Para pendirinya (sam altman) terdorong oleh ketakutan mereka akan kemungkinan bahwa kecerdasan buatan dapat mengancam keberadaan manusia, perusahaan ini ada di amerika serikat. PT. Indodana , salah satu perusahann di Indonesia mulai mengadopsi teknologi ini."
results = predict_ner(sentence)
print(results)
for token, label in results:
    print(f"{token}: {label}")

Dataset and citation info

@article{DBLP:journals/corr/abs-1810-04805,
  author    = {Jacob Devlin and
               Ming{-}Wei Chang and
               Kenton Lee and
               Kristina Toutanova},
  title     = {{BERT:} Pre-training of Deep Bidirectional Transformers for Language
               Understanding},
  journal   = {CoRR},
  volume    = {abs/1810.04805},
  year      = {2018},
  url       = {http://arxiv.org/abs/1810.04805},
  archivePrefix = {arXiv},
  eprint    = {1810.04805},
  timestamp = {Tue, 30 Oct 2018 20:39:56 +0100},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1810-04805.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
Downloads last month
17
Safetensors
Model size
167M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for wuriyanto/ner-bert-indonesian-v1

Finetuned
(1631)
this model