farnazzeidi's picture
Update README.md
86681d5 verified
metadata
license: cc-by-nc-sa-4.0
language:
  - tr
pipeline_tag: token-classification
tags:
  - legal

NER Model for Legal Texts

Released in January 2024, this is a Turkish BERT language model pretrained from scratch on an optimized BERT architecture using a 2 GB Turkish legal corpus. The corpus was sourced from legal-related thesis documents available in the Higher Education Board National Thesis Center (YÖKTEZ). The model has been fine-tuned for Named Entity Recognition (NER) tasks on human-annotated datasets provided by NewMind, a legal tech company in Istanbul, Turkey.

In our paper, we outline the steps taken to train this model and demonstrate its superior performance compared to previous approaches.


Overview

  • Preprint Paper: https://arxiv.org/abs/2407.00648
  • Architecture: Optimized BERT Base
  • Language: Turkish
  • Supported Labels:
    • Person
    • Law
    • Publication
    • Government
    • Corporation
    • Other
    • Project
    • Money
    • Date
    • Location
    • Court

Model Name: LegalTurk Optimized BERT


How to Use

Use a pipeline as a high-level helper

from transformers import pipeline

# Load the pipeline
model = pipeline("ner", model="farnazzeidi/ner-legalturk-bert-model", aggregation_strategy='simple')

# Input text
text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."

# Get predictions
predictions = model(text)
print(predictions)

Load model directly

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer

tokenizer = AutoTokenizer.from_pretrained("farnazzeidi/ner-legalturk-bert-model")
model = AutoModelForTokenClassification.from_pretrained("farnazzeidi/ner-legalturk-bert-model")

text = "Burada, Tebligat Kanunu ile VUK düzenlemesi ayrımına dikkat etmek gerekir."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Process logits and map predictions to labels
predictions = [
    (token, model.config.id2label[label.item()])
    for token, label in zip(
        tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]),
        torch.argmax(torch.softmax(outputs.logits, dim=-1), dim=-1)[0]
    )
    if token not in tokenizer.all_special_tokens
]

print(predictions)

Authors

Farnaz Zeidi, Mehmet Fatih Amasyali, Çigdem Erol


License

This model is shared under the CC BY-NC-SA 4.0 License. You are free to use, share, and adapt the model for non-commercial purposes, provided that you give appropriate credit to the authors.

For commercial use, please contact [zeidi.uni@gmail.com].