Edit model card

TavBERT base model

A Turkish BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020).

How to use

import numpy as np
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

model = AutoModelForMaskedLM.from_pretrained("tau/tavbert-tr")
tokenizer = AutoTokenizer.from_pretrained("tau/tavbert-tr")

def mask_sentence(sent, span_len=5):
    start_pos = np.random.randint(0, len(sent) - span_len)
    masked_sent = sent[:start_pos] + '[MASK]' * span_len + sent[start_pos + span_len:]
    print("Masked sentence:", masked_sent)
    output = model(**tokenizer.encode_plus(masked_sent, 
                                           return_tensors='pt'))['logits'][0][1:-1]
    preds = [int(x) for x in torch.argmax(torch.softmax(output, axis=1), axis=1)[start_pos:start_pos + span_len]]
    pred_sent = sent[:start_pos] + ''.join(tokenizer.convert_ids_to_tokens(preds)) + sent[start_pos + span_len:]
    print("Model's prediction:", pred_sent)

Training data

OSCAR (Ortiz, 2019) Turkish section (27 GB text, 77 million sentences).

Downloads last month
5
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train tau/tavbert-tr