mys's picture
Create README.md
b807e81
## Acknowledgement
Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉
## What is this?
This model is a finetuned version of [dbmdz/bert-base-turkish-cased](https://huggingface.co/dbmdz/bert-base-turkish-cased) to be used in zero-shot tasks in Turkish. It is finetuned with an NLI task by using `sentence-transformers` and uses `mean` of the token embeddings as the aggregation function. I also converted it to TensorFlow with the aggregation function rewritten in TF to use it in [my `ai-aas` repo on GitHub](https://github.com/monatis/ai-aas) for production-grade deployment, but a simple usage example is as follows:
## Usage
```python
import time
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer
texts = ["Galatasaray, bu akşamki maçın ardından şampiyonluğunu ilan etmeye hazırlanıyor."]
labels = ["spor", "siyaset", "kültür"]
model_name = 'mys/bert-base-turkish-cased-nli-mean'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name)
def label_text(model, tokenizer, texts, labels):
texts_length = len(texts)
tokens = tokenizer(texts + labels, padding=True, return_tensors='tf')
embs = model(**tokens)[0]
attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
dists = tf.experimental.numpy.inner(masked_embs[:texts_length], masked_embs[texts_length:])
scores = tf.nn.softmax(dists)
results = list(zip(labels, scores.numpy().squeeze().tolist()))
sorted_results = sorted(results, key=lambda x: x[1], reverse=True)
sorted_results = [{"label": label, "score": f"{score:.4f}"} for label, score in sorted_results]
return sorted_results
start = time.time()
sorted_results = label_text(model, tokenizer, texts, labels)
alapsed = time.time() - start
print(sorted_results)
print(f"Processed in {alapsed:.2f} secs")
```
Output:
```shell
[{'label': 'spor', 'score': '1.0000'}, {'label': 'siyaset', 'score': '0.0000'}, {'label': 'kültür', 'score': '0.0000'}]
Processed in 0.22 secs
```
## How it works
`label_text()` function runs the BERT model with a concatenation of `texts` and `labels` as the input, and it agregates per-token hidden states outputted by the BERT model to produce a single vector per sequence. Then, the inner product of text embeddings and label embeddings is calculated as the similarity metric, and `softmax` is applied to convert these distance values to probabilities.
## Dataset
>[Emrah Budur](https://scholar.google.com/citations?user=zSNd03UAAAAJ), [Rıza Özçelik](https://www.cmpe.boun.edu.tr/~riza.ozcelik), [Tunga Güngör](https://www.cmpe.boun.edu.tr/~gungort/) and [Christopher Potts](https://web.stanford.edu/~cgpotts). 2020.
Data and Representation for Turkish Natural Language Inference. To appear in Proceedings of EMNLP. [[pdf]](https://arxiv.org/abs/2004.14963) [[bib]](https://tabilab.cmpe.boun.edu.tr/datasets/nli_datasets/nli-tr.bib)
```
@inproceedings{budur-etal-2020-data,
title = "Data and Representation for Turkish Natural Language Inference",
author = "Budur, Emrah and
\"{O}z\c{c}elik, R{\i}za and
G\"{u}ng\"{o}r, Tunga",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics"
}
```