File size: 3,664 Bytes
b807e81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
## Acknowledgement
Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉

## What is this?
This model is a finetuned version of [dbmdz/bert-base-turkish-cased](https://huggingface.co/dbmdz/bert-base-turkish-cased) to be used in zero-shot tasks in Turkish. It is finetuned with an NLI task by using `sentence-transformers` and uses `mean` of the token embeddings as the aggregation function. I also converted it to TensorFlow with the aggregation function rewritten in TF to use it in [my `ai-aas` repo on GitHub](https://github.com/monatis/ai-aas) for production-grade deployment, but a simple usage example is as follows:

## Usage
```python
import time
import tensorflow as tf
from transformers import TFAutoModel, AutoTokenizer

texts = ["Galatasaray, bu akşamki maçın ardından şampiyonluğunu ilan etmeye hazırlanıyor."]
labels = ["spor", "siyaset", "kültür"]
model_name = 'mys/bert-base-turkish-cased-nli-mean'

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = TFAutoModel.from_pretrained(model_name)


def label_text(model, tokenizer, texts, labels):
    texts_length = len(texts)
    tokens = tokenizer(texts + labels, padding=True, return_tensors='tf')
    embs = model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)

    dists = tf.experimental.numpy.inner(masked_embs[:texts_length], masked_embs[texts_length:])
    scores = tf.nn.softmax(dists)
    results = list(zip(labels, scores.numpy().squeeze().tolist()))
    sorted_results = sorted(results, key=lambda x: x[1], reverse=True)
    sorted_results = [{"label": label, "score": f"{score:.4f}"} for label, score in sorted_results]
    return sorted_results


start = time.time()
sorted_results = label_text(model, tokenizer, texts, labels)
alapsed = time.time() - start

print(sorted_results)
print(f"Processed in {alapsed:.2f} secs")
```

Output:
```shell
[{'label': 'spor', 'score': '1.0000'}, {'label': 'siyaset', 'score': '0.0000'}, {'label': 'kültür', 'score': '0.0000'}]
Processed in 0.22 secs
```

## How it works
`label_text()`  function runs the BERT model with a concatenation of `texts` and `labels` as the input, and it agregates per-token hidden states outputted by the BERT model to produce a single vector per sequence. Then, the inner product of text embeddings and label embeddings is calculated as the similarity metric, and `softmax` is applied to convert these distance values to probabilities.

## Dataset
>[Emrah Budur](https://scholar.google.com/citations?user=zSNd03UAAAAJ), [Rıza Özçelik](https://www.cmpe.boun.edu.tr/~riza.ozcelik), [Tunga Güngör](https://www.cmpe.boun.edu.tr/~gungort/)  and [Christopher Potts](https://web.stanford.edu/~cgpotts). 2020. 
Data and Representation for Turkish Natural Language Inference. To appear in Proceedings of EMNLP. [[pdf]](https://arxiv.org/abs/2004.14963) [[bib]](https://tabilab.cmpe.boun.edu.tr/datasets/nli_datasets/nli-tr.bib)

```
@inproceedings{budur-etal-2020-data,
    title = "Data and Representation for Turkish Natural Language Inference",
    author = "Budur, Emrah and
      \"{O}z\c{c}elik, R{\i}za and
      G\"{u}ng\"{o}r, Tunga",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics"
}
```