File size: 5,976 Bytes
9c144a7
65e0eb6
 
8f1f4af
 
 
f45d27d
9c144a7
f45d27d
65e0eb6
f45d27d
65e0eb6
0aedf1f
65e0eb6
f45d27d
65e0eb6
f45d27d
 
 
 
 
 
 
 
 
 
 
 
65e0eb6
f45d27d
 
 
65e0eb6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f45d27d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0aedf1f
f45d27d
 
 
6f0f397
f45d27d
65e0eb6
 
f45d27d
65e0eb6
 
f45d27d
 
 
65e0eb6
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
license: apache-2.0
library_name: transformers
language:
- en
- pt
pipeline_tag: translation
---
# Transformer-eng-por

## Model Overview

The transformer-eng-por model is a transformer trained for text classification.

### Details

- **Size:** 23,805,216 parameters
- **Model type:** Transformer
- **Optimizer**: `rmsprop` 
- **Number of Epochs:** 30
- **Embbedding dimensionality:** 256
- **Dense dimensionality:** 2048
- **Attention heads:** 8
- **Vocabulary size:** 20000
- **Sequence lenght:** 20
- **Hardware:** Tesla V4
- **Emissions:** Not measured
- **Total Energy Consumption:** Not measured

### How to Use

```python
import tensorflow as tf
import numpy as np
import string
import keras
import re

strip_chars = string.punctuation
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")


def custom_standardization(input_string):
    lowercase = tf.strings.lower(input_string)
    return tf.strings.regex_replace(lowercase, f"[{re.escape(strip_chars)}]", "")

portuguese_vocabulary_path = hf_hub_download(
    repo_id="AiresPucrs/transformer-eng-por",
    filename="keras_transformer_blocks.py",
    repo_type='model',
    local_dir="./")

from keras_transformer_blocks import TransformerEncoder, PositionalEmbedding, TransformerDecoder

transformer = keras.models.load_model("/content/transformer-eng-por/transformer-eng-por.h5",
    custom_objects={"TransformerEncoder": TransformerEncoder,
        "PositionalEmbedding": PositionalEmbedding,
        "TransformerDecoder": TransformerDecoder})

with open('portuguese_vocabulary.txt', encoding='utf-8', errors='backslashreplace') as fp:
    portuguese_vocab = [line.strip() for line in fp]
    fp.close()

with open('english_vocabulary.txt', encoding='utf-8', errors='backslashreplace') as fp:
    english_vocab = [line.strip() for line in fp]
    fp.close()


target_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=21,
                                        standardize=custom_standardization,
                                        vocabulary=portuguese_vocab)

source_vectorization = tf.keras.layers.TextVectorization(max_tokens=20000,
                                        output_mode="int",
                                        output_sequence_length=20,
                                        vocabulary=english_vocab)

portuguese_index_lookup = dict(zip(range(len(portuguese_vocab)), portuguese_vocab))
max_decoded_sentence_length = 20


def decode_sequence(input_sentence):
    tokenized_input_sentence = source_vectorization([input_sentence])
    decoded_sentence = "[start]"

    for i in range(max_decoded_sentence_length):
        tokenized_target_sentence = target_vectorization([decoded_sentence])[:, :-1]
        predictions = transformer([tokenized_input_sentence, tokenized_target_sentence])
        sampled_token_index = np.argmax(predictions[0, i, :])
        sampled_token = portuguese_index_lookup[sampled_token_index]
        decoded_sentence += " " + sampled_token
        if sampled_token == "[end]":
            break
    return decoded_sentence


eng_sentences =["What is its name?",
                "How old are you?",
                "I know you know where Mary is.",
                "We will show Tom.",
                "What do you all do?",
                "Don't do it!"]

for sentence in eng_sentences:
    print(f"English sentence:\n{sentence}")
    print(f'Portuguese translation:\n{decode_sequence(sentence)}')
    print('-' * 50)
```
This will output the following:

```
English sentence:
What is its name?
Portuguese translation:
[start] qual é o nome dele [end]
--------------------------------------------------
English sentence:
How old are you?
Portuguese translation:
[start] quantos anos você tem [end]
--------------------------------------------------
English sentence:
I know you know where Mary is.
Portuguese translation:
[start] eu sei que você sabe onde mary está [end]
--------------------------------------------------
English sentence:
We will show Tom.
Portuguese translation:
[start] vamos ligar para o tom [end]
--------------------------------------------------
English sentence:
What do you all do?
Portuguese translation:
[start] o que vocês todos nós têm feito [end]
--------------------------------------------------
English sentence:
Don't do it!
Portuguese translation:
[start] não faça isso [end]
--------------------------------------------------
```
## Intended Use

This model was created for research purposes only. Specifically, it was designed to translate sentences from English to Portuguese. We do not recommend any application of this model outside this scope.


## Performance Metrics

Accuracy: 76,46%


## Training Data

 [English-portuguese translation](https://www.kaggle.com/datasets/nageshsingh/englishportuguese-translation).

 The dataset consists of a set of English and Portuguese sentences.

## Limitations

 In  `What is its name?` to `[start] o que é o seu nome [end]`, the `transformer` model makes a gender assumption,
 even though the source sentence wasn't gendered (`[start] qual é o nome dele [end]`).
 Errors like these are common in NLP, algorithmic bias being one of the great problems associated with using language models
 in real applications. In conclusion, we do not recommend using this model in real-world applications.
 It was solely developed for academic and educational purposes.

# Cite as
```latex
@misc{teenytinycastle,
    doi = {10.5281/zenodo.7112065},
    url = {https://github.com/Nkluge-correa/teeny-tiny_castle},
    author = {Nicholas Kluge Corr{\^e}a},
    title = {Teeny-Tiny Castle},
    year = {2024},
    publisher = {GitHub},
    journal = {GitHub repository},
}
```
## License
The transformer-eng-por is licensed under the Apache License, Version 2.0. See the LICENSE file for more details.