KameliaZaman
commited on
Commit
•
7727851
1
Parent(s):
957cebf
Update README.md
Browse files
README.md
CHANGED
@@ -8,59 +8,107 @@ metrics:
|
|
8 |
library_name: transformers
|
9 |
pipeline_tag: translation
|
10 |
---
|
|
|
11 |
|
12 |
-
|
|
|
13 |
|
14 |
-
|
15 |
-
- **Project link:** https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation
|
16 |
-
- **Language(s):** Python
|
17 |
-
- **License:** MIT
|
18 |
-
- **Contact:** kamelia.stu2017@juniv.edu
|
19 |
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
-
## 1. Introduction
|
28 |
This project aims to develop a machine translation system for translating French text into English. The system utilizes state-of-the-art neural network architectures and techniques in natural language processing (NLP) to accurately translate French sentences into their corresponding English equivalents.
|
29 |
|
30 |
-
|
31 |
-
The machine translation model employs a sequence-to-sequence architecture, specifically utilizing a recurrent neural network (RNN) with an attention mechanism. The model is trained on a parallel corpus consisting of aligned French and English sentences. Key components of the model include encoder and decoder networks, attention mechanism, and tokenization for text processing.
|
32 |
-
```
|
33 |
-
── eng_-french.csv - text dataset.
|
34 |
|
35 |
-
|
36 |
|
37 |
-
|
|
|
|
|
|
|
|
|
38 |
|
39 |
-
|
40 |
|
41 |
-
|
|
|
42 |
|
43 |
-
|
44 |
-
```
|
45 |
|
46 |
-
|
47 |
-
### 3.1. Data Preparation
|
48 |
-
- The parallel corpus containing French and English sentences is preprocessed.
|
49 |
-
- Text is tokenized and converted into numerical representations suitable for input to the neural network.
|
50 |
|
51 |
-
|
52 |
-
- The sequence-to-sequence model is constructed, comprising an encoder and decoder.
|
53 |
-
- Training data is fed into the model, and parameters are optimized using backpropagation and gradient descent algorithms.
|
54 |
|
55 |
-
|
56 |
-
|
57 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
-
###
|
60 |
-
|
61 |
-
|
62 |
|
63 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
# clone project
|
65 |
git clone https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation/tree/main
|
66 |
|
@@ -74,8 +122,222 @@ pip install -r requirements.txt
|
|
74 |
python app.py
|
75 |
```
|
76 |
|
77 |
-
|
78 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
79 |
|
80 |
-
|
81 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
library_name: transformers
|
9 |
pipeline_tag: translation
|
10 |
---
|
11 |
+
<a name="readme-top"></a>
|
12 |
|
13 |
+
<div align="center">
|
14 |
+
<img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/logo.jpg" alt="Logo" width="100" height="100">
|
15 |
|
16 |
+
<h3 align="center">French to English Machine Translation</h3>
|
|
|
|
|
|
|
|
|
17 |
|
18 |
+
<p align="center">
|
19 |
+
French to English language translation using sequence to sequence transformer.
|
20 |
+
<br />
|
21 |
+
<a href="https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation">View Demo</a>
|
22 |
+
</p>
|
23 |
+
</div>
|
24 |
+
|
25 |
+
<!-- TABLE OF CONTENTS -->
|
26 |
+
<details>
|
27 |
+
<summary>Table of Contents</summary>
|
28 |
+
<ol>
|
29 |
+
<li>
|
30 |
+
<a href="#about-the-project">About The Project</a>
|
31 |
+
<ul>
|
32 |
+
<li><a href="#built-with">Built With</a></li>
|
33 |
+
</ul>
|
34 |
+
</li>
|
35 |
+
<li>
|
36 |
+
<a href="#getting-started">Getting Started</a>
|
37 |
+
<ul>
|
38 |
+
<li><a href="#dependencies">Dependencies</a></li>
|
39 |
+
<li><a href="#installation">Installation</a></li>
|
40 |
+
</ul>
|
41 |
+
</li>
|
42 |
+
<li><a href="#usage">Usage</a></li>
|
43 |
+
<li><a href="#contributing">Contributing</a></li>
|
44 |
+
<li><a href="#license">License</a></li>
|
45 |
+
<li><a href="#contact">Contact</a></li>
|
46 |
+
</ol>
|
47 |
+
</details>
|
48 |
+
|
49 |
+
<!-- ABOUT THE PROJECT -->
|
50 |
+
## About The Project
|
51 |
+
|
52 |
+
<img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/About.png" alt="Logo" width="500" height="500">
|
53 |
|
|
|
54 |
This project aims to develop a machine translation system for translating French text into English. The system utilizes state-of-the-art neural network architectures and techniques in natural language processing (NLP) to accurately translate French sentences into their corresponding English equivalents.
|
55 |
|
56 |
+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
|
|
|
|
|
|
57 |
|
58 |
+
### Built With
|
59 |
|
60 |
+
* [![Python][Python]][Python-url]
|
61 |
+
* [![TensorFlow][TensorFlow]][TensorFlow-url]
|
62 |
+
* [![Keras][Keras]][Keras-url]
|
63 |
+
* [![NumPy][NumPy]][NumPy-url]
|
64 |
+
* [![Pandas][Pandas]][Pandas-url]
|
65 |
|
66 |
+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
67 |
|
68 |
+
<!-- GETTING STARTED -->
|
69 |
+
## Getting Started
|
70 |
|
71 |
+
Please follow these simple steps to setup this project locally.
|
|
|
72 |
|
73 |
+
### Dependencies
|
|
|
|
|
|
|
74 |
|
75 |
+
Here are the list all libraries, packages and other dependencies that need to be installed to run this project.
|
|
|
|
|
76 |
|
77 |
+
For example, this is how you would list them:
|
78 |
+
* TensorFlow 2.16.1
|
79 |
+
```sh
|
80 |
+
conda install -c conda-forge tensorflow
|
81 |
+
```
|
82 |
+
* Keras 2.15.0
|
83 |
+
```sh
|
84 |
+
conda install -c conda-forge keras
|
85 |
+
```
|
86 |
+
* Gradio 4.24.0
|
87 |
+
```sh
|
88 |
+
conda install -c conda-forge gradio
|
89 |
+
```
|
90 |
+
* NumPy 1.26.4
|
91 |
+
```sh
|
92 |
+
conda install -c conda-forge numpy
|
93 |
+
```
|
94 |
|
95 |
+
### Alternative: Export Environment
|
96 |
+
|
97 |
+
Alternatively, clone the project repository, install it and have all dependencies needed.
|
98 |
|
99 |
+
```sh
|
100 |
+
conda env export > requirements.txt
|
101 |
+
```
|
102 |
+
|
103 |
+
Recreate it using:
|
104 |
+
|
105 |
+
```sh
|
106 |
+
conda env create -f requirements.txt
|
107 |
+
```
|
108 |
+
|
109 |
+
### Installation
|
110 |
+
|
111 |
+
```sh
|
112 |
# clone project
|
113 |
git clone https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation/tree/main
|
114 |
|
|
|
122 |
python app.py
|
123 |
```
|
124 |
|
125 |
+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
126 |
+
|
127 |
+
<!-- USAGE EXAMPLES -->
|
128 |
+
## Usage
|
129 |
+
|
130 |
+
#### Dataset
|
131 |
+
|
132 |
+
Dataset is from "https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench" which contains 2 columns where one column has english words/sentences and the other one has french words/sentence
|
133 |
+
|
134 |
+
#### Model Architecture
|
135 |
+
|
136 |
+
The model architecture consists of an Encoder-Decoder Long Short-Term Memory network with an embedding layer. It was built on a Neural Machine Translation architecture where sequence-to-sequence framework with attention mechanisms was applied.
|
137 |
+
|
138 |
+
<img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/arch.png" alt="Logo" width="500" height="500">
|
139 |
+
|
140 |
+
#### Data Preparation
|
141 |
+
- The parallel corpus containing French and English sentences is preprocessed.
|
142 |
+
- Text is tokenized and converted into numerical representations suitable for input to the neural network.
|
143 |
+
|
144 |
+
#### Model Training
|
145 |
+
- The sequence-to-sequence model is constructed, comprising an encoder and decoder.
|
146 |
+
- Training data is fed into the model, and parameters are optimized using backpropagation and gradient descent algorithms.
|
147 |
+
|
148 |
+
```sh
|
149 |
+
def create_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
|
150 |
+
# Create the model
|
151 |
+
model = Sequential()
|
152 |
+
model.add(Embedding(src_vocab_size, n_units, input_length=src_length, mask_zero=True))
|
153 |
+
model.add(LSTM(n_units))
|
154 |
+
model.add(RepeatVector(tar_timesteps))
|
155 |
+
model.add(LSTM(n_units, return_sequences=True))
|
156 |
+
model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
|
157 |
+
return model
|
158 |
+
|
159 |
+
model = create_model(src_vocab_size, tar_vocab_size, src_length, tar_length, 256)
|
160 |
+
model.compile(optimizer='adam', loss='categorical_crossentropy')
|
161 |
+
|
162 |
+
history = model.fit(trainX,
|
163 |
+
trainY,
|
164 |
+
epochs=20,
|
165 |
+
batch_size=64,
|
166 |
+
validation_split=0.1,
|
167 |
+
verbose=1,
|
168 |
+
callbacks=[
|
169 |
+
EarlyStopping(
|
170 |
+
monitor='val_loss',
|
171 |
+
patience=10,
|
172 |
+
restore_best_weights=True
|
173 |
+
)
|
174 |
+
])
|
175 |
+
```
|
176 |
+
|
177 |
+
<img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/train_loss.png" alt="Logo" width="500" height="500">
|
178 |
+
|
179 |
+
#### Model Evaluation
|
180 |
+
- The trained model is evaluated on the test set to measure its accuracy.
|
181 |
+
- Metrics such as BLEU score has been used to quantify the quality of translations.
|
182 |
+
|
183 |
+
<img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/train_acc.png" alt="Logo" width="500" height="500">
|
184 |
+
<img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/test_acc.png" alt="Logo" width="500" height="500">
|
185 |
+
|
186 |
+
#### Deployment
|
187 |
+
- Gradio is utilized for deploying the trained model.
|
188 |
+
- Users can input a French text, and the model will translate it to English.
|
189 |
+
|
190 |
+
```sh
|
191 |
+
import string
|
192 |
+
import re
|
193 |
+
from unicodedata import normalize
|
194 |
+
import numpy as np
|
195 |
+
from keras.preprocessing.text import Tokenizer
|
196 |
+
from keras.preprocessing.sequence import pad_sequences
|
197 |
+
from keras.utils import to_categorical
|
198 |
+
from keras.models import Sequential,load_model
|
199 |
+
from keras.layers import LSTM,Dense,Embedding,RepeatVector,TimeDistributed
|
200 |
+
from keras.callbacks import EarlyStopping
|
201 |
+
from nltk.translate.bleu_score import corpus_bleu
|
202 |
+
import pandas as pd
|
203 |
+
from string import punctuation
|
204 |
+
import matplotlib.pyplot as plt
|
205 |
+
from IPython.display import Markdown, display
|
206 |
+
import gradio as gr
|
207 |
+
import tensorflow as tf
|
208 |
+
from tensorflow.keras.models import load_model
|
209 |
+
|
210 |
+
total_sentences = 10000
|
211 |
+
|
212 |
+
dataset = pd.read_csv("./eng_-french.csv", nrows = total_sentences)
|
213 |
+
|
214 |
+
def clean(string):
|
215 |
+
# Clean the string
|
216 |
+
string = string.replace("\u202f"," ") # Replace no-break space with space
|
217 |
+
string = string.lower()
|
218 |
+
|
219 |
+
# Delete the punctuation and the numbers
|
220 |
+
for p in punctuation + "«»" + "0123456789":
|
221 |
+
string = string.replace(p," ")
|
222 |
+
|
223 |
+
string = re.sub('\s+',' ', string)
|
224 |
+
string = string.strip()
|
225 |
+
|
226 |
+
return string
|
227 |
+
|
228 |
+
dataset = dataset.sample(frac=1, random_state=0)
|
229 |
+
dataset["English words/sentences"] = dataset["English words/sentences"].apply(lambda x: clean(x))
|
230 |
+
dataset["French words/sentences"] = dataset["French words/sentences"].apply(lambda x: clean(x))
|
231 |
+
|
232 |
+
dataset = dataset.values
|
233 |
+
dataset = dataset[:total_sentences]
|
234 |
+
|
235 |
+
source_str, target_str = "French", "English"
|
236 |
+
idx_src, idx_tar = 1, 0
|
237 |
+
|
238 |
+
def create_tokenizer(lines):
|
239 |
+
# fit a tokenizer
|
240 |
+
tokenizer = Tokenizer()
|
241 |
+
tokenizer.fit_on_texts(lines)
|
242 |
+
return tokenizer
|
243 |
+
|
244 |
+
def max_len(lines):
|
245 |
+
# max sentence length
|
246 |
+
return max(len(line.split()) for line in lines)
|
247 |
+
|
248 |
+
def encode_sequences(tokenizer, length, lines):
|
249 |
+
# encode and pad sequences
|
250 |
+
X = tokenizer.texts_to_sequences(lines) # integer encode sequences
|
251 |
+
X = pad_sequences(X, maxlen=length, padding='post') # pad sequences with 0 values
|
252 |
+
return X
|
253 |
+
|
254 |
+
def word_for_id(integer, tokenizer):
|
255 |
+
# map an integer to a word
|
256 |
+
for word, index in tokenizer.word_index.items():
|
257 |
+
if index == integer:
|
258 |
+
return word
|
259 |
+
return None
|
260 |
+
|
261 |
+
def predict_seq(model, tokenizer, source):
|
262 |
+
# generate target from a source sequence
|
263 |
+
prediction = model.predict(source, verbose=0)[0]
|
264 |
+
integers = [np.argmax(vector) for vector in prediction]
|
265 |
+
target = list()
|
266 |
+
for i in integers:
|
267 |
+
word = word_for_id(i, tokenizer)
|
268 |
+
if word is None:
|
269 |
+
break
|
270 |
+
target.append(word)
|
271 |
+
return ' '.join(target)
|
272 |
+
|
273 |
+
src_tokenizer = create_tokenizer(dataset[:, idx_src])
|
274 |
+
src_vocab_size = len(src_tokenizer.word_index) + 1
|
275 |
+
src_length = max_len(dataset[:, idx_src])
|
276 |
+
tar_tokenizer = create_tokenizer(dataset[:, idx_tar])
|
277 |
+
|
278 |
+
model = load_model('./french_to_english_translator.h5')
|
279 |
+
|
280 |
+
def translate_french_english(french_sentence):
|
281 |
+
# Clean the input sentence
|
282 |
+
french_sentence = clean(french_sentence)
|
283 |
+
# Tokenize and pad the input sentence
|
284 |
+
input_sequence = encode_sequences(src_tokenizer, src_length, [french_sentence])
|
285 |
+
# Generate the translation
|
286 |
+
english_translation = predict_seq(model, tar_tokenizer, input_sequence)
|
287 |
+
return english_translation
|
288 |
+
|
289 |
+
gr.Interface(
|
290 |
+
fn=translate_french_english,
|
291 |
+
inputs="text",
|
292 |
+
outputs="text",
|
293 |
+
title="French to English Translator",
|
294 |
+
description="Translate French sentences to English."
|
295 |
+
).launch()
|
296 |
+
```
|
297 |
+
|
298 |
+
<img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/About.png" alt="Logo" width="500" height="500">
|
299 |
+
|
300 |
+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
301 |
+
|
302 |
+
<!-- CONTRIBUTING -->
|
303 |
+
## Contributing
|
304 |
+
|
305 |
+
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
|
306 |
+
|
307 |
+
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
|
308 |
+
Don't forget to give the project a star! Thanks again!
|
309 |
+
|
310 |
+
1. Fork the Project
|
311 |
+
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
|
312 |
+
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
|
313 |
+
4. Push to the Branch (`git push origin feature/AmazingFeature`)
|
314 |
+
5. Open a Pull Request
|
315 |
+
|
316 |
+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
317 |
+
|
318 |
+
<!-- LICENSE -->
|
319 |
+
## License
|
320 |
+
|
321 |
+
Distributed under the MIT License. See [MIT License](LICENSE) for more information.
|
322 |
+
|
323 |
+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
324 |
+
|
325 |
+
<!-- CONTACT -->
|
326 |
+
## Contact
|
327 |
+
|
328 |
+
Kamelia Zaman Moon - kamelia.stu2017@juniv.edu
|
329 |
+
|
330 |
+
Project Link: [https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation](https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation/tree/main)
|
331 |
+
|
332 |
+
<p align="right">(<a href="#readme-top">back to top</a>)</p>
|
333 |
|
334 |
+
[Python]: https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54
|
335 |
+
[Python-url]: https://www.python.org/
|
336 |
+
[TensorFlow]: https://img.shields.io/badge/TensorFlow-%23FF6F00.svg?style=for-the-badge&logo=TensorFlow&logoColor=white
|
337 |
+
[TensorFlow-url]: https://tensorflow.org/
|
338 |
+
[Keras]: https://img.shields.io/badge/Keras-%23D00000.svg?style=for-the-badge&logo=Keras&logoColor=white
|
339 |
+
[Keras-url]: https://keras.io/
|
340 |
+
[NumPy]: https://img.shields.io/badge/numpy-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white
|
341 |
+
[NumPy-url]: https://numpy.org/
|
342 |
+
[Pandas]: https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white
|
343 |
+
[Pandas-url]: https://pandas.pydata.org/
|