KameliaZaman commited on
Commit
7727851
1 Parent(s): 957cebf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +303 -41
README.md CHANGED
@@ -8,59 +8,107 @@ metrics:
8
  library_name: transformers
9
  pipeline_tag: translation
10
  ---
 
11
 
12
- # French to English Machine Translation
 
13
 
14
- - **Author:** Kamelia Zaman Moon
15
- - **Project link:** https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation
16
- - **Language(s):** Python
17
- - **License:** MIT
18
- - **Contact:** kamelia.stu2017@juniv.edu
19
 
20
- ## Table of Contents
21
- - [Introduction](#introduction)
22
- - [Model Architecture](#model-architecture)
23
- - [How-to Guide](#how-to-guide)
24
- - [License](#license)
25
- - [Contributors](#contributors)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- ## 1. Introduction
28
  This project aims to develop a machine translation system for translating French text into English. The system utilizes state-of-the-art neural network architectures and techniques in natural language processing (NLP) to accurately translate French sentences into their corresponding English equivalents.
29
 
30
- ## 2. Model Architecture
31
- The machine translation model employs a sequence-to-sequence architecture, specifically utilizing a recurrent neural network (RNN) with an attention mechanism. The model is trained on a parallel corpus consisting of aligned French and English sentences. Key components of the model include encoder and decoder networks, attention mechanism, and tokenization for text processing.
32
- ```
33
- ── eng_-french.csv - text dataset.
34
 
35
- ── french_to_english_translator.h5 - generated model.
36
 
37
- ── french_to_english_translation_using_seq2seq.ipynb - preprocesses input, trains, saves and evaluates the model.
 
 
 
 
38
 
39
- ── app.py - this module starts the app interface.
40
 
41
- ── README.md - readme file of this project.
 
42
 
43
- ── requirements.txt - list of required packages.
44
- ```
45
 
46
- ## 3. How-to Guide
47
- ### 3.1. Data Preparation
48
- - The parallel corpus containing French and English sentences is preprocessed.
49
- - Text is tokenized and converted into numerical representations suitable for input to the neural network.
50
 
51
- ### 3.2. Model Training
52
- - The sequence-to-sequence model is constructed, comprising an encoder and decoder.
53
- - Training data is fed into the model, and parameters are optimized using backpropagation and gradient descent algorithms.
54
 
55
- ### 3.3. Model Evaluation
56
- - The trained model is evaluated on the test set to measure its accuracy.
57
- - Metrics such as BLEU score has been used to quantify the quality of translations.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- ### 3.4. Deployment
60
- - Gradio is utilized for deploying the trained model.
61
- - Users can input a French text, and the model will translate it to English.
62
 
63
- ```bash
 
 
 
 
 
 
 
 
 
 
 
 
64
  # clone project
65
  git clone https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation/tree/main
66
 
@@ -74,8 +122,222 @@ pip install -r requirements.txt
74
  python app.py
75
  ```
76
 
77
- ## 4. License
78
- This project is licensed under the [MIT License](LICENSE).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- ## 5. Contributors
81
- - Kamelia Zaman Moon - kamelia.stu2017@juniv.edu
 
 
 
 
 
 
 
 
 
8
  library_name: transformers
9
  pipeline_tag: translation
10
  ---
11
+ <a name="readme-top"></a>
12
 
13
+ <div align="center">
14
+ <img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/logo.jpg" alt="Logo" width="100" height="100">
15
 
16
+ <h3 align="center">French to English Machine Translation</h3>
 
 
 
 
17
 
18
+ <p align="center">
19
+ French to English language translation using sequence to sequence transformer.
20
+ <br />
21
+ <a href="https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation">View Demo</a>
22
+ </p>
23
+ </div>
24
+
25
+ <!-- TABLE OF CONTENTS -->
26
+ <details>
27
+ <summary>Table of Contents</summary>
28
+ <ol>
29
+ <li>
30
+ <a href="#about-the-project">About The Project</a>
31
+ <ul>
32
+ <li><a href="#built-with">Built With</a></li>
33
+ </ul>
34
+ </li>
35
+ <li>
36
+ <a href="#getting-started">Getting Started</a>
37
+ <ul>
38
+ <li><a href="#dependencies">Dependencies</a></li>
39
+ <li><a href="#installation">Installation</a></li>
40
+ </ul>
41
+ </li>
42
+ <li><a href="#usage">Usage</a></li>
43
+ <li><a href="#contributing">Contributing</a></li>
44
+ <li><a href="#license">License</a></li>
45
+ <li><a href="#contact">Contact</a></li>
46
+ </ol>
47
+ </details>
48
+
49
+ <!-- ABOUT THE PROJECT -->
50
+ ## About The Project
51
+
52
+ <img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/About.png" alt="Logo" width="500" height="500">
53
 
 
54
  This project aims to develop a machine translation system for translating French text into English. The system utilizes state-of-the-art neural network architectures and techniques in natural language processing (NLP) to accurately translate French sentences into their corresponding English equivalents.
55
 
56
+ <p align="right">(<a href="#readme-top">back to top</a>)</p>
 
 
 
57
 
58
+ ### Built With
59
 
60
+ * [![Python][Python]][Python-url]
61
+ * [![TensorFlow][TensorFlow]][TensorFlow-url]
62
+ * [![Keras][Keras]][Keras-url]
63
+ * [![NumPy][NumPy]][NumPy-url]
64
+ * [![Pandas][Pandas]][Pandas-url]
65
 
66
+ <p align="right">(<a href="#readme-top">back to top</a>)</p>
67
 
68
+ <!-- GETTING STARTED -->
69
+ ## Getting Started
70
 
71
+ Please follow these simple steps to setup this project locally.
 
72
 
73
+ ### Dependencies
 
 
 
74
 
75
+ Here are the list all libraries, packages and other dependencies that need to be installed to run this project.
 
 
76
 
77
+ For example, this is how you would list them:
78
+ * TensorFlow 2.16.1
79
+ ```sh
80
+ conda install -c conda-forge tensorflow
81
+ ```
82
+ * Keras 2.15.0
83
+ ```sh
84
+ conda install -c conda-forge keras
85
+ ```
86
+ * Gradio 4.24.0
87
+ ```sh
88
+ conda install -c conda-forge gradio
89
+ ```
90
+ * NumPy 1.26.4
91
+ ```sh
92
+ conda install -c conda-forge numpy
93
+ ```
94
 
95
+ ### Alternative: Export Environment
96
+
97
+ Alternatively, clone the project repository, install it and have all dependencies needed.
98
 
99
+ ```sh
100
+ conda env export > requirements.txt
101
+ ```
102
+
103
+ Recreate it using:
104
+
105
+ ```sh
106
+ conda env create -f requirements.txt
107
+ ```
108
+
109
+ ### Installation
110
+
111
+ ```sh
112
  # clone project
113
  git clone https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation/tree/main
114
 
 
122
  python app.py
123
  ```
124
 
125
+ <p align="right">(<a href="#readme-top">back to top</a>)</p>
126
+
127
+ <!-- USAGE EXAMPLES -->
128
+ ## Usage
129
+
130
+ #### Dataset
131
+
132
+ Dataset is from "https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench" which contains 2 columns where one column has english words/sentences and the other one has french words/sentence
133
+
134
+ #### Model Architecture
135
+
136
+ The model architecture consists of an Encoder-Decoder Long Short-Term Memory network with an embedding layer. It was built on a Neural Machine Translation architecture where sequence-to-sequence framework with attention mechanisms was applied.
137
+
138
+ <img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/arch.png" alt="Logo" width="500" height="500">
139
+
140
+ #### Data Preparation
141
+ - The parallel corpus containing French and English sentences is preprocessed.
142
+ - Text is tokenized and converted into numerical representations suitable for input to the neural network.
143
+
144
+ #### Model Training
145
+ - The sequence-to-sequence model is constructed, comprising an encoder and decoder.
146
+ - Training data is fed into the model, and parameters are optimized using backpropagation and gradient descent algorithms.
147
+
148
+ ```sh
149
+ def create_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
150
+ # Create the model
151
+ model = Sequential()
152
+ model.add(Embedding(src_vocab_size, n_units, input_length=src_length, mask_zero=True))
153
+ model.add(LSTM(n_units))
154
+ model.add(RepeatVector(tar_timesteps))
155
+ model.add(LSTM(n_units, return_sequences=True))
156
+ model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
157
+ return model
158
+
159
+ model = create_model(src_vocab_size, tar_vocab_size, src_length, tar_length, 256)
160
+ model.compile(optimizer='adam', loss='categorical_crossentropy')
161
+
162
+ history = model.fit(trainX,
163
+ trainY,
164
+ epochs=20,
165
+ batch_size=64,
166
+ validation_split=0.1,
167
+ verbose=1,
168
+ callbacks=[
169
+ EarlyStopping(
170
+ monitor='val_loss',
171
+ patience=10,
172
+ restore_best_weights=True
173
+ )
174
+ ])
175
+ ```
176
+
177
+ <img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/train_loss.png" alt="Logo" width="500" height="500">
178
+
179
+ #### Model Evaluation
180
+ - The trained model is evaluated on the test set to measure its accuracy.
181
+ - Metrics such as BLEU score has been used to quantify the quality of translations.
182
+
183
+ <img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/train_acc.png" alt="Logo" width="500" height="500">
184
+ <img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/test_acc.png" alt="Logo" width="500" height="500">
185
+
186
+ #### Deployment
187
+ - Gradio is utilized for deploying the trained model.
188
+ - Users can input a French text, and the model will translate it to English.
189
+
190
+ ```sh
191
+ import string
192
+ import re
193
+ from unicodedata import normalize
194
+ import numpy as np
195
+ from keras.preprocessing.text import Tokenizer
196
+ from keras.preprocessing.sequence import pad_sequences
197
+ from keras.utils import to_categorical
198
+ from keras.models import Sequential,load_model
199
+ from keras.layers import LSTM,Dense,Embedding,RepeatVector,TimeDistributed
200
+ from keras.callbacks import EarlyStopping
201
+ from nltk.translate.bleu_score import corpus_bleu
202
+ import pandas as pd
203
+ from string import punctuation
204
+ import matplotlib.pyplot as plt
205
+ from IPython.display import Markdown, display
206
+ import gradio as gr
207
+ import tensorflow as tf
208
+ from tensorflow.keras.models import load_model
209
+
210
+ total_sentences = 10000
211
+
212
+ dataset = pd.read_csv("./eng_-french.csv", nrows = total_sentences)
213
+
214
+ def clean(string):
215
+ # Clean the string
216
+ string = string.replace("\u202f"," ") # Replace no-break space with space
217
+ string = string.lower()
218
+
219
+ # Delete the punctuation and the numbers
220
+ for p in punctuation + "«»" + "0123456789":
221
+ string = string.replace(p," ")
222
+
223
+ string = re.sub('\s+',' ', string)
224
+ string = string.strip()
225
+
226
+ return string
227
+
228
+ dataset = dataset.sample(frac=1, random_state=0)
229
+ dataset["English words/sentences"] = dataset["English words/sentences"].apply(lambda x: clean(x))
230
+ dataset["French words/sentences"] = dataset["French words/sentences"].apply(lambda x: clean(x))
231
+
232
+ dataset = dataset.values
233
+ dataset = dataset[:total_sentences]
234
+
235
+ source_str, target_str = "French", "English"
236
+ idx_src, idx_tar = 1, 0
237
+
238
+ def create_tokenizer(lines):
239
+ # fit a tokenizer
240
+ tokenizer = Tokenizer()
241
+ tokenizer.fit_on_texts(lines)
242
+ return tokenizer
243
+
244
+ def max_len(lines):
245
+ # max sentence length
246
+ return max(len(line.split()) for line in lines)
247
+
248
+ def encode_sequences(tokenizer, length, lines):
249
+ # encode and pad sequences
250
+ X = tokenizer.texts_to_sequences(lines) # integer encode sequences
251
+ X = pad_sequences(X, maxlen=length, padding='post') # pad sequences with 0 values
252
+ return X
253
+
254
+ def word_for_id(integer, tokenizer):
255
+ # map an integer to a word
256
+ for word, index in tokenizer.word_index.items():
257
+ if index == integer:
258
+ return word
259
+ return None
260
+
261
+ def predict_seq(model, tokenizer, source):
262
+ # generate target from a source sequence
263
+ prediction = model.predict(source, verbose=0)[0]
264
+ integers = [np.argmax(vector) for vector in prediction]
265
+ target = list()
266
+ for i in integers:
267
+ word = word_for_id(i, tokenizer)
268
+ if word is None:
269
+ break
270
+ target.append(word)
271
+ return ' '.join(target)
272
+
273
+ src_tokenizer = create_tokenizer(dataset[:, idx_src])
274
+ src_vocab_size = len(src_tokenizer.word_index) + 1
275
+ src_length = max_len(dataset[:, idx_src])
276
+ tar_tokenizer = create_tokenizer(dataset[:, idx_tar])
277
+
278
+ model = load_model('./french_to_english_translator.h5')
279
+
280
+ def translate_french_english(french_sentence):
281
+ # Clean the input sentence
282
+ french_sentence = clean(french_sentence)
283
+ # Tokenize and pad the input sentence
284
+ input_sequence = encode_sequences(src_tokenizer, src_length, [french_sentence])
285
+ # Generate the translation
286
+ english_translation = predict_seq(model, tar_tokenizer, input_sequence)
287
+ return english_translation
288
+
289
+ gr.Interface(
290
+ fn=translate_french_english,
291
+ inputs="text",
292
+ outputs="text",
293
+ title="French to English Translator",
294
+ description="Translate French sentences to English."
295
+ ).launch()
296
+ ```
297
+
298
+ <img src="https://huggingface.co/KameliaZaman/French-to-English-Translation/resolve/main/assets/About.png" alt="Logo" width="500" height="500">
299
+
300
+ <p align="right">(<a href="#readme-top">back to top</a>)</p>
301
+
302
+ <!-- CONTRIBUTING -->
303
+ ## Contributing
304
+
305
+ Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**.
306
+
307
+ If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement".
308
+ Don't forget to give the project a star! Thanks again!
309
+
310
+ 1. Fork the Project
311
+ 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
312
+ 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
313
+ 4. Push to the Branch (`git push origin feature/AmazingFeature`)
314
+ 5. Open a Pull Request
315
+
316
+ <p align="right">(<a href="#readme-top">back to top</a>)</p>
317
+
318
+ <!-- LICENSE -->
319
+ ## License
320
+
321
+ Distributed under the MIT License. See [MIT License](LICENSE) for more information.
322
+
323
+ <p align="right">(<a href="#readme-top">back to top</a>)</p>
324
+
325
+ <!-- CONTACT -->
326
+ ## Contact
327
+
328
+ Kamelia Zaman Moon - kamelia.stu2017@juniv.edu
329
+
330
+ Project Link: [https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation](https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation/tree/main)
331
+
332
+ <p align="right">(<a href="#readme-top">back to top</a>)</p>
333
 
334
+ [Python]: https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54
335
+ [Python-url]: https://www.python.org/
336
+ [TensorFlow]: https://img.shields.io/badge/TensorFlow-%23FF6F00.svg?style=for-the-badge&logo=TensorFlow&logoColor=white
337
+ [TensorFlow-url]: https://tensorflow.org/
338
+ [Keras]: https://img.shields.io/badge/Keras-%23D00000.svg?style=for-the-badge&logo=Keras&logoColor=white
339
+ [Keras-url]: https://keras.io/
340
+ [NumPy]: https://img.shields.io/badge/numpy-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white
341
+ [NumPy-url]: https://numpy.org/
342
+ [Pandas]: https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white
343
+ [Pandas-url]: https://pandas.pydata.org/