--- license: mit language: - en - fr metrics: - bleu library_name: transformers pipeline_tag: translation ---
Logo

French to English Machine Translation

French to English language translation using sequence to sequence transformer.
View Demo

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. License
  6. Contact
## About The Project Logo This project aims to develop a machine translation system for translating French text into English. The system utilizes state-of-the-art neural network architectures and techniques in natural language processing (NLP) to accurately translate French sentences into their corresponding English equivalents.

(back to top)

### Built With * [![Python][Python]][Python-url] * [![TensorFlow][TensorFlow]][TensorFlow-url] * [![Keras][Keras]][Keras-url] * [![NumPy][NumPy]][NumPy-url] * [![Pandas][Pandas]][Pandas-url]

(back to top)

## Getting Started Please follow these simple steps to setup this project locally. ### Dependencies Here are the list all libraries, packages and other dependencies that need to be installed to run this project. For example, this is how you would list them: * TensorFlow 2.16.1 ```sh conda install -c conda-forge tensorflow ``` * Keras 2.15.0 ```sh conda install -c conda-forge keras ``` * Gradio 4.24.0 ```sh conda install -c conda-forge gradio ``` * NumPy 1.26.4 ```sh conda install -c conda-forge numpy ``` ### Alternative: Export Environment Alternatively, clone the project repository, install it and have all dependencies needed. ```sh conda env export > requirements.txt ``` Recreate it using: ```sh conda env create -f requirements.txt ``` ### Installation ```sh # clone project git clone https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation/tree/main # go inside the project directory cd French-to-English-Translation # install the required packages pip install -r requirements.txt # run the gradio app python app.py ```

(back to top)

## Usage #### Dataset Dataset is from "https://www.kaggle.com/datasets/devicharith/language-translation-englishfrench" which contains 2 columns where one column has english words/sentences and the other one has french words/sentence #### Model Architecture The model architecture consists of an Encoder-Decoder Long Short-Term Memory network with an embedding layer. It was built on a Neural Machine Translation architecture where sequence-to-sequence framework with attention mechanisms was applied. Logo #### Data Preparation - The parallel corpus containing French and English sentences is preprocessed. - Text is tokenized and converted into numerical representations suitable for input to the neural network. #### Model Training - The sequence-to-sequence model is constructed, comprising an encoder and decoder. - Training data is fed into the model, and parameters are optimized using backpropagation and gradient descent algorithms. ```sh def create_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units): # Create the model model = Sequential() model.add(Embedding(src_vocab_size, n_units, input_length=src_length, mask_zero=True)) model.add(LSTM(n_units)) model.add(RepeatVector(tar_timesteps)) model.add(LSTM(n_units, return_sequences=True)) model.add(TimeDistributed(Dense(tar_vocab, activation='softmax'))) return model model = create_model(src_vocab_size, tar_vocab_size, src_length, tar_length, 256) model.compile(optimizer='adam', loss='categorical_crossentropy') history = model.fit(trainX, trainY, epochs=20, batch_size=64, validation_split=0.1, verbose=1, callbacks=[ EarlyStopping( monitor='val_loss', patience=10, restore_best_weights=True ) ]) ``` Logo #### Model Evaluation - The trained model is evaluated on the test set to measure its accuracy. - Metrics such as BLEU score has been used to quantify the quality of translations. Logo Logo #### Deployment - Gradio is utilized for deploying the trained model. - Users can input a French text, and the model will translate it to English. ```sh import string import re from unicodedata import normalize import numpy as np from keras.preprocessing.text import Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical from keras.models import Sequential,load_model from keras.layers import LSTM,Dense,Embedding,RepeatVector,TimeDistributed from keras.callbacks import EarlyStopping from nltk.translate.bleu_score import corpus_bleu import pandas as pd from string import punctuation import matplotlib.pyplot as plt from IPython.display import Markdown, display import gradio as gr import tensorflow as tf from tensorflow.keras.models import load_model total_sentences = 10000 dataset = pd.read_csv("./eng_-french.csv", nrows = total_sentences) def clean(string): # Clean the string string = string.replace("\u202f"," ") # Replace no-break space with space string = string.lower() # Delete the punctuation and the numbers for p in punctuation + "«»" + "0123456789": string = string.replace(p," ") string = re.sub('\s+',' ', string) string = string.strip() return string dataset = dataset.sample(frac=1, random_state=0) dataset["English words/sentences"] = dataset["English words/sentences"].apply(lambda x: clean(x)) dataset["French words/sentences"] = dataset["French words/sentences"].apply(lambda x: clean(x)) dataset = dataset.values dataset = dataset[:total_sentences] source_str, target_str = "French", "English" idx_src, idx_tar = 1, 0 def create_tokenizer(lines): # fit a tokenizer tokenizer = Tokenizer() tokenizer.fit_on_texts(lines) return tokenizer def max_len(lines): # max sentence length return max(len(line.split()) for line in lines) def encode_sequences(tokenizer, length, lines): # encode and pad sequences X = tokenizer.texts_to_sequences(lines) # integer encode sequences X = pad_sequences(X, maxlen=length, padding='post') # pad sequences with 0 values return X def word_for_id(integer, tokenizer): # map an integer to a word for word, index in tokenizer.word_index.items(): if index == integer: return word return None def predict_seq(model, tokenizer, source): # generate target from a source sequence prediction = model.predict(source, verbose=0)[0] integers = [np.argmax(vector) for vector in prediction] target = list() for i in integers: word = word_for_id(i, tokenizer) if word is None: break target.append(word) return ' '.join(target) src_tokenizer = create_tokenizer(dataset[:, idx_src]) src_vocab_size = len(src_tokenizer.word_index) + 1 src_length = max_len(dataset[:, idx_src]) tar_tokenizer = create_tokenizer(dataset[:, idx_tar]) model = load_model('./french_to_english_translator.h5') def translate_french_english(french_sentence): # Clean the input sentence french_sentence = clean(french_sentence) # Tokenize and pad the input sentence input_sequence = encode_sequences(src_tokenizer, src_length, [french_sentence]) # Generate the translation english_translation = predict_seq(model, tar_tokenizer, input_sequence) return english_translation gr.Interface( fn=translate_french_english, inputs="text", outputs="text", title="French to English Translator", description="Translate French sentences to English." ).launch() ``` Logo

(back to top)

## Contributing Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are **greatly appreciated**. If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again! 1. Fork the Project 2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) 3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) 4. Push to the Branch (`git push origin feature/AmazingFeature`) 5. Open a Pull Request

(back to top)

## License Distributed under the MIT License. See [MIT License](LICENSE) for more information.

(back to top)

## Contact Kamelia Zaman Moon - kamelia.stu2017@juniv.edu Project Link: [https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation](https://huggingface.co/spaces/KameliaZaman/French-to-English-Translation/tree/main)

(back to top)

[Python]: https://img.shields.io/badge/python-3670A0?style=for-the-badge&logo=python&logoColor=ffdd54 [Python-url]: https://www.python.org/ [TensorFlow]: https://img.shields.io/badge/TensorFlow-%23FF6F00.svg?style=for-the-badge&logo=TensorFlow&logoColor=white [TensorFlow-url]: https://tensorflow.org/ [Keras]: https://img.shields.io/badge/Keras-%23D00000.svg?style=for-the-badge&logo=Keras&logoColor=white [Keras-url]: https://keras.io/ [NumPy]: https://img.shields.io/badge/numpy-%23013243.svg?style=for-the-badge&logo=numpy&logoColor=white [NumPy-url]: https://numpy.org/ [Pandas]: https://img.shields.io/badge/pandas-%23150458.svg?style=for-the-badge&logo=pandas&logoColor=white [Pandas-url]: https://pandas.pydata.org/