--- license: mit datasets: - bentrevett/multi30k language: - en library_name: transformers pipeline_tag: translation --- The translator app: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/nVheCVJjZiCK3cvof6x84.png) # Model Name German to English Translator # Model Description This model translates german language to english language. It used Sequence to Sequence Transformer(Seq2SeqTransformer) for training. - **Developed by:** Neelima Monjusha Preeti - **Model type:** Seq2SeqTransformer - **Language(s):** Python - **License:** MIT - **Contact:** monjusha.2017@juniv.edu # Task Description This app translates German to English. First the language is tokenized, passed through encoder, decoder and trained with Seq2SeqTransformer. Then as output the language is english. # Data Processing Defining source and target languages and then Tokenization. Tokenizers for German and English are initialized using spaCy (spacy library). The get_tokenizer function from spaCy is used to obtain tokenizers for each language. A function yield_tokens is defined to tokenize sentences from the data iterator for both source and target languages. Special symbols and indices: Special indices are defined for unknown words (UNK_IDX), padding (PAD_IDX), beginning of sequence (BOS_IDX), and end of sequence (EOS_IDX). Special symbols are defined as ['', '', '', '']. Then vocabulary is built.For each language (source and target), the code iterates over the training data and builds a vocabulary using the build_vocab_from_iterator function. It uses the tokenization function defined earlier to tokenize the data. The vocabulary is built with a minimum frequency of 1 (including all tokens) and special symbols are added first. For each language's vocabulary, the default index for unknown tokens (UNK_IDX) is set. ```bash token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm') token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm') def yield_tokens(data_iter: Iterable, language: str) -> List[str]: language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1} for data_sample in data_iter: yield token_transform[language](data_sample[language_index[language]]) # Define special symbols and indices UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3 # Make sure the tokens are in order of their indices to properly insert them in vocab special_symbols = ['', '', '', ''] for ln in [SRC_LANGUAGE, TGT_LANGUAGE]: # Training data Iterator train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE)) vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln), min_freq=1, specials=special_symbols, special_first=True) for ln in [SRC_LANGUAGE, TGT_LANGUAGE]: vocab_transform[ln].set_default_index(UNK_IDX) ``` # Model Architecture For machine translation I used Seq2SeqTransformer. class PositionalEncoding(nn.Module) adds positional encodings to token embeddings, while class TokenEmbedding(nn.Module) converts token indices into dense embeddings using an embedding layer. The parameters defined and initialized for the model are: ### num_encoder_layers: Number of layers in the encoder stack -- 3. ### num_decoder_layers: Number of layers in the decoder stack-- 3. ### emb_size: The dimensionality of token embeddings-- 512. ### nhead: The number of attention heads in the multi-head attention mechanism-- 512. ### src_vocab_size: Vocabulary size of the source language. ### tgt_vocab_size: Vocabulary size of the target language. ### dim_feedforward: Dimensionality of the feedforward network (defaulted to 512). ### dropout: Dropout probability (defaulted to 0.1). The loss function and optimizer are calculated with this: ```bash loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX) optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9) ``` Then the model is passed through encoder and decoder layers. The helper functions and list are ```bash sequential_transforms(*transforms) tensor_transform(token_ids: List[int]) collate_fn(batch) text_transform = {} ``` These utility functions and transformations handle the preprocessing of text data, including tokenization, numericalization, adding special tokens, and collating samples into batch tensors suitable for training a sequence-to-sequence transformer model. Then the model is trained with Seq2SeqTransformer and evaluated with function evaluate(model). # Result Analysis greedy_decode() - this function takes ### model: The sequence-to-sequence transformer model. ### src: The source sequence tensor. ### src_mask: The mask for the source sequence. ### max_len: The maximum length of the output sequence. ### start_symbol: The index of the start symbol in the target vocabulary as parameter and returns the generated target sequence tensor ys, which contains the complete translation. ## Test input: The function for translating german to english is - translate(). ```bash def translate(src_sentence: str): model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM) model.load_state_dict(torch.load('./transformer_model.pth')) model.to(DEVICE) model.eval() src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1) num_tokens = src.shape[0] src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool) tgt_tokens = greedy_decode( model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten() return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("", "").replace("", "") ``` This function first loads the saved model. Then it tokenizes and implements greedy_decode for getting the translated output. Then returns the output. # Hugging Face Interface: For creating interface gradio and torch as well as Seq2SeqTransformer, translate and greedy_decode function from the germantoenglish.py file was loaded. ```bash import gradio as gr import torch from germantoenglish import Seq2SeqTransformer, translate, greedy_decode ``` The the app takes input a german line and output shows the translated english text. ```bash if __name__ == "__main__": iface = gr.Interface( fn=translate, inputs=[ gr.components.Textbox(label="Text") ], outputs=["text"], cache_examples=False, title="GermanToEnglish", ) iface.launch(share=True) ``` The app interface looks like this: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/J_Q4eqXiN7cNuhOM3NbjR.png) # Project Structure ```bash |---Readme.md | |---germantoenglish.py-The full code for processing, training, evaluating is here | |---app.py- This is for creating the app interface | |---Modeltensors- needed tensor file for loading app | |---requirements.txt- necessary packages and dataset which needs to be downloaded for the app to work. | |--translate_model.pth- the model file which is loaded for the app ``` # How to Run ```bash git clone https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish cd GermanToEnglish pip install -r requirements.txt python app.py ``` # License This project is licensed under the MIT License. # Contributor Neelima Monjusha Preeti - monjusha.stu2017@juniv.edu App link: https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish