# Sequence to Sequence Language Transliteration using RNNs and Transformers This repository contains the files for the third assignment of the course CS6910 - Deep Learning at IIT Madras. The transformers part was added later and was not part of the assignment. Implemented a Encoder Decoder Architecture with/without Attention Mechanism, and later with Transformers, and used then to perform Transliteration on the Akshanrankar Dataset(Englist-Hindi transliteration pairs) provided. These models where built using RNN, LSTM and GRU cells provided by PyTorch. Transformers architecture is built from scratch following the "Attention is All You Need" paper. Used basic feed forward and embeddings layers from pyTorch. Jump to Section: [Usage](#usage) Report: [Report](https://wandb.ai/iitmadras/CS6910_Assignment_3/reports/CS6910-Assignment-3-Report--Vmlldzo0MzQyNDk5) ## Encoder The encoder is a simple cell of either LSTM, RNN or GRU. The input to the encoder is a sequence of characters and the output is a sequence of hidden states. The hidden state of the last time step is used as the context vector for the decoder. Encoder can also be a transformer encoder with multiple layers containing self-attention mechanism. The output generated by the encoder is fed to the decoder of transformers. ## Decoder The decoder is again a simple cell of either LSTM, RNN or GRU. The input to the decoder is the hidden state of the encoder and the output of the previous time step. The output of the decoder is a sequence of characters. The decoder has an additional fully connected layer and a log softmax which is used to predict the next character. Decoder can also be a transformer decoder with multiple layers containing masked self-attention and masked cross-attention mechanism. The output generated by the encoder is fed as input to the decoder of transformers. Next character prediction model is used to generate the complete target sequence in Hindi. ## Attention Mechanism The attention mechanism is implemented using the dot product attention mechanism. The attentions are calulated by a weighted sum of softmax values of dot products of the hidden states of the decoder and the hidden states of the encoder. The attention values are then concatenated with the hidden states of the decoder and passed through a fully connected layer to get the output of the decoder. ## Dataset The dataset used is the Aksharankar Dataset provided by the course. The dataset contains 3 files, namely, `train.csv`, `valid.csv` and `test.csv` for each language for a subset of indian languages. I have used the Tamil dataset for this assignment. The dataset contains 2 columns, namely, `English` and `Hindi` words which are the input and output strings respectively. ## Used Python Libraries and Version - Python 3.10.9 - Pytorch 1.13.1 - Pandas 1.5.3 ## Usage To run the training code for the standard encoder decoder architecture using the best set of hyperparameters, run the following command: ```bash python3 train.py ``` To run the training code for the encoder decoder architecture with attention mechanism using the best set of hyperparameters, run the following command: ```bash python3 train_attention.py ``` To run the inference code for the standard encoder decoder architecture using the best set of hyperparameters, run the following command: (This uses the state dicts stored in the best_models folder and creates a file named test_gen.txt with the test predictions) ```bash python3 test_best_vanilla.py ``` To run the inference code for the encoder decoder architecture with attention mechanism using the best set of hyperparameters, run the following command: (This uses the state dicts stored in the best_models folder and creates a file named test_gen.txt with the test predictions) ```bash python3 test_best_attention.py ``` To run with custom hyperparameters, run the following command: ```bash python3 train.py -h ``` ```bash # The output of the above command is as follows: usage: train.py [-h] [-es EMBED_SIZE] [-hs HIDDEN_SIZE] [-ct CELL_TYPE] [-nl NUM_LAYERS] [-d DROPOUT] [-lr LEARNING_RATE] [-o OPTIMIZER] [-l LANGUAGE] Transliteration Model options: -h, --help show this help message and exit -es EMBED_SIZE, --embed_size EMBED_SIZE Embedding Size, good_choices = [8, 16, 32] -hs HIDDEN_SIZE, --hidden_size HIDDEN_SIZE Hidden Size, good_choices = [128, 256, 512] -ct CELL_TYPE, --cell_type CELL_TYPE Cell Type, choices: [LSTM, GRU, RNN] -nl NUM_LAYERS, --num_layers NUM_LAYERS Number of Layers, choices: [1, 2, 3] -d DROPOUT, --dropout DROPOUT Dropout, good_choices: [0, 0.1, 0.2] -lr LEARNING_RATE, --learning_rate LEARNING_RATE Learning Rate, good_choices: [0.0005, 0.001, 0.005] -o OPTIMIZER, --optimizer OPTIMIZER Optimizer, choices: [SGD, ADAM] -l LANGUAGE, --language LANGUAGE Language ``` To run the training code for the attention mechanism with custom hyperparameters, run the following command: ```bash python3 train_attention.py -h ``` ```bash usage: train_attention.py [-h] [-es EMBED_SIZE] [-hs HIDDEN_SIZE] [-ct CELL_TYPE] [-nl NUM_LAYERS] [-dr DROPOUT] [-lr LEARNING_RATE] [-op OPTIMIZER] [-wd WEIGHT_DECAY] [-l LANG] Transliteration Model with Attention options: -h, --help show this help message and exit -es EMBED_SIZE, --embed_size EMBED_SIZE Embedding size -hs HIDDEN_SIZE, --hidden_size HIDDEN_SIZE Hidden size -ct CELL_TYPE, --cell_type CELL_TYPE Cell type -nl NUM_LAYERS, --num_layers NUM_LAYERS Number of layers -dr DROPOUT, --dropout DROPOUT Dropout -lr LEARNING_RATE, --learning_rate LEARNING_RATE Learning rate -op OPTIMIZER, --optimizer OPTIMIZER Optimizer -wd WEIGHT_DECAY, --weight_decay WEIGHT_DECAY Weight decay -l LANG, --lang LANG Language ```