Spaces:

Harveenchadha
/

en_to_indic_translation

Runtime error

App Files Files Community

harveen commited on Jan 4, 2022

Commit

4192287

1 Parent(s): 87dca7d

Changing structure

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

indicTrans/IndicTrans_training.ipynb → IndicTrans_training.ipynb +0 -0
indicTrans/LICENSE → LICENSE +0 -0
README.md +287 -28
indicTrans/api.py → api.py +0 -0
indicTrans/apply_bpe_traindevtest_notag.sh → apply_bpe_traindevtest_notag.sh +0 -0
indicTrans/apply_single_bpe_traindevtest_notag.sh → apply_single_bpe_traindevtest_notag.sh +0 -0
indicTrans/binarize_training_exp.sh → binarize_training_exp.sh +0 -0
indicTrans/compute_bleu.sh → compute_bleu.sh +0 -0
indicTrans/.gitignore +0 -143
indicTrans/README.md +0 -296
indicTrans/indicTrans_Finetuning.ipynb → indicTrans_Finetuning.ipynb +0 -0
indicTrans/indicTrans_python_interface.ipynb → indicTrans_python_interface.ipynb +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/LICENSE +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/README.md +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/README.md +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/correct_moses_tokenizer.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/hindi_to_kannada_transliterator.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/indic_scraper_project_sample.ipynb +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/Makefile +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/cmd.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/code.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/conf.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/index.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.MD +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.cli.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.morph.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.normalize.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.pdf +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.script.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.syllable.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.tokenize.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.transliterate.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/make.bat +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/docs/modules.rst +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/__init__.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/cli/__init__.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/cli/cliparser.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/common.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/langinfo.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/loader.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/morph/__init__.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/morph/unsupervised_morph.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/normalize/__init__.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/normalize/indic_normalize.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/__init__.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/english_script.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/indic_scripts.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/phonetic_sim.py +0 -0
{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/syllable/__init__.py +0 -0

indicTrans/IndicTrans_training.ipynb → IndicTrans_training.ipynb RENAMED Viewed

File without changes

indicTrans/LICENSE → LICENSE RENAMED Viewed

File without changes

README.md CHANGED Viewed

@@ -1,37 +1,296 @@
----
-title: indic translation
-emoji: 🏢
-colorFrom: gray
-colorTo: pink
-sdk: gradio
-app_file: app.py
-pinned: false
----
-# Configuration
-`title`: _string_
-Display title for the Space
-`emoji`: _string_
-Space emoji (emoji-only character allowed)
-`colorFrom`: _string_
-Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
-`colorTo`: _string_
-Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
-`sdk`: _string_
-Can be either `gradio` or `streamlit`
-`sdk_version` : _string_
-Only applicable for `streamlit` SDK.
-See [doc](https://hf.co/docs/hub/spaces) for more info on supported versions.
-`app_file`: _string_
-Path to your main application file (which contains either `gradio` or `streamlit` Python code).
-Path is relative to the root of the repository.
-`pinned`: _boolean_
-Whether the Space stays on top of your list.

+<div align="center">
+	<h1><b><i>IndicTrans</i></b></h1>
+	<a href="http://indicnlp.ai4bharat.org/samanantar">Website</a> |
+	<a href="https://arxiv.org/abs/2104.05596">Paper</a> |
+        <a href="https://youtu.be/QwYPOd1eBtQ?t=383">Video</a><br><br>
+</div>
+**IndicTrans** is a Transformer-4x ( ~434M ) multilingual NMT model trained on [Samanantar](https://indicnlp.ai4bharat.org/samanantar) dataset which is the largest publicly available parallel corpora collection for Indic languages at the time of writing ( 14 April 2021 ). It is a single script model i.e we convert all the Indic data to the Devanagari script which allows for ***better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary***. We currently release two models - Indic to English and English to Indic and support the following 11 indic languages:
+| <!-- -->      | <!-- -->       | <!-- -->     | <!-- -->    |
+| ------------- | -------------- | ------------ | ----------- |
+| Assamese (as) | Hindi (hi)     | Marathi (mr) | Tamil (ta)  |
+| Bengali (bn)  | Kannada (kn)   | Oriya (or)   | Telugu (te) |
+| Gujarati (gu) | Malayalam (ml) | Punjabi (pa) |
+- [Updates](#updates)
+- [Download IndicTrans models:](#download-indictrans-models)
+- [Using the model for translating any input](#using-the-model-for-translating-any-input)
+- [Finetuning the model on your input dataset](#finetuning-the-model-on-your-input-dataset)
+- [Mining Indic to Indic pairs from english centric corpus](#mining-indic-to-indic-pairs-from-english-centric-corpus)
+- [Installation](#installation)
+- [How to train the indictrans model on your training data?](#how-to-train-the-indictrans-model-on-your-training-data)
+- [Network & Training Details](#network--training-details)
+- [Folder Structure](#folder-structure)
+- [Citing](#citing)
+  - [License](#license)
+  - [Contributors](#contributors)
+  - [Contact](#contact)
+## Updates
+<details><summary>Click to expand </summary>
+18 December 2021
+```
+Tutorials updated with latest model links
+```
+26 November 2021
+```
+ - v0.3 models are now available for download
+```
+27 June 2021
+```
+- Updated links for indic to indic model
+- Add more comments to training scripts
+- Add link to [Samanantar Video](https://youtu.be/QwYPOd1eBtQ?t=383)
+- Add folder structure in readme
+- Add python wrapper for model inference
+```
+09 June 2021
+```
+- Updated links for models
+- Added Indic to Indic model
+```
+09 May 2021
+```
+- Added fix for finetuning on datasets where some lang pairs are not present. Previously the script assumed the finetuning dataset will have data for all 11 indic lang pairs
+- Added colab notebook for finetuning instructions
+```
+</details>
+## Download IndicTrans models:
+Indic to English: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/indic-en.zip)
+English to Indic: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/en-indic.zip)
+Indic to Indic:   [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/m2m.zip)
+## Using the model for translating any input
+The model is trained on single sentences and hence, users need to split parapgraphs to sentences before running the translation when using our command line interface (The python interface has `translate_paragraph` method to handle multi sentence translations).
+Note: IndicTrans is trained with a max sequence length of **200** tokens (subwords). If your sentence is too long (> 200 tokens), the sentence will be truncated to 200 tokens before translation.
+Here is an example snippet to split paragraphs into sentences for English and Indic languages supported by our model:
+```python
+# install these libraries
+# pip install mosestokenizer
+# pip install indic-nlp-library
+from mosestokenizer import *
+from indicnlp.tokenize import sentence_tokenize
+INDIC = ["as", "bn", "gu", "hi", "kn", "ml", "mr", "or", "pa", "ta", "te"]
+def split_sentences(paragraph, language):
+    if language == "en":
+        with MosesSentenceSplitter(language) as splitter:
+            return splitter([paragraph])
+    elif language in INDIC:
+        return sentence_tokenize.sentence_split(paragraph, lang=language)
+split_sentences("""COVID-19 is caused by infection with the severe acute respiratory
+syndrome coronavirus 2 (SARS-CoV-2) virus strain. The disease is mainly transmitted via the respiratory
+route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing. """, language='en')
+>> ['COVID-19 is caused by infection with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus strain.',
+ 'The disease is mainly transmitted via the respiratory route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing.']
+split_sentences("""இத்தொற்றுநோய் உலகளாவிய சமூக மற்றும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது. இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.
+அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.""",
+ language='ta')
+>> ['இத்தொற்றுநோய் உலகளாவிய சமூக மற்றும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.',
+ 'இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது.',
+ 'இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.',
+ 'அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.']
+```
+Follow the colab notebook to setup the environment, download the trained _IndicTrans_ models and translating your own text.
+Command line interface --> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indictrans_fairseq_inference.ipynb)
+Python interface       --> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_python_interface.ipynb)
+ The python interface is useful in case you want to reuse the model for multiple translations and do not want to reinitialize the model each time
+## Finetuning the model on your input dataset
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_Finetuning.ipynb)
+The colab notebook can be used to setup the environment, download the trained _IndicTrans_ models and prepare your custom dataset for funetuning the indictrans model. There is also a section on mining indic to indic data from english centric corpus for finetuning indic to indic model.
+**Note**: Since this is a big model (400M params), you might not be able to train with reasonable batch sizes in the free google Colab account. We are planning to release smaller models (after pruning / distallation) soon.
+## Mining Indic to Indic pairs from english centric corpus
+The `extract_non_english_pairs` in `scripts/extract_non_english_pairs.py` can be used to mine indic to indic pairs from english centric corpus.
+As described in the [paper](https://arxiv.org/pdf/2104.05596.pdf) (section 2.5) , we use a very strict deduplication criterion to avoid the creation of very similar parallel sentences. For example, if an en sentence is aligned to *M* hi sentences and *N* ta sentences, then we would get *MN* hi-ta pairs. However, these pairs would be very similar and not contribute much to the training process. Hence, we retain only 1 randomly chosen pair out of these *MN* pairs.
+```bash
+extract_non_english_pairs(indir, outdir, LANGS):
+    """
+    Extracts non-english pair parallel corpora
+    indir: contains english centric data in the following form:
+            - directory named en-xx for language xx
+            - each directory contains a train.en and train.xx
+    outdir: output directory to store mined data for each pair.
+            One directory is created for each pair.
+    LANGS: list of languages in the corpus (other than English).
+            The language codes must correspond to the ones used in the
+            files and directories in indir. Prefarably, sort the languages
+            in this list in alphabetic order. outdir will contain data for xx-yy,
+            but not for yy-xx, so it will be convenient to have this list in sorted order.
+    """
+```
+## Installation
+<details><summary>Click to expand </summary>
+```bash
+cd indicTrans
+git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git
+git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
+git clone https://github.com/rsennrich/subword-nmt.git
+# install required libraries
+pip install sacremoses pandas mock sacrebleu tensorboardX pyarrow indic-nlp-library
+# Install fairseq from source
+git clone https://github.com/pytorch/fairseq.git
+cd fairseq
+pip install --editable ./
+```
+</details>
+## How to train the indictrans model on your training data?
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/IndicTrans_training.ipynb)
+Follow the colab notebook to setup the environment, download the dataset and train the indicTrans model
+## Network & Training Details
+- Architechture: IndicTrans uses 6 encoder and decoder layers, input embeddings of size 1536 with 16 attention heads and
+feedforward dimension of 4096 with total number of parameters of 434M
+- Loss: Cross entropy loss
+- Optimizer: Adam
+- Label Smoothing: 0.1
+- Gradient clipping: 1.0
+- Learning rate: 5e-4
+- Warmup_steps: 4000
+Please refer to section 4, 5 of our [paper](https://arxiv.org/ftp/arxiv/papers/2104/2104.05596.pdf) for more details on training/experimental setup.
+## Folder Structure
+```
+IndicTrans
+│   .gitignore
+│   apply_bpe_traindevtest_notag.sh         # apply bpe for joint vocab (Train, dev and test)
+│   apply_single_bpe_traindevtest_notag.sh  # apply bpe for seperate vocab   (Train, dev and test)
+│   binarize_training_exp.sh                # binarize the training data after preprocessing for fairseq-training
+│   compute_bleu.sh                         # Compute blue scores with postprocessing after translating with `joint_translate.sh`
+│   indictrans_fairseq_inference.ipynb      # colab example to show how to use model for inference
+│   indicTrans_Finetuning.ipynb             # colab example to show how to use model for finetuning on custom domain data
+│   joint_translate.sh                      # used for inference (see colab inference notebook for more details on usage)
+│   learn_bpe.sh                            # learning joint bpe on preprocessed text
+│   learn_single_bpe.sh                     # learning seperate bpe on preprocessed text
+│   LICENSE
+│   prepare_data.sh                         # prepare data given an experiment dir (this does preprocessing,
+│                                           # building vocab, binarization ) for bilingual training
+│   prepare_data_joint_training.sh          # prepare data given an experiment dir (this does preprocessing,
+│                                           # building vocab, binarization ) for joint training
+│   README.md
+│
+├───legacy                                  # old unused scripts
+├───model_configs                           # custom model configrations are stored here
+│       custom_transformer.py               # contains custom 4x transformer models
+│       __init__.py
+├───inference
+│       custom_interactive.py               # for python wrapper around fairseq-interactive
+│       engine.py                           # python interface for model inference
+└───scripts                                 # stores python scripts that are used by other bash scripts
+    │   add_joint_tags_translate.py         # add lang tags to the processed training data for bilingual training
+    │   add_tags_translate.py               # add lang tags to the processed training data for joint training
+    │   clean_vocab.py                      # clean vocabulary after building with subword_nmt
+    │   concat_joint_data.py                # concatenates lang pair data and creates text files to keep track
+    │                                       # of number of lines in each lang pair.
+    │   extract_non_english_pairs.py        # Mining Indic to Indic pairs from english centric corpus
+    │   postprocess_translate.py            # Postprocesses translations
+    │   preprocess_translate.py             # Preprocess translations and for script conversion (from indic to devnagiri)
+    │   remove_large_sentences.py           # to remove large sentences from training data
+    └───remove_train_devtest_overlaps.py    # Finds and removes overlaped data of train with dev and test sets
+```
+## Citing
+If you are using any of the resources, please cite the following article:
+```
+@misc{ramesh2021samanantar,
+      title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
+      author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
+      year={2021},
+      eprint={2104.05596},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+We would like to hear from you if:
+- You are using our resources. Please let us know how you are putting these resources to use.
+- You have any feedback on these resources.
+### License
+The IndicTrans code (and models) are released under the MIT License.
+### Contributors
+- Gowtham Ramesh, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [IITM](https://www.iitm.ac.in))</sub>
+- Sumanth Doddapaneni, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [IITM](https://www.iitm.ac.in))</sub>
+- Aravinth Bheemaraj, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
+- Mayank Jobanputra, <sub>([IITM](https://www.iitm.ac.in))</sub>
+- Raghavan AK, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
+- Ajitesh Sharma, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
+- Sujit Sahoo, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
+- Harshita Diddee, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
+- Mahalakshmi J, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
+- Divyanshu Kakwani, <sub>([IITM](https://www.iitm.ac.in), [AI4Bharat](https://ai4bharat.org))</sub>
+- Navneet Kumar, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
+- Aswin Pradeep, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
+- Kumar Deepak, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
+- Vivek Raghavan, <sub>([EkStep](https://ekstep.in))</sub>
+- Anoop Kunchukuttan, <sub>([Microsoft](https://www.microsoft.com/en-in/), [AI4Bharat](https://ai4bharat.org))</sub>
+- Pratyush Kumar, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in))</sub>
+- Mitesh Shantadevi Khapra, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in))</sub>
+### Contact
+- Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
+- Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
+- Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))

indicTrans/api.py → api.py RENAMED Viewed

File without changes

indicTrans/apply_bpe_traindevtest_notag.sh → apply_bpe_traindevtest_notag.sh RENAMED Viewed

File without changes

indicTrans/apply_single_bpe_traindevtest_notag.sh → apply_single_bpe_traindevtest_notag.sh RENAMED Viewed

File without changes

indicTrans/binarize_training_exp.sh → binarize_training_exp.sh RENAMED Viewed

File without changes

indicTrans/compute_bleu.sh → compute_bleu.sh RENAMED Viewed

File without changes

indicTrans/.gitignore DELETED Viewed

@@ -1,143 +0,0 @@
-#ignore libs folder we use
-indic_nlp_library
-indic_nlp_resources
-subword-nmt
-# Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
-# C extensions
-*.so
-# Distribution / packaging
-.Python
-build/
-develop-eggs/
-dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-share/python-wheels/
-*.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
-# Unit test / coverage reports
-htmlcov/
-.tox/
-.nox/
-.coverage
-.coverage.*
-.cache
-nosetests.xml
-coverage.xml
-*.cover
-*.py,cover
-.hypothesis/
-.pytest_cache/
-cover/
-# Translations
-*.mo
-*.pot
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-db.sqlite3-journal
-# Flask stuff:
-instance/
-.webassets-cache
-# Scrapy stuff:
-.scrapy
-# Sphinx documentation
-docs/_build/
-# PyBuilder
-.pybuilder/
-target/
-# Jupyter Notebook
-.ipynb_checkpoints
-# IPython
-profile_default/
-ipython_config.py
-# pyenv
-#   For a library or package, you might want to ignore these files since the code is
-#   intended to run in multiple environments; otherwise, check them in:
-# .python-version
-# pipenv
-#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
-#   However, in case of collaboration, if having platform-specific dependencies or dependencies
-#   having no cross-platform support, pipenv may install dependencies that don't work, or not
-#   install all needed dependencies.
-#Pipfile.lock
-# PEP 582; used by e.g. github.com/David-OConnor/pyflow
-__pypackages__/
-# Celery stuff
-celerybeat-schedule
-celerybeat.pid
-# SageMath parsed files
-*.sage.py
-# Environments
-.env
-.venv
-env/
-venv/
-ENV/
-env.bak/
-venv.bak/
-# Spyder project settings
-.spyderproject
-.spyproject
-# Rope project settings
-.ropeproject
-# mkdocs documentation
-/site
-# mypy
-.mypy_cache/
-.dmypy.json
-dmypy.json
-# Pyre type checker
-.pyre/
-# pytype static type analyzer
-.pytype/
-# Cython debug symbols
-cython_debug/

indicTrans/README.md DELETED Viewed

@@ -1,296 +0,0 @@
-<div align="center">
-	<h1><b><i>IndicTrans</i></b></h1>
-	<a href="http://indicnlp.ai4bharat.org/samanantar">Website</a> |
-	<a href="https://arxiv.org/abs/2104.05596">Paper</a> |
-        <a href="https://youtu.be/QwYPOd1eBtQ?t=383">Video</a><br><br>
-</div>
-**IndicTrans** is a Transformer-4x ( ~434M ) multilingual NMT model trained on [Samanantar](https://indicnlp.ai4bharat.org/samanantar) dataset which is the largest publicly available parallel corpora collection for Indic languages at the time of writing ( 14 April 2021 ). It is a single script model i.e we convert all the Indic data to the Devanagari script which allows for ***better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary***. We currently release two models - Indic to English and English to Indic and support the following 11 indic languages:
-| <!-- -->      | <!-- -->       | <!-- -->     | <!-- -->    |
-| ------------- | -------------- | ------------ | ----------- |
-| Assamese (as) | Hindi (hi)     | Marathi (mr) | Tamil (ta)  |
-| Bengali (bn)  | Kannada (kn)   | Oriya (or)   | Telugu (te) |
-| Gujarati (gu) | Malayalam (ml) | Punjabi (pa) |
-- [Updates](#updates)
-- [Download IndicTrans models:](#download-indictrans-models)
-- [Using the model for translating any input](#using-the-model-for-translating-any-input)
-- [Finetuning the model on your input dataset](#finetuning-the-model-on-your-input-dataset)
-- [Mining Indic to Indic pairs from english centric corpus](#mining-indic-to-indic-pairs-from-english-centric-corpus)
-- [Installation](#installation)
-- [How to train the indictrans model on your training data?](#how-to-train-the-indictrans-model-on-your-training-data)
-- [Network & Training Details](#network--training-details)
-- [Folder Structure](#folder-structure)
-- [Citing](#citing)
-  - [License](#license)
-  - [Contributors](#contributors)
-  - [Contact](#contact)
-## Updates
-<details><summary>Click to expand </summary>
-18 December 2021
-```
-Tutorials updated with latest model links
-```
-26 November 2021
-```
- - v0.3 models are now available for download
-```
-27 June 2021
-```
-- Updated links for indic to indic model
-- Add more comments to training scripts
-- Add link to [Samanantar Video](https://youtu.be/QwYPOd1eBtQ?t=383)
-- Add folder structure in readme
-- Add python wrapper for model inference
-```
-09 June 2021
-```
-- Updated links for models
-- Added Indic to Indic model
-```
-09 May 2021
-```
-- Added fix for finetuning on datasets where some lang pairs are not present. Previously the script assumed the finetuning dataset will have data for all 11 indic lang pairs
-- Added colab notebook for finetuning instructions
-```
-</details>
-## Download IndicTrans models:
-Indic to English: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/indic-en.zip)
-English to Indic: [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/en-indic.zip)
-Indic to Indic:   [v0.3](https://storage.googleapis.com/samanantar-public/V0.3/models/m2m.zip)
-## Using the model for translating any input
-The model is trained on single sentences and hence, users need to split parapgraphs to sentences before running the translation when using our command line interface (The python interface has `translate_paragraph` method to handle multi sentence translations).
-Note: IndicTrans is trained with a max sequence length of **200** tokens (subwords). If your sentence is too long (> 200 tokens), the sentence will be truncated to 200 tokens before translation.
-Here is an example snippet to split paragraphs into sentences for English and Indic languages supported by our model:
-```python
-# install these libraries
-# pip install mosestokenizer
-# pip install indic-nlp-library
-from mosestokenizer import *
-from indicnlp.tokenize import sentence_tokenize
-INDIC = ["as", "bn", "gu", "hi", "kn", "ml", "mr", "or", "pa", "ta", "te"]
-def split_sentences(paragraph, language):
-    if language == "en":
-        with MosesSentenceSplitter(language) as splitter:
-            return splitter([paragraph])
-    elif language in INDIC:
-        return sentence_tokenize.sentence_split(paragraph, lang=language)
-split_sentences("""COVID-19 is caused by infection with the severe acute respiratory
-syndrome coronavirus 2 (SARS-CoV-2) virus strain. The disease is mainly transmitted via the respiratory
-route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing. """, language='en')
->> ['COVID-19 is caused by infection with the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus strain.',
- 'The disease is mainly transmitted via the respiratory route when people inhale droplets and particles that infected people release as they breathe, talk, cough, sneeze, or sing.']
-split_sentences("""இத்தொற்றுநோய் உலகளாவிய சமூக மற்���ும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது. இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.
-அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.""",
- language='ta')
->> ['இத்தொற்றுநோய் உலகளாவிய சமூக மற்றும் பொருளாதார சீர்குலைவை ஏற்படுத்தியுள்ளது.',
- 'இதனால் பெரும் பொருளாதார மந்தநிலைக்குப் பின்னர் உலகளவில் மிகப்பெரிய மந்தநிலை ஏற்பட்டுள்ளது.',
- 'இது விளையாட்டு,மத, அரசியல் மற்றும் கலாச்சார நிகழ்வுகளை ஒத்திவைக்க அல்லது ரத்து செய்ய வழிவகுத்தது.',
- 'அச்சம் காரணமாக முகக்கவசம், கிருமிநாசினி உள்ளிட்ட பொருட்களை அதிக நபர்கள் வாங்கியதால் விநியோகப் பற்றாக்குறை ஏற்பட்டது.']
-```
-Follow the colab notebook to setup the environment, download the trained _IndicTrans_ models and translating your own text.
-Command line interface --> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indictrans_fairseq_inference.ipynb)
-Python interface       --> [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_python_interface.ipynb)
- The python interface is useful in case you want to reuse the model for multiple translations and do not want to reinitialize the model each time
-## Finetuning the model on your input dataset
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/indicTrans_Finetuning.ipynb)
-The colab notebook can be used to setup the environment, download the trained _IndicTrans_ models and prepare your custom dataset for funetuning the indictrans model. There is also a section on mining indic to indic data from english centric corpus for finetuning indic to indic model.
-**Note**: Since this is a big model (400M params), you might not be able to train with reasonable batch sizes in the free google Colab account. We are planning to release smaller models (after pruning / distallation) soon.
-## Mining Indic to Indic pairs from english centric corpus
-The `extract_non_english_pairs` in `scripts/extract_non_english_pairs.py` can be used to mine indic to indic pairs from english centric corpus.
-As described in the [paper](https://arxiv.org/pdf/2104.05596.pdf) (section 2.5) , we use a very strict deduplication criterion to avoid the creation of very similar parallel sentences. For example, if an en sentence is aligned to *M* hi sentences and *N* ta sentences, then we would get *MN* hi-ta pairs. However, these pairs would be very similar and not contribute much to the training process. Hence, we retain only 1 randomly chosen pair out of these *MN* pairs.
-```bash
-extract_non_english_pairs(indir, outdir, LANGS):
-    """
-    Extracts non-english pair parallel corpora
-    indir: contains english centric data in the following form:
-            - directory named en-xx for language xx
-            - each directory contains a train.en and train.xx
-    outdir: output directory to store mined data for each pair.
-            One directory is created for each pair.
-    LANGS: list of languages in the corpus (other than English).
-            The language codes must correspond to the ones used in the
-            files and directories in indir. Prefarably, sort the languages
-            in this list in alphabetic order. outdir will contain data for xx-yy,
-            but not for yy-xx, so it will be convenient to have this list in sorted order.
-    """
-```
-## Installation
-<details><summary>Click to expand </summary>
-```bash
-cd indicTrans
-git clone https://github.com/anoopkunchukuttan/indic_nlp_library.git
-git clone https://github.com/anoopkunchukuttan/indic_nlp_resources.git
-git clone https://github.com/rsennrich/subword-nmt.git
-# install required libraries
-pip install sacremoses pandas mock sacrebleu tensorboardX pyarrow indic-nlp-library
-# Install fairseq from source
-git clone https://github.com/pytorch/fairseq.git
-cd fairseq
-pip install --editable ./
-```
-</details>
-## How to train the indictrans model on your training data?
-[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/AI4Bharat/indicTrans/blob/main/IndicTrans_training.ipynb)
-Follow the colab notebook to setup the environment, download the dataset and train the indicTrans model
-## Network & Training Details
-- Architechture: IndicTrans uses 6 encoder and decoder layers, input embeddings of size 1536 with 16 attention heads and
-feedforward dimension of 4096 with total number of parameters of 434M
-- Loss: Cross entropy loss
-- Optimizer: Adam
-- Label Smoothing: 0.1
-- Gradient clipping: 1.0
-- Learning rate: 5e-4
-- Warmup_steps: 4000
-Please refer to section 4, 5 of our [paper](https://arxiv.org/ftp/arxiv/papers/2104/2104.05596.pdf) for more details on training/experimental setup.
-## Folder Structure
-```
-IndicTrans
-│   .gitignore
-│   apply_bpe_traindevtest_notag.sh         # apply bpe for joint vocab (Train, dev and test)
-│   apply_single_bpe_traindevtest_notag.sh  # apply bpe for seperate vocab   (Train, dev and test)
-│   binarize_training_exp.sh                # binarize the training data after preprocessing for fairseq-training
-│   compute_bleu.sh                         # Compute blue scores with postprocessing after translating with `joint_translate.sh`
-│   indictrans_fairseq_inference.ipynb      # colab example to show how to use model for inference
-│   indicTrans_Finetuning.ipynb             # colab example to show how to use model for finetuning on custom domain data
-│   joint_translate.sh                      # used for inference (see colab inference notebook for more details on usage)
-│   learn_bpe.sh                            # learning joint bpe on preprocessed text
-│   learn_single_bpe.sh                     # learning seperate bpe on preprocessed text
-│   LICENSE
-│   prepare_data.sh                         # prepare data given an experiment dir (this does preprocessing,
-│                                           # building vocab, binarization ) for bilingual training
-│   prepare_data_joint_training.sh          # prepare data given an experiment dir (this does preprocessing,
-│                                           # building vocab, binarization ) for joint training
-│   README.md
-│
-├───legacy                                  # old unused scripts
-├───model_configs                           # custom model configrations are stored here
-│       custom_transformer.py               # contains custom 4x transformer models
-│       __init__.py
-├───inference
-│       custom_interactive.py               # for python wrapper around fairseq-interactive
-│       engine.py                           # python interface for model inference
-└───scripts                                 # stores python scripts that are used by other bash scripts
-    │   add_joint_tags_translate.py         # add lang tags to the processed training data for bilingual training
-    │   add_tags_translate.py               # add lang tags to the processed training data for joint training
-    │   clean_vocab.py                      # clean vocabulary after building with subword_nmt
-    │   concat_joint_data.py                # concatenates lang pair data and creates text files to keep track
-    │                                       # of number of lines in each lang pair.
-    │   extract_non_english_pairs.py        # Mining Indic to Indic pairs from english centric corpus
-    │   postprocess_translate.py            # Postprocesses translations
-    │   preprocess_translate.py             # Preprocess translations and for script conversion (from indic to devnagiri)
-    │   remove_large_sentences.py           # to remove large sentences from training data
-    └───remove_train_devtest_overlaps.py    # Finds and removes overlaped data of train with dev and test sets
-```
-## Citing
-If you are using any of the resources, please cite the following article:
-```
-@misc{ramesh2021samanantar,
-      title={Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages},
-      author={Gowtham Ramesh and Sumanth Doddapaneni and Aravinth Bheemaraj and Mayank Jobanputra and Raghavan AK and Ajitesh Sharma and Sujit Sahoo and Harshita Diddee and Mahalakshmi J and Divyanshu Kakwani and Navneet Kumar and Aswin Pradeep and Kumar Deepak and Vivek Raghavan and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
-      year={2021},
-      eprint={2104.05596},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-```
-We would like to hear from you if:
-- You are using our resources. Please let us know how you are putting these resources to use.
-- You have any feedback on these resources.
-### License
-The IndicTrans code (and models) are released under the MIT License.
-### Contributors
-- Gowtham Ramesh, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [IITM](https://www.iitm.ac.in))</sub>
-- Sumanth Doddapaneni, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [IITM](https://www.iitm.ac.in))</sub>
-- Aravinth Bheemaraj, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
-- Mayank Jobanputra, <sub>([IITM](https://www.iitm.ac.in))</sub>
-- Raghavan AK, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
-- Ajitesh Sharma, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
-- Sujit Sahoo, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
-- Harshita Diddee, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
-- Mahalakshmi J, <sub>([AI4Bharat](https://ai4bharat.org))</sub>
-- Divyanshu Kakwani, <sub>([IITM](https://www.iitm.ac.in), [AI4Bharat](https://ai4bharat.org))</sub>
-- Navneet Kumar, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
-- Aswin Pradeep, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
-- Kumar Deepak, <sub>([Tarento](https://www.linkedin.com/company/tarento-group/), [EkStep](https://ekstep.in))</sub>
-- Vivek Raghavan, <sub>([EkStep](https://ekstep.in))</sub>
-- Anoop Kunchukuttan, <sub>([Microsoft](https://www.microsoft.com/en-in/), [AI4Bharat](https://ai4bharat.org))</sub>
-- Pratyush Kumar, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in))</sub>
-- Mitesh Shantadevi Khapra, <sub>([RBCDSAI](https://rbcdsai.iitm.ac.in), [AI4Bharat](https://ai4bharat.org), [IITM](https://www.iitm.ac.in))</sub>
-### Contact
-- Anoop Kunchukuttan ([anoop.kunchukuttan@gmail.com](mailto:anoop.kunchukuttan@gmail.com))
-- Mitesh Khapra ([miteshk@cse.iitm.ac.in](mailto:miteshk@cse.iitm.ac.in))
-- Pratyush Kumar ([pratyush@cse.iitm.ac.in](mailto:pratyush@cse.iitm.ac.in))

indicTrans/indicTrans_Finetuning.ipynb → indicTrans_Finetuning.ipynb RENAMED Viewed

File without changes

indicTrans/indicTrans_python_interface.ipynb → indicTrans_python_interface.ipynb RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/LICENSE RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/README.md RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/README.md RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/correct_moses_tokenizer.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/hindi_to_kannada_transliterator.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/contrib/indic_scraper_project_sample.ipynb RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/Makefile RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/cmd.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/code.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/conf.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/index.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.MD RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.cli.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.morph.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.normalize.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.pdf RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.script.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.syllable.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.tokenize.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/indicnlp.transliterate.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/make.bat RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/docs/modules.rst RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/__init__.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/cli/__init__.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/cli/cliparser.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/common.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/langinfo.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/loader.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/morph/__init__.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/morph/unsupervised_morph.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/normalize/__init__.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/normalize/indic_normalize.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/__init__.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/english_script.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/indic_scripts.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/script/phonetic_sim.py RENAMED Viewed

File without changes

{indicTrans/indic_nlp_library → indic_nlp_library}/indicnlp/syllable/__init__.py RENAMED Viewed

File without changes