File size: 1,216 Bytes
f066ef4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
[tokenizer](#tokenizer) | [model](#model) | [datasets](#datasets) | [plots](#plots) | [fine tuning](#fine-tuning)

# Tokenizer {#tokenizer}

We trained our tokenizer using [sentencepiece](https://github.com/google/sentencepiece)'s unigram tokenizer. Then loaded the tokenizer as MT5TokenizerFast.

## Model {#model}

We used [MT5-base](https://huggingface.co/google/mt5-base) model.

## Datasets {#datasets}

We used [Code Search Net](https://huggingface.co/datasets/code_search_net)'s dataset and some scrapped data from internet to train the model. We maintained a list of datasets where each dataset had codes of same language.

## Plots {#plots}

[train loss](#train_loss) | [evaluation loss](#eval_loss) | [evaluation accuracy](#eval_acc) | [learning rate](#lrs)

### Train loss {#train_loss}

![train loss](train_loss.png)

### Evaluation loss {#eval_loss}

![eval loss](eval_loss.png)

### Evaluation accuracy {#eval_acc}

![eval accuracy](eval_accuracy.png)

### Learning rate {#lrs}

![learning rate](learning_rate.png)

## Fine tuning {#fine-tuning}

We fine tuned the model with [CodeXGLUE code-to-code-trans dataset](https://huggingface.co/datasets/code_x_glue_cc_code_to_code_trans), and scrapper data.