ncoop57 commited on
Commit
681857f
1 Parent(s): 7e348aa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -4
README.md CHANGED
@@ -2,7 +2,7 @@
2
 
3
  ## Model Description
4
 
5
- PT-Neo-125M-Code-Clippy-Dedup is a [GPT-Neo-125M model](https://huggingface.co/EleutherAI/gpt-neo-125M) finetuned using causal language modeling on our version of the Code Clippy Data dataset that has duplicates, which was scraped from public Github repositories (more information in the provided link). This model is specialized to autocomplete methods in multiple programming languages.
6
 
7
  ## Training data
8
 
@@ -10,11 +10,48 @@ PT-Neo-125M-Code-Clippy-Dedup is a [GPT-Neo-125M model](https://huggingface.co/E
10
 
11
  ## Training procedure
12
 
13
- The training script used to train this model can be found [here](https://github.com/ncoop57/gpt-code-clippy/blob/camera-ready/training/run_clm_apps.py).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
  ## Intended Use and Limitations
16
 
17
- The model is finetuned text file from github repositories (mostly programming languages but also markdown and other project related files).
 
18
 
19
  ### How to use
20
 
@@ -24,7 +61,7 @@ You can use this model directly with a pipeline for text generation. This exampl
24
 
25
  from transformers import AutoModelForCausalLM, AutoTokenizer, FlaxAutoModelForCausalLM
26
 
27
- model = AutoModelForCausalLM.from_pretrained("flax-community/gpt-neo-125M-code-clippy-dedup")
28
 
29
  tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt-neo-125M-code-clippy")
30
 
 
2
 
3
  ## Model Description
4
 
5
+ GPT-Neo-125M-Code-Clippy is a [GPT-Neo-125M model](https://huggingface.co/EleutherAI/gpt-neo-125M) finetuned using causal language modeling on our version of the Code Clippy Data dataset that has duplicates, which was scraped from public Github repositories (more information in the provided link). This model is specialized to autocomplete methods in multiple programming languages. As discussed in OpenAI's [Codex paper](https://arxiv.org/abs/2107.03374), we modified the GPT-Neo model and tokenizer to accommodate for additional whitespace characters. Specifically, we add the following tokens `["\t\t", " ", " ", " "]` and since they are all related to indentation, we initialize the embedding layer of these tokens with the same weights as the `\t` token already present in the model in hopes the model will learn to associate these whitespace characters with indentation faster.
6
 
7
  ## Training data
8
 
 
10
 
11
  ## Training procedure
12
 
13
+ The training script used to train this model can be found [here](https://github.com/ncoop57/gpt-code-clippy/blob/camera-ready/training/run_clm_streaming_flax.py).
14
+
15
+ To reproduce the training one can use this command with the above script:
16
+ ```bash
17
+ ./run_clm_streaming_flax_v2.py \
18
+ --output_dir $HOME/gpt-neo-125M-code-clippy \
19
+ --model_name_or_path="flax-community/gpt-neo-125M-code-clippy" \
20
+ --dataset_name $HOME/gpt-code-clippy/code_clippy.py \
21
+ --data_dir /home/shared/code_clippy_data \
22
+ --text_column_name="text" \
23
+ --do_train --do_eval \
24
+ --block_size="2048" \
25
+ --per_device_train_batch_size="8" \
26
+ --per_device_eval_batch_size="16" \
27
+ --preprocessing_num_workers="8" \
28
+ --learning_rate="1e-4" \
29
+ --max_steps 100000 \
30
+ --warmup_steps 2500 \
31
+ --decay_steps 25000 \
32
+ --adam_beta1="0.9" \
33
+ --adam_beta2="0.95" \
34
+ --weight_decay="0.1" \
35
+ --overwrite_output_dir \
36
+ --logging_steps="100" \
37
+ --eval_steps="500" \
38
+ --push_to_hub="False" \
39
+ --report_to="all" \
40
+ --dtype="bfloat16" \
41
+ --skip_memory_metrics="True" \
42
+ --save_steps="500" \
43
+ --save_total_limit 10 \
44
+ --gradient_accumulation_steps 16 \
45
+ --report_to="wandb" \
46
+ --run_name="125m_1e-4lr_1024bs" \
47
+ --max_eval_samples 2000 \
48
+ --save_optimizer true
49
+ ```
50
 
51
  ## Intended Use and Limitations
52
 
53
+ The model is finetuned on text files from github repositories (mostly programming languages but also markdown and other project related files).
54
+
55
 
56
  ### How to use
57
 
 
61
 
62
  from transformers import AutoModelForCausalLM, AutoTokenizer, FlaxAutoModelForCausalLM
63
 
64
+ model = AutoModelForCausalLM.from_pretrained("flax-community/gpt-neo-125M-code-clippy")
65
 
66
  tokenizer = AutoTokenizer.from_pretrained("flax-community/gpt-neo-125M-code-clippy")
67