IntelligenzaArtificiale mamta commited on
Commit
baf7f9c
·
0 Parent(s):

Duplicate from codeparrot/code-generation-models

Browse files

Co-authored-by: Mamta Narang <mamta@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ftz filter=lfs diff=lfs merge=lfs -text
6
+ *.gz filter=lfs diff=lfs merge=lfs -text
7
+ *.h5 filter=lfs diff=lfs merge=lfs -text
8
+ *.joblib filter=lfs diff=lfs merge=lfs -text
9
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
10
+ *.model filter=lfs diff=lfs merge=lfs -text
11
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
12
+ *.onnx filter=lfs diff=lfs merge=lfs -text
13
+ *.ot filter=lfs diff=lfs merge=lfs -text
14
+ *.parquet filter=lfs diff=lfs merge=lfs -text
15
+ *.pb filter=lfs diff=lfs merge=lfs -text
16
+ *.pt filter=lfs diff=lfs merge=lfs -text
17
+ *.pth filter=lfs diff=lfs merge=lfs -text
18
+ *.rar filter=lfs diff=lfs merge=lfs -text
19
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
20
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
21
+ *.tflite filter=lfs diff=lfs merge=lfs -text
22
+ *.tgz filter=lfs diff=lfs merge=lfs -text
23
+ *.wasm filter=lfs diff=lfs merge=lfs -text
24
+ *.xz filter=lfs diff=lfs merge=lfs -text
25
+ *.zip filter=lfs diff=lfs merge=lfs -text
26
+ *.zstandard filter=lfs diff=lfs merge=lfs -text
27
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Code generation with 🤗
3
+ emoji: ✨
4
+ colorFrom: purple
5
+ colorTo: yellow
6
+ sdk: streamlit
7
+ sdk_version: 1.9.0
8
+ app_file: app.py
9
+ models:
10
+ - lvwerra/codeparrot
11
+ - Salesforce/codegen-16B-mono
12
+ - facebook/incoder-6B
13
+ datasets:
14
+ - lvwerra/github-code
15
+ - openai_humaneval
16
+ - the_pile
17
+ - code_search_net
18
+ - mbpp
19
+ - loubnabnl/apps
20
+ - lvwerra/codeparrot-clean
21
+ pinned: true
22
+ license: apache-2.0
23
+ duplicated_from: codeparrot/code-generation-models
24
+ ---
25
+
26
+ # Configuration
27
+ `title`: _string_
28
+ Display title for the Space
29
+ `emoji`: _string_
30
+ Space emoji (emoji-only character allowed)
31
+ `colorFrom`: _string_
32
+ Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
33
+ `colorTo`: _string_
34
+ Color for Thumbnail gradient (red, yellow, green, blue, indigo, purple, pink, gray)
35
+ `sdk`: _string_
36
+ Can be either `gradio` or `streamlit`
37
+ `sdk_version` : _string_
38
+ Only applicable for `streamlit` SDK.
39
+ See [doc](https://hf.co/docs/hub/spaces) for more info on supported versions.
40
+
41
+ `app_file`: _string_
42
+ Path to your main application file (which contains either `gradio` or `streamlit` Python code).
43
+ Path is relative to the root of the repository.
44
+
45
+ `pinned`: _boolean_
46
+ Whether the Space stays on top of your list.
app.py ADDED
@@ -0,0 +1,210 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import os
3
+ import pandas as pd
4
+ import requests
5
+ import threading
6
+ import streamlit as st
7
+ from datasets import load_dataset, load_metric
8
+
9
+ MODELS = ["CodeParrot", "InCoder", "CodeGen", "PolyCoder"]
10
+ GENERATION_MODELS = ["CodeParrot", "InCoder", "CodeGen"]
11
+
12
+
13
+ @st.cache()
14
+ def load_examples():
15
+ with open("utils/examples.json", "r") as f:
16
+ examples = json.load(f)
17
+ return examples
18
+
19
+
20
+ def load_evaluation():
21
+ # load task 2 of HumanEval and code_eval_metric
22
+ os.environ["HF_ALLOW_CODE_EVAL"] = "1"
23
+ human_eval = load_dataset("openai_humaneval")
24
+ entry_point = f"check({human_eval['test'][2]['entry_point']})"
25
+ test_func = "\n" + human_eval["test"][2]["test"] + "\n" + entry_point
26
+ code_eval = load_metric("code_eval")
27
+ return code_eval, test_func
28
+
29
+
30
+ def read_markdown(path):
31
+ with open(path, "r") as f:
32
+ output = f.read()
33
+ st.markdown(output, unsafe_allow_html=True)
34
+
35
+
36
+ def generate_code(
37
+ generations, model_name, gen_prompt, max_new_tokens, temperature, seed
38
+ ):
39
+ # call space using its API endpoint
40
+ url = (
41
+ f"https://hf.space/embed/codeparrot/{model_name.lower()}-subspace/+/api/predict/"
42
+ )
43
+ r = requests.post(
44
+ url=url, json={"data": [gen_prompt, max_new_tokens, temperature, seed]}
45
+ )
46
+ generated_text = r.json()["data"][0]
47
+ generations.append({model_name: generated_text})
48
+
49
+
50
+ def generate_code_threads(
51
+ generations, models, gen_prompt, max_new_tokens, temperature, seed
52
+ ):
53
+ threads = []
54
+ for model_name in models:
55
+ # create the thread
56
+ threads.append(
57
+ threading.Thread(
58
+ target=generate_code,
59
+ args=(
60
+ generations,
61
+ model_name,
62
+ gen_prompt,
63
+ max_new_tokens,
64
+ temperature,
65
+ seed,
66
+ ),
67
+ )
68
+ )
69
+ threads[-1].start()
70
+
71
+ for t in threads:
72
+ t.join()
73
+
74
+ @st.cache(show_spinner=False)
75
+ def generate_teaser(gen_prompt):
76
+ generations = []
77
+ generate_code(generations, "CodeParrot", gen_prompt, 8, 0.2, 42)
78
+ return generations[0]["CodeParrot"]
79
+
80
+ st.set_page_config(page_icon=":laptop:", layout="wide")
81
+ with open("utils/table_contents.md", "r") as f:
82
+ contents = f.read()
83
+
84
+ st.sidebar.markdown(contents)
85
+
86
+ # Introduction
87
+ st.title("Code generation with 🤗")
88
+ read_markdown("utils/summary.md")
89
+ ## teaser
90
+ example_text = "def print_hello_world():"
91
+ col1, col2, col3 = st.columns([1, 2, 1])
92
+ with col2:
93
+ gen_prompt = st.text_area(
94
+ "",
95
+ value=example_text,
96
+ height=100,
97
+ ).strip()
98
+ if st.button("Generate code!", key=1):
99
+ with st.spinner("Generating code..."):
100
+ st.code(generate_teaser(gen_prompt))
101
+ read_markdown("utils/intro.md")
102
+
103
+ # Code datasets
104
+ st.subheader("1 - Code datasets")
105
+ read_markdown("datasets/intro.md")
106
+ read_markdown("datasets/github_code.md")
107
+ col1, col2 = st.columns([1, 2])
108
+ with col1:
109
+ selected_model = st.selectbox("", MODELS, key=1)
110
+ read_markdown(f"datasets/{selected_model.lower()}.md")
111
+
112
+
113
+ # Model architecture
114
+ st.subheader("2 - Model architecture")
115
+ read_markdown("architectures/intro.md")
116
+ col1, col2 = st.columns([1, 2])
117
+ with col1:
118
+ selected_model = st.selectbox("", MODELS, key=2)
119
+ read_markdown(f"architectures/{selected_model.lower()}.md")
120
+
121
+ # Model evaluation
122
+ st.subheader("3 - Code model evaluation")
123
+ read_markdown("evaluation/intro.md")
124
+ read_markdown("evaluation/demo_humaneval.md")
125
+ ## quiz
126
+ st.markdown("Below you can try solving this problem or visualize the solution of CodeParrot:")
127
+ with open("evaluation/problem.md", "r") as f:
128
+ problem = f.read()
129
+ with open("evaluation/solution.md", "r") as f:
130
+ solution = f.read()
131
+
132
+ candidate_solution = st.text_area(
133
+ "Complete the problem:",
134
+ value=problem,
135
+ height=240,
136
+ ).strip()
137
+ if st.button("Test my solution", key=2):
138
+ with st.spinner("Testing..."):
139
+ code_eval, test_func = load_evaluation()
140
+ test_cases = [test_func]
141
+ candidates = [[candidate_solution]]
142
+ pass_at_k, _ = code_eval.compute(references=test_cases, predictions=candidates)
143
+ text = "Your solution didn't pass the test, pass@1 is 0 😕" if pass_at_k['pass@1'] < 1 else "Congrats your pass@1 is 1! 🎉"
144
+ st.markdown(text)
145
+ if st.button("Show model solution", key=3):
146
+ st.markdown(solution)
147
+
148
+ # Code generation
149
+ st.subheader("4 - Code generation ✨")
150
+ read_markdown("generation/intro.md")
151
+ col1, col2, col3 = st.columns([7, 1, 6])
152
+ with col1:
153
+ st.markdown("**Models**")
154
+ selected_models = st.multiselect(
155
+ "Select code generation models to compare:",
156
+ GENERATION_MODELS,
157
+ default=GENERATION_MODELS,
158
+ key=3,
159
+ )
160
+ st.markdown(" ")
161
+ st.markdown("**Examples**")
162
+ examples = load_examples()
163
+ example_names = [example["name"] for example in examples]
164
+ name2id = dict([(name, i) for i, name in enumerate(example_names)])
165
+ selected_example = st.selectbox(
166
+ "Select one of the following examples or implement yours:", example_names
167
+ )
168
+ example_text = examples[name2id[selected_example]]["value"]
169
+ default_length = examples[name2id[selected_example]]["length"]
170
+ with col3:
171
+ st.markdown("**Generation settings**")
172
+ temperature = st.slider(
173
+ "Temperature:", value=0.2, min_value=0.1, step=0.1, max_value=2.0
174
+ )
175
+ max_new_tokens = st.slider(
176
+ "Number of tokens to generate:",
177
+ value=default_length,
178
+ min_value=8,
179
+ step=4,
180
+ max_value=256,
181
+ )
182
+ seed = st.slider("Random seed:", value=42, min_value=0, step=1, max_value=1000)
183
+ gen_prompt = st.text_area(
184
+ "Generate code with prompt:",
185
+ value=example_text,
186
+ height=200,
187
+ ).strip()
188
+ if st.button("Generate code!", key=4):
189
+ with st.spinner("Generating code..."):
190
+ # use threading
191
+ generations = []
192
+ generate_code_threads(
193
+ generations,
194
+ selected_models,
195
+ gen_prompt=gen_prompt,
196
+ max_new_tokens=max_new_tokens,
197
+ temperature=temperature,
198
+ seed=seed,
199
+ )
200
+ for i in range(len(generations)):
201
+ st.markdown(f"**{selected_models[i]}**")
202
+ for j in range(len(generations)):
203
+ if selected_models[i] in generations[j].keys():
204
+ st.code(generations[j][selected_models[i]])
205
+ if len(generations) < len(selected_models):
206
+ st.markdown("<span style='color:red'>Warning: Some models run into timeout, try another time or reduce the Number of tokens to generate. You can also try generating code using the original subspaces: [InCoder](https://huggingface.co/spaces/loubnabnl/incoder-subspace), [CodeGen](https://huggingface.co/spaces/loubnabnl/codegen-subspace), [CodeParrot](https://huggingface.co/spaces/loubnabnl/codeparrot-subspace)</span>", unsafe_allow_html=True)
207
+
208
+ # Resources
209
+ st.subheader("Resources")
210
+ read_markdown("utils/resources.md")
architectures/codegen.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The CodeGen architecture follows a standard transformer decoder with left-to-right causal masking. With rotary position embedding for the positional encoding [(Su et al., 2021)](https://arxiv.org/abs/2104.09864), and a context length of 2048. CodeGen models are trained in various sizes.
2
+
3
+ <div align="center">
4
+
5
+ |Model | # parameters |
6
+ | - | - |
7
+ | [Salesforce/codegen-350m-mono](https://huggingface.co/Salesforce/codegen-350-mono) | 350M |
8
+ | [Salesforce/codegen-2B-mono](https://huggingface.co/Salesforce/codegen-2B-mono) | 2.7B |
9
+ | [Salesforce/codegen-6B-mono](https://huggingface.co/Salesforce/codegen-6B-mono) | 6.1B |
10
+ | [Salesforce/codegen-16B-mono](https://huggingface.co/Salesforce/codegen-16B-mono) | 16.1B |
11
+
12
+ </div>
13
+
14
+
15
+ You can load the model and tokenizer directly from 🤗 [`transformers`](https://huggingface.co/docs/transformers/index):
16
+
17
+ ```python
18
+ from transformers import AutoTokenizer, AutoModelForCausalLM
19
+
20
+ tokenizer = AutoTokenizer.from_pretrained('Salesforce/codegen-16B-mono')
21
+ model = AutoModelForCausalLM.from_pretrained('Salesforce/codegen-16B-mono')
22
+
23
+ inputs = tokenizer("def hello_world():", return_tensors="pt")
24
+ outputs = model(**inputs)
25
+ ```
architectures/codeparrot.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [CodeParrot](https://huggingface.co/codeparrot/codeparrot) uses GPT-2 architecture with BPE tokenizer trained on Python code from the training split of the data, and a context length of 1024. This model was released as an educational tool for training large language models from scratch on code, with detailed tutorials and descriptions of the training process. It makes use of 🤗 [`accelerate`](https://huggingface.co/docs/accelerate/index) for distributed training and mixed precision. See this [blog](https://huggingface.co/blog/codeparrot) and [repo](https://github.com/huggingface/transformers/tree/main/examples/research_projects/codeparrot) for more details.
2
+
3
+ <div align="center">
4
+
5
+ |Model | # parameters |
6
+ | - | - |
7
+ | [codeparrot-small](https://huggingface.co/codeparrot/codeparrot-small) | 110M |
8
+ | [codeparrot](https://huggingface.co/codeparrot/codeparrot) | 1.5B |
9
+
10
+ </div>
11
+
12
+
13
+ You can load the model and tokenizer directly from 🤗 [`transformers`](https://huggingface.co/docs/transformers/index):
14
+
15
+ ```python
16
+ from transformers import AutoTokenizer, AutoModelWithLMHead
17
+
18
+ tokenizer = AutoTokenizer.from_pretrained("codeparrot/codeparrot")
19
+ model = AutoModelWithLMHead.from_pretrained("codeparrot/codeparrot")
20
+
21
+ inputs = tokenizer("def hello_world():", return_tensors="pt")
22
+ outputs = model(**inputs)
23
+
24
+ ```
25
+
26
+ You can also use `pipeline` to generate code:
27
+
28
+ ```python
29
+ from transformers import pipeline
30
+
31
+ pipe = pipeline("text-generation", model="codeparrot/codeparrot")
32
+ outputs = pipe("def hello_world():")
33
+ ```
architectures/incoder.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [InCoder](https://huggingface.co/facebook/incoder-6B) uses a decoder-only Transformer with Causal Masking objective, to train a left-to-right language model to fill in masked token segments, with a context length of 2048.
2
+ <div align="center">
3
+
4
+ |Model | # parameters |
5
+ | - | - |
6
+ | [facebook/incoder-1B](https://huggingface.co/facebook/incoder-1B) |1.3B |
7
+ | [facebook/incoder-6B](https://huggingface.co/facebook/incoder-6B) |6.7B |
8
+
9
+ </div>
10
+
11
+ [Causal Masking objective](https://arxiv.org/abs/2201.07520) is a hybrid approach of Causal and Masked language models, "it combines the benefit of per-token generation with optional bi-directionality specifically tailored to prompting".
12
+ During the training of InCoder, spans of code were randomly masked and moved to the end of each file, which allows for bidirectional context. Figure below from InCoder [paper](https://arxiv.org/pdf/2204.05999.pdf) illustrates the training process.
13
+
14
+ <p align="center">
15
+ <img src="https://huggingface.co/datasets/loubnabnl/repo-images/raw/main/incoder.png" alt="drawing" width="750"/>
16
+ </p>
17
+
18
+ So in addition to program synthesis (via left-to-right generation), InCoder can also perform editing (via infilling). The model gives promising results in some zero-shot code infilling tasks such as type prediction, variable re-naming and comment generation.
19
+
20
+ You can load the model and tokenizer directly from 🤗 [`transformers`](https://huggingface.co/docs/transformers/index):
21
+
22
+ ```python
23
+ from transformers import AutoTokenizer, AutoModelWithLMHead
24
+
25
+ tokenizer = AutoTokenizer.from_pretrained("facebook/incoder-6B")
26
+ model = AutoModelWithLMHead.from_pretrained("facebook/incoder-6B")
27
+
28
+ inputs = tokenizer("def hello_world():", return_tensors="pt")
29
+ outputs = model(**inputs)
30
+
31
+ ```
architectures/intro.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ Various architectures are used in code generation models, but most of them use the auto-regressive left-to-right setting, such as GPT. However InCoder used a decoder-only Transformer with Causal Masking objective,
2
+ that combines both next token prediction and bidirectional context through masking. AlphaCode used an encoder-decoder architecture.
3
+
4
+ <p align="center">
5
+ <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/model_size.png" alt="drawing" width="440"/>
6
+ </p>
7
+
8
+ For model-specific information about each architecture, please select a model below:
architectures/polycoder.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [PolyCoder](https://github.com/VHellendoorn/Code-LMs) uses GPT2 architecture, with BPE tokenizer trained on a random 5% subset of the data (all languages), and a context length of 2048. To study the effect of scaling of model size, the odel was trained in 3 different sizes.
2
+
3
+ <div align="center">
4
+
5
+ |Model | # parameters |
6
+ | - | - |
7
+ | GPT2 | 160M |
8
+ | GPT2 | 400M |
9
+ | GPT2 | 2.7B |
10
+
11
+ </div>
12
+
13
+
14
+ PolyCoder is currently being integrated in 🤗 `transformers`. Meanwhile it can be loaded following the instructions in the original GitHub [repo](https://github.com/vhellendoorn/code-lms#models).
datasets/.ipynb_checkpoints/codeparrot-checkpoint.txt ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ [CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
+ - Exact match deduplication
3
+ - Filtering
4
+ - Average line length < 100
5
+ - Maximum line length < 1000
6
+ - Alpha numeric characters fraction > 0.25
7
+ - Remove auto-generated files (keyword search)
8
+
9
+ For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).
datasets/.ipynb_checkpoints/opt-checkpoint.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ [OPT](https://huggingface.co/facebook/opt-30b) was trained on the following 5 filtered datasets of textual documents, one of them includes code, [The Pile](https://arxiv.org/pdf/2101.00027v1.pdf), it used *Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews*.
2
+ The final training data contains 180B tokens corresponding to 800GB of data. For more details please refer to this [paper](https://arxiv.org/abs/2205.01068)
datasets/codegen.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [Codegen](https://huggingface.co/Salesforce/codegen-16B-mono) is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.
2
+
3
+ It was sequentially trained on three datasets:
4
+ - [The Pile](https://huggingface.co/datasets/the_pile)
5
+ - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
6
+ - 217GB of Python data from GitHub repositories
7
+
8
+ The second and third datasets used the following preprocessing:
9
+ - Exact match deduplication
10
+ - Filtering:
11
+ - Exact match deduplication
12
+ - Average line length < 100 tokens
13
+ - Maximum line length < 1000 MB
14
+ - Characters being decimal or hexadecimal digits >90%
datasets/codeparrot.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ [CodeParrot](https://huggingface.co/lvwerra/codeparrot) is a code generation model trained on **50GB** of pre-processed Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
2
+ - Exact match deduplication
3
+ - Filtering:
4
+ - Average line length < 100 tokens
5
+ - Maximum line length < 1000 MB
6
+ - Alphanumeric characters fraction > 0.25
7
+ - Remove auto-generated files (keyword search)
8
+
9
+ For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).
datasets/github_code.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ We also released [Github code dataset](https://huggingface.co/datasets/codeparrot/github-code), a 1TB of code data from Github repositories in 32 programming languages. It was created from the public GitHub dataset on Google [BigQuery](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code). The dataset can be loaded in streaming mode if you don't want to download it because of memory limitations, this will create an iterable dataset:
2
+
3
+ ```python
4
+ from datasets import load_dataset
5
+
6
+ ds = load_dataset("codeparrot/github-code", streaming=True, split="train")
7
+ print(next(iter(ds)))
8
+
9
+ #OUTPUT:
10
+ {
11
+ 'code': "import mod189 from './mod189';\nvar value=mod189+1;\nexport default value;\n",
12
+ 'repo_name': 'MirekSz/webpack-es6-ts',
13
+ 'path': 'app/mods/mod190.js',
14
+ 'language': 'JavaScript',
15
+ 'license': 'isc',
16
+ 'size': 73
17
+ }
18
+
19
+ ```
20
+ You can see that in addition to the code, the samples include some metadata: repo name, path, language, license, and the size of the file. Below is the distribution of programming languages in this dataset.
21
+
22
+ <p align="center">
23
+ <img src="https://huggingface.co/datasets/codeparrot/github-code/resolve/main/github-code-stats-alpha.png" alt="drawing" width="650"/>
24
+ </p>
25
+
26
+ For model-specific information about the pretraining dataset, please select a model below:
datasets/incoder.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [InCoder](https://huggingface.co/facebook/incoder-6B) is a code generation model that also allows code editing via [infilling](https://arxiv.org/pdf/2204.05999.pdf). It was trained on **216 GB** of preprocessed data from GitHub and Stack Overflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
2
+
3
+ The GitHub data was cleaned with the following steps:
4
+ - Average line length < 100 tokens
5
+ - Maximum line length < 3000 MB
6
+ - Alphanumeric characters fraction > 0.4
7
+ - Remove auto-generated files (keyword search)
8
+
9
+ The second component of the data consists of questions, answers, and comments from Stack Overflow. It includes:
10
+ - all questions that have at least one answer
11
+ - up to ten answers with a non-negative score (sorted by score) per question
12
+ - up to five comments per question/answer
13
+
14
+ Exact match deduplication was performed on code files. For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).
datasets/intro.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ Most code models are trained on data from public software repositories hosted on GitHub. Some also include code coupled with natural text from platforms such as Stack Overflow. Additional datasets can be crafted based on the target task of the model. [Alphacode](https://arxiv.org/pdf/2203.07814v1.pdf), for instance, was fine-tuned on [CodeContests](https://github.com/deepmind/code_contests), a competitive programming dataset for machine-learning. Another popular dataset is [The Pile](https://huggingface.co/datasets/the_pile), which is a large corpus containing both natural language texts and code from different sources such as StackExchange dumps and popular (>100 stars) GitHub repositories. It can be efficient for models intended to do translation from natural text to code or the opposite, it was used in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf) for instance.
2
+
3
+ Below is the distribution of the pretraining data size of some code models, we provide model-specific information for open-source models later in this section:
4
+ <p align="center">
5
+ <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/data_distrub.png" alt="drawing" width="440"/>
6
+ </p>
7
+
8
+ Some other useful datasets that are available on the 🤗 Hub are [CodeSearchNet](https://huggingface.co/datasets/code_search_net), a corpus of 2 milllion (comment, code) pairs from open-source libraries hosted on GitHub for several programming languages, and [Mostly Basic Python Problems (mbpp)](https://huggingface.co/datasets/mbpp), a benchmark of around 1,000 crowd-sourced Python programming problems, for entry level programmers, where each problem consists of a task description, code solution and 3 automated test cases, this dataset was used in [InCoder](https://huggingface.co/facebook/incoder-6B) evaluation in addition to [HumanEval](https://huggingface.co/datasets/openai_humaneval) that we will present later. You can also find [APPS](https://huggingface.co/datasets/loubnabnl/apps), a benchmark with 10000 problems consisting of programming questions in English and code solutions in Python, this dataset was also used in Codex evaluation along with HumanEval.
datasets/polycoder.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ The [PolyCoder paper](https://arxiv.org/pdf/2202.13169v3.pdf) gives a nice comparison of existing code models. The authors also trained a code generation model on **249GB** of data, after preprocessing, consisting of popular repositories for 12 popular programming languages with at least 50 stars from GitHub in October 2021. The data used the following preprocessing:
2
+ - Exact match deduplication
3
+ - Filtering:
4
+ - Average line length < 100 tokens
5
+ - Maximum line length < 1000 MB
evaluation/demo_humaneval.md ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ We can load HumanEval dataset and pass@k metric from 🤗 [`datasets`](https://huggingface.co/docs/datasets/index) and 🤗 [`evaluate`](https://huggingface.co/docs/evaluate/index)
3
+
4
+ ```python
5
+ from datasets import load_dataset
6
+ from evaluate import load
7
+
8
+ human_eval = load_dataset("openai_humaneval")
9
+ code_eval_metric = load("code_eval")
10
+ ```
11
+
12
+ We can easily compute the pass@k for a problem that asks for the implementation of a function that sums two integers:
13
+
14
+ ```python
15
+ test_cases = ["assert add(2,3)==5"]
16
+ candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
17
+ pass_at_k, results = code_eval_metric.compute(references=test_cases, predictions=candidates, k=[1, 2])
18
+ print(pass_at_k)
19
+ {'pass@1': 0.5, 'pass@2': 1.0}
20
+ ```
21
+
22
+ To better understand how pass@k metric works, we will illustrate it with a concrete example from HumanEval dataset. We select the problem below and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests:
23
+
24
+ **Problem:**
25
+
26
+ ```python
27
+
28
+ def truncate_number(number: float) -> float:
29
+ """ Given a positive floating point number, it can be decomposed into
30
+ and integer part (largest integer smaller than given number) and decimals
31
+ (leftover part always smaller than 1).
32
+
33
+ Return the decimal part of the number.
34
+ >>> truncate_number(3.5)
35
+ 0.5
36
+ """
37
+ ````
38
+
39
+ Instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use nucleus sampling with top-p where `p=0.95`, `temperature=0.2`, and sample tokens from the model until we encounter a stop sequence indicating the end of a method: ‘\nclass’, ‘\ndef’, ‘\n#’, ‘\nif’, or ‘\nprint’. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co/blog/how-to-generate).
40
+
41
+ **Remark**:
42
+
43
+ Regarding the temperature parameter, in [Codex](https://arxiv.org/pdf/2107.03374.pdf) paper, the authors observed that the best performing temperature increases as the number of samples permitted k increases. Similar behavior was also observed in [CodeGen](https://arxiv.org/pdf/2203.13474.pdf). When a model is only allowed a few samples to pass unit tests, it is beneficial to use the learned distribution, through a low temperature, to select candidates that are likely to pass. But when a model is allowed for more chances with a high k, using a higher sampling temperature to tilt the learned model distribution lets it explore diverse samples and thus have a greater chance of synthesizing a correct program.
44
+
45
+
46
+ For our experiment, we compute pass@1, pass@10 and pass@20, each corresponding to unit test pass rate when selecting respectively 1, 10 and 20 samples from the candidate solutions.
47
+
48
+ ```
49
+
50
+ Results: {'pass@1': 0.1, 'pass@10': 0.7631, 'pass@20': 1.0}
51
+
52
+ ````
53
+
54
+ If we take a closer look at the unit test results for each candidate solution, we find that 2 passed the unit test. This means that we have 2 correct solutions among 20, which corresponds to our pass@1 value `2/20 = 0.1`. The scores pass@10 and pass@20 are higher, because the more samples we select from the candidate completions, the more likely we are to include the correct implementation. As
55
+ for pass@20, it is `1`, since if we select all 20 candidates the problem gets solved which gives 100% success rate.
evaluation/eval_table.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Table 1 below shows the HumanEval scores of CodeParrot, InCoder, PolyCoder, CodeGen and Codex (not open-source).
2
+
3
+ <div align="center">
4
+
5
+ Model | pass@1 | pass@10 | pass@100|
6
+ |-------|--------|---------|---------|
7
+ |CodeParrot (110M) | 3.80% | 6.57% | 12.78% |
8
+ |CodeParrot (1.5B) | 3.58% | 8.03% | 14.96% |
9
+ |CodeParrot (1.5B) | 3.99% | 8.69% | 17.88% |
10
+ |||||
11
+ |InCoder (6.7B) | 15.2% | 27.8% | 47.00% |
12
+ |||||
13
+ |PolyCoder (160M)| 2.13% | 3.35% | 4.88% |
14
+ |PolyCoder (400M)| 2.96% | 5.29% | 11.59% |
15
+ |PolyCoder (2.7B)| 5.59% | 9.84% | 17.68% |
16
+ |||||
17
+ |CodeGen-Mono (350M)| 12.76% | 23.11% | 35.19% |
18
+ |CodeGen-Mono (2.7B)| 23.70% | 36.64% | 57.01% |
19
+ |CodeGen-Mono (6.1B)| 26.13% | 42.29% | 65.82% |
20
+ |CodeGen-Mono (16.1B)| **29.28%** | **49.86%** | **75.00%** |
21
+ |||||
22
+ |Codex (25M)| 3.21% | 7.1% | 12.89%|
23
+ |Codex (300M)| 13.17%| 20.37% | 36.27% |
24
+ |Codex (12B)| 28.81%| 46.81% | 72.31% |
25
+
26
+ </div>
evaluation/intro.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ A natural way to evaluate code programs is to see if they pass unit tests, it is the idea behind the [pass@k](https://huggingface.co/metrics/code_eval) metric, a popular evaluation framework for code generation models, on [HumanEval](https://huggingface.co/datasets/openai_humaneval) dataset, which was introduced in [Codex paper](https://arxiv.org/pdf/2107.03374v2.pdf). The dataset includes 164 handwritten programming problems. In the pass@k metric, k code samples are generated per problem, and a problem is considered solved if any sample passes the unit tests and the total fraction of problems solved is reported.
2
+ In most papers, 200 candidate program completions are sampled, and pass@1, pass@10, and pass@100 are computed using an unbiased sampling estimator.
3
+
4
+ This plot shows the pass@100 by model size, for CodeParrot, InCoder, PolyCoder, CodeGen and Codex (not open-source):
5
+ <p align="center">
6
+ <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/pass@100_figure.png" alt="drawing" width="550"/>
7
+ </p>
evaluation/problem.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ def truncate_number(number: float) -> float:
2
+ """ Given a positive floating point number, it can be decomposed into
3
+ and integer part (largest integer smaller than given number) and decimals
4
+ (leftover part always smaller than 1).
5
+ Return the decimal part of the number.
6
+ >>> truncate_number(3.5)
7
+ 0.5
8
+ """
9
+
evaluation/solution.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ```python
2
+
3
+ def truncate_number(number: float) -> float:
4
+ """ Given a positive floating point number, it can be decomposed into
5
+ and integer part (largest integer smaller than given number) and decimals
6
+ (leftover part always smaller than 1).
7
+
8
+ Return the decimal part of the number.
9
+ >>> truncate_number(3.5)
10
+ 0.5
11
+ """
12
+ return number % 1
13
+ ```
generation/intro.md ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ In this section you can prompt the following models to generate Python code: CodeParrot 1.5B, InCoder 1B and CodeGen 2B.
2
+
3
+ * For CodeGen, there are a larger [models](https://huggingface.co/Salesforce/codegen-16B-mono) available on the 🤗 Hub with 6.1 B and 16.1B parameters, but we use the 2B version to have models of comparable size in this demo. For InCoder too, there is a larger [model](https://huggingface.co/spaces/facebook/incoder-6B) with 6B parameters.
4
+ * For InCoder, you can also try the original [demo](https://huggingface.co/spaces/facebook/incoder-demo), which has more tasks and examples.
requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ git+https://github.com/huggingface/transformers
2
+ torch
3
+ protobuf~=3.19.0
utils/.ipynb_checkpoints/intro-checkpoint.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ In this space you can compare some of the features of code generation models:
2
+ * Pretraining datasets
3
+ * Model Architecture
4
+ * Model evaluation
5
+ You can also test their code generation capacities ✨.
utils/data_preview.csv ADDED
@@ -0,0 +1,408 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ code,repo_name,path,language,license,size
2
+ "/* { dg-do compile } */
3
+ /* { dg-options ""-mavx512f -O2 -masm=att"" } */
4
+ /* { dg-final { scan-assembler-times ""vmovss\[ \\t\]+\\(%\[a-z0-9,]*\\), %xmm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)"" 1 } } */
5
+ /* { dg-final { scan-assembler-times ""vmovss\[ \\t\]+\\(%\[a-z0-9,]*\\), %xmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)"" 1 } } */
6
+ /* { dg-final { scan-assembler-times ""vmovss\[ \\t\]+%xmm\[0-9\]+, %xmm\[0-9\]+, %xmm\[0-9\]+\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)"" 1 } } */
7
+ /* { dg-final { scan-assembler-times ""vmovss\[ \\t\]+%xmm\[0-9\]+, %xmm\[0-9\]+, %xmm\[0-9\]+\{%k\[1-7\]\}\{z\}(?:\n|\[ \\t\]+#)"" 1 } } */
8
+ /* { dg-final { scan-assembler-times ""vmovss\[ \\t\]+%xmm\[0-9\]+, \\(%\[a-z0-9,]*\\)\{%k\[1-7\]\}(?:\n|\[ \\t\]+#)"" 1 } } */
9
+
10
+ #include <immintrin.h>
11
+
12
+ volatile __m128 x1, x2, x3;
13
+ volatile __mmask8 m;
14
+ float *volatile p;
15
+
16
+ void extern
17
+ avx512f_test (void)
18
+ {
19
+ x1 = _mm_mask_load_ss (x1, m, p);
20
+ x1 = _mm_maskz_load_ss (m, p);
21
+ x1 = _mm_mask_move_ss (x1, m, x2, x3);
22
+ x1 = _mm_maskz_move_ss (m, x2, x3);
23
+ _mm_mask_store_ss (p, m, x1);
24
+ }
25
+ ",Gurgel100/gcc,gcc/testsuite/gcc.target/i386/avx512f-vmovss-1.c,C,gpl-2.0,1037
26
+ "from virtTrinity import picker
27
+ from virtTrinity.providers.virsh_cmd import data
28
+ from virtTrinity.providers.virsh_cmd.utils import virsh
29
+ from virtTrinity.providers.virsh_cmd.picker.command import CmdPicker
30
+
31
+
32
+ class OptSetPicker(picker.PickerBase):
33
+ depends_on = CmdPicker
34
+ data_type = data.VirshOptSet()
35
+
36
+ types = {
37
+ ""positive"": {
38
+ ""patterns"": None,
39
+ ""data_type"": data.OptSet(),
40
+ },
41
+ ""miss_dep"": {
42
+ ""patterns"": r""command '.*' requires .* option"",
43
+ ""data_type"": data.MissingDepOptSet(),
44
+ },
45
+ ""other"": {
46
+ ""patterns"": [
47
+ r""command '.*' doesn't support option --.*"",
48
+ # r""command or command group '.*' doesn't exist"",
49
+ ]
50
+ },
51
+ }
52
+
53
+ def prerequisite(self):
54
+ return self.test.cmd in virsh.commands
55
+
56
+ def apply(self, result):
57
+ self.test.options = result
58
+ ",Hao-Liu/virt-trinity,virtTrinity/providers/virsh_cmd/picker/optset.py,Python,gpl-2.0,913
59
+ "package com.suscipio_solutions.consecro_mud.Abilities.Spells;
60
+ import java.util.LinkedList;
61
+ import java.util.Vector;
62
+
63
+ import com.suscipio_solutions.consecro_mud.Abilities.interfaces.Ability;
64
+ import com.suscipio_solutions.consecro_mud.Common.interfaces.CMMsg;
65
+ import com.suscipio_solutions.consecro_mud.Items.interfaces.Item;
66
+ import com.suscipio_solutions.consecro_mud.Items.interfaces.Wearable;
67
+ import com.suscipio_solutions.consecro_mud.Locales.interfaces.Room;
68
+ import com.suscipio_solutions.consecro_mud.MOBS.interfaces.MOB;
69
+ import com.suscipio_solutions.consecro_mud.core.CMClass;
70
+ import com.suscipio_solutions.consecro_mud.core.CMLib;
71
+ import com.suscipio_solutions.consecro_mud.core.CMStrings;
72
+ import com.suscipio_solutions.consecro_mud.core.interfaces.Environmental;
73
+ import com.suscipio_solutions.consecro_mud.core.interfaces.Physical;
74
+
75
+
76
+ @SuppressWarnings(""rawtypes"")
77
+ public class Spell_SpyingStone extends Spell
78
+ {
79
+ @Override public String ID() { return ""Spell_SpyingStone""; }
80
+ private final static String localizedName = CMLib.lang().L(""Spying Stone"");
81
+ @Override public String name() { return localizedName; }
82
+ private final static String localizedStaticDisplay = CMLib.lang().L(""(Spying Stone)"");
83
+ @Override public String displayText() { return localizedStaticDisplay; }
84
+ @Override protected int canAffectCode(){return CAN_ITEMS;}
85
+ @Override protected int canTargetCode(){return Ability.CAN_ITEMS;}
86
+ @Override public int classificationCode(){return Ability.ACODE_SPELL|Ability.DOMAIN_DIVINATION;}
87
+ @Override public int abstractQuality(){ return Ability.QUALITY_INDIFFERENT;}
88
+
89
+ protected LinkedList<String> msgs=new LinkedList<String>();
90
+
91
+ @Override
92
+ public void executeMsg(final Environmental myHost, final CMMsg msg)
93
+ {
94
+ super.executeMsg(myHost, msg);
95
+ if((msg.targetMinor()==CMMsg.TYP_SPEAK)
96
+ &&((msg.source()==invoker())
97
+ ||((invoker()!=null) && msg.source().Name().equalsIgnoreCase(invoker().Name())))
98
+ &&(msg.target()==affected)
99
+ &&(msg.sourceMessage().toUpperCase().indexOf(""SPEAK"")>=0))
100
+ {
101
+ final Room room=CMLib.map().roomLocation(affected);
102
+ if(room!=null)
103
+ {
104
+ final StringBuilder str=new StringBuilder("""");
105
+ for(final String m : msgs)
106
+ str.append(m).append(""\n\r"");
107
+ if(str.length()==0) str.append(L(""Nothing!""));
108
+ room.showHappens(CMMsg.MSG_SPEAK, affected,L(""^S<S-NAME> grow(s) a mouth and say(s) '^N@x1^S'^N"",str.toString()));
109
+ msgs.clear();
110
+ }
111
+ }
112
+ else
113
+ if((msg.othersCode()!=CMMsg.NO_EFFECT)
114
+ &&(msg.othersMessage()!=null)
115
+ &&(msg.othersMessage().length()>0))
116
+ msgs.add(CMLib.coffeeFilter().fullOutFilter(null, null, msg.source(), msg.target(), msg.tool(), CMStrings.removeColors(msg.othersMessage()), false));
117
+ }
118
+
119
+ @Override
120
+ public boolean invoke(MOB mob, Vector commands, Physical givenTarget, boolean auto, int asLevel)
121
+ {
122
+ final Physical target=getTarget(mob,mob.location(),givenTarget,commands,Wearable.FILTER_ANY);
123
+ if(target==null) return false;
124
+
125
+ if(!(target instanceof Item))
126
+ {
127
+ mob.tell(L(""You can't cast this spell on that.""));
128
+ return false;
129
+ }
130
+
131
+ if(target.fetchEffect(this.ID())!=null)
132
+ {
133
+ mob.tell(L(""@x1 is already a spying stone!"",target.name(mob)));
134
+ return false;
135
+ }
136
+
137
+ if(!super.invoke(mob,commands,givenTarget,auto,asLevel))
138
+ return false;
139
+
140
+ final boolean success=proficiencyCheck(mob,0,auto);
141
+
142
+ if(success)
143
+ {
144
+ final CMMsg msg=CMClass.getMsg(mob,target,this,verbalCastCode(mob,target,auto),auto?"""":L(""^S<S-NAME> point(s) <S-HIS-HER> finger at <T-NAMESELF>, incanting.^?""));
145
+ if(mob.location().okMessage(mob,msg))
146
+ {
147
+ mob.location().send(mob,msg);
148
+ beneficialAffect(mob,target,asLevel,0);
149
+ mob.location().show(mob,target,CMMsg.MSG_OK_VISUAL,L(""<T-NAME> open(s) a pair of strange eyes, which become transluscent.""));
150
+ }
151
+ }
152
+ else
153
+ beneficialWordsFizzle(mob,target,L(""<S-NAME> point(s) at <T-NAMESELF>, incanting, but nothing happens.""));
154
+
155
+
156
+ // return whether it worked
157
+ return success;
158
+ }
159
+ }
160
+ ",ConsecroMUD/ConsecroMUD,com/suscipio_solutions/consecro_mud/Abilities/Spells/Spell_SpyingStone.java,Java,apache-2.0,3919
161
+ "# -*- encoding: utf-8 -*-
162
+ '''
163
+ HubbleStack Nebula-to-Splunk returner
164
+
165
+ Deliver HubbleStack Nebula query data into Splunk using the HTTP
166
+ event collector. Required config/pillar settings:
167
+
168
+ .. code-block:: yaml
169
+
170
+ hubblestack:
171
+ returner:
172
+ splunk:
173
+ - token: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
174
+ indexer: splunk-indexer.domain.tld
175
+ index: hubble
176
+ sourcetype_nebula: hubble_osquery
177
+
178
+ You can also add a `custom_fields` argument which is a list of keys to add to
179
+ events with using the results of config.get(<custom_field>). These new keys
180
+ will be prefixed with 'custom_' to prevent conflicts. The values of these keys
181
+ should be strings or lists (will be sent as CSV string), do not choose grains
182
+ or pillar values with complex values or they will be skipped.
183
+
184
+ Additionally, you can define a fallback_indexer which will be used if a default
185
+ gateway is not defined.
186
+
187
+ .. code-block:: yaml
188
+
189
+ hubblestack:
190
+ returner:
191
+ splunk:
192
+ - token: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
193
+ indexer: splunk-indexer.domain.tld
194
+ index: hubble
195
+ sourcetype_nebula: hubble_osquery
196
+ fallback_indexer: splunk-indexer.loc.domain.tld
197
+ custom_fields:
198
+ - site
199
+ - product_group
200
+ '''
201
+ import socket
202
+
203
+ # Imports for http event forwarder
204
+ import requests
205
+ import json
206
+ import time
207
+ from datetime import datetime
208
+ from hubblestack.hec import http_event_collector, get_splunk_options, make_hec_args
209
+
210
+ import logging
211
+
212
+ _max_content_bytes = 100000
213
+ http_event_collector_debug = False
214
+ RETRY = False
215
+
216
+ log = logging.getLogger(__name__)
217
+
218
+
219
+ def returner(ret):
220
+ try:
221
+ opts_list = get_splunk_options( sourcetype_nebula='hubble_osquery',
222
+ add_query_to_sourcetype=True, _nick={'sourcetype_nebula': 'sourcetype'})
223
+
224
+ for opts in opts_list:
225
+ logging.debug('Options: %s' % json.dumps(opts))
226
+ custom_fields = opts['custom_fields']
227
+
228
+ # Set up the fields to be extracted at index time. The field values must be strings.
229
+ # Note that these fields will also still be available in the event data
230
+ index_extracted_fields = []
231
+ try:
232
+ index_extracted_fields.extend(__opts__.get('splunk_index_extracted_fields', []))
233
+ except TypeError:
234
+ pass
235
+
236
+ # Set up the collector
237
+ args, kwargs = make_hec_args(opts)
238
+ hec = http_event_collector(*args, **kwargs)
239
+
240
+ # st = 'salt:hubble:nova'
241
+ data = ret['return']
242
+ minion_id = ret['id']
243
+ jid = ret['jid']
244
+ global RETRY
245
+ RETRY = ret['retry']
246
+ master = __grains__['master']
247
+ fqdn = __grains__['fqdn']
248
+ # Sometimes fqdn is blank. If it is, replace it with minion_id
249
+ fqdn = fqdn if fqdn else minion_id
250
+ try:
251
+ fqdn_ip4 = __grains__.get('local_ip4')
252
+ if not fqdn_ip4:
253
+ fqdn_ip4 = __grains__['fqdn_ip4'][0]
254
+ except IndexError:
255
+ try:
256
+ fqdn_ip4 = __grains__['ipv4'][0]
257
+ except IndexError:
258
+ raise Exception('No ipv4 grains found. Is net-tools installed?')
259
+ if fqdn_ip4.startswith('127.'):
260
+ for ip4_addr in __grains__['ipv4']:
261
+ if ip4_addr and not ip4_addr.startswith('127.'):
262
+ fqdn_ip4 = ip4_addr
263
+ break
264
+ local_fqdn = __grains__.get('local_fqdn', __grains__['fqdn'])
265
+
266
+ # Sometimes fqdn reports a value of localhost. If that happens, try another method.
267
+ bad_fqdns = ['localhost', 'localhost.localdomain', 'localhost6.localdomain6']
268
+ if fqdn in bad_fqdns:
269
+ new_fqdn = socket.gethostname()
270
+ if '.' not in new_fqdn or new_fqdn in bad_fqdns:
271
+ new_fqdn = fqdn_ip4
272
+ fqdn = new_fqdn
273
+
274
+ # Get cloud details
275
+ cloud_details = __grains__.get('cloud_details', {})
276
+
277
+ if not data:
278
+ return
279
+ else:
280
+ for query in data:
281
+ for query_name, query_results in query.iteritems():
282
+ if 'data' not in query_results:
283
+ query_results['data'] = [{'error': 'result missing'}]
284
+ for query_result in query_results['data']:
285
+ event = {}
286
+ payload = {}
287
+ event.update(query_result)
288
+ event.update({'query': query_name})
289
+ event.update({'job_id': jid})
290
+ event.update({'master': master})
291
+ event.update({'minion_id': minion_id})
292
+ event.update({'dest_host': fqdn})
293
+ event.update({'dest_ip': fqdn_ip4})
294
+ event.update({'dest_fqdn': local_fqdn})
295
+ event.update({'system_uuid': __grains__.get('system_uuid')})
296
+
297
+ event.update(cloud_details)
298
+
299
+ for custom_field in custom_fields:
300
+ custom_field_name = 'custom_' + custom_field
301
+ custom_field_value = __salt__['config.get'](custom_field, '')
302
+ if isinstance(custom_field_value, (str, unicode)):
303
+ event.update({custom_field_name: custom_field_value})
304
+ elif isinstance(custom_field_value, list):
305
+ custom_field_value = ','.join(custom_field_value)
306
+ event.update({custom_field_name: custom_field_value})
307
+
308
+ payload.update({'host': fqdn})
309
+ payload.update({'index': opts['index']})
310
+ if opts['add_query_to_sourcetype']:
311
+ payload.update({'sourcetype': ""%s_%s"" % (opts['sourcetype'], query_name)})
312
+ else:
313
+ payload.update({'sourcetype': opts['sourcetype']})
314
+
315
+ # Remove any empty fields from the event payload
316
+ remove_keys = [k for k in event if event[k] == """"]
317
+ for k in remove_keys:
318
+ del event[k]
319
+
320
+ payload.update({'event': event})
321
+
322
+ # Potentially add metadata fields:
323
+ fields = {}
324
+ for item in index_extracted_fields:
325
+ if item in payload['event'] and not isinstance(payload['event'][item], (list, dict, tuple)):
326
+ fields[""meta_%s"" % item] = str(payload['event'][item])
327
+ if fields:
328
+ payload.update({'fields': fields})
329
+
330
+ # If the osquery query includes a field called 'time' it will be checked.
331
+ # If it's within the last year, it will be used as the eventtime.
332
+ event_time = query_result.get('time', '')
333
+ try:
334
+ if (datetime.fromtimestamp(time.time()) - datetime.fromtimestamp(float(event_time))).days > 365:
335
+ event_time = ''
336
+ except Exception:
337
+ event_time = ''
338
+ finally:
339
+ hec.batchEvent(payload, eventtime=event_time)
340
+
341
+ hec.flushBatch()
342
+ except Exception:
343
+ log.exception('Error ocurred in splunk_nebula_return')
344
+ return
345
+ ",basepi/hubble,hubblestack/extmods/returners/splunk_nebula_return.py,Python,apache-2.0,7889
346
+ "// Copyright (c) 2012 The Chromium Authors. All rights reserved.
347
+ // Use of this source code is governed by a BSD-style license that can be
348
+ // found in the LICENSE file.
349
+
350
+ #ifndef CHROME_BROWSER_UI_VIEWS_TAB_ICON_VIEW_MODEL_H_
351
+ #define CHROME_BROWSER_UI_VIEWS_TAB_ICON_VIEW_MODEL_H_
352
+
353
+ namespace ui {
354
+ class ImageModel;
355
+ } // namespace ui
356
+
357
+ // Classes implement this interface to provide state for the TabIconView.
358
+ class TabIconViewModel {
359
+ public:
360
+ // Returns true if the TabIconView should show a loading animation.
361
+ virtual bool ShouldTabIconViewAnimate() const = 0;
362
+
363
+ // Returns the favicon to display in the icon view
364
+ virtual ui::ImageModel GetFaviconForTabIconView() = 0;
365
+
366
+ protected:
367
+ virtual ~TabIconViewModel() {}
368
+ };
369
+
370
+ #endif // CHROME_BROWSER_UI_VIEWS_TAB_ICON_VIEW_MODEL_H_
371
+ ",ric2b/Vivaldi-browser,chromium/chrome/browser/ui/views/tab_icon_view_model.h,C,bsd-3-clause,784
372
+ "//
373
+ // HealthKit.h
374
+ // HealthKit
375
+ //
376
+ // Copyright (c) 2013-2014 Apple Inc. All rights reserved.
377
+ //
378
+
379
+ #import <HealthKit/HKActivitySummary.h>
380
+ #import <HealthKit/HKActivitySummaryQuery.h>
381
+ #import <HealthKit/HKAnchoredObjectQuery.h>
382
+ #import <HealthKit/HKCategorySample.h>
383
+ #import <HealthKit/HKCorrelation.h>
384
+ #import <HealthKit/HKCorrelationQuery.h>
385
+ #import <HealthKit/HKDefines.h>
386
+ #import <HealthKit/HKDeletedObject.h>
387
+ #import <HealthKit/HKDevice.h>
388
+ #import <HealthKit/HKHealthStore.h>
389
+ #import <HealthKit/HKMetadata.h>
390
+ #import <HealthKit/HKObject.h>
391
+ #import <HealthKit/HKObjectType.h>
392
+ #import <HealthKit/HKObserverQuery.h>
393
+ #import <HealthKit/HKQuantity.h>
394
+ #import <HealthKit/HKQuantitySample.h>
395
+ #import <HealthKit/HKQuery.h>
396
+ #import <HealthKit/HKSample.h>
397
+ #import <HealthKit/HKSampleQuery.h>
398
+ #import <HealthKit/HKSource.h>
399
+ #import <HealthKit/HKSourceQuery.h>
400
+ #import <HealthKit/HKSourceRevision.h>
401
+ #import <HealthKit/HKStatistics.h>
402
+ #import <HealthKit/HKStatisticsCollectionQuery.h>
403
+ #import <HealthKit/HKStatisticsQuery.h>
404
+ #import <HealthKit/HKTypeIdentifiers.h>
405
+ #import <HealthKit/HKUnit.h>
406
+ #import <HealthKit/HKWorkout.h>
407
+ #import <HealthKit/HKWorkoutSession.h>
408
+ ",rweichler/cylinder,deps/iPhoneOS9.3.sdk/System/Library/Frameworks/HealthKit.framework/Headers/HealthKit.h,C,mit,1159
utils/examples.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "name": "Hello World!",
4
+ "value": "def print_hello_world():\n \"\"\"Print 'Hello World!'.\"\"\"",
5
+ "length": 8
6
+ },
7
+ {
8
+ "name": "Scikit-Learn",
9
+ "value": "import numpy as np\nfrom sklearn.ensemble import RandomForestClassifier\n\n# create training data\nX = np.random.randn(100, 100)\ny = np.random.randint(0, 1, 100)\n\n# setup train test split",
10
+ "length": 52
11
+ },
12
+ {
13
+ "name": "Transformers",
14
+ "value": "from transformers import AutoTokenizer, AutoModelForSequenceClassification\n\n# build a BERT classifier",
15
+ "length": 48
16
+ },
17
+ {
18
+ "name": "Count words",
19
+ "value": "def count_words(filename):\n \"\"\"Count the number of occurrences of each word in the file\"\"\"",
20
+ "length": 48
21
+ },
22
+ {
23
+ "name": "Is e in L",
24
+ "value": "def is_in_list(L, e):\n \"\"\"Find if list L contains the element e.\"\"\"",
25
+ "length": 32
26
+ },
27
+ {
28
+ "name": "unittest",
29
+ "value": "def is_even(value):\n \"\"\"Returns True if value is an even number.\"\"\"\n return value % 2 == 0\n\n# setup unit tests for is_even\nimport unittest",
30
+ "length": 52
31
+
32
+ },
33
+ {
34
+ "name": "Pizza Problem",
35
+ "value": "def exercise():\n \"\"\"Marie ordered one chicken meal that costs 12 dollars, and some boxes of pizza. Marie paid a total of 44 dollars. How many boxes of pizza did Marie order if each box costs 8 dollars?\"\"\"",
36
+ "length": 52
37
+ },
38
+ {
39
+ "name": "Pandas",
40
+ "value": "# load dataframe from csv\ndf = pd.read_csv(filename)\n\n# columns: \"age_group\", \"income\"\n# calculate average income per age group",
41
+ "length": 16
42
+ }
43
+ ]
utils/intro.md ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ ## Introduction
2
+
3
+ The application of language models to code generation has sparked great interest recently. You have probably heard of [Codex](https://arxiv.org/pdf/2107.03374v2.pdf), the model behind [Github Copilot](https://copilot.github.com/), or [AlphaCode](https://www.deepmind.com/blog/competitive-programming-with-alphacode) for competition-level programming. These models aren't open-source, and it is hard to reproduce them with a limited budget and incomplete information about their training. The ML community has luckily contributed some code models to allow for further research.
4
+
5
+ However, it can be easy to get lost between models. At Hugging Face we aim to democratize ML and centralize all information in the 🤗 ecosystem to make the usage of open-source tools easier and more efficient. Code models aren't an exception, you can find all open-source models on the Hub, with several code datasets and evaluation metrics. In this blog we will give an overview of these tools and how to use them.
6
+
7
+ <p align="center">
8
+ <img src="https://huggingface.co/datasets/loubnabnl/repo-images/resolve/main/pipeline.png" alt="drawing" width="550"/>
9
+ </p>
utils/resources.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ - Natural Language Processing with Transformers [Tunstall et al., 2022](https://www.oreilly.com/library/view/natural-language-processing/9781098103231/).
2
+ - Evaluating large language models trained on code [Chen et al., 2021](https://arxiv.org/abs/2107.03374).
3
+ - Competition-Level Code Generation with AlphaCode [Li et al., 2022](https://arxiv.org/abs/2203.07814).
4
+ - InCoder: A Generative Model for Code Infilling and Synthesis [Fried et al., 2022](https://arxiv.org/abs/2204.05999).
5
+ - A Conversational Paradigm for Program Synthesis [Nijkamp et al. 2022](https://arxiv.org/abs/2203.13474).
6
+ - A systematic evaluation of large language models of code [Xu et al. 2022](https://arxiv.org/abs/2202.13169).
utils/summary.md ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ This is an **interactive** blog that provides an overview of open-source language models for code generation. This post presents:
2
+ * code datasets
3
+ * model architecture
4
+ * model evaluation
5
+
6
+ We also give examples and tips to use the 🤗 Hub for this task.
7
+ At the end of this blog, you will find a **demo** to test and compare code generation across multi-billion parameter code models directly in the browser! Here's a small teaser ✨:
utils/table_contents.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ### 📖 Table of contents 📖
2
+
3
+ 1 - Code datasets
4
+
5
+ 2 - Model architecture
6
+
7
+ 3 - Model evaluation
8
+
9
+ 4 - Code generation
10
+
11
+ For each section, you can choose to visualize the information of 4 code generation models:
12
+
13
+ * [CodeParrot](https://huggingface.co/lvwerra/codeparrot)
14
+ * [InCoder](https://huggingface.co/facebook/incoder-6B)
15
+ * [CodeGen](https://huggingface.co/Salesforce/codegen-16B-mono)
16
+ * [PolyCoder](https://github.com/vhellendoorn/code-lms)
17
+
18
+ In section 4, you get to prompt the models and test their **code generation** capacities ✨!