Spaces:
Running
Running
add SC2 and stack v2 (#6)
Browse files- add SC2 and stack v2 (0faf2d62e3e90853afef8ac575c321f7b2a99900)
- Update README.md (b999cd3ce6eee6ac24a1081104e871f11964a170)
- Update README.md (9fa76a4a6877c0f95179f43a060106511ba931cc)
- Update README.md (499d29517c1c6b96f293c86604c973a65e773bbf)
- Update README.md (2ae84d12d083027e2ab6ae4c112283425e543b76)
Co-authored-by: Loubna Ben Allal <loubnabnl@users.noreply.huggingface.co>
README.md
CHANGED
@@ -19,6 +19,39 @@ pinned: false
|
|
19 |
|
20 |
BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
|
21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
22 |
---
|
23 |
<details>
|
24 |
<summary>
|
@@ -72,12 +105,10 @@ BigCode is an open scientific collaboration working on responsible training of l
|
|
72 |
<summary>
|
73 |
<b><font size="+1">📑The Stack</font></b>
|
74 |
</summary>
|
75 |
-
The Stack is a 6.4TB of source code in 358 programming languages from permissive licenses.
|
76 |
|
77 |
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
|
78 |
- [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
|
79 |
-
- [The Stack issues](https://huggingface.co/datasets/bigcode/the-stack-github-issues): Collection of GitHub issues.
|
80 |
-
- [The Stack Metadata](https://huggingface.co/datasets/bigcode/the-stack-metadata): Metadata of the repositories in The Stack.
|
81 |
- [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
|
82 |
</details>
|
83 |
---
|
|
|
19 |
|
20 |
BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main [website](https://www.bigcode-project.org/) or follow Big Code on [Twitter](https://twitter.com/BigCodeProject). In this organization you can find the artefacts of this collaboration: **StarCoder**, a state-of-the-art language model for code, **OctoPack**, artifacts for instruction tuning large code models, **The Stack**, the largest available pretraining dataset with perimssive code, and **SantaCoder**, a 1.1B parameter model for code.
|
21 |
|
22 |
+
---
|
23 |
+
<details>
|
24 |
+
<summary>
|
25 |
+
<b><font size="+1">💫StarCoder 2</font></b>
|
26 |
+
</summary>
|
27 |
+
StarCoder2 models are a series of 3B, 7B, and 15B models trained on 3.3 to 4.3 trillion tokens of code from The Stack v2 dataset, with over 600 programming languages. The models use GQA, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.
|
28 |
+
|
29 |
+
### Models
|
30 |
+
- [Paper](https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view): A technical report about StarCoder2.
|
31 |
+
- [GitHub](https://github.com/bigcode-project/starcoder2): All you need to know about using or fine-tuning StarCoder2.
|
32 |
+
- [StarCoder2-15B](https://huggingface.co/bigcode/starcoder2-15b): 15B model trained on 600+ programming languages and 4.3T tokens.
|
33 |
+
- [StarCoder2-7B](https://huggingface.co/bigcode/starcoder2-7b): 7B model trained on 17 programming languages for 3.7T tokens.
|
34 |
+
- [StarCoder2-3B](https://huggingface.co/bigcode/starcoder2-3b): 3B model trained on 17 programming languages for 3.3T tokens.
|
35 |
+
|
36 |
+
### Data & Governance
|
37 |
+
- [Governance Card](https://huggingface.co/datasets/bigcode/governance-card): A card outlining the governance of the model.
|
38 |
+
- [StarCoder2 License Agreement](https://huggingface.co/spaces/bigcode/bigcode-model-license-agreement): The model is licensed under the BigCode OpenRAIL-M v1 license agreement.
|
39 |
+
- [The Stack train smol](https://huggingface.co/datasets/bigcode/the-stack-v2-train-smol-ids): The Software Heritage identifiers for the training dataset of StarCoder2 3B and 7B with 600B+ unique tokens.
|
40 |
+
- [The Stack train full](https://huggingface.co/datasets/bigcode/the-stack-v2-train-full-ids): The Software Heritage identifiers for the training dataset of StarCoder2 15B with 900B+ unique tokens.
|
41 |
+
- [StarCoder2 Search](https://huggingface.co/spaces/bigcode/search-v2): Full-text search code in the pretraining dataset.
|
42 |
+
- [StarCoder2 Membership Test](https://stack-dev.dataportraits.org/): Blazing fast test if code was present in pretraining dataset.
|
43 |
+
</details>
|
44 |
+
---
|
45 |
+
<details>
|
46 |
+
<summary>
|
47 |
+
<b><font size="+1">📑The Stack v2</font></b>
|
48 |
+
</summary>
|
49 |
+
The Stack v2 is a 67.5TB dataset of source code in over 600 programming languages with permissive licenses or no license.
|
50 |
+
|
51 |
+
- [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2): Exact deduplicated version of The Stack v2.
|
52 |
+
- [The Stack v2 dedup](https://huggingface.co/datasets/bigcode/the-stack-v2-dedup): Near deduplicated version of The Stack v2 (recommended for training).
|
53 |
+
- [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
|
54 |
+
</details>
|
55 |
---
|
56 |
<details>
|
57 |
<summary>
|
|
|
105 |
<summary>
|
106 |
<b><font size="+1">📑The Stack</font></b>
|
107 |
</summary>
|
108 |
+
The Stack v1 is a 6.4TB dataset of source code in 358 programming languages from permissive licenses.
|
109 |
|
110 |
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack): Exact deduplicated version of The Stack.
|
111 |
- [The Stack dedup](https://huggingface.co/datasets/bigcode/the-stack-dedup): Near deduplicated version of The Stack (recommended for training).
|
|
|
|
|
112 |
- [Am I in the Stack](https://huggingface.co/spaces/bigcode/in-the-stack): Check if your data is in The Stack and request opt-out.
|
113 |
</details>
|
114 |
---
|