README / README.md
lvwerra's picture
lvwerra HF staff
add SC2 and stack v2 (#6)
a9caed7 verified
|
raw
history blame
8.52 kB
metadata
title: README
emoji: 
colorFrom: gray
colorTo: red
sdk: static
pinned: false

BigCode

BigCode is an open scientific collaboration working on responsible training of large language models for coding applications. You can find more information on the main website or follow Big Code on Twitter. In this organization you can find the artefacts of this collaboration: StarCoder, a state-of-the-art language model for code, OctoPack, artifacts for instruction tuning large code models, The Stack, the largest available pretraining dataset with perimssive code, and SantaCoder, a 1.1B parameter model for code.


💫StarCoder 2 StarCoder2 models are a series of 3B, 7B, and 15B models trained on 3.3 to 4.3 trillion tokens of code from The Stack v2 dataset, with over 600 programming languages. The models use GQA, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.

Models

  • Paper: A technical report about StarCoder2.
  • GitHub: All you need to know about using or fine-tuning StarCoder2.
  • StarCoder2-15B: 15B model trained on 600+ programming languages and 4.3T tokens.
  • StarCoder2-7B: 7B model trained on 17 programming languages for 3.7T tokens.
  • StarCoder2-3B: 3B model trained on 17 programming languages for 3.3T tokens.

Data & Governance


📑The Stack v2 The Stack v2 is a 67.5TB dataset of source code in over 600 programming languages with permissive licenses or no license.

💫StarCoder StarCoder is a 15.5B parameters language model for code trained for 1T tokens on 80+ programming languages. It uses MQA for efficient generation, has 8,192 tokens context window and can do fill-in-the-middle.

Models

  • Paper: A technical report about StarCoder.
  • GitHub: All you need to know about using or fine-tuning StarCoder.
  • StarCoder: StarCoderBase further trained on Python.
  • StarCoderBase: Trained on 80+ languages from The Stack.
  • StarCoder+: StarCoderBase further trained on English web data.
  • StarEncoder: Encoder model trained on TheStack.
  • StarPii: StarEncoder based PII detector.

Tools & Demos

Data & Governance


🐙OctoPack

OctoPack consists of data, evals & models relating to Code LLMs that follow human instructions.
- Paper: Research paper with details about all components of OctoPack. - GitHub: All code used for the creation of OctoPack. - CommitPack: 4TB of Git commits. - Am I in the CommitPack: Check if your code is in the CommitPack. - CommitPackFT: 2GB of high-quality Git commits that resemble instructions. - HumanEvalPack: Benchmark for Code Fixing/Explaining/Synthesizing across Python/JavaScript/Java/Go/C++/Rust. - OctoCoder: Instruction tuned model of StarCoder by training on CommitPackFT. - OctoCoder Demo: Play with OctoCoder. - OctoGeeX: Instruction tuned model of CodeGeeX2 by training on CommitPackFT.

📑The Stack The Stack v1 is a 6.4TB dataset of source code in 358 programming languages from permissive licenses.
  • The Stack: Exact deduplicated version of The Stack.
  • The Stack dedup: Near deduplicated version of The Stack (recommended for training).
  • Am I in the Stack: Check if your data is in The Stack and request opt-out.

🎅SantaCoder SantaCoder aka smol StarCoder: same architecture but only trained on Python, Java, JavaScript.