Upload Phind/Phind-CodeLlama-34B-v2 ctranslate2 weights

Browse files

Files changed (9) hide show

.gitattributes +8 -0
README.md +162 -0
config.json +30 -0
generation_config.json +6 -0
model.bin +3 -0
special_tokens_map.json +24 -0
tokenizer_config.json +37 -0
vocabulary.json +0 -0
vocabulary.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,11 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00006-of-00007.bin filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00007-of-00007.bin filter=lfs diff=lfs merge=lfs -text
+pytorch_model.bin.index.json filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00001-of-00007.bin filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00002-of-00007.bin filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00003-of-00007.bin filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00004-of-00007.bin filter=lfs diff=lfs merge=lfs -text
+pytorch_model-00005-of-00007.bin filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,162 @@

+---
+license: llama2
+model-index:
+- name: Phind-CodeLlama-34B-v1
+  results:
+  - task:
+      type: text-generation
+    dataset:
+      type: openai_humaneval
+      name: HumanEval
+    metrics:
+    - name: pass@1
+      type: pass@1
+      value: 73.8%
+      verified: false
+tags:
+- ctranslate2
+- int8
+- float16
+- code llama
+---
+# # Fast-Inference with Ctranslate2
+Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.
+quantized version of [Phind/Phind-CodeLlama-34B-v2](https://huggingface.co/Phind/Phind-CodeLlama-34B-v2)
+```bash
+pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1
+```
+```python
+# from transformers import AutoTokenizer
+model_name = "michaelfeil/ct2fast-Phind-CodeLlama-34B-v2"
+from hf_hub_ctranslate2 import GeneratorCT2fromHfHub
+model = GeneratorCT2fromHfHub(
+        # load in int8 on CUDA
+        model_name_or_path=model_name,
+        device="cuda",
+        compute_type="int8_float16",
+        # tokenizer=AutoTokenizer.from_pretrained("{ORG}/{NAME}")
+)
+outputs = model.generate(
+    text=["def fibonnaci(", "User: How are you doing? Bot:"],
+    max_length=64,
+    include_prompt_in_result=False
+)
+print(outputs)
+```
+Checkpoint compatible to [ctranslate2>=3.17.1](https://github.com/OpenNMT/CTranslate2)
+and [hf-hub-ctranslate2>=2.12.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
+- `compute_type=int8_float16` for `device="cuda"`
+- `compute_type=int8`  for `device="cpu"`
+Converted on 2023-10-08 using
+```
+LLama-2 -> removed <pad> token.
+```
+# Licence and other remarks:
+This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
+# Original description
+# **Phind-CodeLlama-34B-v2**
+We've fine-tuned Phind-CodeLlama-34B-v1 on an additional 1.5B tokens high-quality programming-related data, achieving **73.8% pass@1** on HumanEval. It's the current state-of-the-art amongst open-source models.
+Furthermore, this model is **instruction-tuned** on the Alpaca/Vicuna format to be steerable and easy-to-use.
+More details can be found on our [blog post](https://www.phind.com/blog/code-llama-beats-gpt4).
+## Model Details
+This model is fine-tuned from Phind-CodeLlama-34B-v1 and achieves **73.8% pass@1** on HumanEval.
+Phind-CodeLlama-34B-v2 is **multi-lingual** and is proficient in Python, C/C++, TypeScript, Java, and more.
+## Dataset Details
+We fined-tuned on a proprietary dataset of 1.5B tokens of high quality programming problems and solutions. This dataset consists of instruction-answer pairs instead of code completion examples, making it structurally different from HumanEval. LoRA was not used -- both models are a native finetune. We used DeepSpeed ZeRO 3 and Flash Attention 2 to train these models in 15 hours on 32 A100-80GB GPUs. We used a sequence length of 4096 tokens.
+## How to Get Started with the Model
+Make sure to install Transformers from the main git branch:
+```bash
+pip install git+https://github.com/huggingface/transformers.git
+```
+## How to Prompt the Model
+This model accepts the Alpaca/Vicuna instruction format.
+For example:
+```
+### System Prompt
+You are an intelligent programming assistant.
+### User Message
+Implement a linked list in C++
+### Assistant
+...
+```
+## How to reproduce HumanEval Results
+To reproduce our results:
+```python
+from transformers import AutoTokenizer, LlamaForCausalLM
+from human_eval.data import write_jsonl, read_problems
+from tqdm import tqdm
+# initialize the model
+model_path = "Phind/Phind-CodeLlama-34B-v2"
+model = LlamaForCausalLM.from_pretrained(model_path, device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+# HumanEval helper
+def generate_one_completion(prompt: str):
+    tokenizer.pad_token = tokenizer.eos_token
+    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=4096)
+    # Generate
+    generate_ids = model.generate(inputs.input_ids.to("cuda"), max_new_tokens=384, do_sample=True, top_p=0.75, top_k=40, temperature=0.1)
+    completion = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+    completion = completion.replace(prompt, "").split("\n\n\n")[0]
+    return completion
+# perform HumanEval
+problems = read_problems()
+num_samples_per_task = 1
+samples = [
+    dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
+    for task_id in tqdm(problems)
+    for _ in range(num_samples_per_task)
+]
+write_jsonl("samples.jsonl", samples)
+# run `evaluate_functional_correctness samples.jsonl` in your HumanEval code sandbox
+```
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+This model has undergone very limited testing. Additional safety testing should be performed before any real-world deployments.
+## Training details
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+- **Hardware Type:** 32x A100-80GB
+- **Hours used:** 480 GPU-hours
+- **Cloud Provider:** AWS
+- **Compute Region:** us-east-1

config.json ADDED Viewed

	@@ -0,0 +1,30 @@

+{
+    "_name_or_path": "/fsx/Phind-CodeLlama-34B-v1",
+    "architectures": [
+        "LlamaForCausalLM"
+    ],
+    "bos_token_id": 1,
+    "eos_token_id": 2,
+    "hidden_act": "silu",
+    "hidden_size": 8192,
+    "initializer_range": 0.02,
+    "intermediate_size": 22016,
+    "max_position_embeddings": 16384,
+    "model_type": "llama",
+    "num_attention_heads": 64,
+    "num_hidden_layers": 48,
+    "num_key_value_heads": 8,
+    "pretraining_tp": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_scaling": null,
+    "rope_theta": 1000000,
+    "tie_word_embeddings": false,
+    "torch_dtype": "bfloat16",
+    "transformers_version": "4.33.0.dev0",
+    "use_cache": true,
+    "vocab_size": 32000,
+    "bos_token": "<s>",
+    "eos_token": "</s>",
+    "layer_norm_epsilon": 1e-05,
+    "unk_token": "<unk>"
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 1,
+  "eos_token_id": 2,
+  "transformers_version": "4.33.0.dev0"
+}

model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48749279396a6bcf43d64cd78889928cff7aedcbfeff7e8147af1105728609a7
+size 33758632127

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "</s>",
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "add_bos_token": true,
+  "add_eos_token": false,
+  "bos_token": {
+    "__type": "AddedToken",
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "clean_up_tokenization_spaces": false,
+  "eos_token": {
+    "__type": "AddedToken",
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "legacy": null,
+  "model_max_length": 4096,
+  "pad_token": null,
+  "padding_side": "right",
+  "sp_model_kwargs": {},
+  "spaces_between_special_tokens": false,
+  "tokenizer_class": "LlamaTokenizer",
+  "unk_token": {
+    "__type": "AddedToken",
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "use_default_system_prompt": true
+}

vocabulary.json ADDED Viewed

The diff for this file is too large to render. See raw diff

vocabulary.txt ADDED Viewed

The diff for this file is too large to render. See raw diff