michaelfeil commited on
Commit
76033bc
1 Parent(s): 92e4e7b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - ctranslate2
5
+ ---
6
+ # Fast-Inference with Ctranslate2
7
+ Speedup inference by 2x-8x using int8 inference in C++
8
+
9
+ quantized version of [databricks/dolly-v2-12b](https://huggingface.co/databricks/dolly-v2-12b)
10
+ ```bash
11
+ pip install hf_hub_ctranslate2>=1.0.0 ctranslate2>=3.13.0
12
+ ```
13
+
14
+
15
+ Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2)
16
+ - `compute_type=int8_float16` for `device="cuda"`
17
+ - `compute_type=int8` for `device="cuda"`
18
+
19
+ ```python
20
+ from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub
21
+
22
+ model_name = "michaelfeil/ct2fast-dolly-v2-12b"
23
+ model = GeneratorCT2fromHfHub(
24
+ # load in int8 on CUDA
25
+ model_name_or_path=model_name,
26
+ device="cuda",
27
+ compute_type="int8_float16"
28
+ )
29
+ outputs = model.generate(
30
+ text=["How do you call a fast Flan-ingo?", "User: How are you doing?"],
31
+ )
32
+ print(outputs)
33
+ ```
34
+
35
+ # Licence and other remarks:
36
+ This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.
37
+
38
+ # Usage of Dolly-v2:
39
+ According to the Intruction Pipeline of databricks/dolly-v2-12b
40
+ ```python
41
+ # from https://huggingface.co/databricks/dolly-v2-12b
42
+ def encode_prompt(instruction):
43
+ INSTRUCTION_KEY = "### Instruction:"
44
+ RESPONSE_KEY = "### Response:"
45
+ END_KEY = "### End"
46
+ INTRO_BLURB = (
47
+ "Below is an instruction that describes a task. Write a response that appropriately completes the request."
48
+ )
49
+
50
+ # This is the prompt that is used for generating responses using an already trained model. It ends with the response
51
+ # key, where the job of the model is to provide the completion that follows it (i.e. the response itself).
52
+ PROMPT_FOR_GENERATION_FORMAT = """{intro}
53
+ {instruction_key}
54
+ {instruction}
55
+ {response_key}
56
+ """.format(
57
+ intro=INTRO_BLURB,
58
+ instruction_key=INSTRUCTION_KEY,
59
+ instruction="{instruction}",
60
+ response_key=RESPONSE_KEY,
61
+ )
62
+ return PROMPT_FOR_GENERATION_FORMAT.format(instruction=instruction)
63
+ ```