Update README.md
Browse files
README.md
CHANGED
@@ -36,6 +36,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
36 |
import torch
|
37 |
|
38 |
tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictalm-7b-instruct')
|
|
|
39 |
model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True).cuda()
|
40 |
|
41 |
model.eval()
|
@@ -55,6 +56,25 @@ with torch.inference_mode():
|
|
55 |
print(tokenizer.batch_decode(model.generate(**kwargs), skip_special_tokens=True))
|
56 |
```
|
57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
There are many different parameters you can input into `kwargs` for different results (greedy, beamsearch, different samplign configurations, longer/shorter respones, etc.).
|
59 |
|
60 |
You can view the full list of parameters you can pass to the `generate` function [here](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate).
|
|
|
36 |
import torch
|
37 |
|
38 |
tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictalm-7b-instruct')
|
39 |
+
# If you don't have cuda installed, remove the `.cuda()` call at the end
|
40 |
model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True).cuda()
|
41 |
|
42 |
model.eval()
|
|
|
56 |
print(tokenizer.batch_decode(model.generate(**kwargs), skip_special_tokens=True))
|
57 |
```
|
58 |
|
59 |
+
### Alternative ways to initialize the model:
|
60 |
+
|
61 |
+
If you have multiple smaller GPUs, and the package `accelerate` is installed, you can initialize the model split across the devices:
|
62 |
+
```python
|
63 |
+
model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, device_map='auto')
|
64 |
+
```
|
65 |
+
|
66 |
+
If you are running on linux and have the `bitsandbytes` package installed, you can initialize the model in 4/8 bit inference mode:
|
67 |
+
```python
|
68 |
+
model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, load_in_8bit=True)
|
69 |
+
```
|
70 |
+
|
71 |
+
If you have [FlashAttention](https://github.com/Dao-AILab/flash-attention) installed in your environment, you can instruct the model to use the flash attention implementation (either V1 or V2, whichever is installed):
|
72 |
+
```python
|
73 |
+
model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, use_flash_attention=True)
|
74 |
+
```
|
75 |
+
|
76 |
+
|
77 |
+
|
78 |
There are many different parameters you can input into `kwargs` for different results (greedy, beamsearch, different samplign configurations, longer/shorter respones, etc.).
|
79 |
|
80 |
You can view the full list of parameters you can pass to the `generate` function [here](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate).
|