dicta-il
/

dictalm-7b-instruct

Text Generation

Model card Files Files and versions Community

Shaltiel commited on Sep 14, 2023

Commit

1915b64

•

1 Parent(s): f5c007b

Update README.md

Files changed (1) hide show

README.md +20 -0

README.md CHANGED Viewed

@@ -36,6 +36,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
 import torch
 tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictalm-7b-instruct')
 model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True).cuda()
 model.eval()
@@ -55,6 +56,25 @@ with torch.inference_mode():
     print(tokenizer.batch_decode(model.generate(**kwargs), skip_special_tokens=True))
 ```
 There are many different parameters you can input into `kwargs` for different results (greedy, beamsearch, different samplign configurations, longer/shorter respones, etc.).
 You can view the full list of parameters you can pass to the `generate` function [here](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate).

 import torch
 tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictalm-7b-instruct')
+# If you don't have cuda installed, remove the `.cuda()` call at the end
 model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True).cuda()
 model.eval()
     print(tokenizer.batch_decode(model.generate(**kwargs), skip_special_tokens=True))
 ```
+### Alternative ways to initialize the model:
+If you have multiple smaller GPUs, and the package `accelerate` is installed, you can initialize the model split across the devices:
+```python
+model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, device_map='auto')
+```
+If you are running on linux and have the `bitsandbytes` package installed, you can initialize the model in 4/8 bit inference mode:
+```python
+model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, load_in_8bit=True)
+```
+If you have [FlashAttention](https://github.com/Dao-AILab/flash-attention) installed in your environment, you can instruct the model to use the flash attention implementation (either V1 or V2, whichever is installed):
+```python
+model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, use_flash_attention=True)
+```
 There are many different parameters you can input into `kwargs` for different results (greedy, beamsearch, different samplign configurations, longer/shorter respones, etc.).
 You can view the full list of parameters you can pass to the `generate` function [here](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate).