Shaltiel commited on
Commit
1915b64
1 Parent(s): f5c007b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -0
README.md CHANGED
@@ -36,6 +36,7 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
36
  import torch
37
 
38
  tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictalm-7b-instruct')
 
39
  model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True).cuda()
40
 
41
  model.eval()
@@ -55,6 +56,25 @@ with torch.inference_mode():
55
  print(tokenizer.batch_decode(model.generate(**kwargs), skip_special_tokens=True))
56
  ```
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  There are many different parameters you can input into `kwargs` for different results (greedy, beamsearch, different samplign configurations, longer/shorter respones, etc.).
59
 
60
  You can view the full list of parameters you can pass to the `generate` function [here](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate).
 
36
  import torch
37
 
38
  tokenizer = AutoTokenizer.from_pretrained('dicta-il/dictalm-7b-instruct')
39
+ # If you don't have cuda installed, remove the `.cuda()` call at the end
40
  model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True).cuda()
41
 
42
  model.eval()
 
56
  print(tokenizer.batch_decode(model.generate(**kwargs), skip_special_tokens=True))
57
  ```
58
 
59
+ ### Alternative ways to initialize the model:
60
+
61
+ If you have multiple smaller GPUs, and the package `accelerate` is installed, you can initialize the model split across the devices:
62
+ ```python
63
+ model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, device_map='auto')
64
+ ```
65
+
66
+ If you are running on linux and have the `bitsandbytes` package installed, you can initialize the model in 4/8 bit inference mode:
67
+ ```python
68
+ model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, load_in_8bit=True)
69
+ ```
70
+
71
+ If you have [FlashAttention](https://github.com/Dao-AILab/flash-attention) installed in your environment, you can instruct the model to use the flash attention implementation (either V1 or V2, whichever is installed):
72
+ ```python
73
+ model = AutoModelForCausalLM.from_pretrained('dicta-il/dictalm-7b-instruct', trust_remote_code=True, use_flash_attention=True)
74
+ ```
75
+
76
+
77
+
78
  There are many different parameters you can input into `kwargs` for different results (greedy, beamsearch, different samplign configurations, longer/shorter respones, etc.).
79
 
80
  You can view the full list of parameters you can pass to the `generate` function [here](https://huggingface.co/docs/transformers/v4.33.0/en/main_classes/text_generation#transformers.GenerationMixin.generate).