avans06 commited on
Commit
1461bd0
1 Parent(s): 6e4cc9d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -72
README.md CHANGED
@@ -17,6 +17,10 @@ tags:
17
  - pytorch
18
  - llama
19
  - llama-3
 
 
 
 
20
  extra_gated_prompt: "### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT\nLlama 3.1 Version\
21
  \ Release Date: July 23, 2024\n\"Agreement\" means the terms and conditions for\
22
  \ use, reproduction, distribution and modification of the Llama Materials set forth\
@@ -189,6 +193,18 @@ extra_gated_description: The information you provide will be collected, stored,
189
  extra_gated_button_content: Submit
190
  ---
191
 
 
 
 
 
 
 
 
 
 
 
 
 
192
  ## Model Information
193
 
194
  The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.
@@ -289,95 +305,37 @@ Where to send questions or comments about the model Instructions on how to provi
289
 
290
  ## How to use
291
 
292
- This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase.
293
 
294
- ### Use with transformers
295
 
296
- Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function.
297
-
298
- Make sure to update your transformers installation via `pip install --upgrade transformers`.
299
 
300
  ```python
 
301
  import transformers
302
- import torch
303
-
304
- model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
305
 
306
- pipeline = transformers.pipeline(
307
- "text-generation",
308
- model=model_id,
309
- model_kwargs={"torch_dtype": torch.bfloat16},
310
- device_map="auto",
311
- )
312
 
313
  messages = [
314
  {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
315
  {"role": "user", "content": "Who are you?"},
316
  ]
317
 
318
- outputs = pipeline(
319
  messages,
320
- max_new_tokens=256,
321
  )
322
- print(outputs[0]["generated_text"][-1])
323
- ```
324
-
325
- Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at [`huggingface-llama-recipes`](https://github.com/huggingface/huggingface-llama-recipes)
326
-
327
- ### Tool use with transformers
328
-
329
- LLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/).
330
-
331
- Tool use is also supported through [chat templates](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling) in Transformers.
332
- Here is a quick example showing a single simple tool:
333
-
334
- ```python
335
- # First, define a tool
336
- def get_current_temperature(location: str) -> float:
337
- """
338
- Get the current temperature at a location.
339
-
340
- Args:
341
- location: The location to get the temperature for, in the format "City, Country"
342
- Returns:
343
- The current temperature at the specified location in the specified units, as a float.
344
- """
345
- return 22. # A real function should probably actually get the temperature!
346
-
347
- # Next, create a chat and apply the chat template
348
- messages = [
349
- {"role": "system", "content": "You are a bot that responds to weather queries."},
350
- {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
351
- ]
352
-
353
- inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
354
- ```
355
-
356
- You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
357
-
358
- ```python
359
- tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
360
- messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
361
- ```
362
 
363
- and then call the tool and append the result, with the `tool` role, like so:
364
-
365
- ```python
366
- messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
367
- ```
368
-
369
- After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information,
370
- see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).
371
 
 
 
372
 
373
- ### Use with `llama`
374
-
375
- Please, follow the instructions in the [repository](https://github.com/meta-llama/llama)
376
-
377
- To download Original checkpoints, see the example command below leveraging `huggingface-cli`:
378
-
379
- ```
380
- huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct
381
  ```
382
 
383
  ## Hardware and Software
 
17
  - pytorch
18
  - llama
19
  - llama-3
20
+ - ctranslate2
21
+ - quantization
22
+ - int8
23
+ - float16
24
  extra_gated_prompt: "### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT\nLlama 3.1 Version\
25
  \ Release Date: July 23, 2024\n\"Agreement\" means the terms and conditions for\
26
  \ use, reproduction, distribution and modification of the Llama Materials set forth\
 
193
  extra_gated_button_content: Submit
194
  ---
195
 
196
+ ## meta-llama/Meta-Llama-3.1-8B-Instruct for CTranslate2
197
+
198
+ **The model is quantized version of the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) with int8_float16 quantization and can be used in [CTranslate2](https://github.com/OpenNMT/CTranslate2).**
199
+
200
+ ## Conversion details
201
+
202
+ The original model was converted on 2024-10 with the following command:
203
+ ```
204
+ ct2-transformers-converter --model Path\To\Local\meta-llama\Meta-Llama-3.1-8B-Instruct \
205
+ --quantization int8_float16 --output_dir Meta-Llama-3.1-8B-Instruct-ct2-int8_float16
206
+ ```
207
+
208
  ## Model Information
209
 
210
  The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.
 
305
 
306
  ## How to use
307
 
308
+ This repository for use with [CTranslate2](https://github.com/OpenNMT/CTranslate2).
309
 
310
+ ### Use with CTranslate2
311
 
312
+ This example code is obtained from [CTranslate2_transformers](https://opennmt.net/CTranslate2/guides/transformers.html#mpt) and [tokenizer AutoTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer).
313
+ More detailed information about the `generate_batch` methon can be found at [CTranslate2_Generator.generate_batch](https://opennmt.net/CTranslate2/python/ctranslate2.Generator.html#ctranslate2.Generator.generate_batch).
 
314
 
315
  ```python
316
+ import ctranslate2
317
  import transformers
 
 
 
318
 
319
+ model_id = "avans06/Meta-Llama-3.1-8B-Instruct-ct2-int8_float16"
320
+ model = ctranslate2.Generator(model_id, device="auto", compute_type="int8_float16")
321
+ tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
 
 
 
322
 
323
  messages = [
324
  {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
325
  {"role": "user", "content": "Who are you?"},
326
  ]
327
 
328
+ input_ids = tokenizer.apply_chat_template(
329
  messages,
330
+ add_generation_prompt=True)
331
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
332
 
333
+ input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
 
 
 
 
 
 
 
334
 
335
+ results = model.generate_batch([input_tokens], include_prompt_in_result=False, max_length=256)
336
+ output = tokenizer.decode(results[0].sequences_ids[0])
337
 
338
+ print(output)
 
 
 
 
 
 
 
339
  ```
340
 
341
  ## Hardware and Software