avans06
/

Meta-Llama-3.1-8B-Instruct-ct2-int8_float16

@@ -17,6 +17,10 @@ tags:
 - pytorch
 - llama
 - llama-3
 extra_gated_prompt: "### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT\nLlama 3.1 Version\
   \ Release Date: July 23, 2024\n\"Agreement\" means the terms and conditions for\
   \ use, reproduction, distribution and modification of the  Llama Materials set forth\
@@ -189,6 +193,18 @@ extra_gated_description: The information you provide will be collected, stored,
 extra_gated_button_content: Submit
 ---
 ## Model Information
 The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.
@@ -289,95 +305,37 @@ Where to send questions or comments about the model Instructions on how to provi
 ## How to use
-This repository contains two versions of Meta-Llama-3.1-8B-Instruct, for use with transformers and with the original `llama` codebase.
-### Use with transformers
-Starting with `transformers >= 4.43.0` onward, you can run conversational inference using the Transformers `pipeline` abstraction or by leveraging the Auto classes with the `generate()` function.
-Make sure to update your transformers installation via `pip install --upgrade transformers`.
 ```python
 import transformers
-import torch
-model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"
-pipeline = transformers.pipeline(
-    "text-generation",
-    model=model_id,
-    model_kwargs={"torch_dtype": torch.bfloat16},
-    device_map="auto",
-)
 messages = [
     {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
     {"role": "user", "content": "Who are you?"},
 ]
-outputs = pipeline(
     messages,
-    max_new_tokens=256,
 )
-print(outputs[0]["generated_text"][-1])
-```
-Note: You can also find detailed recipes on how to use the model locally, with `torch.compile()`, assisted generations, quantised and more at [`huggingface-llama-recipes`](https://github.com/huggingface/huggingface-llama-recipes)
-### Tool use with transformers
-LLaMA-3.1 supports multiple tool use formats. You can see a full guide to prompt formatting [here](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/).
-Tool use is also supported through [chat templates](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling) in Transformers.
-Here is a quick example showing a single simple tool:
-```python
-# First, define a tool
-def get_current_temperature(location: str) -> float:
-    """
-    Get the current temperature at a location.
-    Args:
-        location: The location to get the temperature for, in the format "City, Country"
-    Returns:
-        The current temperature at the specified location in the specified units, as a float.
-    """
-    return 22.  # A real function should probably actually get the temperature!
-# Next, create a chat and apply the chat template
-messages = [
-  {"role": "system", "content": "You are a bot that responds to weather queries."},
-  {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
-]
-inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
-```
-You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
-```python
-tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
-messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
-```
-and then call the tool and append the result, with the `tool` role, like so:
-```python
-messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
-```
-After that, you can `generate()` again to let the model use the tool result in the chat. Note that this was a very brief introduction to tool calling - for more information,
-see the [LLaMA prompt format docs](https://llama.meta.com/docs/model-cards-and-prompt-formats/llama3_1/) and the Transformers [tool use documentation](https://huggingface.co/docs/transformers/main/chat_templating#advanced-tool-use--function-calling).
-### Use with `llama`
-Please, follow the instructions in the [repository](https://github.com/meta-llama/llama)
-To download Original checkpoints, see the example command below leveraging `huggingface-cli`:
-```
-huggingface-cli download meta-llama/Meta-Llama-3.1-8B-Instruct --include "original/*" --local-dir Meta-Llama-3.1-8B-Instruct
 ```
 ## Hardware and Software

 - pytorch
 - llama
 - llama-3
+- ctranslate2
+- quantization
+- int8
+- float16
 extra_gated_prompt: "### LLAMA 3.1 COMMUNITY LICENSE AGREEMENT\nLlama 3.1 Version\
   \ Release Date: July 23, 2024\n\"Agreement\" means the terms and conditions for\
   \ use, reproduction, distribution and modification of the  Llama Materials set forth\
 extra_gated_button_content: Submit
 ---
+## meta-llama/Meta-Llama-3.1-8B-Instruct for CTranslate2
+**The model is quantized version of the [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) with int8_float16 quantization and can be used in [CTranslate2](https://github.com/OpenNMT/CTranslate2).**
+## Conversion details
+The original model was converted on 2024-10 with the following command:
+```
+ct2-transformers-converter --model Path\To\Local\meta-llama\Meta-Llama-3.1-8B-Instruct \
+    --quantization int8_float16 --output_dir Meta-Llama-3.1-8B-Instruct-ct2-int8_float16
+```
 ## Model Information
 The Meta Llama 3.1 collection of multilingual large language models (LLMs) is a collection of pretrained and instruction tuned generative models in 8B, 70B and 405B sizes (text in/text out). The Llama 3.1 instruction tuned text only models (8B, 70B, 405B) are optimized for multilingual dialogue use cases and outperform many of the available open source and closed chat models on common industry benchmarks.
 ## How to use
+This repository for use with [CTranslate2](https://github.com/OpenNMT/CTranslate2).
+### Use with CTranslate2
+This example code is obtained from [CTranslate2_transformers](https://opennmt.net/CTranslate2/guides/transformers.html#mpt) and [tokenizer AutoTokenizer](https://huggingface.co/docs/transformers/main_classes/tokenizer).
+More detailed information about the `generate_batch` methon can be found at [CTranslate2_Generator.generate_batch](https://opennmt.net/CTranslate2/python/ctranslate2.Generator.html#ctranslate2.Generator.generate_batch).
 ```python
+import ctranslate2
 import transformers
+model_id = "avans06/Meta-Llama-3.1-8B-Instruct-ct2-int8_float16"
+model = ctranslate2.Generator(model_id, device="auto", compute_type="int8_float16")
+tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
 messages = [
     {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
     {"role": "user", "content": "Who are you?"},
 ]
+input_ids = tokenizer.apply_chat_template(
     messages,
+    add_generation_prompt=True)
 )
+input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
+results = model.generate_batch([input_tokens], include_prompt_in_result=False, max_length=256)
+output = tokenizer.decode(results[0].sequences_ids[0])
+print(output)
 ```
 ## Hardware and Software