File size: 10,820 Bytes

f361916
a9b12d7
 
 
 
 
 
 
 
 
 
ccea462
a9b12d7
 
f361916
a9b12d7
 
 
 
af626e4
a9b12d7
 
 
 
 
af626e4
 
a9b12d7
 
 
 
 
 
 
 
 
 
68e4a9c
a9b12d7

---
license: cc-by-nc-4.0
language:
- en
pipeline_tag: text-generation
inference: false
fine-tuning: true
tags:
- nvidia
- conversational
- llama2
- rlhf
datasets:
- Anthropic/hh-rlhf
---

# NV-Llama2-70B-RLHF-Chat 

## Description
NV-Llama2-70B-RLHF-Chat is a 70 billion parameter generative language model instruct-tuned on [LLama2-70B](https://huggingface.co/meta-llama/Llama-2-70b) model. It takes input with context length up to 4,096 tokens. The model has been fine-tuned for instruction following using Supervised Fine-tuning (SFT) on [NVIDIA SFT Datablend v1](https://huggingface.co/datasets/nvidia/sft_datablend_v1) [^1] and Reinforcement Learning from Human Feedback (RLHF) on [HH-RLHF dataset](https://huggingface.co/datasets/Anthropic/hh-rlhf) , achieving 7.59 on MT-Bench and demonstrating strong performance on academic benchmarks.

NV-Llama2-70B-RLHF-Chat is trained with NVIDIA [NeMo-Aligner](https://github.com/NVIDIA/NeMo-Aligner), a scalable toolkit for performant and efficient model alignment. NeMo-Aligner is built using the [NeMo Framework](https://github.com/NVIDIA/NeMo)  which allows for scaling training up to 1000s of GPUs using tensor, data and pipeline parallelism for all components of alignment. All of our checkpoints are cross compatible with the NeMo ecosystem, allowing for inference deployment and further customization.

Try this model instantly for free hosted by us at [NVIDIA AI Playground](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-foundation/models/nv-llama2-70b-rlhf). You can use this in the provided UI or through a limited access API (up to 10, 000 requests within 30 days). If you would need more requests, we demonstrate how you can set up an inference server below.

[^1]: as well as ~5k proprietary datapoints that we are unable to release due to data vendor restrictions 

<img src="https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat/resolve/main/mtbench_categories.png" alt="MT Bench Categories" width="800" style="margin-left:'auto' margin-right:'auto' display:'block'"/>


## Model Architecture
- **Architecture Type:** Transformer
- **Network Architecture:** Llama 2

## Prompt Format
| Single-Turn                                                                     | Single-Turn with Context                                                                     | Multi-Turn                                                                                                                                                   |
|---------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------|
| \<extra_id_0>System<br><br>\<extra_id_1>User<br>{prompt}<br>\<extra_id_1>Assistant | \<extra_id_0>System<br><br>\<extra_id_1>User<br>{context}<br>{prompt}<br>\<extra_id_1>Assistant | \<extra_id_0>System<br><br>\<extra_id_1>User<br>{prompt 1}<br>\<extra_id_1>Assistant<br>{response 1}<br>\<extra_id_1>User<br>{prompt 2}<br>\<extra_id_1>Assistant |


## Software Integration for Inference
- **Runtime Engine(s):** NVIDIA AI Enterprise
- **Toolkit:** NeMo Framework
- **Supported Hardware Architecture Compatibility:** H100, A100 80GB, A100 40GB

### Steps to Run Inference
We demonstrate inference using [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo), which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.

Pre-requisite: You would need at least a machine with 4 40GB or 2 80GB NVIDIA GPUs, and 300GB of free disk space.

1. Please sign up to get free and immediate access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.

2. If you don’t already have NVIDIA NGC key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting `organization/team: ea-bignlp/ga-participants` and click Generate API key. Save this key for the next step.

3. On your machine, docker login to `nvcr.io`. 
    ```bash
    docker login nvcr.io
    Username: $oauthtoken
    Password: <Your Saved NGC Key>
    ```
4. Download the required container.
    ```bash
    docker pull nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
    ```
5. Download the checkpoint.
    ```bash
    git lfs install
    git clone https://huggingface.co/nvidia/NV-Llama2-70B-RLHF-Chat
    cd NV-Llama2-70B-RLHF-Chat
    git lfs pull
    ```
6. Convert checkpoint into NeMo format.
   ```bash
    cd NV-Llama2-70B-RLHF-Chat
    tar -cvf NV-Llama2-70B-RLHF-Chat.nemo .
    mv NV-Llama2-70B-RLHF-Chat.nemo ../
    cd ..
    rm -r NV-Llama2-70B-RLHF-Chat
    ```
7. Run Docker container.
    ```bash
    docker run --gpus all -it --rm --shm-size=300g -p 8000:8000 -v ${PWD}/NV-Llama2-70B-RLHF-Chat.nemo:/opt/checkpoints/NV-Llama2-70B-RLHF-Chat.nemo -w /opt/NeMo nvcr.io/ea-bignlp/ga-participants/nemofw-inference:23.10
    ```
8. Within the container, start the server in the background. This step does both conversion of the NeMo checkpoint to TRT-LLM and then deployment using TRT-LLM. For an explanation of each argument and advanced usage, please refer to [NeMo FW Deployment Guide](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html).
    ```bash
    python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/NV-Llama2-70B-RLHF-Chat.nemo --model_type="llama" --triton_model_name NV-Llama2-70B-RLHF-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
    ```
9. Once the server is ready (i.e. when you see this messages below), you are ready to launch your client code.
    ```
    Started HTTPService at 0.0.0.0:8000
    Started GRPCInferenceService at 0.0.0.0:8001
    Started Metrics Service at 0.0.0.0:8002
    ```
    An example for single-turn closed QA with context:
    ```python
    from nemo.deploy import NemoQuery

    PROMPT_TEMPLATE = """<extra_id_0>System

    <extra_id_1>User
    This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions based on the context. The assistant should also indicate when the answer cannot be found in the context.
    Context: {context}
    Please give a full and complete answer for the question. {prompt}
    <extra_id_1>Assistant
    """

    context = "Climate change refers to long-term shifts in temperatures and weather patterns. Such shifts can be natural, due to changes in the sun’s activity or large volcanic eruptions. But since the 1800s, human activities have been the main driver of climate change, primarily due to the burning of fossil fuels like coal, oil and gas."
    question = "What did Michael Jackson achieve?"
    prompt = PROMPT_TEMPLATE.format(context=context, prompt=question)
    print(prompt)

    nq = NemoQuery(url="localhost:8000", model_name="NV-Llama2-70B-RLHF-Chat")
    output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)

    #this container currently does not support stop words but you do something like this as workaround
    output = output[0][0].split("\n<extra_id_1>")[0]
    print(output)
    ```

    An example for multi-turn conversation:
    ```python
    from nemo.deploy import NemoQuery

    PROMPT_TEMPLATE1 = """<extra_id_0>System

    <extra_id_1>User
    {prompt1}
    <extra_id_1>Assistant
    """
    PROMPT_TEMPLATE2 = """<extra_id_0>System

    <extra_id_1>User
    {prompt1}
    <extra_id_1>Assistant
    {response1}
    <extra_id_1>User
    {prompt2}
    <extra_id_1>Assistant
    """

    nq = NemoQuery(url="localhost:8000", model_name="NV-Llama2-70B-RLHF-Chat")
    # Turn 1
    question1 = "Write an introduction about NVIDIA."
    prompt = PROMPT_TEMPLATE1.format(prompt1=question1)
    print(prompt)

    output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)

    #this container currently does not support stop words but you do something like this as workaround
    response1 = output[0][0].split("\n<extra_id_1>")[0]
    print(response1)

    # Turn 2
    question2 = "Can you write it in a poem in the style of Shakespeare?"
    prompt = PROMPT_TEMPLATE2.format(prompt1=question1, response1=response1, prompt2=question2)
    print(prompt)

    output = nq.query_llm(prompts=[prompt], max_output_token=256, top_k=1, top_p=0.0, temperature=1.0)

    response2 = output[0][0].split("\n<extra_id_1>")[0]
    print(response2)
    ```


## Evaluation

### MT-bench

 | Total | Writing | Roleplay | Extraction | STEM | Humanities | Reasoning | Math | Coding |
|:-------:|:-------:|:--------:|:----------:|:----:|:----------:|:-----------:|:------:|:--------:|
| 7.59  |   9.15  |   8.90   |    8.80    | 8.60 |    9.65    |   6.25      | 4.65 | 4.70   |

### Academic Benchmarks

| MMLU<br>(5-shot) | HellaSwag<br>(0-shot) | ARC easy<br>(0-shot) | WinoGrande<br>(0-shot) | TruthfulQA MC2<br>(0-shot) | TriviaQA<br>(5-shot) |
|:----------------:|:---------------------:|:--------------------:|:----------------------:|:--------------------------:|:--------------------:|
|       68.04      |         84.04         |         83.67        |          79.40         |            58.16           |         80.86        |

## Intended use
- The NV-Llama2-70B-RLHF-Chat model is best for chat use cases including Question and Answering, Search, Summarization following instructions.
- Ethical use: Technology can have a profound impact on people and the world, and NVIDIA is committed to enabling trust and transparency in AI development. NVIDIA encourages users to adopt principles of AI ethics and trustworthiness to guide your decisions by following the guidelines in the [cc-by-nc-4.0](https://creativecommons.org/licenses/by-nc/4.0/legalcode.en) license.

## Limitations
- The model was trained on the data that contains toxic language and societal biases originally crawled from the Internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts.
- The Model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. 
- We recommend deploying the model with [NeMo Guardrails](https://github.com/NVIDIA/NeMo-Guardrails) to mitigate these potential issues.