nvidia
/

Llama2-70B-SteerLM-Chat

Text Generation

Model card Files Files and versions Community

zhilinw commited on Nov 26, 2023

Commit

0de8691

•

1 Parent(s): 4d1b4ff

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -65,7 +65,7 @@ H100, A100 80GB, A100 40GB
 We demonstrate inference using NVIDIA NeMo Framework, which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.
-Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GPUs, and 300GB of free disk space.
 1. Please sign up to get **free and immediate** access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.
 2. If you don’t have an NVIDIA NGC API key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting organization/team: ea-bignlp/ga-participants and click Generate API key. Save this key for the next step. Else, skip this step.
@@ -104,7 +104,7 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
    python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
    ```
-9. Once the server is ready in 20-45 mins depending on your computer (i.e. when you see this messages below), you are ready to launch your client code
     ```
     Started HTTPService at 0.0.0.0:8000

 We demonstrate inference using NVIDIA NeMo Framework, which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.
+Pre-requisite: You would need at least a machine with 4 40GB or 2 80GB NVIDIA GPUs, and 300GB of free disk space.
 1. Please sign up to get **free and immediate** access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.
 2. If you don’t have an NVIDIA NGC API key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting organization/team: ea-bignlp/ga-participants and click Generate API key. Save this key for the next step. Else, skip this step.
    python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
    ```
+9. Once the server is ready (i.e. when you see this messages below), you are ready to launch your client code
     ```
     Started HTTPService at 0.0.0.0:8000