Update README.md
Browse files
README.md
CHANGED
@@ -65,7 +65,7 @@ H100, A100 80GB, A100 40GB
|
|
65 |
|
66 |
We demonstrate inference using NVIDIA NeMo Framework, which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.
|
67 |
|
68 |
-
Pre-requisite:
|
69 |
|
70 |
1. Please sign up to get **free and immediate** access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.
|
71 |
2. If you don’t have an NVIDIA NGC API key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting organization/team: ea-bignlp/ga-participants and click Generate API key. Save this key for the next step. Else, skip this step.
|
@@ -104,7 +104,7 @@ Pre-requisite: you would need at least a machine with 4 40GB or 2 80GB NVIDIA GP
|
|
104 |
python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
|
105 |
```
|
106 |
|
107 |
-
9. Once the server is ready
|
108 |
|
109 |
```
|
110 |
Started HTTPService at 0.0.0.0:8000
|
|
|
65 |
|
66 |
We demonstrate inference using NVIDIA NeMo Framework, which allows hassle-free model deployment based on [NVIDIA TRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), a highly optimized inference solution focussing on high throughput and low latency.
|
67 |
|
68 |
+
Pre-requisite: You would need at least a machine with 4 40GB or 2 80GB NVIDIA GPUs, and 300GB of free disk space.
|
69 |
|
70 |
1. Please sign up to get **free and immediate** access to [NVIDIA NeMo Framework container](https://developer.nvidia.com/nemo-framework). If you don’t have an NVIDIA NGC account, you will be prompted to sign up for an account before proceeding.
|
71 |
2. If you don’t have an NVIDIA NGC API key, sign into [NVIDIA NGC](https://ngc.nvidia.com/setup), selecting organization/team: ea-bignlp/ga-participants and click Generate API key. Save this key for the next step. Else, skip this step.
|
|
|
104 |
python scripts/deploy/deploy_triton.py --nemo_checkpoint /opt/checkpoints/Llama2-70B-SteerLM-Chat.nemo --model_type="llama" --triton_model_name Llama2-70B-SteerLM-Chat --triton_http_address 0.0.0.0 --triton_port 8000 --num_gpus 2 --max_input_len 3072 --max_output_len 1024 --max_batch_size 1 &
|
105 |
```
|
106 |
|
107 |
+
9. Once the server is ready (i.e. when you see this messages below), you are ready to launch your client code
|
108 |
|
109 |
```
|
110 |
Started HTTPService at 0.0.0.0:8000
|