Given setup scripts don't work

by PrasannSinghal - opened Jun 21

Jun 21

Hi, thanks for making this resource available. I've been trying to get inference for this model ready (linux machine). The huggingface documentation doesn't say anything about setting up the docker container except for the container name, but I do the following to get it spinning:

docker pull nvcr.io/nvidia/nemo:24.01.framework
(I keep an external mount to store huggingface hub and models)
docker run \ --gpus all \ -it \ --rm \ --shm-size=16g \ --ulimit memlock=-1 \ --ulimit stack=67108864 \ -v /mnt/nvme/:/mnt/nvme/ \ -v /mnt/efs/:/mnt/efs/ \ --net=host \ nvcr.io/nvidia/nemo:24.01.framework
(inside container)
pip install nemo-aligner

For the actual run-script, just passing in Llama3-70B-SteerLM-RM for rm_model_file didn't work, so I had to do
git clone https://huggingface.co/nvidia/Llama3-70B-SteerLM-RM
(with git lfs installed) to get the model into my hub, and then I had to set the HF_HOME and HF_TOKEN as exports. From there I set the actual path to the huggingface hub download in the run script:
python /opt/NeMo-Aligner/examples/nlp/gpt/serve_reward_model.py \ rm_model_file=/mnt/nvme/prasann/huggingface/hub/Llama3-70B-SteerLM-RM/ \ trainer.num_nodes=1 \ trainer.devices=8 \ ++model.tensor_model_parallel_size=8 \ ++model.pipeline_model_parallel_size=1 \ inference.micro_batch_size=1 \ inference.port=1424
This series of steps were required for me to get the server running in my docker container (8 80GB gpu machine).
The server log seems to start up fine, is using gpus, and outputs this last line:
I0621 15:26:05.340455 4623 model_lifecycle.cc:818] successfully loaded 'reward_model'

From here I check that the port and address are exposed, but when I actually run the given calling scripts

python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_openassistant_data.py --output_directory=data/oasst (runs fine)
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/attribute_annotate.py \ --input-file=data/oasst/train.jsonl \ --output-file=data/oasst/train_labeled.jsonl \ --port=1424
The second script leads to a timeout, and the pytriton server logs don't show any sign of receiving a request.
pytriton.client.exceptions.PyTritonClientTimeoutError: Timeout occurred during inference request. Timeout: 60.0 s Message: timed out

This snippet

import requests

try:
    response = requests.get("http://localhost:1424/v2")
    response.raise_for_status()  # Raises a HTTPError if the status is 4xx, 5xx
    print("Available endpoints:", response.json())
except requests.exceptions.HTTPError as errh:
    print ("HTTP Error:",errh)

outputs fine:
Available endpoints: {'name': 'triton', 'version': '2.39.0', 'extensions': ['classification', 'sequence', 'model_repository', 'model_repository(unload_dependents)', 'schedule_policy', 'model_configuration', 'system_shared_memory', 'cuda_shared_memory', 'binary_tensor_data', 'parameters', 'statistics', 'trace', 'logging']}
As does this snippet

try:
    response = requests.get("http://localhost:1424/v2/health/ready")
    response.raise_for_status()  # Raises a HTTPError if the status is 4xx, 5xx
    print("Server is accessible and ready.")
except requests.exceptions.HTTPError as errh:
    print ("HTTP Error:",errh)

This makes it seem that the server is accessible. I was wondering if someone knew what may be going wrong? I had to make several jumps in these setup steps since the given starter scripts didn't seem to work, so I was wondering if the setup I've detailed above sounds ok? Alternatively, if anyone knows of any more detailed up-to-date instructions on running / querying an inference server for this model anywhere in nvidia's documentation?

PrasannSinghal

Jun 21

•

edited Jun 21

Actually I think I resolved the issue, it seems like this was just a GPU allocation problem. Regardless I'd be curious if anyone knows whether the setup procedure I followed is ok (especially loading in the weights using git clone from the huggingface repo). If so other people may be able to use this procedure to get it running.

zhilinw

NVIDIA org Jun 24

•

edited Jun 24

Hi @PrasannSinghal thank you for your interest in this model and appreciate your patience.

I see a few issues in your docker command specifically

--shm-size=16g - this seems too little for a 70B model - please increase this to say 200GB if possible.
--net=host - this shouldn't be used as the use case is to have the server/client be both running in the same container. If you need to call it outside of the container (i.e. have your client script outside), you can use port mapping directly instead of --net=host , which might have unexpected issues.
No need to do pip install nemo-aligner - it is already installed in container

We tested our scripts in a SLURM environment, which is why we didn't share the specific as it's slightly different for every user depending on their setup but if you have further questions, I'm happy to follow up here or through email listed on the contact section of the README.

PrasannSinghal

Jun 25

Sounds good, thanks for the info!

PrasannSinghal changed discussion status to closed Jun 25

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment