Model not working with accelerate for inference.

#25
by Satandon1999 - opened

Trying to do multi-gpu inference using the accelerate library following the instructions mentioned here: https://medium.com/@geronimo7/llms-multi-gpu-inference-with-accelerate-5a8333e4c5db.
The process works perfectly fine for mini models. But with this small model I am facing the following error:

e90707fca55744aebfc511579dbd663c00000C:353:385 [1] NCCL INFO [Service thread] Connection closed by localRank 1
sh: 1: cannot create 0.1/compile-ptx-log-7f1750: Directory nonexistent
SystemLog: Traceback (most recent call last):
SystemLog:   File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 141, in main
SystemLog:     output = model.generate(input_ids=input_ids,
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
SystemLog:     return func(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
SystemLog:     result = self._sample(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample
SystemLog:     outputs = self(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 956, in forward
SystemLog:     outputs = self.model(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 859, in forward
SystemLog:     layer_outputs = decoder_layer(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 671, in forward
SystemLog:     hidden_states, self_attn_weights, present_key_values = self.self_attn(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 616, in forward
SystemLog:     attn_function_output = self._apply_blocksparse_attention(
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 382, in _apply_blocksparse_attention
SystemLog:     context_layer = self._blocksparse_layer(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
SystemLog:     return self._call_impl(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
SystemLog:     return forward_call(*args, **kwargs)
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_blocksparse_attention_layer.py", line 165, in forward
SystemLog:     return blocksparse_flash_attn_padded_fwd(
SystemLog:   File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_flash_blocksparse_attn.py", line 996, in blocksparse_flash_attn_padded_fwd
SystemLog:     _fwd_kernel_batch_inference[grid](
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
SystemLog:     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run
SystemLog:     return self.fn.run(*args, **kwargs)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
SystemLog:     self.cache[device][key] = compile(
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/compiler.py", line 193, in compile
SystemLog:     next_module = compile_ir(module, metadata)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 201, in <lambda>
SystemLog:     stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
SystemLog:   File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 194, in make_cubin
SystemLog:     return compile_ptx_to_cubin(src, ptxas, capability, opt.enable_fp_fusion)
SystemLog: RuntimeError: `ptxas` failed with error code 2: 
SystemLog: 
SystemLog:ERROR:__main__:An error occurred during execution
Traceback (most recent call last):
  File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 201, in <module>
    main()
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 411, in wrapper
    print_prefixed_stack_trace_and_raise(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 366, in print_prefixed_stack_trace_and_raise
    raise scrubbed_err  # type: ignore
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/shrike/compliant_logging/exceptions.py", line 406, in wrapper
    return function(*func_args, **func_kwargs)
  File "/mnt/azureml/cr/j/c3a3f3df23864dbdbb10a3f2d941acb8/exe/wd/run.py", line 141, in main
    output = model.generate(input_ids=input_ids,
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 1758, in generate
    result = self._sample(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/transformers/generation/utils.py", line 2397, in _sample
    outputs = self(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 956, in forward
    outputs = self.model(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 859, in forward
    layer_outputs = decoder_layer(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 671, in forward
    hidden_states, self_attn_weights, present_key_values = self.self_attn(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 616, in forward
    attn_function_output = self._apply_blocksparse_attention(
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/modeling_phi3_small.py", line 382, in _apply_blocksparse_attention
    context_layer = self._blocksparse_layer(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_blocksparse_attention_layer.py", line 165, in forward
    return blocksparse_flash_attn_padded_fwd(
  File "/root/.cache/huggingface/modules/transformers_modules/microsoft/Phi-3-small-128k-instruct/351b4fdafe349962997fd94996349824a7cd0214/triton_flash_blocksparse_attn.py", line 996, in blocksparse_flash_attn_padded_fwd
    _fwd_kernel_batch_inference[grid](
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 167, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/autotuner.py", line 305, in run
    return self.fn.run(*args, **kwargs)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/runtime/jit.py", line 416, in run
    self.cache[device][key] = compile(
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/compiler.py", line 193, in compile
    next_module = compile_ir(module, metadata)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 201, in <lambda>
    stages["cubin"] = lambda src, metadata: self.make_cubin(src, metadata, options, self.capability)
  File "/azureml-envs/azureml_50694e7d12e9be98761297f3c3adb59f/lib/python3.10/site-packages/triton/compiler/backends/cuda.py", line 194, in make_cubin
    return compile_ptx_to_cubin(src, ptxas, capability, opt.enable_fp_fusion)
RuntimeError: `ptxas` failed with error code 2: 

Resolved by creating the directory called "0.1" as mentioned in the error line: sh: 1: cannot create 0.1/compile-ptx-log-7f1750: Directory nonexistent.

Credit: https://github.com/vllm-project/vllm/issues/3926

Satandon1999 changed discussion status to closed

Sign up or log in to comment