|
======================== |
|
START TIME: Sat Jul 6 09:34:54 UTC 2024 |
|
python3 version = Python 3.10.14 |
|
======================== |
|
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. |
|
Token is valid (permission: write). |
|
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token |
|
Login successful |
|
Already on 'bench_cluster' |
|
M examples/config_tiny_llama.py |
|
M examples/config_tiny_llama.yaml |
|
M examples/train_tiny_llama.sh |
|
Your branch is up to date with 'origin/bench_cluster'. |
|
Job status: RUNNING |
|
[2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] ***************************************** |
|
[default0]:[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:48529 (errno: 98 - Address already in use). |
|
[default0]:[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use). |
|
[default0]:[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address. |
|
[default0]:Traceback (most recent call last): |
|
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]: trainer = DistributedTrainer(config_file) |
|
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 145, in __init__ |
|
[default0]: self.parallel_context = ParallelContext( |
|
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/context.py", line 53, in __init__ |
|
[default0]: dist.initialize_torch_distributed() |
|
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/distributed.py", line 278, in initialize_torch_distributed |
|
[default0]: dist.init_process_group( |
|
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper |
|
[default0]: func_return = func(*args, **kwargs) |
|
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group |
|
[default0]: store, rank, world_size = next(rendezvous_iterator) |
|
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler |
|
[default0]: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) |
|
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store |
|
[default0]: return TCPStore( |
|
[default0]:torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:48529 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use). |
|
[2024-07-06 09:35:08,437] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183941 closing signal SIGTERM |
|
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183942 closing signal SIGTERM |
|
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183943 closing signal SIGTERM |
|
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183944 closing signal SIGTERM |
|
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183945 closing signal SIGTERM |
|
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183946 closing signal SIGTERM |
|
[2024-07-06 09:35:08,439] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183947 closing signal SIGTERM |
|
[2024-07-06 09:35:09,341] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 183940) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
<NO_OTHER_FAILURES> |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-06_09:35:08 |
|
host : ip-26-0-163-220.ec2.internal |
|
rank : 0 (local_rank: 0) |
|
exitcode : 1 (pid: 183940) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
srun: error: ip-26-0-163-220: task 0: Exited with exit code 1 |
|
[2024-07-06 09:35:13,325] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-172-252.ec2.internal_3117434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:35:13,371] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-18.ec2.internal_961861_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:35:13,378] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-45.ec2.internal_1411207_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:35:13,393] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-187.ec2.internal_423434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:35:13,398] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-236.ec2.internal_1020445_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:35:13,402] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-75.ec2.internal_967841_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:35:13,422] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-173-7.ec2.internal_2779941_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961930 closing signal SIGTERM |
|
[2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961931 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961932 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961933 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961934 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961935 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961936 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961937 closing signal SIGTERM |
|
[2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020514 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020515 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020516 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020517 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020518 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020519 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020520 closing signal SIGTERM |
|
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020521 closing signal SIGTERM |
|
[2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411275 closing signal SIGTERM |
|
[2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411276 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411277 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411278 closing signal SIGTERM |
|
[2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423506 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411279 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423507 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411280 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117503 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411281 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117504 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411282 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423508 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967910 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780010 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423509 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967911 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780011 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423510 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117505 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423511 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117506 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780012 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423512 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967912 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117507 closing signal SIGTERM |
|
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780013 closing signal SIGTERM |
|
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423513 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967913 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117508 closing signal SIGTERM |
|
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780014 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967914 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117509 closing signal SIGTERM |
|
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780015 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967915 closing signal SIGTERM |
|
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117510 closing signal SIGTERM |
|
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780016 closing signal SIGTERM |
|
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967916 closing signal SIGTERM |
|
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780017 closing signal SIGTERM |
|
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967917 closing signal SIGTERM |
|
[2024-07-06 09:35:13,953] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-172-252.ec2.internal_3117434_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
[2024-07-06 09:35:13,953] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-45.ec2.internal_1411207_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
Traceback (most recent call last): |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
[2024-07-06 09:35:13,955] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-173-7.ec2.internal_2779941_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
Traceback (most recent call last): |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
[2024-07-06 09:35:13,955] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-236.ec2.internal_1020445_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
run(args) |
|
[2024-07-06 09:35:13,957] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-18.ec2.internal_961861_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
torch.distributed.DistNetworkError: Broken pipe |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
sys.exit(main()) |
|
elastic_launch( |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
result = agent.run() |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
result = f(*args, **kwargs) |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
result = self._invoke_run(role) |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
result = self._invoke_run(role) |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
self._state_holder.sync() |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
self._state_holder.sync() |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
get_response = self._backend.get_state() |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
base64_state: bytes = self._call_store("get", self._key) |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
raise RendezvousConnectionError( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
[2024-07-06 09:35:14,052] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-187.ec2.internal_423434_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
[2024-07-06 09:35:14,053] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-75.ec2.internal_967841_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
run(args) |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
elastic_launch( |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
result = agent.run() |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
srun: error: ip-26-0-164-45: task 3: Exited with exit code 1 |
|
srun: error: ip-26-0-164-18: task 2: Exited with exit code 1 |
|
srun: error: ip-26-0-173-7: task 7: Exited with exit code 1 |
|
srun: error: ip-26-0-172-252: task 6: Exited with exit code 1 |
|
srun: error: ip-26-0-163-236: task 1: Exited with exit code 1 |
|
srun: error: ip-26-0-164-187: task 5: Exited with exit code 1 |
|
srun: error: ip-26-0-164-75: task 4: Exited with exit code 1 |
|
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details. |
|
|