|
======================== |
|
START TIME: Sat Jul 6 09:18:51 UTC 2024 |
|
python3 version = Python 3.10.14 |
|
======================== |
|
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. |
|
Token is valid (permission: write). |
|
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token |
|
Login successful |
|
fatal: Unable to create '/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/.git/index.lock': File exists. |
|
|
|
Another git process seems to be running in this repository, e.g. |
|
an editor opened by 'git commit'. Please make sure all processes |
|
are terminated then try again. If it still fails, a git process |
|
may have crashed in this repository earlier: |
|
remove the file manually to continue. |
|
Job status: RUNNING |
|
[2024-07-06 09:18:59,021] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:18:59,021] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,021] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:18:59,021] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,377] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:18:59,377] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,377] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:18:59,377] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,572] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:18:59,572] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,572] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:18:59,572] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,782] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:18:59,782] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,782] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:18:59,782] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,786] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:18:59,786] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:18:59,786] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:18:59,786] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:19:00,003] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:19:00,003] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:19:00,003] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:19:00,003] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:19:00,227] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:19:00,227] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:19:00,227] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:19:00,227] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:19:00,463] torch.distributed.run: [WARNING] |
|
[2024-07-06 09:19:00,463] torch.distributed.run: [WARNING] ***************************************** |
|
[2024-07-06 09:19:00,463] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. |
|
[2024-07-06 09:19:00,463] torch.distributed.run: [WARNING] ***************************************** |
|
[default0]:[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:47605 (errno: 98 - Address already in use). |
|
[default0]:[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use). |
|
[default0]:[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address. |
|
[default0]:Traceback (most recent call last): |
|
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module> |
|
[default0]: trainer = DistributedTrainer(config_file) |
|
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 145, in __init__ |
|
[default0]: self.parallel_context = ParallelContext( |
|
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/context.py", line 53, in __init__ |
|
[default0]: dist.initialize_torch_distributed() |
|
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/distributed.py", line 278, in initialize_torch_distributed |
|
[default0]: dist.init_process_group( |
|
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper |
|
[default0]: func_return = func(*args, **kwargs) |
|
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group |
|
[default0]: store, rank, world_size = next(rendezvous_iterator) |
|
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler |
|
[default0]: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) |
|
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store |
|
[default0]: return TCPStore( |
|
[default0]:torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:47605 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use). |
|
[2024-07-06 09:19:11,395] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136702 closing signal SIGTERM |
|
[2024-07-06 09:19:11,396] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136703 closing signal SIGTERM |
|
[2024-07-06 09:19:11,396] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136704 closing signal SIGTERM |
|
[2024-07-06 09:19:11,397] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136705 closing signal SIGTERM |
|
[2024-07-06 09:19:11,397] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136706 closing signal SIGTERM |
|
[2024-07-06 09:19:11,397] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136707 closing signal SIGTERM |
|
[2024-07-06 09:19:11,398] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 136708 closing signal SIGTERM |
|
[2024-07-06 09:19:11,899] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 136701) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent |
|
raise ChildFailedError( |
|
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: |
|
============================================================ |
|
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED |
|
------------------------------------------------------------ |
|
Failures: |
|
<NO_OTHER_FAILURES> |
|
------------------------------------------------------------ |
|
Root Cause (first observed failure): |
|
[0]: |
|
time : 2024-07-06_09:19:11 |
|
host : ip-26-0-167-175.ec2.internal |
|
rank : 0 (local_rank: 0) |
|
exitcode : 1 (pid: 136701) |
|
error_file: <N/A> |
|
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html |
|
============================================================ |
|
srun: error: ip-26-0-167-175: task 0: Exited with exit code 1 |
|
[2024-07-06 09:19:15,625] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-168-52.ec2.internal_1169895_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:19:15,764] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-167-177.ec2.internal_64043_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:19:15,830] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-167-217.ec2.internal_191267_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:19:16,064] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-168-30.ec2.internal_61479_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:19:16,159] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-168-34.ec2.internal_1168489_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:19:16,269] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-168-95.ec2.internal_1185412_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:19:16,399] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-167-245.ec2.internal_2267265_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
[2024-07-06 09:19:16,403] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2267344 closing signal SIGTERM |
|
[2024-07-06 09:19:16,404] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2267345 closing signal SIGTERM |
|
[2024-07-06 09:19:16,404] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2267346 closing signal SIGTERM |
|
[2024-07-06 09:19:16,404] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2267347 closing signal SIGTERM |
|
[2024-07-06 09:19:16,404] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2267348 closing signal SIGTERM |
|
[2024-07-06 09:19:16,404] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2267349 closing signal SIGTERM |
|
[2024-07-06 09:19:16,404] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2267350 closing signal SIGTERM |
|
[2024-07-06 09:19:16,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2267351 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 64119 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 64120 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 64121 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 64122 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 64123 closing signal SIGTERM |
|
[2024-07-06 09:19:16,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191344 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191345 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 64124 closing signal SIGTERM |
|
[2024-07-06 09:19:16,407] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 64125 closing signal SIGTERM |
|
[2024-07-06 09:19:16,407] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 64126 closing signal SIGTERM |
|
[2024-07-06 09:19:16,404] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 61557 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191346 closing signal SIGTERM |
|
[2024-07-06 09:19:16,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 61558 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191347 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191348 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191349 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191350 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 191351 closing signal SIGTERM |
|
[2024-07-06 09:19:16,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 61559 closing signal SIGTERM |
|
[2024-07-06 09:19:16,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 61560 closing signal SIGTERM |
|
[2024-07-06 09:19:16,405] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 61561 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 61562 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 61563 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 61564 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1185488 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1185489 closing signal SIGTERM |
|
[2024-07-06 09:19:16,406] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1169971 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1185490 closing signal SIGTERM |
|
[2024-07-06 09:19:16,407] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1169972 closing signal SIGTERM |
|
[2024-07-06 09:19:16,407] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1169973 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1185491 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1185492 closing signal SIGTERM |
|
[2024-07-06 09:19:16,407] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1168564 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1185493 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1185494 closing signal SIGTERM |
|
[2024-07-06 09:19:16,409] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1185495 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1168565 closing signal SIGTERM |
|
[2024-07-06 09:19:16,407] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1169974 closing signal SIGTERM |
|
[2024-07-06 09:19:16,407] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1169975 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1168566 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1168567 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1168568 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1168569 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1168570 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1168571 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1169976 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1169977 closing signal SIGTERM |
|
[2024-07-06 09:19:16,408] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1169978 closing signal SIGTERM |
|
[2024-07-06 09:19:16,915] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-167-217.ec2.internal_191267_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
[2024-07-06 09:19:16,916] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-167-177.ec2.internal_64043_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
[2024-07-06 09:19:16,914] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-168-30.ec2.internal_61479_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
sys.exit(main()) |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
[2024-07-06 09:19:16,917] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-168-34.ec2.internal_1168489_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
run(args) |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
[2024-07-06 09:19:16,918] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-167-245.ec2.internal_2267265_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
torch.distributed.DistNetworkError: Broken pipe |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
[2024-07-06 09:19:16,918] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-168-52.ec2.internal_1169895_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
[2024-07-06 09:19:17,020] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-168-95.ec2.internal_1185412_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. |
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store |
|
return getattr(self._store, store_op)(*args, **kwargs) |
|
torch.distributed.DistNetworkError: Broken pipe |
|
|
|
The above exception was the direct cause of the following exception: |
|
|
|
Traceback (most recent call last): |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module> |
|
sys.exit(main()) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper |
|
return f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main |
|
run(args) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run |
|
elastic_launch( |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ |
|
return launch_agent(self._config, self._entrypoint, list(args)) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent |
|
result = agent.run() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper |
|
result = f(*args, **kwargs) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run |
|
result = self._invoke_run(role) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run |
|
num_nodes_waiting = rdzv_handler.num_nodes_waiting() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting |
|
self._state_holder.sync() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync |
|
get_response = self._backend.get_state() |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state |
|
base64_state: bytes = self._call_store("get", self._key) |
|
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store |
|
raise RendezvousConnectionError( |
|
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. |
|
srun: error: ip-26-0-167-217: task 2: Exited with exit code 1 |
|
srun: error: ip-26-0-167-245: task 3: Exited with exit code 1 |
|
srun: error: ip-26-0-168-30: task 4: Exited with exit code 1 |
|
srun: error: ip-26-0-168-52: task 6: Exited with exit code 1 |
|
srun: error: ip-26-0-167-177: task 1: Exited with exit code 1 |
|
srun: error: ip-26-0-168-34: task 5: Exited with exit code 1 |
|
srun: error: ip-26-0-168-95: task 7: Exited with exit code 1 |
|
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details. |
|
|