3outeille's picture
3outeille HF staff
Upload llama-1B/64_GPUS/dp-1_tp-8_pp-8_mbz-2
e05b043 verified
raw
history blame
42.9 kB
========================
START TIME: Sat Jul 6 09:34:54 UTC 2024
python3 version = Python 3.10.14
========================
The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token
Login successful
Already on 'bench_cluster'
M examples/config_tiny_llama.py
M examples/config_tiny_llama.yaml
M examples/train_tiny_llama.sh
Your branch is up to date with 'origin/bench_cluster'.
Job status: RUNNING
[2024-07-06 09:34:57,164] torch.distributed.run: [WARNING]
[2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,165] torch.distributed.run: [WARNING]
[2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,168] torch.distributed.run: [WARNING]
[2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,167] torch.distributed.run: [WARNING]
[2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,169] torch.distributed.run: [WARNING]
[2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,181] torch.distributed.run: [WARNING]
[2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,197] torch.distributed.run: [WARNING]
[2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,242] torch.distributed.run: [WARNING]
[2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] *****************************************
[2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] *****************************************
[default0]:[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:48529 (errno: 98 - Address already in use).
[default0]:[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[default0]:[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
[default0]:Traceback (most recent call last):
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in <module>
[default0]: trainer = DistributedTrainer(config_file)
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 145, in __init__
[default0]: self.parallel_context = ParallelContext(
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/context.py", line 53, in __init__
[default0]: dist.initialize_torch_distributed()
[default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/distributed.py", line 278, in initialize_torch_distributed
[default0]: dist.init_process_group(
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
[default0]: func_return = func(*args, **kwargs)
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
[default0]: store, rank, world_size = next(rendezvous_iterator)
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
[default0]: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
[default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
[default0]: return TCPStore(
[default0]:torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:48529 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[2024-07-06 09:35:08,437] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183941 closing signal SIGTERM
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183942 closing signal SIGTERM
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183943 closing signal SIGTERM
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183944 closing signal SIGTERM
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183945 closing signal SIGTERM
[2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183946 closing signal SIGTERM
[2024-07-06 09:35:08,439] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183947 closing signal SIGTERM
[2024-07-06 09:35:09,341] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 183940) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-07-06_09:35:08
host : ip-26-0-163-220.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 183940)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
srun: error: ip-26-0-163-220: task 0: Exited with exit code 1
[2024-07-06 09:35:13,325] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-172-252.ec2.internal_3117434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:35:13,371] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-18.ec2.internal_961861_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:35:13,378] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-45.ec2.internal_1411207_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:35:13,393] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-187.ec2.internal_423434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:35:13,398] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-236.ec2.internal_1020445_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:35:13,402] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-75.ec2.internal_967841_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:35:13,422] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-173-7.ec2.internal_2779941_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError.
[2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961930 closing signal SIGTERM
[2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961931 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961932 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961933 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961934 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961935 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961936 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961937 closing signal SIGTERM
[2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020514 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020515 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020516 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020517 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020518 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020519 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020520 closing signal SIGTERM
[2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020521 closing signal SIGTERM
[2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411275 closing signal SIGTERM
[2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411276 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411277 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411278 closing signal SIGTERM
[2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423506 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411279 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423507 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411280 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117503 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411281 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117504 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411282 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423508 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967910 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780010 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423509 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967911 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780011 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423510 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117505 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423511 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117506 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780012 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423512 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967912 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117507 closing signal SIGTERM
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780013 closing signal SIGTERM
[2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423513 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967913 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117508 closing signal SIGTERM
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780014 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967914 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117509 closing signal SIGTERM
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780015 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967915 closing signal SIGTERM
[2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117510 closing signal SIGTERM
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780016 closing signal SIGTERM
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967916 closing signal SIGTERM
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780017 closing signal SIGTERM
[2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967917 closing signal SIGTERM
[2024-07-06 09:35:13,953] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-172-252.ec2.internal_3117434_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
[2024-07-06 09:35:13,953] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-45.ec2.internal_1411207_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
Traceback (most recent call last):
return getattr(self._store, store_op)(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
return getattr(self._store, store_op)(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
[2024-07-06 09:35:13,955] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-173-7.ec2.internal_2779941_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
Traceback (most recent call last):
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
[2024-07-06 09:35:13,955] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-236.ec2.internal_1020445_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
run(args)
[2024-07-06 09:35:13,957] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-18.ec2.internal_961861_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
sys.exit(main())
elastic_launch(
return getattr(self._store, store_op)(*args, **kwargs)
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
result = agent.run()
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
return launch_agent(self._config, self._entrypoint, list(args))
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
result = f(*args, **kwargs)
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
result = self._invoke_run(role)
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
result = self._invoke_run(role)
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
self._state_holder.sync()
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
self._state_holder.sync()
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
get_response = self._backend.get_state()
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
base64_state: bytes = self._call_store("get", self._key)
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
raise RendezvousConnectionError(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
[2024-07-06 09:35:14,052] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-187.ec2.internal_423434_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
[2024-07-06 09:35:14,053] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-75.ec2.internal_967841_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
return getattr(self._store, store_op)(*args, **kwargs)
torch.distributed.DistNetworkError: Broken pipe
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
sys.exit(main())
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
return f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
run(args)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
elastic_launch(
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
return launch_agent(self._config, self._entrypoint, list(args))
result = agent.run()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = agent.run()
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
result = f(*args, **kwargs)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
num_nodes_waiting = rdzv_handler.num_nodes_waiting()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
self._state_holder.sync()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
get_response = self._backend.get_state()
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
base64_state: bytes = self._call_store("get", self._key)
File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
srun: error: ip-26-0-164-45: task 3: Exited with exit code 1
srun: error: ip-26-0-164-18: task 2: Exited with exit code 1
srun: error: ip-26-0-173-7: task 7: Exited with exit code 1
srun: error: ip-26-0-172-252: task 6: Exited with exit code 1
srun: error: ip-26-0-163-236: task 1: Exited with exit code 1
srun: error: ip-26-0-164-187: task 5: Exited with exit code 1
srun: error: ip-26-0-164-75: task 4: Exited with exit code 1
Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.