======================== START TIME: Sat Jul 6 09:34:54 UTC 2024 python3 version = Python 3.10.14 ======================== The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well. Token is valid (permission: write). Your token has been saved to /admin/home/ferdinand_mom/.cache/huggingface/token Login successful Already on 'bench_cluster' M examples/config_tiny_llama.py M examples/config_tiny_llama.yaml M examples/train_tiny_llama.sh Your branch is up to date with 'origin/bench_cluster'. Job status: RUNNING [2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] [2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:57,164] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] [2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:57,165] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] [2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:57,168] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] [2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:57,167] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] [2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:57,169] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] [2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:57,181] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] [2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:57,197] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] [2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] ***************************************** [2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. [2024-07-06 09:34:57,242] torch.distributed.run: [WARNING] ***************************************** [default0]:[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:48529 (errno: 98 - Address already in use). [default0]:[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use). [default0]:[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address. [default0]:Traceback (most recent call last): [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py", line 233, in [default0]: trainer = DistributedTrainer(config_file) [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/trainer.py", line 145, in __init__ [default0]: self.parallel_context = ParallelContext( [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/parallel/context.py", line 53, in __init__ [default0]: dist.initialize_torch_distributed() [default0]: File "/fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/src/nanotron/distributed.py", line 278, in initialize_torch_distributed [default0]: dist.init_process_group( [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper [default0]: func_return = func(*args, **kwargs) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group [default0]: store, rank, world_size = next(rendezvous_iterator) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler [default0]: store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv) [default0]: File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store [default0]: return TCPStore( [default0]:torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:48529 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use). [2024-07-06 09:35:08,437] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183941 closing signal SIGTERM [2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183942 closing signal SIGTERM [2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183943 closing signal SIGTERM [2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183944 closing signal SIGTERM [2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183945 closing signal SIGTERM [2024-07-06 09:35:08,438] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183946 closing signal SIGTERM [2024-07-06 09:35:08,439] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 183947 closing signal SIGTERM [2024-07-06 09:35:09,341] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 183940) of binary: /fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/python3.10 Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /fsx/ferdinandmom/ferdinand-hf/bench_cluster/nanotron/run_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-07-06_09:35:08 host : ip-26-0-163-220.ec2.internal rank : 0 (local_rank: 0) exitcode : 1 (pid: 183940) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ srun: error: ip-26-0-163-220: task 0: Exited with exit code 1 [2024-07-06 09:35:13,325] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-172-252.ec2.internal_3117434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:35:13,371] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-18.ec2.internal_961861_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:35:13,378] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-45.ec2.internal_1411207_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:35:13,393] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-187.ec2.internal_423434_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:35:13,398] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-236.ec2.internal_1020445_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:35:13,402] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-75.ec2.internal_967841_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:35:13,422] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-173-7.ec2.internal_2779941_0' has failed to send a keep-alive heartbeat to the rendezvous 'none' due to an error of type RendezvousConnectionError. [2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961930 closing signal SIGTERM [2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961931 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961932 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961933 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961934 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961935 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961936 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 961937 closing signal SIGTERM [2024-07-06 09:35:13,443] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020514 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020515 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020516 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020517 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020518 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020519 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020520 closing signal SIGTERM [2024-07-06 09:35:13,444] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1020521 closing signal SIGTERM [2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411275 closing signal SIGTERM [2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411276 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411277 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411278 closing signal SIGTERM [2024-07-06 09:35:13,445] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423506 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411279 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423507 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411280 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117503 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411281 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117504 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1411282 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423508 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967910 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780010 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423509 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967911 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780011 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423510 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117505 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423511 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117506 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780012 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423512 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967912 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117507 closing signal SIGTERM [2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780013 closing signal SIGTERM [2024-07-06 09:35:13,446] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 423513 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967913 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117508 closing signal SIGTERM [2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780014 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967914 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117509 closing signal SIGTERM [2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780015 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967915 closing signal SIGTERM [2024-07-06 09:35:13,447] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 3117510 closing signal SIGTERM [2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780016 closing signal SIGTERM [2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967916 closing signal SIGTERM [2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2780017 closing signal SIGTERM [2024-07-06 09:35:13,448] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 967917 closing signal SIGTERM [2024-07-06 09:35:13,953] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-172-252.ec2.internal_3117434_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): [2024-07-06 09:35:13,953] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-45.ec2.internal_1411207_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store Traceback (most recent call last): return getattr(self._store, store_op)(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): return getattr(self._store, store_op)(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper [2024-07-06 09:35:13,955] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-173-7.ec2.internal_2779941_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) Traceback (most recent call last): return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main [2024-07-06 09:35:13,955] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-163-236.ec2.internal_1020445_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) run(args) [2024-07-06 09:35:13,957] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-18.ec2.internal_961861_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ The above exception was the direct cause of the following exception: Traceback (most recent call last): sys.exit(main()) elastic_launch( return getattr(self._store, store_op)(*args, **kwargs) return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in result = agent.run() sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) return launch_agent(self._config, self._entrypoint, list(args)) sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main result = f(*args, **kwargs) run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ result = self._invoke_run(role) result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run result = self._invoke_run(role) num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper num_nodes_waiting = rdzv_handler.num_nodes_waiting() self._state_holder.sync() result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run self._state_holder.sync() get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run get_response = self._backend.get_state() result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run base64_state: bytes = self._call_store("get", self._key) num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store num_nodes_waiting = rdzv_handler.num_nodes_waiting() get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state raise RendezvousConnectionError( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. [2024-07-06 09:35:14,052] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-187.ec2.internal_423434_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store [2024-07-06 09:35:14,053] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'ip-26-0-164-75.ec2.internal_967841_0' has failed to shutdown the rendezvous 'none' due to an error of type RendezvousConnectionError. Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): return getattr(self._store, store_op)(*args, **kwargs) torch.distributed.DistNetworkError: Broken pipe The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/bin/torchrun", line 8, in sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper sys.exit(main()) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main return f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main run(args) run(args) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run elastic_launch( elastic_launch( File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent return launch_agent(self._config, self._entrypoint, list(args)) result = agent.run() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = agent.run() result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run result = f(*args, **kwargs) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run result = self._invoke_run(role) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 900, in _invoke_run num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting num_nodes_waiting = rdzv_handler.num_nodes_waiting() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1083, in num_nodes_waiting self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync self._state_holder.sync() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 409, in sync get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state get_response = self._backend.get_state() File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store base64_state: bytes = self._call_store("get", self._key) File "/fsx/ferdinandmom/miniforge3/envs/env-bench-cluster/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. raise RendezvousConnectionError( torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details. srun: error: ip-26-0-164-45: task 3: Exited with exit code 1 srun: error: ip-26-0-164-18: task 2: Exited with exit code 1 srun: error: ip-26-0-173-7: task 7: Exited with exit code 1 srun: error: ip-26-0-172-252: task 6: Exited with exit code 1 srun: error: ip-26-0-163-236: task 1: Exited with exit code 1 srun: error: ip-26-0-164-187: task 5: Exited with exit code 1 srun: error: ip-26-0-164-75: task 4: Exited with exit code 1 Consider using `hf_transfer` for faster uploads. This solution comes with some limitations. See https://huggingface.co/docs/huggingface_hub/hf_transfer for more details.