Spaces:

soutrik
/

gradio_demo_CatDogClassifier

Runtime error

App Files Files Community

gradio_demo_CatDogClassifier / ec2_runner_setup.md

soutrik

check and confirm docker compose on gpu

035df3d 2 months ago

preview code

raw

history blame

10.8 kB

	Install docker and docker-compose on Ubuntu 22.04
	__PreRequisites__:

	* Have an aws account with a user that has the necessary permissions
	* Have the access key either on env variables or in the github actions secrets
	* Have an ec2 runner instance running/created in the aws account
	* Have a s3 bucket created in the aws account
	* Have aws container registry created in the aws account
	__Local VM setup__:
	* Install aws configure and setup the access key and secret key and the right zone
	```bash
	curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
	unzip awscliv2.zip
	sudo ./aws/install
	aws configure
	```


	__Install docker__:
	```bash
	sudo apt update
	sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
	curl -fsSL https://download.docker.com/linux/ubuntu/gpg \| sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
	echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" \| sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
	sudo apt update
	sudo apt install -y docker-ce
	sudo systemctl start docker
	sudo systemctl enable docker
	sudo usermod -aG docker $USER
	sudo systemctl restart docker
	sudo reboot
	docker --version
	docker ps
	```
	__Install docker-compose__:
	```bash
	sudo rm /usr/local/bin/docker-compose
	sudo curl -L "https://github.com/docker/compose/releases/download/v2.30.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
	sudo chmod +x /usr/local/bin/docker-compose
	docker-compose --version
	```

	__Github actions self-hosted runner__:
	```bash
	mkdir actions-runner && cd actions-runner
	curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz
	echo "93ac1b7ce743ee85b5d386f5c1787385ef07b3d7c728ff66ce0d3813d5f46900 actions-runner-linux-x64-2.320.0.tar.gz" \| shasum -a 256 -c
	tar xzf ./actions-runner-linux-x64-2.320.0.tar.gz
	./config.sh --url https://github.com/soutrik71/pytorch-template-aws --token <Latest>
	# cd actions-runner/
	./run.sh
	./config.sh remove --token <> # To remove the runner
	# https://github.com/soutrik71/pytorch-template-aws/settings/actions/runners/new?arch=x64&os=linux
	```
	__Activate aws cli__:
	```bash
	curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
	sudo apt install unzip
	unzip awscliv2.zip
	sudo ./aws/install
	aws --version
	aws configure

	```
	__S3 bucket operations__:
	```bash
	aws s3 cp data s3://deep-bucket-s3/data --recursive
	aws s3 ls s3://deep-bucket-s3
	aws s3 rm s3://deep-bucket-s3/data --recursive
	```

	__Cuda Update Setup__:
	```bash
	# if you already have nvidia drivers installed and you have a Tesla T4 GPU
	sudo apt update
	sudo apt upgrade
	sudo reboot

	sudo apt --fix-broken install
	sudo apt install ubuntu-drivers-common
	sudo apt autoremove

	nvidia-smi
	lsmod \| grep nvidia

	sudo apt install nvidia-cuda-toolkit
	nvcc --version

	ls /usr/local/ \| grep cuda
	ldconfig -p \| grep cudnn
	lspci \| grep -i nvidia

	Based on the provided details, here is the breakdown of the information about your GPU, CUDA, and environment setup:

	---

	### 1. GPU Details
	- Model: Tesla T4
	- A popular NVIDIA GPU for deep learning and AI workloads.
	- It belongs to the Turing architecture (TU104GL).

	- Memory: 16 GB
	- Only 2 MiB is currently in use, indicating minimal GPU activity.

	- Temperature: 25°C
	- The GPU is operating at a low temperature, suggesting no heavy utilization currently.

	- Power Usage: 11W / 70W
	- The GPU is in idle or low-performance mode (P8).

	- MIG Mode: Not enabled.
	- MIG (Multi-Instance GPU) mode is specific to NVIDIA A100 and other GPUs, so it is not applicable here.

	---

	### 2. Driver and CUDA Version
	- Driver Version: 535.216.03
	- Installed NVIDIA driver supports CUDA 12.x.

	- CUDA Runtime Version: 12.2
	- This is the active runtime version compatible with the driver.

	---

	### 3. CUDA Toolkit Versions
	From your `nvcc` and file system checks:
	- Default `nvcc` Version: CUDA 10.1
	- The system's default `nvcc` is pointing to an older CUDA 10.1 installation (`nvcc --version` output shows CUDA 10.1).

	- Installed CUDA Toolkits:
	- `cuda-12`
	- `cuda-12.2`
	- `cuda` (likely symlinked to `cuda-12.2`)

	Multiple CUDA versions are installed. However, the runtime and drivers align with CUDA 12.2, while the default compiler (`nvcc`) is still from CUDA 10.1.

	---

	### 4. cuDNN Version
	From `cudnn_version.h` and `ldconfig`:
	- cuDNN Version: 9.5.1
	- This cuDNN version is compatible with CUDA 12.x.
	- cuDNN Runtime: The libraries for cuDNN 9 are present under `/lib/x86_64-linux-gnu`.

	---

	### 5. NVIDIA Software Packages
	From `dpkg`:
	- NVIDIA Drivers: Driver version 535 is installed.
	- CUDA Toolkit: Multiple versions installed (`10.1`, `12`, `12.2`).
	- cuDNN: Versions for CUDA 12 and CUDA 12.6 are installed (`cudnn9-cuda-12`, `cudnn9-cuda-12-6`).

	---

	### 6. Other Observations
	- Graphics Settings Issue:
	- `nvidia-settings` failed due to the lack of a display server connection (`Connection refused`). Likely, this is a headless server without a GUI environment.

	- OpenGL Tools Missing:
	- `glxinfo` command is missing, indicating the `mesa-utils` package needs to be installed.

	---

	### Summary of Setup
	- GPU: Tesla T4
	- Driver Version: 535.216.03
	- CUDA Runtime Version: 12.2
	- CUDA Toolkit Versions: 10.1 (default `nvcc`), 12, 12.2
	- cuDNN Version: 9.5.1 (compatible with CUDA 12.x)
	- Software Packages: NVIDIA drivers, CUDA, cuDNN installed
	```

	__CUDA New Installation__:
	```bash
	# if you don't have nvidia drivers installed and you have a Tesla T4 GPU
	lspci \| grep -i nvidia # Check if the GPU is detected
	To set up the T4 GPU from scratch, starting with no drivers or CUDA tools, and replicating the above configurations and drivers, follow these reverse-engineered steps:

	---

	### 1. Update System
	Ensure the system is updated:
	```bash
	sudo apt update && sudo apt upgrade -y
	sudo reboot
	```

	---

	### 2. Install NVIDIA Driver
	#### a. Identify Required Driver
	The T4 GPU requires a compatible NVIDIA driver version. Based on your configurations, we will install Driver 535.

	#### b. Add NVIDIA Repository
	Add the official NVIDIA driver repository:
	```bash
	sudo apt install -y software-properties-common
	sudo add-apt-repository -y ppa:graphics-drivers/ppa
	sudo apt update
	```

	#### c. Install Driver
	Install the driver for the T4 GPU:
	```bash
	sudo apt install -y nvidia-driver-535
	```

	#### d. Verify Driver Installation
	Reboot the system and check the driver:
	```bash
	sudo reboot
	nvidia-smi
	```
	This should display the GPU model and driver version.

	---

	### 3. Install CUDA Toolkit
	#### a. Add CUDA Repository
	Download and install the CUDA 12.2 repository for Ubuntu 20.04:
	```bash
	wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
	sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
	wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
	sudo dpkg -i cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
	sudo cp /var/cuda-repo-ubuntu2004-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
	sudo apt update
	```

	#### b. Install CUDA Toolkit
	Install CUDA 12.2:
	```bash
	sudo apt install -y cuda
	```

	#### c. Set Up Environment Variables
	Add CUDA binaries to the PATH and library paths:
	```bash
	echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
	echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
	source ~/.bashrc
	```

	#### d. Verify CUDA Installation
	Check CUDA installation:
	```bash
	nvcc --version
	nvidia-smi
	```

	---

	### 4. Install cuDNN
	#### a. Download cuDNN
	Download cuDNN 9.5.1 (compatible with CUDA 12.x) from the [NVIDIA cuDNN page](https://developer.nvidia.com/cudnn). You’ll need to log in and download the appropriate `.deb` files for Ubuntu 20.04.

	#### b. Install cuDNN
	Install the downloaded `.deb` files:
	```bash
	sudo dpkg -i libcudnn9*.deb
	```

	#### c. Verify cuDNN
	Check the installed version:
	```bash
	cat /usr/include/cudnn_version.h \| grep CUDNN_MAJOR -A 2
	```

	---

	### 5. Install NCCL and Other Libraries
	Install additional NVIDIA libraries (like NCCL) required for distributed deep learning:
	```bash
	sudo apt install -y libnccl2 libnccl-dev
	```

	---

	### 6. Install PyTorch
	#### a. Install Python Environment
	Install Python and `pip` if not already present:
	```bash
	sudo apt install -y python3 python3-pip
	```

	#### b. Install PyTorch with CUDA 12.2
	Install PyTorch with the appropriate CUDA runtime:
	```bash
	pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122
	```

	#### c. Test PyTorch
	Run a quick test:
	```python
	import torch
	print(torch.cuda.is_available()) # Should return True
	print(torch.cuda.get_device_name(0)) # Should return "Tesla T4"
	```

	---

	### 7. Optional: Install Nsight Tools
	For debugging and profiling:
	```bash
	sudo apt install -y nsight-compute nsight-systems
	```

	---

	### 8. Check for OpenGL
	If you need OpenGL utilities (like `glxinfo`):
	```bash
	sudo apt install -y mesa-utils
	glxinfo \| grep "OpenGL version"
	```

	---

	### 9. Validate Entire Setup
	Run the NVIDIA sample tests to confirm the configuration:
	```bash
	cd /usr/local/cuda-12.2/samples/1_Utilities/deviceQuery
	make
	./deviceQuery
	```
	If successful, it should show details of the T4 GPU.

	---

	### Summary of Installed Components
	- GPU: Tesla T4
	- Driver: 535
	- CUDA Toolkit: 12.2
	- cuDNN: 9.5.1
	- PyTorch: Installed with CUDA 12.2 support

	This setup ensures your system is ready for deep learning workloads with the T4 GPU.

	Install conda and create a new environment for the project
	Install pytorch and torchvision in the new environment
	Install other dependencies like numpy, pandas, matplotlib, etc.
	Run the project code in the new environment
	>>> import torch
	>>> print(torch.cuda.is_available())
	>>> print(torch.cuda.get_device_name(0))
	>>> print(torch.version.cuda)
	```
	__CUDA Docker Setup__:
	```bash
	# If you are using docker and want to run a container with CUDA support
	sudo apt install -y nvidia-container-toolkit
	nvidia-ctk --version
	sudo systemctl restart docker
	sudo systemctl status docker
	docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
	docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvcc --version
	```