**Install docker and docker-compose on Ubuntu 22.04** __PreRequisites__: * Have an aws account with a user that has the necessary permissions * Have the access key either on env variables or in the github actions secrets * Have an ec2 runner instance running/created in the aws account * Have a s3 bucket created in the aws account * Have aws container registry created in the aws account __Local VM setup__: * Install aws configure and setup the access key and secret key and the right zone ```bash curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install aws configure ``` __Install docker__: ```bash sudo apt update sudo apt install -y apt-transport-https ca-certificates curl software-properties-common curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt update sudo apt install -y docker-ce sudo systemctl start docker sudo systemctl enable docker sudo usermod -aG docker $USER sudo systemctl restart docker sudo reboot docker --version docker ps ``` __Install docker-compose__: ```bash sudo rm /usr/local/bin/docker-compose sudo curl -L "https://github.com/docker/compose/releases/download/v2.30.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose sudo chmod +x /usr/local/bin/docker-compose docker-compose --version ``` __Github actions self-hosted runner__: ```bash mkdir actions-runner && cd actions-runner curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz echo "93ac1b7ce743ee85b5d386f5c1787385ef07b3d7c728ff66ce0d3813d5f46900 actions-runner-linux-x64-2.320.0.tar.gz" | shasum -a 256 -c tar xzf ./actions-runner-linux-x64-2.320.0.tar.gz ./config.sh --url https://github.com/soutrik71/pytorch-template-aws --token # cd actions-runner/ ./run.sh ./config.sh remove --token <> # To remove the runner # https://github.com/soutrik71/pytorch-template-aws/settings/actions/runners/new?arch=x64&os=linux ``` __Activate aws cli__: ```bash curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" sudo apt install unzip unzip awscliv2.zip sudo ./aws/install aws --version aws configure ``` __S3 bucket operations__: ```bash aws s3 cp data s3://deep-bucket-s3/data --recursive aws s3 ls s3://deep-bucket-s3 aws s3 rm s3://deep-bucket-s3/data --recursive ``` __Cuda Update Setup__: ```bash # if you already have nvidia drivers installed and you have a Tesla T4 GPU sudo apt update sudo apt upgrade sudo reboot sudo apt --fix-broken install sudo apt install ubuntu-drivers-common sudo apt autoremove nvidia-smi lsmod | grep nvidia sudo apt install nvidia-cuda-toolkit nvcc --version ls /usr/local/ | grep cuda ldconfig -p | grep cudnn lspci | grep -i nvidia Based on the provided details, here is the breakdown of the information about your GPU, CUDA, and environment setup: --- ### **1. GPU Details** - **Model**: Tesla T4 - A popular NVIDIA GPU for deep learning and AI workloads. - It belongs to the Turing architecture (TU104GL). - **Memory**: 16 GB - Only **2 MiB is currently in use**, indicating minimal GPU activity. - **Temperature**: 25°C - The GPU is operating at a low temperature, suggesting no heavy utilization currently. - **Power Usage**: 11W / 70W - The GPU is in idle or low-performance mode (P8). - **MIG Mode**: Not enabled. - MIG (Multi-Instance GPU) mode is specific to NVIDIA A100 and other GPUs, so it is not applicable here. --- ### **2. Driver and CUDA Version** - **Driver Version**: 535.216.03 - Installed NVIDIA driver supports CUDA 12.x. - **CUDA Runtime Version**: 12.2 - This is the active runtime version compatible with the driver. --- ### **3. CUDA Toolkit Versions** From your `nvcc` and file system checks: - **Default `nvcc` Version**: CUDA 10.1 - The system's default `nvcc` is pointing to an older CUDA 10.1 installation (`nvcc --version` output shows CUDA 10.1). - **Installed CUDA Toolkits**: - `cuda-12` - `cuda-12.2` - `cuda` (likely symlinked to `cuda-12.2`) Multiple CUDA versions are installed. However, the runtime and drivers align with **CUDA 12.2**, while the default compiler (`nvcc`) is still from CUDA 10.1. --- ### **4. cuDNN Version** From `cudnn_version.h` and `ldconfig`: - **cuDNN Version**: 9.5.1 - This cuDNN version is compatible with **CUDA 12.x**. - **cuDNN Runtime**: The libraries for cuDNN 9 are present under `/lib/x86_64-linux-gnu`. --- ### **5. NVIDIA Software Packages** From `dpkg`: - **NVIDIA Drivers**: Driver version 535 is installed. - **CUDA Toolkit**: Multiple versions installed (`10.1`, `12`, `12.2`). - **cuDNN**: Versions for CUDA 12 and CUDA 12.6 are installed (`cudnn9-cuda-12`, `cudnn9-cuda-12-6`). --- ### **6. Other Observations** - **Graphics Settings Issue**: - `nvidia-settings` failed due to the lack of a display server connection (`Connection refused`). Likely, this is a headless server without a GUI environment. - **OpenGL Tools Missing**: - `glxinfo` command is missing, indicating the `mesa-utils` package needs to be installed. --- ### **Summary of Setup** - **GPU**: Tesla T4 - **Driver Version**: 535.216.03 - **CUDA Runtime Version**: 12.2 - **CUDA Toolkit Versions**: 10.1 (default `nvcc`), 12, 12.2 - **cuDNN Version**: 9.5.1 (compatible with CUDA 12.x) - **Software Packages**: NVIDIA drivers, CUDA, cuDNN installed ``` __CUDA New Installation__: ```bash # if you don't have nvidia drivers installed and you have a Tesla T4 GPU lspci | grep -i nvidia # Check if the GPU is detected To set up the T4 GPU from scratch, starting with no drivers or CUDA tools, and replicating the above configurations and drivers, follow these reverse-engineered steps: --- ### **1. Update System** Ensure the system is updated: ```bash sudo apt update && sudo apt upgrade -y sudo reboot ``` --- ### **2. Install NVIDIA Driver** #### **a. Identify Required Driver** The T4 GPU requires a compatible NVIDIA driver version. Based on your configurations, we will install **Driver 535**. #### **b. Add NVIDIA Repository** Add the official NVIDIA driver repository: ```bash sudo apt install -y software-properties-common sudo add-apt-repository -y ppa:graphics-drivers/ppa sudo apt update ``` #### **c. Install Driver** Install the driver for the T4 GPU: ```bash sudo apt install -y nvidia-driver-535 ``` #### **d. Verify Driver Installation** Reboot the system and check the driver: ```bash sudo reboot nvidia-smi ``` This should display the GPU model and driver version. --- ### **3. Install CUDA Toolkit** #### **a. Add CUDA Repository** Download and install the CUDA 12.2 repository for Ubuntu 20.04: ```bash wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb sudo dpkg -i cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb sudo cp /var/cuda-repo-ubuntu2004-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/ sudo apt update ``` #### **b. Install CUDA Toolkit** Install CUDA 12.2: ```bash sudo apt install -y cuda ``` #### **c. Set Up Environment Variables** Add CUDA binaries to the PATH and library paths: ```bash echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc source ~/.bashrc ``` #### **d. Verify CUDA Installation** Check CUDA installation: ```bash nvcc --version nvidia-smi ``` --- ### **4. Install cuDNN** #### **a. Download cuDNN** Download cuDNN 9.5.1 (compatible with CUDA 12.x) from the [NVIDIA cuDNN page](https://developer.nvidia.com/cudnn). You’ll need to log in and download the appropriate `.deb` files for Ubuntu 20.04. #### **b. Install cuDNN** Install the downloaded `.deb` files: ```bash sudo dpkg -i libcudnn9*.deb ``` #### **c. Verify cuDNN** Check the installed version: ```bash cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2 ``` --- ### **5. Install NCCL and Other Libraries** Install additional NVIDIA libraries (like NCCL) required for distributed deep learning: ```bash sudo apt install -y libnccl2 libnccl-dev ``` --- ### **6. Install PyTorch** #### **a. Install Python Environment** Install Python and `pip` if not already present: ```bash sudo apt install -y python3 python3-pip ``` #### **b. Install PyTorch with CUDA 12.2** Install PyTorch with the appropriate CUDA runtime: ```bash pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122 ``` #### **c. Test PyTorch** Run a quick test: ```python import torch print(torch.cuda.is_available()) # Should return True print(torch.cuda.get_device_name(0)) # Should return "Tesla T4" ``` --- ### **7. Optional: Install Nsight Tools** For debugging and profiling: ```bash sudo apt install -y nsight-compute nsight-systems ``` --- ### **8. Check for OpenGL** If you need OpenGL utilities (like `glxinfo`): ```bash sudo apt install -y mesa-utils glxinfo | grep "OpenGL version" ``` --- ### **9. Validate Entire Setup** Run the NVIDIA sample tests to confirm the configuration: ```bash cd /usr/local/cuda-12.2/samples/1_Utilities/deviceQuery make ./deviceQuery ``` If successful, it should show details of the T4 GPU. --- ### **Summary of Installed Components** - **GPU**: Tesla T4 - **Driver**: 535 - **CUDA Toolkit**: 12.2 - **cuDNN**: 9.5.1 - **PyTorch**: Installed with CUDA 12.2 support This setup ensures your system is ready for deep learning workloads with the T4 GPU. Install conda and create a new environment for the project Install pytorch and torchvision in the new environment Install other dependencies like numpy, pandas, matplotlib, etc. Run the project code in the new environment >>> import torch >>> print(torch.cuda.is_available()) >>> print(torch.cuda.get_device_name(0)) >>> print(torch.version.cuda) ``` __CUDA Docker Setup__: ```bash # If you are using docker and want to run a container with CUDA support sudo apt install -y nvidia-container-toolkit nvidia-ctk --version sudo systemctl restart docker sudo systemctl status docker docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvcc --version ```