ec2_runner_setup.md · soutrik/gradio_demo

Install docker and docker-compose on Ubuntu 22.04 PreRequisites:

* Have an aws account with a user that has the necessary permissions
* Have the access key either on env variables or in the github actions secrets
* Have an ec2 runner instance running/created in the aws account
* Have a s3 bucket created in the aws account
* Have aws container registry created in the aws account

Local VM setup: * Install aws configure and setup the access key and secret key and the right zone bash curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" unzip awscliv2.zip sudo ./aws/install aws configure

Install docker:

sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER
sudo systemctl restart docker
sudo reboot
docker --version
docker ps

Install docker-compose:

sudo rm /usr/local/bin/docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/v2.30.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version

Github actions self-hosted runner:

mkdir actions-runner && cd actions-runner
curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz
echo "93ac1b7ce743ee85b5d386f5c1787385ef07b3d7c728ff66ce0d3813d5f46900  actions-runner-linux-x64-2.320.0.tar.gz" | shasum -a 256 -c
tar xzf ./actions-runner-linux-x64-2.320.0.tar.gz
./config.sh --url https://github.com/soutrik71/pytorch-template-aws --token <Latest>
# cd actions-runner/
./run.sh
./config.sh remove --token <> # To remove the runner
# https://github.com/soutrik71/pytorch-template-aws/settings/actions/runners/new?arch=x64&os=linux

Activate aws cli:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install
aws --version
aws configure

S3 bucket operations:

aws s3 cp data s3://deep-bucket-s3/data --recursive
aws s3 ls s3://deep-bucket-s3
aws s3 rm s3://deep-bucket-s3/data --recursive

Cuda Update Setup:

# if you already have nvidia drivers installed and you have a Tesla T4 GPU
sudo apt update
sudo apt upgrade
sudo reboot

sudo apt --fix-broken install
sudo apt install ubuntu-drivers-common
sudo apt autoremove

nvidia-smi
lsmod | grep nvidia

sudo apt install nvidia-cuda-toolkit
nvcc --version

ls /usr/local/ | grep cuda
ldconfig -p | grep cudnn
lspci | grep -i nvidia

Based on the provided details, here is the breakdown of the information about your GPU, CUDA, and environment setup:

---

### **1. GPU Details**
- **Model**: Tesla T4  
  - A popular NVIDIA GPU for deep learning and AI workloads.  
  - It belongs to the Turing architecture (TU104GL).  

- **Memory**: 16 GB  
  - Only **2 MiB is currently in use**, indicating minimal GPU activity.

- **Temperature**: 25°C  
  - The GPU is operating at a low temperature, suggesting no heavy utilization currently.

- **Power Usage**: 11W / 70W  
  - The GPU is in idle or low-performance mode (P8).

- **MIG Mode**: Not enabled.  
  - MIG (Multi-Instance GPU) mode is specific to NVIDIA A100 and other GPUs, so it is not applicable here.

---

### **2. Driver and CUDA Version**
- **Driver Version**: 535.216.03  
  - Installed NVIDIA driver supports CUDA 12.x.

- **CUDA Runtime Version**: 12.2  
  - This is the active runtime version compatible with the driver.

---

### **3. CUDA Toolkit Versions**
From your `nvcc` and file system checks:
- **Default `nvcc` Version**: CUDA 10.1  
  - The system's default `nvcc` is pointing to an older CUDA 10.1 installation (`nvcc --version` output shows CUDA 10.1).  

- **Installed CUDA Toolkits**:
  - `cuda-12`
  - `cuda-12.2`
  - `cuda` (likely symlinked to `cuda-12.2`)
  
  Multiple CUDA versions are installed. However, the runtime and drivers align with **CUDA 12.2**, while the default compiler (`nvcc`) is still from CUDA 10.1.

---

### **4. cuDNN Version**
From `cudnn_version.h` and `ldconfig`:
- **cuDNN Version**: 9.5.1  
  - This cuDNN version is compatible with **CUDA 12.x**.
- **cuDNN Runtime**: The libraries for cuDNN 9 are present under `/lib/x86_64-linux-gnu`.

---

### **5. NVIDIA Software Packages**
From `dpkg`:
- **NVIDIA Drivers**: Driver version 535 is installed.
- **CUDA Toolkit**: Multiple versions installed (`10.1`, `12`, `12.2`).
- **cuDNN**: Versions for CUDA 12 and CUDA 12.6 are installed (`cudnn9-cuda-12`, `cudnn9-cuda-12-6`).

---

### **6. Other Observations**
- **Graphics Settings Issue**: 
  - `nvidia-settings` failed due to the lack of a display server connection (`Connection refused`). Likely, this is a headless server without a GUI environment.
  
- **OpenGL Tools Missing**: 
  - `glxinfo` command is missing, indicating the `mesa-utils` package needs to be installed.

---

### **Summary of Setup**
- **GPU**: Tesla T4  
- **Driver Version**: 535.216.03  
- **CUDA Runtime Version**: 12.2  
- **CUDA Toolkit Versions**: 10.1 (default `nvcc`), 12, 12.2  
- **cuDNN Version**: 9.5.1 (compatible with CUDA 12.x)  
- **Software Packages**: NVIDIA drivers, CUDA, cuDNN installed

CUDA New Installation:

# if you don't have nvidia drivers installed and you have a Tesla T4 GPU 
lspci | grep -i nvidia # Check if the GPU is detected
To set up the T4 GPU from scratch, starting with no drivers or CUDA tools, and replicating the above configurations and drivers, follow these reverse-engineered steps:

---

### **1. Update System**
Ensure the system is updated:
```bash
sudo apt update && sudo apt upgrade -y
sudo reboot

2. Install NVIDIA Driver

a. Identify Required Driver

The T4 GPU requires a compatible NVIDIA driver version. Based on your configurations, we will install Driver 535.

b. Add NVIDIA Repository

Add the official NVIDIA driver repository:

sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt update

c. Install Driver

Install the driver for the T4 GPU:

sudo apt install -y nvidia-driver-535

d. Verify Driver Installation

Reboot the system and check the driver:

sudo reboot
nvidia-smi

This should display the GPU model and driver version.

3. Install CUDA Toolkit

a. Add CUDA Repository

Download and install the CUDA 12.2 repository for Ubuntu 20.04:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt update

b. Install CUDA Toolkit

Install CUDA 12.2:

sudo apt install -y cuda

c. Set Up Environment Variables

Add CUDA binaries to the PATH and library paths:

echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

d. Verify CUDA Installation

Check CUDA installation:

nvcc --version
nvidia-smi

4. Install cuDNN

a. Download cuDNN

Download cuDNN 9.5.1 (compatible with CUDA 12.x) from the NVIDIA cuDNN page. You’ll need to log in and download the appropriate .deb files for Ubuntu 20.04.

b. Install cuDNN

Install the downloaded .deb files:

sudo dpkg -i libcudnn9*.deb

c. Verify cuDNN

Check the installed version:

cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

5. Install NCCL and Other Libraries

Install additional NVIDIA libraries (like NCCL) required for distributed deep learning:

sudo apt install -y libnccl2 libnccl-dev

6. Install PyTorch

a. Install Python Environment

Install Python and pip if not already present:

sudo apt install -y python3 python3-pip

b. Install PyTorch with CUDA 12.2

Install PyTorch with the appropriate CUDA runtime:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122

c. Test PyTorch

Run a quick test:

import torch
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.get_device_name(0))  # Should return "Tesla T4"

7. Optional: Install Nsight Tools

For debugging and profiling:

sudo apt install -y nsight-compute nsight-systems

8. Check for OpenGL

If you need OpenGL utilities (like glxinfo):

sudo apt install -y mesa-utils
glxinfo | grep "OpenGL version"

9. Validate Entire Setup

Run the NVIDIA sample tests to confirm the configuration:

cd /usr/local/cuda-12.2/samples/1_Utilities/deviceQuery
make
./deviceQuery

If successful, it should show details of the T4 GPU.

Summary of Installed Components

GPU: Tesla T4
Driver: 535
CUDA Toolkit: 12.2
cuDNN: 9.5.1
PyTorch: Installed with CUDA 12.2 support

This setup ensures your system is ready for deep learning workloads with the T4 GPU.

Install conda and create a new environment for the project Install pytorch and torchvision in the new environment Install other dependencies like numpy, pandas, matplotlib, etc. Run the project code in the new environment

import torch print(torch.cuda.is_available()) print(torch.cuda.get_device_name(0)) print(torch.version.cuda)

__CUDA Docker Setup__:
```bash
# If you are using docker and want to run a container with CUDA support
sudo apt install -y nvidia-container-toolkit
nvidia-ctk --version
sudo systemctl restart docker
sudo systemctl status docker
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvcc --version