Spaces:

soutrik
/

gradio_demo_CatDogClassifier

Runtime error

File size: 10,849 Bytes

**Install docker and docker-compose on Ubuntu 22.04**
__PreRequisites__:

    * Have an aws account with a user that has the necessary permissions
    * Have the access key either on env variables or in the github actions secrets
    * Have an ec2 runner instance running/created in the aws account
    * Have a s3 bucket created in the aws account
    * Have aws container registry created in the aws account 
__Local VM setup__:
    * Install aws configure and setup the access key and secret key and the right zone
        ```bash
        curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
        unzip awscliv2.zip
        sudo ./aws/install   
        aws configure
        ```
    

__Install docker__:
```bash
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER
sudo systemctl restart docker
sudo reboot
docker --version
docker ps
```
__Install docker-compose__:
```bash
sudo rm /usr/local/bin/docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/v2.30.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version
```

__Github actions self-hosted runner__:
```bash
mkdir actions-runner && cd actions-runner
curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz
echo "93ac1b7ce743ee85b5d386f5c1787385ef07b3d7c728ff66ce0d3813d5f46900  actions-runner-linux-x64-2.320.0.tar.gz" | shasum -a 256 -c
tar xzf ./actions-runner-linux-x64-2.320.0.tar.gz
./config.sh --url https://github.com/soutrik71/pytorch-template-aws --token <Latest>
# cd actions-runner/
./run.sh
./config.sh remove --token <> # To remove the runner
# https://github.com/soutrik71/pytorch-template-aws/settings/actions/runners/new?arch=x64&os=linux
```
__Activate aws cli__:
```bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install
aws --version
aws configure

```
__S3 bucket operations__:
```bash
aws s3 cp data s3://deep-bucket-s3/data --recursive
aws s3 ls s3://deep-bucket-s3
aws s3 rm s3://deep-bucket-s3/data --recursive
```

__Cuda Update Setup__:
```bash
# if you already have nvidia drivers installed and you have a Tesla T4 GPU
sudo apt update
sudo apt upgrade
sudo reboot

sudo apt --fix-broken install
sudo apt install ubuntu-drivers-common
sudo apt autoremove

nvidia-smi
lsmod | grep nvidia

sudo apt install nvidia-cuda-toolkit
nvcc --version

ls /usr/local/ | grep cuda
ldconfig -p | grep cudnn
lspci | grep -i nvidia

Based on the provided details, here is the breakdown of the information about your GPU, CUDA, and environment setup:

---

### **1. GPU Details**
- **Model**: Tesla T4  
  - A popular NVIDIA GPU for deep learning and AI workloads.  
  - It belongs to the Turing architecture (TU104GL).  

- **Memory**: 16 GB  
  - Only **2 MiB is currently in use**, indicating minimal GPU activity.

- **Temperature**: 25°C  
  - The GPU is operating at a low temperature, suggesting no heavy utilization currently.

- **Power Usage**: 11W / 70W  
  - The GPU is in idle or low-performance mode (P8).

- **MIG Mode**: Not enabled.  
  - MIG (Multi-Instance GPU) mode is specific to NVIDIA A100 and other GPUs, so it is not applicable here.

---

### **2. Driver and CUDA Version**
- **Driver Version**: 535.216.03  
  - Installed NVIDIA driver supports CUDA 12.x.

- **CUDA Runtime Version**: 12.2  
  - This is the active runtime version compatible with the driver.

---

### **3. CUDA Toolkit Versions**
From your `nvcc` and file system checks:
- **Default `nvcc` Version**: CUDA 10.1  
  - The system's default `nvcc` is pointing to an older CUDA 10.1 installation (`nvcc --version` output shows CUDA 10.1).  

- **Installed CUDA Toolkits**:
  - `cuda-12`
  - `cuda-12.2`
  - `cuda` (likely symlinked to `cuda-12.2`)
  
  Multiple CUDA versions are installed. However, the runtime and drivers align with **CUDA 12.2**, while the default compiler (`nvcc`) is still from CUDA 10.1.

---

### **4. cuDNN Version**
From `cudnn_version.h` and `ldconfig`:
- **cuDNN Version**: 9.5.1  
  - This cuDNN version is compatible with **CUDA 12.x**.
- **cuDNN Runtime**: The libraries for cuDNN 9 are present under `/lib/x86_64-linux-gnu`.

---

### **5. NVIDIA Software Packages**
From `dpkg`:
- **NVIDIA Drivers**: Driver version 535 is installed.
- **CUDA Toolkit**: Multiple versions installed (`10.1`, `12`, `12.2`).
- **cuDNN**: Versions for CUDA 12 and CUDA 12.6 are installed (`cudnn9-cuda-12`, `cudnn9-cuda-12-6`).

---

### **6. Other Observations**
- **Graphics Settings Issue**: 
  - `nvidia-settings` failed due to the lack of a display server connection (`Connection refused`). Likely, this is a headless server without a GUI environment.
  
- **OpenGL Tools Missing**: 
  - `glxinfo` command is missing, indicating the `mesa-utils` package needs to be installed.

---

### **Summary of Setup**
- **GPU**: Tesla T4  
- **Driver Version**: 535.216.03  
- **CUDA Runtime Version**: 12.2  
- **CUDA Toolkit Versions**: 10.1 (default `nvcc`), 12, 12.2  
- **cuDNN Version**: 9.5.1 (compatible with CUDA 12.x)  
- **Software Packages**: NVIDIA drivers, CUDA, cuDNN installed
```

__CUDA New Installation__:
```bash
# if you don't have nvidia drivers installed and you have a Tesla T4 GPU 
lspci | grep -i nvidia # Check if the GPU is detected
To set up the T4 GPU from scratch, starting with no drivers or CUDA tools, and replicating the above configurations and drivers, follow these reverse-engineered steps:

---

### **1. Update System**
Ensure the system is updated:
```bash
sudo apt update && sudo apt upgrade -y
sudo reboot
```

---

### **2. Install NVIDIA Driver**
#### **a. Identify Required Driver**
The T4 GPU requires a compatible NVIDIA driver version. Based on your configurations, we will install **Driver 535**.

#### **b. Add NVIDIA Repository**
Add the official NVIDIA driver repository:
```bash
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt update
```

#### **c. Install Driver**
Install the driver for the T4 GPU:
```bash
sudo apt install -y nvidia-driver-535
```

#### **d. Verify Driver Installation**
Reboot the system and check the driver:
```bash
sudo reboot
nvidia-smi
```
This should display the GPU model and driver version.

---

### **3. Install CUDA Toolkit**
#### **a. Add CUDA Repository**
Download and install the CUDA 12.2 repository for Ubuntu 20.04:
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt update
```

#### **b. Install CUDA Toolkit**
Install CUDA 12.2:
```bash
sudo apt install -y cuda
```

#### **c. Set Up Environment Variables**
Add CUDA binaries to the PATH and library paths:
```bash
echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
```

#### **d. Verify CUDA Installation**
Check CUDA installation:
```bash
nvcc --version
nvidia-smi
```

---

### **4. Install cuDNN**
#### **a. Download cuDNN**
Download cuDNN 9.5.1 (compatible with CUDA 12.x) from the [NVIDIA cuDNN page](https://developer.nvidia.com/cudnn). You’ll need to log in and download the appropriate `.deb` files for Ubuntu 20.04.

#### **b. Install cuDNN**
Install the downloaded `.deb` files:
```bash
sudo dpkg -i libcudnn9*.deb
```

#### **c. Verify cuDNN**
Check the installed version:
```bash
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
```

---

### **5. Install NCCL and Other Libraries**
Install additional NVIDIA libraries (like NCCL) required for distributed deep learning:
```bash
sudo apt install -y libnccl2 libnccl-dev
```

---

### **6. Install PyTorch**
#### **a. Install Python Environment**
Install Python and `pip` if not already present:
```bash
sudo apt install -y python3 python3-pip
```

#### **b. Install PyTorch with CUDA 12.2**
Install PyTorch with the appropriate CUDA runtime:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122
```

#### **c. Test PyTorch**
Run a quick test:
```python
import torch
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.get_device_name(0))  # Should return "Tesla T4"
```

---

### **7. Optional: Install Nsight Tools**
For debugging and profiling:
```bash
sudo apt install -y nsight-compute nsight-systems
```

---

### **8. Check for OpenGL**
If you need OpenGL utilities (like `glxinfo`):
```bash
sudo apt install -y mesa-utils
glxinfo | grep "OpenGL version"
```

---

### **9. Validate Entire Setup**
Run the NVIDIA sample tests to confirm the configuration:
```bash
cd /usr/local/cuda-12.2/samples/1_Utilities/deviceQuery
make
./deviceQuery
```
If successful, it should show details of the T4 GPU.

---

### **Summary of Installed Components**
- **GPU**: Tesla T4
- **Driver**: 535
- **CUDA Toolkit**: 12.2
- **cuDNN**: 9.5.1
- **PyTorch**: Installed with CUDA 12.2 support

This setup ensures your system is ready for deep learning workloads with the T4 GPU.

Install conda and create a new environment for the project
Install pytorch and torchvision in the new environment
Install other dependencies like numpy, pandas, matplotlib, etc.
Run the project code in the new environment
>>> import torch
>>> print(torch.cuda.is_available())
>>> print(torch.cuda.get_device_name(0))
>>> print(torch.version.cuda)
```
__CUDA Docker Setup__:
```bash
# If you are using docker and want to run a container with CUDA support
sudo apt install -y nvidia-container-toolkit
nvidia-ctk --version
sudo systemctl restart docker
sudo systemctl status docker
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvcc --version
```