Spaces:
Runtime error
Runtime error
File size: 10,849 Bytes
3cdf230 3749e6c 3cdf230 3749e6c 3cdf230 3580800 3cdf230 1567755 3cdf230 3749e6c 3cdf230 1567755 3cdf230 3580800 3749e6c 3580800 1567755 ed5638a 3580800 f25067e 1567755 3c9c835 3749e6c 3c9c835 1567755 3c9c835 ff35886 3749e6c ff35886 a31745a 035df3d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 |
**Install docker and docker-compose on Ubuntu 22.04**
__PreRequisites__:
* Have an aws account with a user that has the necessary permissions
* Have the access key either on env variables or in the github actions secrets
* Have an ec2 runner instance running/created in the aws account
* Have a s3 bucket created in the aws account
* Have aws container registry created in the aws account
__Local VM setup__:
* Install aws configure and setup the access key and secret key and the right zone
```bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
aws configure
```
__Install docker__:
```bash
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER
sudo systemctl restart docker
sudo reboot
docker --version
docker ps
```
__Install docker-compose__:
```bash
sudo rm /usr/local/bin/docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/v2.30.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version
```
__Github actions self-hosted runner__:
```bash
mkdir actions-runner && cd actions-runner
curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz
echo "93ac1b7ce743ee85b5d386f5c1787385ef07b3d7c728ff66ce0d3813d5f46900 actions-runner-linux-x64-2.320.0.tar.gz" | shasum -a 256 -c
tar xzf ./actions-runner-linux-x64-2.320.0.tar.gz
./config.sh --url https://github.com/soutrik71/pytorch-template-aws --token <Latest>
# cd actions-runner/
./run.sh
./config.sh remove --token <> # To remove the runner
# https://github.com/soutrik71/pytorch-template-aws/settings/actions/runners/new?arch=x64&os=linux
```
__Activate aws cli__:
```bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install
aws --version
aws configure
```
__S3 bucket operations__:
```bash
aws s3 cp data s3://deep-bucket-s3/data --recursive
aws s3 ls s3://deep-bucket-s3
aws s3 rm s3://deep-bucket-s3/data --recursive
```
__Cuda Update Setup__:
```bash
# if you already have nvidia drivers installed and you have a Tesla T4 GPU
sudo apt update
sudo apt upgrade
sudo reboot
sudo apt --fix-broken install
sudo apt install ubuntu-drivers-common
sudo apt autoremove
nvidia-smi
lsmod | grep nvidia
sudo apt install nvidia-cuda-toolkit
nvcc --version
ls /usr/local/ | grep cuda
ldconfig -p | grep cudnn
lspci | grep -i nvidia
Based on the provided details, here is the breakdown of the information about your GPU, CUDA, and environment setup:
---
### **1. GPU Details**
- **Model**: Tesla T4
- A popular NVIDIA GPU for deep learning and AI workloads.
- It belongs to the Turing architecture (TU104GL).
- **Memory**: 16 GB
- Only **2 MiB is currently in use**, indicating minimal GPU activity.
- **Temperature**: 25°C
- The GPU is operating at a low temperature, suggesting no heavy utilization currently.
- **Power Usage**: 11W / 70W
- The GPU is in idle or low-performance mode (P8).
- **MIG Mode**: Not enabled.
- MIG (Multi-Instance GPU) mode is specific to NVIDIA A100 and other GPUs, so it is not applicable here.
---
### **2. Driver and CUDA Version**
- **Driver Version**: 535.216.03
- Installed NVIDIA driver supports CUDA 12.x.
- **CUDA Runtime Version**: 12.2
- This is the active runtime version compatible with the driver.
---
### **3. CUDA Toolkit Versions**
From your `nvcc` and file system checks:
- **Default `nvcc` Version**: CUDA 10.1
- The system's default `nvcc` is pointing to an older CUDA 10.1 installation (`nvcc --version` output shows CUDA 10.1).
- **Installed CUDA Toolkits**:
- `cuda-12`
- `cuda-12.2`
- `cuda` (likely symlinked to `cuda-12.2`)
Multiple CUDA versions are installed. However, the runtime and drivers align with **CUDA 12.2**, while the default compiler (`nvcc`) is still from CUDA 10.1.
---
### **4. cuDNN Version**
From `cudnn_version.h` and `ldconfig`:
- **cuDNN Version**: 9.5.1
- This cuDNN version is compatible with **CUDA 12.x**.
- **cuDNN Runtime**: The libraries for cuDNN 9 are present under `/lib/x86_64-linux-gnu`.
---
### **5. NVIDIA Software Packages**
From `dpkg`:
- **NVIDIA Drivers**: Driver version 535 is installed.
- **CUDA Toolkit**: Multiple versions installed (`10.1`, `12`, `12.2`).
- **cuDNN**: Versions for CUDA 12 and CUDA 12.6 are installed (`cudnn9-cuda-12`, `cudnn9-cuda-12-6`).
---
### **6. Other Observations**
- **Graphics Settings Issue**:
- `nvidia-settings` failed due to the lack of a display server connection (`Connection refused`). Likely, this is a headless server without a GUI environment.
- **OpenGL Tools Missing**:
- `glxinfo` command is missing, indicating the `mesa-utils` package needs to be installed.
---
### **Summary of Setup**
- **GPU**: Tesla T4
- **Driver Version**: 535.216.03
- **CUDA Runtime Version**: 12.2
- **CUDA Toolkit Versions**: 10.1 (default `nvcc`), 12, 12.2
- **cuDNN Version**: 9.5.1 (compatible with CUDA 12.x)
- **Software Packages**: NVIDIA drivers, CUDA, cuDNN installed
```
__CUDA New Installation__:
```bash
# if you don't have nvidia drivers installed and you have a Tesla T4 GPU
lspci | grep -i nvidia # Check if the GPU is detected
To set up the T4 GPU from scratch, starting with no drivers or CUDA tools, and replicating the above configurations and drivers, follow these reverse-engineered steps:
---
### **1. Update System**
Ensure the system is updated:
```bash
sudo apt update && sudo apt upgrade -y
sudo reboot
```
---
### **2. Install NVIDIA Driver**
#### **a. Identify Required Driver**
The T4 GPU requires a compatible NVIDIA driver version. Based on your configurations, we will install **Driver 535**.
#### **b. Add NVIDIA Repository**
Add the official NVIDIA driver repository:
```bash
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt update
```
#### **c. Install Driver**
Install the driver for the T4 GPU:
```bash
sudo apt install -y nvidia-driver-535
```
#### **d. Verify Driver Installation**
Reboot the system and check the driver:
```bash
sudo reboot
nvidia-smi
```
This should display the GPU model and driver version.
---
### **3. Install CUDA Toolkit**
#### **a. Add CUDA Repository**
Download and install the CUDA 12.2 repository for Ubuntu 20.04:
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt update
```
#### **b. Install CUDA Toolkit**
Install CUDA 12.2:
```bash
sudo apt install -y cuda
```
#### **c. Set Up Environment Variables**
Add CUDA binaries to the PATH and library paths:
```bash
echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
```
#### **d. Verify CUDA Installation**
Check CUDA installation:
```bash
nvcc --version
nvidia-smi
```
---
### **4. Install cuDNN**
#### **a. Download cuDNN**
Download cuDNN 9.5.1 (compatible with CUDA 12.x) from the [NVIDIA cuDNN page](https://developer.nvidia.com/cudnn). You’ll need to log in and download the appropriate `.deb` files for Ubuntu 20.04.
#### **b. Install cuDNN**
Install the downloaded `.deb` files:
```bash
sudo dpkg -i libcudnn9*.deb
```
#### **c. Verify cuDNN**
Check the installed version:
```bash
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
```
---
### **5. Install NCCL and Other Libraries**
Install additional NVIDIA libraries (like NCCL) required for distributed deep learning:
```bash
sudo apt install -y libnccl2 libnccl-dev
```
---
### **6. Install PyTorch**
#### **a. Install Python Environment**
Install Python and `pip` if not already present:
```bash
sudo apt install -y python3 python3-pip
```
#### **b. Install PyTorch with CUDA 12.2**
Install PyTorch with the appropriate CUDA runtime:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122
```
#### **c. Test PyTorch**
Run a quick test:
```python
import torch
print(torch.cuda.is_available()) # Should return True
print(torch.cuda.get_device_name(0)) # Should return "Tesla T4"
```
---
### **7. Optional: Install Nsight Tools**
For debugging and profiling:
```bash
sudo apt install -y nsight-compute nsight-systems
```
---
### **8. Check for OpenGL**
If you need OpenGL utilities (like `glxinfo`):
```bash
sudo apt install -y mesa-utils
glxinfo | grep "OpenGL version"
```
---
### **9. Validate Entire Setup**
Run the NVIDIA sample tests to confirm the configuration:
```bash
cd /usr/local/cuda-12.2/samples/1_Utilities/deviceQuery
make
./deviceQuery
```
If successful, it should show details of the T4 GPU.
---
### **Summary of Installed Components**
- **GPU**: Tesla T4
- **Driver**: 535
- **CUDA Toolkit**: 12.2
- **cuDNN**: 9.5.1
- **PyTorch**: Installed with CUDA 12.2 support
This setup ensures your system is ready for deep learning workloads with the T4 GPU.
Install conda and create a new environment for the project
Install pytorch and torchvision in the new environment
Install other dependencies like numpy, pandas, matplotlib, etc.
Run the project code in the new environment
>>> import torch
>>> print(torch.cuda.is_available())
>>> print(torch.cuda.get_device_name(0))
>>> print(torch.version.cuda)
```
__CUDA Docker Setup__:
```bash
# If you are using docker and want to run a container with CUDA support
sudo apt install -y nvidia-container-toolkit
nvidia-ctk --version
sudo systemctl restart docker
sudo systemctl status docker
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvcc --version
```
|