File size: 10,849 Bytes
3cdf230
3749e6c
3cdf230
3749e6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3cdf230
 
 
 
 
 
 
 
 
 
3580800
 
3cdf230
1567755
3cdf230
3749e6c
3cdf230
 
1567755
3cdf230
 
3580800
 
3749e6c
3580800
 
 
 
 
1567755
ed5638a
3580800
f25067e
1567755
3c9c835
3749e6c
3c9c835
 
1567755
3c9c835
 
 
ff35886
 
 
3749e6c
ff35886
 
 
 
 
a31745a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
035df3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
**Install docker and docker-compose on Ubuntu 22.04**
__PreRequisites__:

    * Have an aws account with a user that has the necessary permissions
    * Have the access key either on env variables or in the github actions secrets
    * Have an ec2 runner instance running/created in the aws account
    * Have a s3 bucket created in the aws account
    * Have aws container registry created in the aws account 
__Local VM setup__:
    * Install aws configure and setup the access key and secret key and the right zone
        ```bash
        curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
        unzip awscliv2.zip
        sudo ./aws/install   
        aws configure
        ```
    

__Install docker__:
```bash
sudo apt update
sudo apt install -y apt-transport-https ca-certificates curl software-properties-common
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt update
sudo apt install -y docker-ce
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker $USER
sudo systemctl restart docker
sudo reboot
docker --version
docker ps
```
__Install docker-compose__:
```bash
sudo rm /usr/local/bin/docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/v2.30.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version
```

__Github actions self-hosted runner__:
```bash
mkdir actions-runner && cd actions-runner
curl -o actions-runner-linux-x64-2.320.0.tar.gz -L https://github.com/actions/runner/releases/download/v2.320.0/actions-runner-linux-x64-2.320.0.tar.gz
echo "93ac1b7ce743ee85b5d386f5c1787385ef07b3d7c728ff66ce0d3813d5f46900  actions-runner-linux-x64-2.320.0.tar.gz" | shasum -a 256 -c
tar xzf ./actions-runner-linux-x64-2.320.0.tar.gz
./config.sh --url https://github.com/soutrik71/pytorch-template-aws --token <Latest>
# cd actions-runner/
./run.sh
./config.sh remove --token <> # To remove the runner
# https://github.com/soutrik71/pytorch-template-aws/settings/actions/runners/new?arch=x64&os=linux
```
__Activate aws cli__:
```bash
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
sudo apt install unzip
unzip awscliv2.zip
sudo ./aws/install
aws --version
aws configure

```
__S3 bucket operations__:
```bash
aws s3 cp data s3://deep-bucket-s3/data --recursive
aws s3 ls s3://deep-bucket-s3
aws s3 rm s3://deep-bucket-s3/data --recursive
```

__Cuda Update Setup__:
```bash
# if you already have nvidia drivers installed and you have a Tesla T4 GPU
sudo apt update
sudo apt upgrade
sudo reboot

sudo apt --fix-broken install
sudo apt install ubuntu-drivers-common
sudo apt autoremove

nvidia-smi
lsmod | grep nvidia

sudo apt install nvidia-cuda-toolkit
nvcc --version

ls /usr/local/ | grep cuda
ldconfig -p | grep cudnn
lspci | grep -i nvidia

Based on the provided details, here is the breakdown of the information about your GPU, CUDA, and environment setup:

---

### **1. GPU Details**
- **Model**: Tesla T4  
  - A popular NVIDIA GPU for deep learning and AI workloads.  
  - It belongs to the Turing architecture (TU104GL).  

- **Memory**: 16 GB  
  - Only **2 MiB is currently in use**, indicating minimal GPU activity.

- **Temperature**: 25°C  
  - The GPU is operating at a low temperature, suggesting no heavy utilization currently.

- **Power Usage**: 11W / 70W  
  - The GPU is in idle or low-performance mode (P8).

- **MIG Mode**: Not enabled.  
  - MIG (Multi-Instance GPU) mode is specific to NVIDIA A100 and other GPUs, so it is not applicable here.

---

### **2. Driver and CUDA Version**
- **Driver Version**: 535.216.03  
  - Installed NVIDIA driver supports CUDA 12.x.

- **CUDA Runtime Version**: 12.2  
  - This is the active runtime version compatible with the driver.

---

### **3. CUDA Toolkit Versions**
From your `nvcc` and file system checks:
- **Default `nvcc` Version**: CUDA 10.1  
  - The system's default `nvcc` is pointing to an older CUDA 10.1 installation (`nvcc --version` output shows CUDA 10.1).  

- **Installed CUDA Toolkits**:
  - `cuda-12`
  - `cuda-12.2`
  - `cuda` (likely symlinked to `cuda-12.2`)
  
  Multiple CUDA versions are installed. However, the runtime and drivers align with **CUDA 12.2**, while the default compiler (`nvcc`) is still from CUDA 10.1.

---

### **4. cuDNN Version**
From `cudnn_version.h` and `ldconfig`:
- **cuDNN Version**: 9.5.1  
  - This cuDNN version is compatible with **CUDA 12.x**.
- **cuDNN Runtime**: The libraries for cuDNN 9 are present under `/lib/x86_64-linux-gnu`.

---

### **5. NVIDIA Software Packages**
From `dpkg`:
- **NVIDIA Drivers**: Driver version 535 is installed.
- **CUDA Toolkit**: Multiple versions installed (`10.1`, `12`, `12.2`).
- **cuDNN**: Versions for CUDA 12 and CUDA 12.6 are installed (`cudnn9-cuda-12`, `cudnn9-cuda-12-6`).

---

### **6. Other Observations**
- **Graphics Settings Issue**: 
  - `nvidia-settings` failed due to the lack of a display server connection (`Connection refused`). Likely, this is a headless server without a GUI environment.
  
- **OpenGL Tools Missing**: 
  - `glxinfo` command is missing, indicating the `mesa-utils` package needs to be installed.

---

### **Summary of Setup**
- **GPU**: Tesla T4  
- **Driver Version**: 535.216.03  
- **CUDA Runtime Version**: 12.2  
- **CUDA Toolkit Versions**: 10.1 (default `nvcc`), 12, 12.2  
- **cuDNN Version**: 9.5.1 (compatible with CUDA 12.x)  
- **Software Packages**: NVIDIA drivers, CUDA, cuDNN installed
```

__CUDA New Installation__:
```bash
# if you don't have nvidia drivers installed and you have a Tesla T4 GPU 
lspci | grep -i nvidia # Check if the GPU is detected
To set up the T4 GPU from scratch, starting with no drivers or CUDA tools, and replicating the above configurations and drivers, follow these reverse-engineered steps:

---

### **1. Update System**
Ensure the system is updated:
```bash
sudo apt update && sudo apt upgrade -y
sudo reboot
```

---

### **2. Install NVIDIA Driver**
#### **a. Identify Required Driver**
The T4 GPU requires a compatible NVIDIA driver version. Based on your configurations, we will install **Driver 535**.

#### **b. Add NVIDIA Repository**
Add the official NVIDIA driver repository:
```bash
sudo apt install -y software-properties-common
sudo add-apt-repository -y ppa:graphics-drivers/ppa
sudo apt update
```

#### **c. Install Driver**
Install the driver for the T4 GPU:
```bash
sudo apt install -y nvidia-driver-535
```

#### **d. Verify Driver Installation**
Reboot the system and check the driver:
```bash
sudo reboot
nvidia-smi
```
This should display the GPU model and driver version.

---

### **3. Install CUDA Toolkit**
#### **a. Add CUDA Repository**
Download and install the CUDA 12.2 repository for Ubuntu 20.04:
```bash
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2004-12-2-local_12.2.0-535.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2004-12-2-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt update
```

#### **b. Install CUDA Toolkit**
Install CUDA 12.2:
```bash
sudo apt install -y cuda
```

#### **c. Set Up Environment Variables**
Add CUDA binaries to the PATH and library paths:
```bash
echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
```

#### **d. Verify CUDA Installation**
Check CUDA installation:
```bash
nvcc --version
nvidia-smi
```

---

### **4. Install cuDNN**
#### **a. Download cuDNN**
Download cuDNN 9.5.1 (compatible with CUDA 12.x) from the [NVIDIA cuDNN page](https://developer.nvidia.com/cudnn). You’ll need to log in and download the appropriate `.deb` files for Ubuntu 20.04.

#### **b. Install cuDNN**
Install the downloaded `.deb` files:
```bash
sudo dpkg -i libcudnn9*.deb
```

#### **c. Verify cuDNN**
Check the installed version:
```bash
cat /usr/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
```

---

### **5. Install NCCL and Other Libraries**
Install additional NVIDIA libraries (like NCCL) required for distributed deep learning:
```bash
sudo apt install -y libnccl2 libnccl-dev
```

---

### **6. Install PyTorch**
#### **a. Install Python Environment**
Install Python and `pip` if not already present:
```bash
sudo apt install -y python3 python3-pip
```

#### **b. Install PyTorch with CUDA 12.2**
Install PyTorch with the appropriate CUDA runtime:
```bash
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu122
```

#### **c. Test PyTorch**
Run a quick test:
```python
import torch
print(torch.cuda.is_available())  # Should return True
print(torch.cuda.get_device_name(0))  # Should return "Tesla T4"
```

---

### **7. Optional: Install Nsight Tools**
For debugging and profiling:
```bash
sudo apt install -y nsight-compute nsight-systems
```

---

### **8. Check for OpenGL**
If you need OpenGL utilities (like `glxinfo`):
```bash
sudo apt install -y mesa-utils
glxinfo | grep "OpenGL version"
```

---

### **9. Validate Entire Setup**
Run the NVIDIA sample tests to confirm the configuration:
```bash
cd /usr/local/cuda-12.2/samples/1_Utilities/deviceQuery
make
./deviceQuery
```
If successful, it should show details of the T4 GPU.

---

### **Summary of Installed Components**
- **GPU**: Tesla T4
- **Driver**: 535
- **CUDA Toolkit**: 12.2
- **cuDNN**: 9.5.1
- **PyTorch**: Installed with CUDA 12.2 support

This setup ensures your system is ready for deep learning workloads with the T4 GPU.

Install conda and create a new environment for the project
Install pytorch and torchvision in the new environment
Install other dependencies like numpy, pandas, matplotlib, etc.
Run the project code in the new environment
>>> import torch
>>> print(torch.cuda.is_available())
>>> print(torch.cuda.get_device_name(0))
>>> print(torch.version.cuda)
```
__CUDA Docker Setup__:
```bash
# If you are using docker and want to run a container with CUDA support
sudo apt install -y nvidia-container-toolkit
nvidia-ctk --version
sudo systemctl restart docker
sudo systemctl status docker
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.2.0-base-ubuntu20.04 nvcc --version
```