File size: 13,126 Bytes
0ca9ca4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a46d6f7
0ca9ca4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cbce8d4
ff35886
 
 
cbce8d4
 
0ca9ca4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53f077b
 
 
 
 
 
 
 
a46d6f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53f077b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
## __POETRY SETUP__

```bash
# Install poetry
conda create -n poetry_env python=3.10 -y
conda activate poetry_env
pip install poetry
poetry env info
poetry new pytorch_project
cd pytorch_project/
# fill up the pyproject.toml file without pytorch and torchvision
poetry install

# Add dependencies to the project for pytorch and torchvision
poetry source add --priority explicit pytorch_cpu https://download.pytorch.org/whl/cpu
poetry add --source pytorch_cpu torch torchvision
poetry lock
poetry show
poetry install --no-root

# Add dependencies to the project 
poetry add matplotlib
poetry add hydra-core
poetry add omegaconf
poetry add hydra_colorlog
poetry add --dev black # 
poetry lock
poetry show

Type	Purpose	Installation Command
  Normal Dependency	Required for the app to run in production.	poetry add <package>
  Development Dependency	Needed only during development (e.g., testing, linting).	poetry add --dev <package>
# Add dependencies to the project with specific version
poetry add <package_name>@<version>
```

## __MULTISTAGEDOCKER SETUP__

#### Step-by-Step Guide to Creating Dockerfile and docker-compose.yml for a New Code Repo

If you're new to the project and need to set up Docker and Docker Compose to run the training and inference steps, follow these steps.

---

### 1. Setting Up the Dockerfile

A Dockerfile is a set of instructions that Docker uses to create an image. In this case, we'll use a __multi-stage build__ to make the final image lightweight while managing dependencies with `Poetry`.

#### Step-by-Step Process for Creating the Dockerfile

1. __Choose a Base Image__:
   - We need to choose a Python image that matches the project's required version (e.g., Python 3.10.14).
   - Use the lightweight __`slim`__ version to minimize image size.

   ```Dockerfile
   FROM python:3.10.14-slim as builder
   ```

2. __Install Dependencies in the Build Stage__:
   - We'll use __Poetry__ for dependency management. Install it using `pip`.
   - Next, copy the `pyproject.toml` and `poetry.lock` files to the `/app` directory to install dependencies.

   ```Dockerfile
   RUN pip3 install poetry==1.7.1
   WORKDIR /app
   COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/
   ```

3. __Configure Poetry__:
   - Configure Poetry to install the dependencies in a virtual environment inside the project directory (not globally). This keeps everything contained and avoids conflicts with the system environment.

   ```Dockerfile
   ENV POETRY_NO_INTERACTION=1 \
       POETRY_VIRTUALENVS_IN_PROJECT=1 \
       POETRY_VIRTUALENVS_CREATE=true \
       POETRY_CACHE_DIR=/tmp/poetry_cache
   ```

4. __Install Dependencies__:
   - Use `poetry install --no-root` to install only the dependencies and not the package itself. This is because you typically don't need to install the actual project code at this stage.

   ```Dockerfile
   RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root
   ```

5. __Build the Runtime Stage__:
   - Now, set up the final runtime image. This stage will only include the required application code and the virtual environment created in the first stage.
   - The final image will use the same Python base image but remain small by avoiding the re-installation of dependencies.

   ```Dockerfile
   FROM python:3.10.14-slim as runner
   WORKDIR /app
   COPY src /app/src
   COPY --from=builder /app/.venv /app/.venv
   ```

6. __Set Up the Path to Use the Virtual Environment__:
   - Update the `PATH` environment variable to use the Python binaries from the virtual environment.

   ```Dockerfile
   ENV PATH="/app/.venv/bin:$PATH"
   ```

7. __Set a Default Command__:
   - Finally, set the command that will be executed by default when the container is run. You can change or override this later in the Docker Compose file.

   ```Dockerfile
   CMD ["python", "-m", "src.train"]
   ```

### Final Dockerfile

```Dockerfile
# Stage 1: Build environment with Poetry and dependencies
FROM python:3.10.14-slim as builder
RUN pip3 install poetry==1.7.1
WORKDIR /app
COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/
ENV POETRY_NO_INTERACTION=1 \
    POETRY_VIRTUALENVS_IN_PROJECT=1 \
    POETRY_VIRTUALENVS_CREATE=true \
    POETRY_CACHE_DIR=/tmp/poetry_cache
RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root

# Stage 2: Runtime environment
FROM python:3.10.14-slim as runner
WORKDIR /app
COPY src /app/src
COPY --from=builder /app/.venv /app/.venv
ENV PATH="/app/.venv/bin:$PATH"
CMD ["python", "-m", "src.train"]
```

---

### 2. Setting Up the docker-compose.yml File

The `docker-compose.yml` file is used to define and run multiple Docker containers as services. In this case, we need two services: one for __training__ and one for __inference__.

### Step-by-Step Process for Creating docker-compose.yml

1. __Define the Version__:
   - Docker Compose uses a versioning system. Use version `3.8`, which is widely supported and offers features such as networking and volume support.

   ```yaml
   version: '3.8'
   ```

2. __Set Up the `train` Service__:
   - The `train` service is responsible for running the training script. It builds the Docker image, runs the training command, and uses volumes to store the data, checkpoints, and artifacts.

   ```yaml
   services:
     train:
       build:
         context: .
       command: python -m src.train
       volumes:
         - data:/app/data
         - checkpoints:/app/checkpoints
         - artifacts:/app/artifacts
       shm_size: '2g'  # Increase shared memory to prevent DataLoader issues
       networks:
         - default
       env_file:
         - .env  # Load environment variables
   ```

3. __Set Up the `inference` Service__:
   - The `inference` service runs after the training has completed. It waits for a file (e.g., `train_done.flag`) to be created by the training process and then runs the inference script.

   ```yaml
     inference:
       build:
         context: .
       command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer"
       volumes:
         - checkpoints:/app/checkpoints
         - artifacts:/app/artifacts
       shm_size: '2g'
       networks:
         - default
       depends_on:
         - train
       env_file:
         - .env
   ```

4. __Define Shared Volumes__:
   - Volumes allow services to share data. Here, we define three shared volumes:
     - `data`: Stores the input data.
     - `checkpoints`: Stores the model checkpoints and the flag indicating training is complete.
     - `artifacts`: Stores the final model outputs or artifacts.

   ```yaml
   volumes:
     data:
     checkpoints:
     artifacts:
   ```

5. __Set Up Networking__:
   - Use the default network to allow the services to communicate.

   ```yaml
   networks:
     default:
   ```

### Final docker-compose.yml

```yaml
version: '3.8'

services:
  train:
    build:
      context: .
    command: python -m src.train
    volumes:
      - data:/app/data
      - checkpoints:/app/checkpoints
      - artifacts:/app/artifacts
    shm_size: '2g'
    networks:
      - default
    env_file:
      - .env

  inference:
    build:
      context: .
    command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer"
    volumes:
      - checkpoints:/app/checkpoints
      - artifacts:/app/artifacts
    shm_size: '2g'
    networks:
      - default
    depends_on:
      - train
    env_file:
      - .env

volumes:
  data:
  checkpoints:
  artifacts:

networks:
  default:
```

---

### Summary

1. __Dockerfile__:
   - A multi-stage Dockerfile is used to create a lightweight image where the dependencies are installed with Poetry and the application code is run using a virtual environment.
   - It ensures that all dependencies are isolated in a virtual environment, and the final container only includes what is necessary for the runtime.

2. __docker-compose.yml__:
   - The `docker-compose.yml` file defines two services:
     - __train__: Runs the training script and stores checkpoints.
     - __inference__: Waits for the training to finish and runs inference based on the saved model.
   - Shared volumes ensure that the services can access data, checkpoints, and artifacts.
   - `shm_size` is increased to prevent issues with DataLoader in PyTorch when using multiple workers.

This setup allows for easy management of multiple services using Docker Compose, ensuring reproducibility and simplicity.

## __References__

- <https://stackoverflow.com/questions/53835198/integrating-python-poetry-with-docker>
- <https://github.com/fralik/poetry-with-private-repos/blob/master/Dockerfile>
- <https://medium.com/@albertazzir/blazing-fast-python-docker-builds-with-poetry-a78a66f5aed0>
- <https://www.martinrichards.me/post/python_poetry_docker/>
- <https://gist.github.com/soof-golan/6ebb97a792ccd87816c0bda1e6e8b8c2>

8. ## __DVC SETUP__

First, install dvc using the following command

```bash
dvc init
dvc version
dvc init -f
dvc config core.autostage true
dvc add data
dvc remote add -d myremote /tmp/dvcstore
dvc push
```

Add some more file in the data directory and run the following commands

```bash
dvc add data
dvc push
dvc pull
```

Next go back to 1 commit and run the following command

```bash
git checkout HEAD~1
dvc checkout
# you will get one file less
```

Next go back to the latest commit and run the following command

```bash
git checkout -
dvc checkout
dv pull
dvc commit
```

Next run the following command to add google drive as a remote

```bash
dvc remote add --default gdrive gdrive://1w2e3r4t5y6u7i8o9p0
dvc remote modify gdrive gdrive_acknowledge_abuse true
dvc remote modify gdrive gdrive_client_id <>
dvc remote modify gdrive gdrive_client_secret <>
# does not work when used from VM and port forwarding to local machine
```

Next run the following command to add azure-blob as a remote

```bash
dvc remote remove azblob
dvc remote add --default azblob azure://mycontainer/myfolder
dvc remote modify --local azblob connection_string "<>"
dvc remote modify azblob  allow_anonymous_login true
dvc push -r azblob
# this works when used and requires no explicit login
```

Next we will add S3 as a remote

```bash
dvc remote add --default aws_remote s3://deep-bucket-s3/data
dvc remote modify --local aws_remote access_key_id <>
dvc remote modify --local aws_remote secret_access_key <>
dvc remote modify --local aws_remote region ap-south-1
dvc remote modify aws_remote region ap-south-1
dvc push -r aws_remote -v
```

9. ## __HYDRA SETUP__

```bash
# Install hydra
pip install hydra-core hydra_colorlog omegaconf
# Fillup the configs folder with the files as per the project
# Run the following command to run the hydra experiment
# for train 
python -m src.hydra_test experiment=catdog_experiment ++task_name=train ++train=True ++test=False
# for eval
python -m src.hydra_test experiment=catdog_experiment ++task_name=eval ++train=False ++test=True
# for both
python -m src.hydra_test experiment=catdog_experiment task_name=train train=True test=True # + means adding new key value pair to the existing config and ++ means overriding the existing key value pair
```

10. ## __LOCAL SETUP__

```bash
 python -m src.train experiment=catdog_experiment ++task_name=train ++train=True ++test=False
 python -m src.train experiment=catdog_experiment ++task_name=eval ++train=False ++test=True
 python -m src.infer experiment=catdog_experiment
```

11. ## _DVC_PIPELINE_SETUP_

```bash
dvc repro
```
12. ## _DVC Experiments_
  - To run the dvc experiments keep different experiment_<>.yaml files in the configs folder under experiment folder
  - Make sure to override the default values in the experiment_<>.yaml file for each parameter that you want to change

13. ## _HYDRA Experiments_
  - make sure to declare te config file in yaml format in the configs folder hparam
  - have hparam null in train and eval config file
  - run the following command to run the hydra experiment
  ```bash
   python -m src.train --multirun experiment=catdog_experiment_convnext ++task_name=train ++train=True ++test=False hparam=catdog_classifier_covnext
   python -m src.create_artifacts
  ```

14. ## __Latest Execution Command__

```bash
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=train ++train=True ++test=False
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=test ++train=False ++test=True
python -m src.infer experiment=catdog_experiment
```

15. ## __GPU Setup__
```bash
docker build -t my-gpu-app .
docker run --gpus all my-gpu-app
docker exec -it <container_id> /bin/bash
# pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime supports cuda 12.1 and python 3.10.14
```
```bash
# for docker compose what we need to is follow similar to the following
services:
  test:
    image: nvidia/cuda:12.3.1-base-ubuntu20.04
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
```