Runtime error
Runtime error
File size: 12,549 Bytes
0ca9ca4 cbce8d4 ff35886 cbce8d4 0ca9ca4 53f077b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 |
# Install poetry
conda create -n poetry_env python=3.10 -y
conda activate poetry_env
pip install poetry
poetry env info
poetry new pytorch_project
cd pytorch_project/
# fill up the pyproject.toml file without pytorch and torchvision
poetry install
# Add dependencies to the project for pytorch and torchvision
poetry source add --priority explicit pytorch_cpu
poetry add --source pytorch_cpu torch torchvision
poetry lock
poetry show
# Add dependencies to the project
poetry add matplotlib
poetry add hydra-core
poetry add omegaconf
poetry add hydra_colorlog
poetry add --dev black #
poetry lock
poetry show
Type Purpose Installation Command
Normal Dependency Required for the app to run in production. poetry add <package>
Development Dependency Needed only during development (e.g., testing, linting). poetry add --dev <package>
# Add dependencies to the project with specific version
poetry add <package_name>@<version>
#### Step-by-Step Guide to Creating Dockerfile and docker-compose.yml for a New Code Repo
If you're new to the project and need to set up Docker and Docker Compose to run the training and inference steps, follow these steps.
### 1. Setting Up the Dockerfile
A Dockerfile is a set of instructions that Docker uses to create an image. In this case, we'll use a __multi-stage build__ to make the final image lightweight while managing dependencies with `Poetry`.
#### Step-by-Step Process for Creating the Dockerfile
1. __Choose a Base Image__:
- We need to choose a Python image that matches the project's required version (e.g., Python 3.10.14).
- Use the lightweight __`slim`__ version to minimize image size.
FROM python:3.10.14-slim as builder
2. __Install Dependencies in the Build Stage__:
- We'll use __Poetry__ for dependency management. Install it using `pip`.
- Next, copy the `pyproject.toml` and `poetry.lock` files to the `/app` directory to install dependencies.
RUN pip3 install poetry==1.7.1
COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/
3. __Configure Poetry__:
- Configure Poetry to install the dependencies in a virtual environment inside the project directory (not globally). This keeps everything contained and avoids conflicts with the system environment.
4. __Install Dependencies__:
- Use `poetry install --no-root` to install only the dependencies and not the package itself. This is because you typically don't need to install the actual project code at this stage.
RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root
5. __Build the Runtime Stage__:
- Now, set up the final runtime image. This stage will only include the required application code and the virtual environment created in the first stage.
- The final image will use the same Python base image but remain small by avoiding the re-installation of dependencies.
FROM python:3.10.14-slim as runner
COPY src /app/src
COPY --from=builder /app/.venv /app/.venv
6. __Set Up the Path to Use the Virtual Environment__:
- Update the `PATH` environment variable to use the Python binaries from the virtual environment.
ENV PATH="/app/.venv/bin:$PATH"
7. __Set a Default Command__:
- Finally, set the command that will be executed by default when the container is run. You can change or override this later in the Docker Compose file.
CMD ["python", "-m", "src.train"]
### Final Dockerfile
# Stage 1: Build environment with Poetry and dependencies
FROM python:3.10.14-slim as builder
RUN pip3 install poetry==1.7.1
COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/
RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root
# Stage 2: Runtime environment
FROM python:3.10.14-slim as runner
COPY src /app/src
COPY --from=builder /app/.venv /app/.venv
ENV PATH="/app/.venv/bin:$PATH"
CMD ["python", "-m", "src.train"]
### 2. Setting Up the docker-compose.yml File
The `docker-compose.yml` file is used to define and run multiple Docker containers as services. In this case, we need two services: one for __training__ and one for __inference__.
### Step-by-Step Process for Creating docker-compose.yml
1. __Define the Version__:
- Docker Compose uses a versioning system. Use version `3.8`, which is widely supported and offers features such as networking and volume support.
version: '3.8'
2. __Set Up the `train` Service__:
- The `train` service is responsible for running the training script. It builds the Docker image, runs the training command, and uses volumes to store the data, checkpoints, and artifacts.
context: .
command: python -m src.train
- data:/app/data
- checkpoints:/app/checkpoints
- artifacts:/app/artifacts
shm_size: '2g' # Increase shared memory to prevent DataLoader issues
- default
- .env # Load environment variables
3. __Set Up the `inference` Service__:
- The `inference` service runs after the training has completed. It waits for a file (e.g., `train_done.flag`) to be created by the training process and then runs the inference script.
context: .
command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer"
- checkpoints:/app/checkpoints
- artifacts:/app/artifacts
shm_size: '2g'
- default
- train
- .env
4. __Define Shared Volumes__:
- Volumes allow services to share data. Here, we define three shared volumes:
- `data`: Stores the input data.
- `checkpoints`: Stores the model checkpoints and the flag indicating training is complete.
- `artifacts`: Stores the final model outputs or artifacts.
5. __Set Up Networking__:
- Use the default network to allow the services to communicate.
### Final docker-compose.yml
version: '3.8'
context: .
command: python -m src.train
- data:/app/data
- checkpoints:/app/checkpoints
- artifacts:/app/artifacts
shm_size: '2g'
- default
- .env
context: .
command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer"
- checkpoints:/app/checkpoints
- artifacts:/app/artifacts
shm_size: '2g'
- default
- train
- .env
### Summary
1. __Dockerfile__:
- A multi-stage Dockerfile is used to create a lightweight image where the dependencies are installed with Poetry and the application code is run using a virtual environment.
- It ensures that all dependencies are isolated in a virtual environment, and the final container only includes what is necessary for the runtime.
2. __docker-compose.yml__:
- The `docker-compose.yml` file defines two services:
- __train__: Runs the training script and stores checkpoints.
- __inference__: Waits for the training to finish and runs inference based on the saved model.
- Shared volumes ensure that the services can access data, checkpoints, and artifacts.
- `shm_size` is increased to prevent issues with DataLoader in PyTorch when using multiple workers.
This setup allows for easy management of multiple services using Docker Compose, ensuring reproducibility and simplicity.
## __References__
- <>
- <>
- <>
- <>
- <>
8. ## __DVC SETUP__
First, install dvc using the following command
dvc init
dvc version
dvc init -f
dvc config core.autostage true
dvc add data
dvc remote add -d myremote /tmp/dvcstore
dvc push
Add some more file in the data directory and run the following commands
dvc add data
dvc push
dvc pull
Next go back to 1 commit and run the following command
git checkout HEAD~1
dvc checkout
# you will get one file less
Next go back to the latest commit and run the following command
git checkout -
dvc checkout
dv pull
dvc commit
Next run the following command to add google drive as a remote
dvc remote add --default gdrive gdrive://1w2e3r4t5y6u7i8o9p0
dvc remote modify gdrive gdrive_acknowledge_abuse true
dvc remote modify gdrive gdrive_client_id <>
dvc remote modify gdrive gdrive_client_secret <>
# does not work when used from VM and port forwarding to local machine
Next run the following command to add azure-blob as a remote
dvc remote remove azblob
dvc remote add --default azblob azure://mycontainer/myfolder
dvc remote modify --local azblob connection_string "<>"
dvc remote modify azblob allow_anonymous_login true
dvc push -r azblob
# this works when used and requires no explicit login
Next we will add S3 as a remote
dvc remote add --default aws_remote s3://deep-bucket-s3/data
dvc remote modify --local aws_remote access_key_id <>
dvc remote modify --local aws_remote secret_access_key <>
dvc remote modify --local aws_remote region ap-south-1
dvc remote modify aws_remote region ap-south-1
dvc push -r aws_remote -v
9. ## __HYDRA SETUP__
# Install hydra
pip install hydra-core hydra_colorlog omegaconf
# Fillup the configs folder with the files as per the project
# Run the following command to run the hydra experiment
# for train
python -m src.hydra_test experiment=catdog_experiment ++task_name=train ++train=True ++test=False
# for eval
python -m src.hydra_test experiment=catdog_experiment ++task_name=eval ++train=False ++test=True
# for both
python -m src.hydra_test experiment=catdog_experiment task_name=train train=True test=True # + means adding new key value pair to the existing config and ++ means overriding the existing key value pair
10. ## __LOCAL SETUP__
python -m src.train experiment=catdog_experiment ++task_name=train ++train=True ++test=False
python -m src.train experiment=catdog_experiment ++task_name=eval ++train=False ++test=True
python -m src.infer experiment=catdog_experiment
dvc repro
12. ## _DVC Experiments_
- To run the dvc experiments keep different experiment_<>.yaml files in the configs folder under experiment folder
- Make sure to override the default values in the experiment_<>.yaml file for each parameter that you want to change
13. ## _HYDRA Experiments_
- make sure to declare te config file in yaml format in the configs folder hparam
- have hparam null in train and eval config file
- run the following command to run the hydra experiment
python -m src.train --multirun experiment=catdog_experiment_convnext ++task_name=train ++train=True ++test=False hparam=catdog_classifier_covnext
python -m src.create_artifacts
14. ## __Latest Execution Command__
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=train ++train=True ++test=False
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=test ++train=False ++test=True
python -m src.infer experiment=catdog_experiment
``` |