Runtime error
Runtime error
## __POETRY SETUP__ | |
```bash | |
# Install poetry | |
conda create -n poetry_env python=3.10 -y | |
conda activate poetry_env | |
pip install poetry | |
poetry env info | |
poetry new pytorch_project | |
cd pytorch_project/ | |
# fill up the pyproject.toml file without pytorch and torchvision | |
poetry install | |
# Add dependencies to the project for pytorch and torchvision | |
poetry source add --priority explicit pytorch_cpu | |
poetry add --source pytorch_cpu torch torchvision | |
poetry lock | |
poetry show | |
# Add dependencies to the project | |
poetry add matplotlib | |
poetry add hydra-core | |
poetry add omegaconf | |
poetry add hydra_colorlog | |
poetry add --dev black # | |
poetry lock | |
poetry show | |
Type Purpose Installation Command | |
Normal Dependency Required for the app to run in production. poetry add <package> | |
Development Dependency Needed only during development (e.g., testing, linting). poetry add --dev <package> | |
# Add dependencies to the project with specific version | |
poetry add <package_name>@<version> | |
``` | |
#### Step-by-Step Guide to Creating Dockerfile and docker-compose.yml for a New Code Repo | |
If you're new to the project and need to set up Docker and Docker Compose to run the training and inference steps, follow these steps. | |
--- | |
### 1. Setting Up the Dockerfile | |
A Dockerfile is a set of instructions that Docker uses to create an image. In this case, we'll use a __multi-stage build__ to make the final image lightweight while managing dependencies with `Poetry`. | |
#### Step-by-Step Process for Creating the Dockerfile | |
1. __Choose a Base Image__: | |
- We need to choose a Python image that matches the project's required version (e.g., Python 3.10.14). | |
- Use the lightweight __`slim`__ version to minimize image size. | |
```Dockerfile | |
FROM python:3.10.14-slim as builder | |
``` | |
2. __Install Dependencies in the Build Stage__: | |
- We'll use __Poetry__ for dependency management. Install it using `pip`. | |
- Next, copy the `pyproject.toml` and `poetry.lock` files to the `/app` directory to install dependencies. | |
```Dockerfile | |
RUN pip3 install poetry==1.7.1 | |
WORKDIR /app | |
COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/ | |
``` | |
3. __Configure Poetry__: | |
- Configure Poetry to install the dependencies in a virtual environment inside the project directory (not globally). This keeps everything contained and avoids conflicts with the system environment. | |
```Dockerfile | |
POETRY_CACHE_DIR=/tmp/poetry_cache | |
``` | |
4. __Install Dependencies__: | |
- Use `poetry install --no-root` to install only the dependencies and not the package itself. This is because you typically don't need to install the actual project code at this stage. | |
```Dockerfile | |
RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root | |
``` | |
5. __Build the Runtime Stage__: | |
- Now, set up the final runtime image. This stage will only include the required application code and the virtual environment created in the first stage. | |
- The final image will use the same Python base image but remain small by avoiding the re-installation of dependencies. | |
```Dockerfile | |
FROM python:3.10.14-slim as runner | |
WORKDIR /app | |
COPY src /app/src | |
COPY --from=builder /app/.venv /app/.venv | |
``` | |
6. __Set Up the Path to Use the Virtual Environment__: | |
- Update the `PATH` environment variable to use the Python binaries from the virtual environment. | |
```Dockerfile | |
ENV PATH="/app/.venv/bin:$PATH" | |
``` | |
7. __Set a Default Command__: | |
- Finally, set the command that will be executed by default when the container is run. You can change or override this later in the Docker Compose file. | |
```Dockerfile | |
CMD ["python", "-m", "src.train"] | |
``` | |
### Final Dockerfile | |
```Dockerfile | |
# Stage 1: Build environment with Poetry and dependencies | |
FROM python:3.10.14-slim as builder | |
RUN pip3 install poetry==1.7.1 | |
WORKDIR /app | |
COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/ | |
POETRY_CACHE_DIR=/tmp/poetry_cache | |
RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root | |
# Stage 2: Runtime environment | |
FROM python:3.10.14-slim as runner | |
WORKDIR /app | |
COPY src /app/src | |
COPY --from=builder /app/.venv /app/.venv | |
ENV PATH="/app/.venv/bin:$PATH" | |
CMD ["python", "-m", "src.train"] | |
``` | |
--- | |
### 2. Setting Up the docker-compose.yml File | |
The `docker-compose.yml` file is used to define and run multiple Docker containers as services. In this case, we need two services: one for __training__ and one for __inference__. | |
### Step-by-Step Process for Creating docker-compose.yml | |
1. __Define the Version__: | |
- Docker Compose uses a versioning system. Use version `3.8`, which is widely supported and offers features such as networking and volume support. | |
```yaml | |
version: '3.8' | |
``` | |
2. __Set Up the `train` Service__: | |
- The `train` service is responsible for running the training script. It builds the Docker image, runs the training command, and uses volumes to store the data, checkpoints, and artifacts. | |
```yaml | |
services: | |
train: | |
build: | |
context: . | |
command: python -m src.train | |
volumes: | |
- data:/app/data | |
- checkpoints:/app/checkpoints | |
- artifacts:/app/artifacts | |
shm_size: '2g' # Increase shared memory to prevent DataLoader issues | |
networks: | |
- default | |
env_file: | |
- .env # Load environment variables | |
``` | |
3. __Set Up the `inference` Service__: | |
- The `inference` service runs after the training has completed. It waits for a file (e.g., `train_done.flag`) to be created by the training process and then runs the inference script. | |
```yaml | |
inference: | |
build: | |
context: . | |
command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer" | |
volumes: | |
- checkpoints:/app/checkpoints | |
- artifacts:/app/artifacts | |
shm_size: '2g' | |
networks: | |
- default | |
depends_on: | |
- train | |
env_file: | |
- .env | |
``` | |
4. __Define Shared Volumes__: | |
- Volumes allow services to share data. Here, we define three shared volumes: | |
- `data`: Stores the input data. | |
- `checkpoints`: Stores the model checkpoints and the flag indicating training is complete. | |
- `artifacts`: Stores the final model outputs or artifacts. | |
```yaml | |
volumes: | |
data: | |
checkpoints: | |
artifacts: | |
``` | |
5. __Set Up Networking__: | |
- Use the default network to allow the services to communicate. | |
```yaml | |
networks: | |
default: | |
``` | |
### Final docker-compose.yml | |
```yaml | |
version: '3.8' | |
services: | |
train: | |
build: | |
context: . | |
command: python -m src.train | |
volumes: | |
- data:/app/data | |
- checkpoints:/app/checkpoints | |
- artifacts:/app/artifacts | |
shm_size: '2g' | |
networks: | |
- default | |
env_file: | |
- .env | |
inference: | |
build: | |
context: . | |
command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer" | |
volumes: | |
- checkpoints:/app/checkpoints | |
- artifacts:/app/artifacts | |
shm_size: '2g' | |
networks: | |
- default | |
depends_on: | |
- train | |
env_file: | |
- .env | |
volumes: | |
data: | |
checkpoints: | |
artifacts: | |
networks: | |
default: | |
``` | |
--- | |
### Summary | |
1. __Dockerfile__: | |
- A multi-stage Dockerfile is used to create a lightweight image where the dependencies are installed with Poetry and the application code is run using a virtual environment. | |
- It ensures that all dependencies are isolated in a virtual environment, and the final container only includes what is necessary for the runtime. | |
2. __docker-compose.yml__: | |
- The `docker-compose.yml` file defines two services: | |
- __train__: Runs the training script and stores checkpoints. | |
- __inference__: Waits for the training to finish and runs inference based on the saved model. | |
- Shared volumes ensure that the services can access data, checkpoints, and artifacts. | |
- `shm_size` is increased to prevent issues with DataLoader in PyTorch when using multiple workers. | |
This setup allows for easy management of multiple services using Docker Compose, ensuring reproducibility and simplicity. | |
## __References__ | |
- <> | |
- <> | |
- <> | |
- <> | |
- <> | |
8. ## __DVC SETUP__ | |
First, install dvc using the following command | |
```bash | |
dvc init | |
dvc version | |
dvc init -f | |
dvc config core.autostage true | |
dvc add data | |
dvc remote add -d myremote /tmp/dvcstore | |
dvc push | |
``` | |
Add some more file in the data directory and run the following commands | |
```bash | |
dvc add data | |
dvc push | |
dvc pull | |
``` | |
Next go back to 1 commit and run the following command | |
```bash | |
git checkout HEAD~1 | |
dvc checkout | |
# you will get one file less | |
``` | |
Next go back to the latest commit and run the following command | |
```bash | |
git checkout - | |
dvc checkout | |
dv pull | |
dvc commit | |
``` | |
Next run the following command to add google drive as a remote | |
```bash | |
dvc remote add --default gdrive gdrive://1w2e3r4t5y6u7i8o9p0 | |
dvc remote modify gdrive gdrive_acknowledge_abuse true | |
dvc remote modify gdrive gdrive_client_id <> | |
dvc remote modify gdrive gdrive_client_secret <> | |
# does not work when used from VM and port forwarding to local machine | |
``` | |
Next run the following command to add azure-blob as a remote | |
```bash | |
dvc remote remove azblob | |
dvc remote add --default azblob azure://mycontainer/myfolder | |
dvc remote modify --local azblob connection_string "<>" | |
dvc remote modify azblob allow_anonymous_login true | |
dvc push -r azblob | |
# this works when used and requires no explicit login | |
``` | |
Next we will add S3 as a remote | |
```bash | |
dvc remote add --default aws_remote s3://deep-bucket-s3/data | |
dvc remote modify --local aws_remote access_key_id <> | |
dvc remote modify --local aws_remote secret_access_key <> | |
dvc remote modify --local aws_remote region ap-south-1 | |
dvc remote modify aws_remote region ap-south-1 | |
dvc push -r aws_remote -v | |
``` | |
9. ## __HYDRA SETUP__ | |
```bash | |
# Install hydra | |
pip install hydra-core hydra_colorlog omegaconf | |
# Fillup the configs folder with the files as per the project | |
# Run the following command to run the hydra experiment | |
# for train | |
python -m src.hydra_test experiment=catdog_experiment ++task_name=train ++train=True ++test=False | |
# for eval | |
python -m src.hydra_test experiment=catdog_experiment ++task_name=eval ++train=False ++test=True | |
# for both | |
python -m src.hydra_test experiment=catdog_experiment task_name=train train=True test=True # + means adding new key value pair to the existing config and ++ means overriding the existing key value pair | |
``` | |
10. ## __LOCAL SETUP__ | |
```bash | |
python -m src.train experiment=catdog_experiment ++task_name=train ++train=True ++test=False | |
python -m src.train experiment=catdog_experiment ++task_name=eval ++train=False ++test=True | |
python -m src.infer experiment=catdog_experiment | |
``` | |
```bash | |
dvc repro | |
``` | |
12. ## _DVC Experiments_ | |
- To run the dvc experiments keep different experiment_<>.yaml files in the configs folder under experiment folder | |
- Make sure to override the default values in the experiment_<>.yaml file for each parameter that you want to change | |
13. ## _HYDRA Experiments_ | |
- make sure to declare te config file in yaml format in the configs folder hparam | |
- have hparam null in train and eval config file | |
- run the following command to run the hydra experiment | |
```bash | |
python -m src.train --multirun experiment=catdog_experiment_convnext ++task_name=train ++train=True ++test=False hparam=catdog_classifier_covnext | |
python -m src.create_artifacts | |
``` | |
14. ## __Latest Execution Command__ | |
```bash | |
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=train ++train=True ++test=False | |
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=test ++train=False ++test=True | |
python -m src.infer experiment=catdog_experiment | |
``` |