Spaces:
Runtime error
POETRY SETUP
# Install poetry
conda create -n poetry_env python=3.10 -y
conda activate poetry_env
pip install poetry
poetry env info
poetry new pytorch_project
cd pytorch_project/
# fill up the pyproject.toml file without pytorch and torchvision
poetry install
# Add dependencies to the project for pytorch and torchvision
poetry source add --priority explicit pytorch_cpu https://download.pytorch.org/whl/cpu
poetry add --source pytorch_cpu torch torchvision
poetry lock
poetry show
poetry install --no-root
# Add dependencies to the project
poetry add matplotlib
poetry add hydra-core
poetry add omegaconf
poetry add hydra_colorlog
poetry add --dev black #
poetry lock
poetry show
Type Purpose Installation Command
Normal Dependency Required for the app to run in production. poetry add <package>
Development Dependency Needed only during development (e.g., testing, linting). poetry add --dev <package>
# Add dependencies to the project with specific version
poetry add <package_name>@<version>
MULTISTAGEDOCKER SETUP
Step-by-Step Guide to Creating Dockerfile and docker-compose.yml for a New Code Repo
If you're new to the project and need to set up Docker and Docker Compose to run the training and inference steps, follow these steps.
1. Setting Up the Dockerfile
A Dockerfile is a set of instructions that Docker uses to create an image. In this case, we'll use a multi-stage build to make the final image lightweight while managing dependencies with Poetry
.
Step-by-Step Process for Creating the Dockerfile
Choose a Base Image:
- We need to choose a Python image that matches the project's required version (e.g., Python 3.10.14).
- Use the lightweight
slim
version to minimize image size.
FROM python:3.10.14-slim as builder
Install Dependencies in the Build Stage:
- We'll use Poetry for dependency management. Install it using
pip
. - Next, copy the
pyproject.toml
andpoetry.lock
files to the/app
directory to install dependencies.
RUN pip3 install poetry==1.7.1 WORKDIR /app COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/
- We'll use Poetry for dependency management. Install it using
Configure Poetry:
- Configure Poetry to install the dependencies in a virtual environment inside the project directory (not globally). This keeps everything contained and avoids conflicts with the system environment.
ENV POETRY_NO_INTERACTION=1 \ POETRY_VIRTUALENVS_IN_PROJECT=1 \ POETRY_VIRTUALENVS_CREATE=true \ POETRY_CACHE_DIR=/tmp/poetry_cache
Install Dependencies:
- Use
poetry install --no-root
to install only the dependencies and not the package itself. This is because you typically don't need to install the actual project code at this stage.
RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root
- Use
Build the Runtime Stage:
- Now, set up the final runtime image. This stage will only include the required application code and the virtual environment created in the first stage.
- The final image will use the same Python base image but remain small by avoiding the re-installation of dependencies.
FROM python:3.10.14-slim as runner WORKDIR /app COPY src /app/src COPY --from=builder /app/.venv /app/.venv
Set Up the Path to Use the Virtual Environment:
- Update the
PATH
environment variable to use the Python binaries from the virtual environment.
ENV PATH="/app/.venv/bin:$PATH"
- Update the
Set a Default Command:
- Finally, set the command that will be executed by default when the container is run. You can change or override this later in the Docker Compose file.
CMD ["python", "-m", "src.train"]
Final Dockerfile
# Stage 1: Build environment with Poetry and dependencies
FROM python:3.10.14-slim as builder
RUN pip3 install poetry==1.7.1
WORKDIR /app
COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/
ENV POETRY_NO_INTERACTION=1 \
POETRY_VIRTUALENVS_IN_PROJECT=1 \
POETRY_VIRTUALENVS_CREATE=true \
POETRY_CACHE_DIR=/tmp/poetry_cache
RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root
# Stage 2: Runtime environment
FROM python:3.10.14-slim as runner
WORKDIR /app
COPY src /app/src
COPY --from=builder /app/.venv /app/.venv
ENV PATH="/app/.venv/bin:$PATH"
CMD ["python", "-m", "src.train"]
2. Setting Up the docker-compose.yml File
The docker-compose.yml
file is used to define and run multiple Docker containers as services. In this case, we need two services: one for training and one for inference.
Step-by-Step Process for Creating docker-compose.yml
Define the Version:
- Docker Compose uses a versioning system. Use version
3.8
, which is widely supported and offers features such as networking and volume support.
version: '3.8'
- Docker Compose uses a versioning system. Use version
Set Up the
train
Service:- The
train
service is responsible for running the training script. It builds the Docker image, runs the training command, and uses volumes to store the data, checkpoints, and artifacts.
services: train: build: context: . command: python -m src.train volumes: - data:/app/data - checkpoints:/app/checkpoints - artifacts:/app/artifacts shm_size: '2g' # Increase shared memory to prevent DataLoader issues networks: - default env_file: - .env # Load environment variables
- The
Set Up the
inference
Service:- The
inference
service runs after the training has completed. It waits for a file (e.g.,train_done.flag
) to be created by the training process and then runs the inference script.
inference: build: context: . command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer" volumes: - checkpoints:/app/checkpoints - artifacts:/app/artifacts shm_size: '2g' networks: - default depends_on: - train env_file: - .env
- The
Define Shared Volumes:
- Volumes allow services to share data. Here, we define three shared volumes:
data
: Stores the input data.checkpoints
: Stores the model checkpoints and the flag indicating training is complete.artifacts
: Stores the final model outputs or artifacts.
volumes: data: checkpoints: artifacts:
- Volumes allow services to share data. Here, we define three shared volumes:
Set Up Networking:
- Use the default network to allow the services to communicate.
networks: default:
Final docker-compose.yml
version: '3.8'
services:
train:
build:
context: .
command: python -m src.train
volumes:
- data:/app/data
- checkpoints:/app/checkpoints
- artifacts:/app/artifacts
shm_size: '2g'
networks:
- default
env_file:
- .env
inference:
build:
context: .
command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer"
volumes:
- checkpoints:/app/checkpoints
- artifacts:/app/artifacts
shm_size: '2g'
networks:
- default
depends_on:
- train
env_file:
- .env
volumes:
data:
checkpoints:
artifacts:
networks:
default:
Summary
Dockerfile:
- A multi-stage Dockerfile is used to create a lightweight image where the dependencies are installed with Poetry and the application code is run using a virtual environment.
- It ensures that all dependencies are isolated in a virtual environment, and the final container only includes what is necessary for the runtime.
docker-compose.yml:
- The
docker-compose.yml
file defines two services:- train: Runs the training script and stores checkpoints.
- inference: Waits for the training to finish and runs inference based on the saved model.
- Shared volumes ensure that the services can access data, checkpoints, and artifacts.
shm_size
is increased to prevent issues with DataLoader in PyTorch when using multiple workers.
- The
This setup allows for easy management of multiple services using Docker Compose, ensuring reproducibility and simplicity.
References
- https://stackoverflow.com/questions/53835198/integrating-python-poetry-with-docker
- https://github.com/fralik/poetry-with-private-repos/blob/master/Dockerfile
- https://medium.com/@albertazzir/blazing-fast-python-docker-builds-with-poetry-a78a66f5aed0
- https://www.martinrichards.me/post/python_poetry_docker/
- https://gist.github.com/soof-golan/6ebb97a792ccd87816c0bda1e6e8b8c2
First, install dvc using the following command
dvc init
dvc version
dvc init -f
dvc config core.autostage true
dvc add data
dvc remote add -d myremote /tmp/dvcstore
dvc push
Add some more file in the data directory and run the following commands
dvc add data
dvc push
dvc pull
Next go back to 1 commit and run the following command
git checkout HEAD~1
dvc checkout
# you will get one file less
Next go back to the latest commit and run the following command
git checkout -
dvc checkout
dv pull
dvc commit
Next run the following command to add google drive as a remote
dvc remote add --default gdrive gdrive://1w2e3r4t5y6u7i8o9p0
dvc remote modify gdrive gdrive_acknowledge_abuse true
dvc remote modify gdrive gdrive_client_id <>
dvc remote modify gdrive gdrive_client_secret <>
# does not work when used from VM and port forwarding to local machine
Next run the following command to add azure-blob as a remote
dvc remote remove azblob
dvc remote add --default azblob azure://mycontainer/myfolder
dvc remote modify --local azblob connection_string "<>"
dvc remote modify azblob allow_anonymous_login true
dvc push -r azblob
# this works when used and requires no explicit login
Next we will add S3 as a remote
dvc remote add --default aws_remote s3://deep-bucket-s3/data
dvc remote modify --local aws_remote access_key_id <>
dvc remote modify --local aws_remote secret_access_key <>
dvc remote modify --local aws_remote region ap-south-1
dvc remote modify aws_remote region ap-south-1
dvc push -r aws_remote -v
# Install hydra
pip install hydra-core hydra_colorlog omegaconf
# Fillup the configs folder with the files as per the project
# Run the following command to run the hydra experiment
# for train
python -m src.hydra_test experiment=catdog_experiment ++task_name=train ++train=True ++test=False
# for eval
python -m src.hydra_test experiment=catdog_experiment ++task_name=eval ++train=False ++test=True
# for both
python -m src.hydra_test experiment=catdog_experiment task_name=train train=True test=True # + means adding new key value pair to the existing config and ++ means overriding the existing key value pair
python -m src.train experiment=catdog_experiment ++task_name=train ++train=True ++test=False
python -m src.train experiment=catdog_experiment ++task_name=eval ++train=False ++test=True
python -m src.infer experiment=catdog_experiment
dvc repro
- To run the dvc experiments keep different experiment_<>.yaml files in the configs folder under experiment folder
- Make sure to override the default values in the experiment_<>.yaml file for each parameter that you want to change
- make sure to declare te config file in yaml format in the configs folder hparam
- have hparam null in train and eval config file
- run the following command to run the hydra experiment
python -m src.train --multirun experiment=catdog_experiment_convnext ++task_name=train ++train=True ++test=False hparam=catdog_classifier_covnext
python -m src.create_artifacts
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=train ++train=True ++test=False
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=test ++train=False ++test=True
python -m src.infer experiment=catdog_experiment
docker build -t my-gpu-app .
docker run --gpus all my-gpu-app
docker exec -it <container_id> /bin/bash
# pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime supports cuda 12.1 and python 3.10.14
# for docker compose what we need to is follow similar to the following
services:
test:
image: nvidia/cuda:12.3.1-base-ubuntu20.04
command: nvidia-smi
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]