soutrik.chowdhury
unfinished gpu docker
a46d6f7
|
raw
history blame
13.1 kB

POETRY SETUP

# Install poetry
conda create -n poetry_env python=3.10 -y
conda activate poetry_env
pip install poetry
poetry env info
poetry new pytorch_project
cd pytorch_project/
# fill up the pyproject.toml file without pytorch and torchvision
poetry install

# Add dependencies to the project for pytorch and torchvision
poetry source add --priority explicit pytorch_cpu https://download.pytorch.org/whl/cpu
poetry add --source pytorch_cpu torch torchvision
poetry lock
poetry show
poetry install --no-root

# Add dependencies to the project 
poetry add matplotlib
poetry add hydra-core
poetry add omegaconf
poetry add hydra_colorlog
poetry add --dev black # 
poetry lock
poetry show

Type	Purpose	Installation Command
  Normal Dependency	Required for the app to run in production.	poetry add <package>
  Development Dependency	Needed only during development (e.g., testing, linting).	poetry add --dev <package>
# Add dependencies to the project with specific version
poetry add <package_name>@<version>

MULTISTAGEDOCKER SETUP

Step-by-Step Guide to Creating Dockerfile and docker-compose.yml for a New Code Repo

If you're new to the project and need to set up Docker and Docker Compose to run the training and inference steps, follow these steps.


1. Setting Up the Dockerfile

A Dockerfile is a set of instructions that Docker uses to create an image. In this case, we'll use a multi-stage build to make the final image lightweight while managing dependencies with Poetry.

Step-by-Step Process for Creating the Dockerfile

  1. Choose a Base Image:

    • We need to choose a Python image that matches the project's required version (e.g., Python 3.10.14).
    • Use the lightweight slim version to minimize image size.
    FROM python:3.10.14-slim as builder
    
  2. Install Dependencies in the Build Stage:

    • We'll use Poetry for dependency management. Install it using pip.
    • Next, copy the pyproject.toml and poetry.lock files to the /app directory to install dependencies.
    RUN pip3 install poetry==1.7.1
    WORKDIR /app
    COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/
    
  3. Configure Poetry:

    • Configure Poetry to install the dependencies in a virtual environment inside the project directory (not globally). This keeps everything contained and avoids conflicts with the system environment.
    ENV POETRY_NO_INTERACTION=1 \
        POETRY_VIRTUALENVS_IN_PROJECT=1 \
        POETRY_VIRTUALENVS_CREATE=true \
        POETRY_CACHE_DIR=/tmp/poetry_cache
    
  4. Install Dependencies:

    • Use poetry install --no-root to install only the dependencies and not the package itself. This is because you typically don't need to install the actual project code at this stage.
    RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root
    
  5. Build the Runtime Stage:

    • Now, set up the final runtime image. This stage will only include the required application code and the virtual environment created in the first stage.
    • The final image will use the same Python base image but remain small by avoiding the re-installation of dependencies.
    FROM python:3.10.14-slim as runner
    WORKDIR /app
    COPY src /app/src
    COPY --from=builder /app/.venv /app/.venv
    
  6. Set Up the Path to Use the Virtual Environment:

    • Update the PATH environment variable to use the Python binaries from the virtual environment.
    ENV PATH="/app/.venv/bin:$PATH"
    
  7. Set a Default Command:

    • Finally, set the command that will be executed by default when the container is run. You can change or override this later in the Docker Compose file.
    CMD ["python", "-m", "src.train"]
    

Final Dockerfile

# Stage 1: Build environment with Poetry and dependencies
FROM python:3.10.14-slim as builder
RUN pip3 install poetry==1.7.1
WORKDIR /app
COPY pytorch_project/pyproject.toml pytorch_project/poetry.lock /app/
ENV POETRY_NO_INTERACTION=1 \
    POETRY_VIRTUALENVS_IN_PROJECT=1 \
    POETRY_VIRTUALENVS_CREATE=true \
    POETRY_CACHE_DIR=/tmp/poetry_cache
RUN --mount=type=cache,target=/tmp/poetry_cache poetry install --only main --no-root

# Stage 2: Runtime environment
FROM python:3.10.14-slim as runner
WORKDIR /app
COPY src /app/src
COPY --from=builder /app/.venv /app/.venv
ENV PATH="/app/.venv/bin:$PATH"
CMD ["python", "-m", "src.train"]

2. Setting Up the docker-compose.yml File

The docker-compose.yml file is used to define and run multiple Docker containers as services. In this case, we need two services: one for training and one for inference.

Step-by-Step Process for Creating docker-compose.yml

  1. Define the Version:

    • Docker Compose uses a versioning system. Use version 3.8, which is widely supported and offers features such as networking and volume support.
    version: '3.8'
    
  2. Set Up the train Service:

    • The train service is responsible for running the training script. It builds the Docker image, runs the training command, and uses volumes to store the data, checkpoints, and artifacts.
    services:
      train:
        build:
          context: .
        command: python -m src.train
        volumes:
          - data:/app/data
          - checkpoints:/app/checkpoints
          - artifacts:/app/artifacts
        shm_size: '2g'  # Increase shared memory to prevent DataLoader issues
        networks:
          - default
        env_file:
          - .env  # Load environment variables
    
  3. Set Up the inference Service:

    • The inference service runs after the training has completed. It waits for a file (e.g., train_done.flag) to be created by the training process and then runs the inference script.
      inference:
        build:
          context: .
        command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer"
        volumes:
          - checkpoints:/app/checkpoints
          - artifacts:/app/artifacts
        shm_size: '2g'
        networks:
          - default
        depends_on:
          - train
        env_file:
          - .env
    
  4. Define Shared Volumes:

    • Volumes allow services to share data. Here, we define three shared volumes:
      • data: Stores the input data.
      • checkpoints: Stores the model checkpoints and the flag indicating training is complete.
      • artifacts: Stores the final model outputs or artifacts.
    volumes:
      data:
      checkpoints:
      artifacts:
    
  5. Set Up Networking:

    • Use the default network to allow the services to communicate.
    networks:
      default:
    

Final docker-compose.yml

version: '3.8'

services:
  train:
    build:
      context: .
    command: python -m src.train
    volumes:
      - data:/app/data
      - checkpoints:/app/checkpoints
      - artifacts:/app/artifacts
    shm_size: '2g'
    networks:
      - default
    env_file:
      - .env

  inference:
    build:
      context: .
    command: /bin/bash -c "while [ ! -f /app/checkpoints/train_done.flag ]; do sleep 10; done; python -m src.infer"
    volumes:
      - checkpoints:/app/checkpoints
      - artifacts:/app/artifacts
    shm_size: '2g'
    networks:
      - default
    depends_on:
      - train
    env_file:
      - .env

volumes:
  data:
  checkpoints:
  artifacts:

networks:
  default:

Summary

  1. Dockerfile:

    • A multi-stage Dockerfile is used to create a lightweight image where the dependencies are installed with Poetry and the application code is run using a virtual environment.
    • It ensures that all dependencies are isolated in a virtual environment, and the final container only includes what is necessary for the runtime.
  2. docker-compose.yml:

    • The docker-compose.yml file defines two services:
      • train: Runs the training script and stores checkpoints.
      • inference: Waits for the training to finish and runs inference based on the saved model.
    • Shared volumes ensure that the services can access data, checkpoints, and artifacts.
    • shm_size is increased to prevent issues with DataLoader in PyTorch when using multiple workers.

This setup allows for easy management of multiple services using Docker Compose, ensuring reproducibility and simplicity.

References

  1. DVC SETUP

First, install dvc using the following command

dvc init
dvc version
dvc init -f
dvc config core.autostage true
dvc add data
dvc remote add -d myremote /tmp/dvcstore
dvc push

Add some more file in the data directory and run the following commands

dvc add data
dvc push
dvc pull

Next go back to 1 commit and run the following command

git checkout HEAD~1
dvc checkout
# you will get one file less

Next go back to the latest commit and run the following command

git checkout -
dvc checkout
dv pull
dvc commit

Next run the following command to add google drive as a remote

dvc remote add --default gdrive gdrive://1w2e3r4t5y6u7i8o9p0
dvc remote modify gdrive gdrive_acknowledge_abuse true
dvc remote modify gdrive gdrive_client_id <>
dvc remote modify gdrive gdrive_client_secret <>
# does not work when used from VM and port forwarding to local machine

Next run the following command to add azure-blob as a remote

dvc remote remove azblob
dvc remote add --default azblob azure://mycontainer/myfolder
dvc remote modify --local azblob connection_string "<>"
dvc remote modify azblob  allow_anonymous_login true
dvc push -r azblob
# this works when used and requires no explicit login

Next we will add S3 as a remote

dvc remote add --default aws_remote s3://deep-bucket-s3/data
dvc remote modify --local aws_remote access_key_id <>
dvc remote modify --local aws_remote secret_access_key <>
dvc remote modify --local aws_remote region ap-south-1
dvc remote modify aws_remote region ap-south-1
dvc push -r aws_remote -v
  1. HYDRA SETUP

# Install hydra
pip install hydra-core hydra_colorlog omegaconf
# Fillup the configs folder with the files as per the project
# Run the following command to run the hydra experiment
# for train 
python -m src.hydra_test experiment=catdog_experiment ++task_name=train ++train=True ++test=False
# for eval
python -m src.hydra_test experiment=catdog_experiment ++task_name=eval ++train=False ++test=True
# for both
python -m src.hydra_test experiment=catdog_experiment task_name=train train=True test=True # + means adding new key value pair to the existing config and ++ means overriding the existing key value pair
  1. LOCAL SETUP

 python -m src.train experiment=catdog_experiment ++task_name=train ++train=True ++test=False
 python -m src.train experiment=catdog_experiment ++task_name=eval ++train=False ++test=True
 python -m src.infer experiment=catdog_experiment
  1. DVC_PIPELINE_SETUP

dvc repro
  1. DVC Experiments

  • To run the dvc experiments keep different experiment_<>.yaml files in the configs folder under experiment folder
  • Make sure to override the default values in the experiment_<>.yaml file for each parameter that you want to change
  1. HYDRA Experiments

  • make sure to declare te config file in yaml format in the configs folder hparam
  • have hparam null in train and eval config file
  • run the following command to run the hydra experiment
 python -m src.train --multirun experiment=catdog_experiment_convnext ++task_name=train ++train=True ++test=False hparam=catdog_classifier_covnext
 python -m src.create_artifacts
  1. Latest Execution Command

python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=train ++train=True ++test=False
python -m src.train_optuna_callbacks experiment=catdog_experiment ++task_name=test ++train=False ++test=True
python -m src.infer experiment=catdog_experiment
  1. GPU Setup

docker build -t my-gpu-app .
docker run --gpus all my-gpu-app
docker exec -it <container_id> /bin/bash
# pytorch/pytorch:2.2.2-cuda12.1-cudnn8-runtime supports cuda 12.1 and python 3.10.14
# for docker compose what we need to is follow similar to the following
services:
  test:
    image: nvidia/cuda:12.3.1-base-ubuntu20.04
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]