moonline / README.md
erikkaum's picture
erikkaum HF staff
Update README.md
eabce06 verified
metadata
license: apache-2.0
pipeline_tag: image-to-text

Moonline

Moonline is a fork of moondream2. It combines the image to text generation with a modification of outlines to be able to generate text according to a specific pydantic model.

Model Details

The weights and the model strcture are directly from moondream2. The difference is that the Phi text model is swapped with a Phi model, that generates text according to a given structure. Since the outlines API doesn't work directly on embeddings, only the relevant parts are copy+pased and modified.

How to use

The best way to start is by cloning the repo and running example.py. Make sure to set up a virtual enviroment and install the dependencies from the requirements.txt

The example.py runs through a simple example of generating a description and a mood for the farm image.

from PIL import Image
from transformers import AutoTokenizer
from pydantic import BaseModel 
from enum import Enum

from moonline import Moonline 

def main():
    class Mood(Enum):
        sad = "sad"
        happy = "happy"
        angry = "angry"
        neutral = "neutral"

    class ExampleModel(BaseModel):
        description: str
        mood: Mood

    prompt = f"""
    Your job is to describe the image.
    Please answer in json with the following format: {ExampleModel.__annotations__}
    """
    
    image_path = "example.png"
    prompt = prompt

    model_id = "vikhyatk/moondream2"
    revision = "2024-04-02"
    tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
    moonline = Moonline.from_pretrained(
        model_id,
        revision=revision,
    ).to()
    moonline.eval()

    image = Image.open(image_path)
    image_embeds = moonline.encode_image(image)
    fsm = moonline.generate_fsm(ExampleModel, tokenizer)

    answer = moonline.answer_question(image_embeds, prompt, tokenizer, fsm)
    print(f"answer: {answer}")


if __name__ == "__main__":
    main()

The result is something like this:

{
  "description": "A cartoon house is shown sitting on a dirt road with a long gravel path. Plants and trees surround the house. In the distance, there is a canal or pond with ducks swimming about. The scene is full of greenery, and flowers bloom among the vegetation. The sky is a clear blue, and a lush, verdant landscape can be spotted in the background. There is a pathway leading towards the house.",
  "mood": "happy"
}

Limitations

The model hallucinetes especially in cases where a field is given, that doesn't exist in the image. This can be alleviated by giving None options or guidance in the prompt. But in my experience this doesn't solve the issue fully.

Moondream is also not specifically trained on json output. I expect results would be improved by fine-tuning on json descriptions of images. Especially cases where missing fields are present.