PixelBytes: Unified Multimodal Generation

Welcome to the PixelBytes repository! This project features models designed to generate text and images simultaneously, pixel by pixel, using a unified embedding. (only testing weight)

Overview

Key Concepts

  • Image Transformer: Generates images pixel by pixel.
  • Bi-Mamba+: A bidirectional model for time series prediction.
  • MambaByte: A selective state-space model without tokens.

The PixelByte model generates mixed sequences of text and images, handling transitions with line breaks and maintaining image dimension consistency.

Dataset

We use the PixelBytes-Pokemon dataset, available on Hugging Face: PixelBytes-Pokemon. It contains text and image sequences of Pokémon for training our model.

Models Trained

  • 10 LSTM Models: (Uni-Bi)directional + 1, 2, 3 layers (including special config : p_embed + 3xhidden_state + 3xembedding_dim)
  • 3 Mamba Models: Bidirectional + 1, 2 layers, Unidirectional + 2 layers
  • 2 Transformer Models: 1, 2 layers

Citation

Furfaro, F. (2024). PixelBytes: A Unified Multimodal Representation Learning Project. (https://github.com/fabienfrfr/PixelBytes)


Thank you for exploring PixelBytes! We hope this model aids your multimodal generation projects.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-to-image models for pytorch library.

Dataset used to train ffurfaro/PixelBytes-Pokemon