PixelBytes: Unified Multimodal Generation
Welcome to the PixelBytes repository! This project features models designed to generate text and images simultaneously, pixel by pixel, using a unified embedding. (only testing weight)
Overview
Key Concepts
- Image Transformer: Generates images pixel by pixel.
- Bi-Mamba+: A bidirectional model for time series prediction.
- MambaByte: A selective state-space model without tokens.
The PixelByte model generates mixed sequences of text and images, handling transitions with line breaks and maintaining image dimension consistency.
Dataset
We use the PixelBytes-Pokemon dataset, available on Hugging Face: PixelBytes-Pokemon. It contains text and image sequences of Pokémon for training our model.
Models Trained
- 10 LSTM Models: (Uni-Bi)directional + 1, 2, 3 layers (including special config : p_embed + 3xhidden_state + 3xembedding_dim)
- 3 Mamba Models: Bidirectional + 1, 2 layers, Unidirectional + 2 layers
- 2 Transformer Models: 1, 2 layers
Citation
Furfaro, F. (2024). PixelBytes: A Unified Multimodal Representation Learning Project. (https://github.com/fabienfrfr/PixelBytes)
Thank you for exploring PixelBytes! We hope this model aids your multimodal generation projects.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The HF Inference API does not support text-to-image models for pytorch library.