|
--- |
|
datasets: |
|
- common-canvas/commoncatalog-cc-by |
|
- madebyollin/megalith-10m |
|
- madebyollin/soa-full |
|
- alfredplpl/artbench-pd-256x256 |
|
language: |
|
- ja |
|
- en |
|
library_name: diffusers |
|
license: apache-2.0 |
|
pipeline_tag: text-to-image |
|
tags: |
|
- art |
|
--- |
|
|
|
# Model Card for CommonArt β |
|
|
|
![eyecatch](eyecatch.jpg) |
|
|
|
This is a text-to-image model learning from CC-BY-4.0, CC-0 or CC-0 like images. |
|
|
|
## Updates |
|
- 2024/09/21: Update the model. (30000 L4 GPU hours) |
|
- 2024/09/09: Release this model. (20000 L4 GPU hours) |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
At AI Picasso, we develop AI technology through active dialogue with creators, aiming for mutual understanding and cooperation. |
|
We strive to solve challenges faced by creators and grow together. |
|
One of these challenges is that some creators and fans want to use image generation but can't, likely due to the lack of permission to use certain images for training. |
|
To address this issue, we have developed CommonArt β. As it's still in beta, its capabilities are limited. |
|
However, its structure is expected to be the same as the final version. |
|
|
|
#### Features of CommonArt β |
|
|
|
- Principally uses images with obtained learning permissions |
|
- Understands both Japanese and English text inputs directly |
|
- Minimizes the risk of exact reproduction of training images |
|
- Utilizes cutting-edge technology for high quality and efficiency |
|
|
|
### Misc. |
|
|
|
- **Developed by:** alfredplpl |
|
- **Funded by:** AI Picasso, Inc. |
|
- **Shared by:** AI Picasso, Inc. |
|
- **Model type:** Diffusion Transformer based architecture |
|
- **Language(s) (NLP):** Japanese, English |
|
- **License:** Apache-2.0 |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [Github](https://github.com/PixArt-alpha/PixArt-sigma) |
|
- **Paper :** [PIXART-δ](https://arxiv.org/abs/2401.05252) |
|
|
|
## How to Get Started with the Model |
|
|
|
- diffusers for 16GB+ VRAM GPU |
|
|
|
1. Install libraries. |
|
|
|
```bash |
|
pip install transformers diffusers |
|
``` |
|
|
|
2. Run the following script |
|
|
|
```python |
|
import torch |
|
from diffusers import Transformer2DModel, PixArtSigmaPipeline, AutoencoderKL, DPMSolverMultistepScheduler |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
# Prompts |
|
prompt = "カラフルなお花畑。赤、青、黄、紫、ピンクなどの色とりどりの花に溢れている。" |
|
neg_prompt="" |
|
|
|
# Settings |
|
device = "cuda" |
|
weight_dtype = torch.float32 |
|
weight_dtype_te = torch.bfloat16 |
|
generator = torch.Generator().manual_seed(44) |
|
|
|
# Load text encoder |
|
tokenizer = AutoTokenizer.from_pretrained("cyberagent/calm2-7b") |
|
text_encoder = AutoModelForCausalLM.from_pretrained( |
|
"cyberagent/calm2-7b", |
|
torch_dtype=weight_dtype_te, |
|
device_map=device |
|
) |
|
|
|
# Get text embeddings |
|
with torch.no_grad(): |
|
pos_ids = tokenizer( |
|
prompt, max_length=512, padding="max_length", truncation=True, return_tensors="pt", |
|
).to(device) |
|
pos_emb = text_encoder(pos_ids.input_ids, output_hidden_states=True, attention_mask=pos_ids.attention_mask) |
|
pos_emb = pos_emb.hidden_states[-1] |
|
neg_ids = tokenizer( |
|
neg_prompt, max_length=512, padding="max_length", truncation=True, return_tensors="pt", |
|
).to(device) |
|
neg_emb = text_encoder(neg_ids.input_ids, output_hidden_states=True, attention_mask=neg_ids.attention_mask) |
|
neg_emb = neg_emb.hidden_states[-1] |
|
|
|
# Important |
|
del text_encoder |
|
|
|
# load models |
|
transformer = Transformer2DModel.from_pretrained( |
|
"aipicasso/commonart-beta", |
|
torch_dtype=weight_dtype |
|
) |
|
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=weight_dtype) |
|
scheduler=DPMSolverMultistepScheduler() |
|
|
|
pipe = PixArtSigmaPipeline( |
|
vae=vae, |
|
tokenizer=None, |
|
text_encoder=None, |
|
transformer=transformer, |
|
scheduler=scheduler |
|
) |
|
|
|
pipe.to(device) |
|
|
|
# Generate Image |
|
with torch.no_grad(): |
|
image = pipe( |
|
negative_prompt=None, |
|
prompt_embeds=pos_emb, |
|
negative_prompt_embeds=neg_emb, |
|
prompt_attention_mask=pos_ids.attention_mask, |
|
negative_prompt_attention_mask=neg_ids.attention_mask, |
|
max_sequence_length=512, |
|
width=512, |
|
height=512, |
|
num_inference_steps=20, |
|
generator=generator, |
|
guidance_scale=4.5).images[0] |
|
image.save("flowers.png") |
|
|
|
``` |
|
|
|
- diffusers for 8GB VRAM GPU |
|
1. Install libraries. |
|
|
|
```bash |
|
pip install transformers diffusers quanto |
|
``` |
|
|
|
2. Run the following script |
|
|
|
```python |
|
import torch |
|
from diffusers import Transformer2DModel, PixArtSigmaPipeline, AutoencoderKL, DPMSolverMultistepScheduler |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig |
|
|
|
# Prompts |
|
prompt = "カラフルなお花畑。赤、青、黄、紫、ピンクなどの色とりどりの花に溢れている。" |
|
neg_prompt="" |
|
|
|
# Settings |
|
device = "cuda" |
|
weight_dtype = torch.bfloat16 |
|
weight_dtype_te = torch.bfloat16 |
|
generator = torch.Generator().manual_seed(44) |
|
|
|
# Load text encoder |
|
tokenizer = AutoTokenizer.from_pretrained("cyberagent/calm2-7b") |
|
quantization_config = QuantoConfig(weights="int8") |
|
text_encoder = AutoModelForCausalLM.from_pretrained( |
|
"cyberagent/calm2-7b", |
|
quantization_config=quantization_config, |
|
torch_dtype=weight_dtype_te, |
|
device_map=device |
|
) |
|
|
|
# Get text embeddings |
|
with torch.no_grad(): |
|
pos_ids = tokenizer( |
|
prompt, max_length=512, padding="max_length", truncation=True, return_tensors="pt", |
|
).to(device) |
|
pos_emb = text_encoder(pos_ids.input_ids, output_hidden_states=True, attention_mask=pos_ids.attention_mask) |
|
pos_emb = pos_emb.hidden_states[-1] |
|
neg_ids = tokenizer( |
|
neg_prompt, max_length=512, padding="max_length", truncation=True, return_tensors="pt", |
|
).to(device) |
|
neg_emb = text_encoder(neg_ids.input_ids, output_hidden_states=True, attention_mask=neg_ids.attention_mask) |
|
neg_emb = neg_emb.hidden_states[-1] |
|
|
|
# Important |
|
del text_encoder |
|
|
|
# load models |
|
transformer = Transformer2DModel.from_pretrained( |
|
"aipicasso/commonart-beta", |
|
torch_dtype=weight_dtype |
|
) |
|
vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=weight_dtype) |
|
scheduler=DPMSolverMultistepScheduler() |
|
|
|
pipe = PixArtSigmaPipeline( |
|
vae=vae, |
|
tokenizer=None, |
|
text_encoder=None, |
|
transformer=transformer, |
|
scheduler=scheduler |
|
) |
|
|
|
pipe.to(device) |
|
|
|
# Generate Image |
|
with torch.no_grad(): |
|
image = pipe( |
|
negative_prompt=None, |
|
prompt_embeds=pos_emb, |
|
negative_prompt_embeds=neg_emb, |
|
prompt_attention_mask=pos_ids.attention_mask, |
|
negative_prompt_attention_mask=neg_ids.attention_mask, |
|
max_sequence_length=512, |
|
width=512, |
|
height=512, |
|
num_inference_steps=20, |
|
generator=generator, |
|
guidance_scale=4.5).images[0] |
|
image.save("flowers.png") |
|
``` |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
- Assistance in creating illustrations, manga, and anime |
|
- For both commercial and non-commercial purposes |
|
- Communication with creators when making requests |
|
- Commercial provision of image generation services |
|
- Please be cautious when handling generated content |
|
- Self-expression |
|
- Using this AI to express "your" uniqueness |
|
- Research and development |
|
- Fine-tuning (also known as additional training) such as LoRA |
|
- Merging with other models |
|
- Examining the performance of this model using metrics like FID |
|
- Education |
|
- Graduation projects for art school or vocational school students |
|
- University students' graduation theses or project assignments |
|
- Teachers demonstrating the current state of image generation AI |
|
- Uses described in the Hugging Face Community |
|
- Please ask questions in Japanese or English |
|
|
|
### Out-of-Scope Use |
|
|
|
- Generate misinfomation such as DeepFake. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
See Yahoo Flickr Creative Commons 100M dataset for more information. The information was collected circa 2014 and known to have a bias towards internet connected Western countries. Some areas such as the global south lack representation. |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
We used these dataset to train the diffusion transformer: |
|
|
|
- [CommonCatalog-cc-by](https://huggingface.co/datasets/common-canvas/commoncatalog-cc-by) |
|
- [Megalith-10M](https://huggingface.co/datasets/madebyollin/megalith-10m) |
|
- [Smithonian Open Access](https://huggingface.co/datasets/madebyollin/soa-full) |
|
- [ArtBench (CC-0 only) ](https://huggingface.co/datasets/alfredplpl/artbench-pd-256x256) |
|
|
|
|
|
## Environmental Impact |
|
|
|
- **Hardware Type:** NVIDIA L4 |
|
- **Hours used:** 30000 |
|
- **Cloud Provider:** Google Cloud |
|
- **Compute Region:** Japan |
|
- **Carbon Emitted:** free |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
[Pixart-Σ based architecture](https://github.com/PixArt-alpha/PixArt-sigma) |
|
|
|
### Compute Infrastructure |
|
|
|
Google Cloud (Tokyo Region). |
|
|
|
#### Hardware |
|
|
|
We used NVIDIA L4x8 instance 4 nodes. (Total: L4x32) |
|
|
|
#### Software |
|
|
|
[Pixart-Σ based code](https://github.com/PixArt-alpha/PixArt-sigma) |
|
|
|
## Model Card Contact |
|
|
|
- support@aipicasso.app |
|
|
|
# Acknowledgement |
|
We approciate the image providers. |
|
So, we are **standing on the shoulders of giants**. |
|
|