|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
library_name: diffusers |
|
tags: |
|
- text-to-image |
|
- prior |
|
- eclipse |
|
- unclip |
|
- kandinskyv2.2 |
|
--- |
|
|
|
# Introduction |
|
<a href="https://colab.research.google.com/drive/1VcqzXZmilntec3AsIyzCqlstEhX4Pa1o?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
|
|
|
The λ-ECLIPSE model is a light weight support for multi-concept personalization. λ-ECLIPSE is tiny T2I prior model designed for Kandinsky v2.2 diffusion image generator. |
|
|
|
λ-ECLIPSE model extends the [ECLIPSE-Prior](https://huggingface.co/ECLIPSE-Community/ECLIPSE_KandinskyV22_Prior) via incorporating the image-text interleaved data. |
|
|
|
λ-ECLIPSE shows that we do not need to train the Personalized T2I (P-T2I) models on lot of resources. For instance, λ-ECLIPSE is trained on mere 74 GPU Hours (A100) compared to it's couterparts BLIP-Diffusion (2304 GPU hours) and Kosmos-G (12300 GPU hours). |
|
|
|
- **Project Page:** [https://eclipse-t2i.github.io/Lambda-ECLIPSE/](https://eclipse-t2i.github.io/Lambda-ECLIPSE/) |
|
- **GitHub:** [https://github.com/Maitreyapatel/lambda-eclipse-inference](https://github.com/Maitreyapatel/lambda-eclipse-inference) |
|
- **Paper (arXiv):** [https://arxiv.org/abs/2402.05195](https://arxiv.org/abs/2402.05195) |
|
|
|
Importantly, λ-ECLIPSE works in pure CLIP latent space without any additional information. Hence, it's performance can be easily imporved via test-time adaption to increase the concept alignment while having solid composition alignment. |
|
|
|
|
|
![Qualitative example](./overview.png) |
|
|
|
More examples at: [Gallery](https://eclipse-t2i.github.io/Lambda-ECLIPSE/gallery.html) |
|
|
|
## Installation |
|
```bash |
|
git clone https://github.com/eclipse-t2i/lambda-eclipse-inference.git |
|
conda create -p ./venv python=3.9 |
|
pip install -r requirements.txt |
|
``` |
|
|
|
## Run Inference |
|
<a href="https://colab.research.google.com/drive/1VcqzXZmilntec3AsIyzCqlstEhX4Pa1o?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> |
|
|
|
```bash |
|
import os |
|
import torch |
|
from transformers import ( |
|
CLIPTextModelWithProjection, |
|
CLIPTokenizer, |
|
) |
|
from src.pipelines.pipeline_kandinsky_subject_prior import KandinskyPriorPipeline |
|
from src.priors.lambda_prior_transformer import PriorTransformer |
|
from diffusers import DiffusionPipeline |
|
|
|
text_encoder = CLIPTextModelWithProjection.from_pretrained( |
|
"laion/CLIP-ViT-bigG-14-laion2B-39B-b160k", |
|
projection_dim=1280, |
|
torch_dtype=torch.float32, |
|
) |
|
tokenizer = CLIPTokenizer.from_pretrained("laion/CLIP-ViT-bigG-14-laion2B-39B-b160k") |
|
|
|
prior = PriorTransformer.from_pretrained("ECLIPSE-Community/Lambda-ECLIPSE-Prior-v1.0") |
|
pipe_prior = KandinskyPriorPipeline.from_pretrained( |
|
"kandinsky-community/kandinsky-2-2-prior", |
|
prior=prior, |
|
text_encoder=text_encoder, |
|
tokenizer=tokenizer, |
|
).to("cuda") |
|
|
|
pipe = DiffusionPipeline.from_pretrained( |
|
"kandinsky-community/kandinsky-2-2-decoder" |
|
).to("cuda") |
|
|
|
raw_data = { |
|
"prompt": args.prompt, |
|
"subject_images": [args.subject1_path, args.subject2_path], |
|
"subject_keywords": [args.subject1_name, args.subject2_name] |
|
} |
|
image_emb, negative_image_emb = pipe_prior( |
|
raw_data=raw_data, |
|
).to_tuple() |
|
image = pipe( |
|
image_embeds=image_emb, |
|
negative_image_embeds=negative_image_emb, |
|
num_inference_steps=50, |
|
guidance_scale=7.5, |
|
).images |
|
|
|
image[0] |
|
``` |
|
|
|
## Important Notes (and limitations): |
|
|
|
- λ-ECLIPSE is trained to support upto four unique concepts, however, this version is trained on biased datasets heavily focusing on single and two subjects. Therefore, it maynot perform expectadly as number of subjects increases. |
|
- As this model is trained for P-T2I specifically, it might not perform well on traditional T2I task. |
|
- λ-ECLIPSE achieves SOTA compositional performance on composition alignment while maintaining the concept alignment. However, there is still a big gap compared to the finetuning based methodologies. |