metadata

license: openrail++
tags:
  - stable-diffusion
  - text-to-image
  - core-ml

Stable Diffusion v2-1-base Model Card

This model was generated by Hugging Face using Apple’s repository which has ASCL. This version contains 2-bit linearly quantized Core ML weights for iOS 17 or macOS 14. To use weights without quantization, please visit this model instead.

This model card focuses on the model associated with the Stable Diffusion v2-1-base model.

This stable-diffusion-2-1-base model fine-tunes stable-diffusion-2-base (512-base-ema.ckpt) with 220k extra steps taken, with punsafe=0.98 on the same dataset.

These weights here have been converted to Core ML for use on Apple Silicon hardware.

There are 4 variants of the Core ML weights:

coreml-stable-diffusion-2-1-base
├── original
│   ├── compiled              # Swift inference, "original" attention
│   └── packages              # Python inference, "original" attention
└── split_einsum
    ├── compiled              # Swift inference, "split_einsum" attention
    └── packages              # Python inference, "split_einsum" attention

There are also two zip archives suitable for use in the Hugging Face demo app and other third party tools:

coreml-stable-diffusion-2-1-base-palettized_original_compiled.zip contains the compiled, 6-bit model with ORIGINAL attention implementation.
coreml-stable-diffusion-2-1-base-palettized_split_einsum_v2_compiled.zip contains the compiled, 6-bit model with SPLIT_EINSUM_V2 attention implementation.

Please, refer to https://huggingface.co/blog/diffusers-coreml for details.

Use it with 🧨 diffusers
Use it with the stablediffusion repository: download the v2-1_512-ema-pruned.ckpt here.

Model Details

Developed by: Robin Rombach, Patrick Esser
Model type: Diffusion-based text-to-image generation model
Language(s): English
License: CreativeML Open RAIL++-M License
Model Description: This is a model that can be used to generate and modify images based on text prompts. It is a Latent Diffusion Model that uses a fixed, pretrained text encoder (OpenCLIP-ViT/H).
Resources for more information: GitHub Repository.

Cite as:

@InProceedings{Rombach_2022_CVPR,
    author    = {Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj\"orn},
    title     = {High-Resolution Image Synthesis With Latent Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2022},
    pages     = {10684-10695}
}

*This model was quantized by Vishnou Vinayagame and adapted from the original by Pedro Cuenca, itself adapted from Robin Rombach, Patrick Esser and David Ha This model card was adapted by Pedro Cuenca from the original written by: Robin Rombach, Patrick Esser and David Ha and is based on the Stable Diffusion v1 and DALL-E Mini model card.