LanguageBind
/

Open-Sora-Plan-v1.2.0

Diffusers

Safetensors

Model card Files Files and versions Community

LanguageBind commited on Jul 26

Commit

f25d17e

•

1 Parent(s): 9e40191

Update README.md

Browse files

Files changed (1) hide show

README.md +11 -3

README.md CHANGED Viewed

@@ -2,6 +2,9 @@
 license: mit
 ---
 <h1 align="left"> <a href="https://github.com/PKU-YuanGroup/Open-Sora-Plan">Open-Sora Plan</a></h1>
@@ -170,14 +173,19 @@ Similar to previous work, we use a multi-stage training approach. With the 3D Di
 The video model is initialized with weights from a 480p image model. We first train 480p videos with 29 frames. Next, we adapt the weights to 720p resolution, training on approximately 7 million samples from panda70m, filtered for aesthetic quality and motion. Finally, we refine the model with a higher-quality (HQ) subset of 1 million samples for fine-tuning 93-frame 720p videos. Below is our training card.
 | Name | Stage 1 | Stage 2 | Stage 3 | Stage 4 |Stage 5 |
-|:---|:---|:---|:---|:---|:---|
 | Training Video Size | 1×320×240 |  1×640×480 | 29×640×480 |  29×1280×720 | 93×1280×720 |
 | Training Step| 146k |  200k | 30k | 21k | 3k |
 | Compute (#Num x #Hours) | 32 Ascend × 81 | 32 Ascend × 142 |  128 Ascend × 38 | 256 H100 × 64 | 256 H100 × 84 |
-| Checkpoint | - | - | - | - | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0) |
 | Log | - | - | [wandb](https://api.wandb.ai/links/1471742727-Huawei/trdu2kba) | [wandb](https://api.wandb.ai/links/linbin/vvxvcd7s) | [wandb](https://api.wandb.ai/links/linbin/easg3qkl)
-| Training Data | 10M SAM | 5M internal image data | 4M Panda70M | 7M Panda70M | 1M HQ Panda70M |
 ### Training Image-to-Video Diffusion Model

 license: mit
 ---
+# Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.
+# We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.
 <h1 align="left"> <a href="https://github.com/PKU-YuanGroup/Open-Sora-Plan">Open-Sora Plan</a></h1>
 The video model is initialized with weights from a 480p image model. We first train 480p videos with 29 frames. Next, we adapt the weights to 720p resolution, training on approximately 7 million samples from panda70m, filtered for aesthetic quality and motion. Finally, we refine the model with a higher-quality (HQ) subset of 1 million samples for fine-tuning 93-frame 720p videos. Below is our training card.
 | Name | Stage 1 | Stage 2 | Stage 3 | Stage 4 |Stage 5 |
+|---|---|---|---|---|---|
 | Training Video Size | 1×320×240 |  1×640×480 | 29×640×480 |  29×1280×720 | 93×1280×720 |
 | Training Step| 146k |  200k | 30k | 21k | 3k |
 | Compute (#Num x #Hours) | 32 Ascend × 81 | 32 Ascend × 142 |  128 Ascend × 38 | 256 H100 × 64 | 256 H100 × 84 |
+| Checkpoint | - | - | - | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/29x720p) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x720p) |
 | Log | - | - | [wandb](https://api.wandb.ai/links/1471742727-Huawei/trdu2kba) | [wandb](https://api.wandb.ai/links/linbin/vvxvcd7s) | [wandb](https://api.wandb.ai/links/linbin/easg3qkl)
+| Training Data | [10M SAM](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/sam_image_11185255_resolution.json) | 5M internal image data | [6M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ6M.json) | [6M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ6M.json) | [1M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ1M.json) and [100k HQ data](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/anno_json) (collected in v1.1.0) |
+Additionally, we fine-tuned 3.5k steps from the final 93×720p to get [93×480p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x480p) for community research use.
 ### Training Image-to-Video Diffusion Model