LanguageBind
commited on
Commit
•
f25d17e
1
Parent(s):
9e40191
Update README.md
Browse files
README.md
CHANGED
@@ -2,6 +2,9 @@
|
|
2 |
license: mit
|
3 |
---
|
4 |
|
|
|
|
|
|
|
5 |
|
6 |
|
7 |
<h1 align="left"> <a href="https://github.com/PKU-YuanGroup/Open-Sora-Plan">Open-Sora Plan</a></h1>
|
@@ -170,14 +173,19 @@ Similar to previous work, we use a multi-stage training approach. With the 3D Di
|
|
170 |
|
171 |
The video model is initialized with weights from a 480p image model. We first train 480p videos with 29 frames. Next, we adapt the weights to 720p resolution, training on approximately 7 million samples from panda70m, filtered for aesthetic quality and motion. Finally, we refine the model with a higher-quality (HQ) subset of 1 million samples for fine-tuning 93-frame 720p videos. Below is our training card.
|
172 |
|
|
|
|
|
173 |
| Name | Stage 1 | Stage 2 | Stage 3 | Stage 4 |Stage 5 |
|
174 |
-
|
175 |
| Training Video Size | 1×320×240 | 1×640×480 | 29×640×480 | 29×1280×720 | 93×1280×720 |
|
176 |
| Training Step| 146k | 200k | 30k | 21k | 3k |
|
177 |
| Compute (#Num x #Hours) | 32 Ascend × 81 | 32 Ascend × 142 | 128 Ascend × 38 | 256 H100 × 64 | 256 H100 × 84 |
|
178 |
-
| Checkpoint | - | - | - | - | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0) |
|
179 |
| Log | - | - | [wandb](https://api.wandb.ai/links/1471742727-Huawei/trdu2kba) | [wandb](https://api.wandb.ai/links/linbin/vvxvcd7s) | [wandb](https://api.wandb.ai/links/linbin/easg3qkl)
|
180 |
-
| Training Data | 10M SAM | 5M internal image data |
|
|
|
|
|
|
|
181 |
|
182 |
### Training Image-to-Video Diffusion Model
|
183 |
|
|
|
2 |
license: mit
|
3 |
---
|
4 |
|
5 |
+
# Please note that the weights for v1.2.0 29×720p and 93×480p were trained on Panda70M and have not undergone final high-quality data fine-tuning, so they may produce watermarks.
|
6 |
+
|
7 |
+
# We fine-tuned 3.5k steps from 93×720p to get 93×480p for community research use.
|
8 |
|
9 |
|
10 |
<h1 align="left"> <a href="https://github.com/PKU-YuanGroup/Open-Sora-Plan">Open-Sora Plan</a></h1>
|
|
|
173 |
|
174 |
The video model is initialized with weights from a 480p image model. We first train 480p videos with 29 frames. Next, we adapt the weights to 720p resolution, training on approximately 7 million samples from panda70m, filtered for aesthetic quality and motion. Finally, we refine the model with a higher-quality (HQ) subset of 1 million samples for fine-tuning 93-frame 720p videos. Below is our training card.
|
175 |
|
176 |
+
|
177 |
+
|
178 |
| Name | Stage 1 | Stage 2 | Stage 3 | Stage 4 |Stage 5 |
|
179 |
+
|---|---|---|---|---|---|
|
180 |
| Training Video Size | 1×320×240 | 1×640×480 | 29×640×480 | 29×1280×720 | 93×1280×720 |
|
181 |
| Training Step| 146k | 200k | 30k | 21k | 3k |
|
182 |
| Compute (#Num x #Hours) | 32 Ascend × 81 | 32 Ascend × 142 | 128 Ascend × 38 | 256 H100 × 64 | 256 H100 × 84 |
|
183 |
+
| Checkpoint | - | - | - | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/29x720p) | [HF](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x720p) |
|
184 |
| Log | - | - | [wandb](https://api.wandb.ai/links/1471742727-Huawei/trdu2kba) | [wandb](https://api.wandb.ai/links/linbin/vvxvcd7s) | [wandb](https://api.wandb.ai/links/linbin/easg3qkl)
|
185 |
+
| Training Data | [10M SAM](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/sam_image_11185255_resolution.json) | 5M internal image data | [6M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ6M.json) | [6M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ6M.json) | [1M HQ Panda70M](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/blob/main/anno_json/Panda70M_HQ1M.json) and [100k HQ data](https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/anno_json) (collected in v1.1.0) |
|
186 |
+
|
187 |
+
Additionally, we fine-tuned 3.5k steps from the final 93×720p to get [93×480p](https://huggingface.co/LanguageBind/Open-Sora-Plan-v1.2.0/tree/main/93x480p) for community research use.
|
188 |
+
|
189 |
|
190 |
### Training Image-to-Video Diffusion Model
|
191 |
|