Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ A series of CLIP ConvNeXt-XXLarge (a custom `timm` ConvNeXt size) models trained
|
|
24 |
|
25 |
| Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
|
26 |
| ----- | ------- | ---------- | ------------ | --------- |
|
27 |
-
| [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1)
|
28 |
| [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
|
29 |
| [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 | N/A | 79.4 |
|
30 |
RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
|
@@ -99,6 +99,20 @@ This model was trained with LAION-2B -- A 2 billion sample English subset of LAI
|
|
99 |
|
100 |
The main training run was done at global batch size of 81920 for 256 checkpoint intervals of 135.6M samples for a total of ~34B samples seen over training.
|
101 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
102 |
For 256x256 models, a slurm script w/ srun below for a 128 8-GPU (40GB A100) configuration:
|
103 |
|
104 |
```
|
@@ -131,6 +145,11 @@ srun --cpu_bind=v --accel-bind=gn python -m training.main \
|
|
131 |
```
|
132 |
|
133 |
For the rewind of last 10%, a higher global batch size of 95744 was used w/ a higher LR and slightly increased augmentation strength. The slurm srun cmd for 136 8-GPU (40GB A100) nodes:
|
|
|
|
|
|
|
|
|
|
|
134 |
```
|
135 |
srun --cpu_bind=v --accel-bind=gn python -m training.main \
|
136 |
--save-frequency 1 \
|
|
|
24 |
|
25 |
| Model | Dataset | Resolution | AugReg | Top-1 ImageNet Zero-Shot (%) |
|
26 |
| ----- | ------- | ---------- | ------------ | --------- |
|
27 |
+
| [convnext_xxlarge.laion2b_s34b_b82k-augreg](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) | LAION-2B | 256x256 | RRC (0.33, 1.0), RE (0.35), SD (0.1) | 79.1 |
|
28 |
| [convnext_xxlarge.laion2b_s34b_b82k-augreg-rewind](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-rewind) | LAION-2B | 256x256 | RRC (0.3, 1.0), RE (0.4), SD (0.1) | 79.3 |
|
29 |
| [convnext_xxlarge.laion2b_s34b_b82k-augreg-soup](CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg-soup) | LAION-2B | 256x256 | N/A | 79.4 |
|
30 |
RRC = Random Resize Crop (crop pcts), RE = Random Erasing (prob), SD = Stochastic Depth (prob) -- image tower only
|
|
|
99 |
|
100 |
The main training run was done at global batch size of 81920 for 256 checkpoint intervals of 135.6M samples for a total of ~34B samples seen over training.
|
101 |
|
102 |
+
Many difficulties w/ both model numerical stability and cluster stability and performance were encountered while training this model. Initial attempts to train with float16 AMP and default adam beta2 resulted in loss spikes and eventually NaN blow ups. `beta2` was reduced to 0.97 which helped, but the loss / zs curves were not tracking as expected. After switching to PyTorch nightlies, it was possible to use bfloat16 + AMP for training (as with rececnt H/14, g/14, and G/14 models), beta2 was returned to 0.98 and metrics improved.
|
103 |
+
|
104 |
+
|Checkpoint Interval |Cluster |# GPUs|# Nodes|GPU |local BS|sample/s|sample/s/gpu|precision |adam beta2 |
|
105 |
+
|--------------------|----------|------|-------|----------|--------|--------|------------|----------|-----------|
|
106 |
+
|1 - 2 |Stability |1024 |128 |A100 40GB | 80 |37-40k | 36-39 |amp + fp16|0.97 |
|
107 |
+
|3 - 32 |Stability |512 |64 |A100 80GB | 160 |27-32k | 52-62 |amp + fp16|0.97 |
|
108 |
+
|33 - 75 |Booster |1024 |256 |A100 40GB | 80 |48k | 47 |amp + fp16|0.97 |
|
109 |
+
|76 - 165 |Booster |1024 |256 |A100 40GB | 80 |51k | 50 |amp + bf16|0.98 |
|
110 |
+
|166 - 232 |Stability |320 |40 |A100 80GB | 256 |18-19k | 56-59 |amp + bf16|0.98 |
|
111 |
+
|233 - 249 |Booster |1024 |256 |A100 40GB | 80 |51k | 50 |amp + bf16|0.98 |
|
112 |
+
|250 - 256 |Stability |1024 |128 |A100 40GB | 80 |27-31k | 26-30 |amp + bf16|0.98 |
|
113 |
+
|
114 |
+
JUWELS Booster has 4x A100 GPU per node w/ 4x HDR-200 IB adapters per node (200Gbit/sec per GPU). Stability setup used was 8x A100 GPU per node w/ 400Gbit/sec EFA connectivity per node (~50 GBit/sec per GPU). Significant variation in training efficiency (throughput per GPU) as observed across the various configurations. The 1024 GPU configurations across both clusters were particularly prone to crashing (or very difficult to get running w/ a 'good' set of GPUs).
|
115 |
+
|
116 |
For 256x256 models, a slurm script w/ srun below for a 128 8-GPU (40GB A100) configuration:
|
117 |
|
118 |
```
|
|
|
145 |
```
|
146 |
|
147 |
For the rewind of last 10%, a higher global batch size of 95744 was used w/ a higher LR and slightly increased augmentation strength. The slurm srun cmd for 136 8-GPU (40GB A100) nodes:
|
148 |
+
|
149 |
+
|Checkpoint Interval |Cluster |# GPUs|# Nodes|GPU |local BS|sample/s|sample/s/gpu|precision |adam beta2 |
|
150 |
+
|--------------------|---------|------|-------|----------|--------|--------|------------|----------|-----------|
|
151 |
+
|231 - 256 |stability|1088 |136 |A100 40GB | 88 |32-35k | 29-32 |amp + bf16|0.98 |
|
152 |
+
|
153 |
```
|
154 |
srun --cpu_bind=v --accel-bind=gn python -m training.main \
|
155 |
--save-frequency 1 \
|