RuntimeError: CUDA error: invalid argument
I've churned through an evening and about $6 trying to get this to work; including trying on multiple different machine types including the highest specced. As well as with different image sets and parameters.
Every time the thing fails with RuntimeError: CUDA error: invalid argument
(or on subsequent runs would fail with zero memory errors).
Anyone know what's going on here?
Starting single training...
Namespace(Session_dir='', adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, adam_weight_decay=0.01, cache_latents=False, center_crop=False, class_data_dir=None, class_prompt='', dump_only_text_encoder=False, gradient_accumulation_steps=1, gradient_checkpointing=False, hub_model_id=None, hub_token=None, image_captions_filename=True, instance_data_dir='instance_images', instance_prompt='', learning_rate=2e-06, local_rank=-1, logging_dir='logs', lr_scheduler='polynomial', lr_warmup_steps=0, max_grad_norm=1.0, max_train_steps=1800, mixed_precision='fp16', num_class_images=100, num_train_epochs=1, output_dir='output_model', pretrained_model_name_or_path='/home/user/.cache/huggingface/hub/models--multimodalart--sd-fine-tunable/snapshots/9dabd4dbbdd4c72e2ffbc8fb4e28debef0254949', prior_loss_weight=1.0, push_to_hub=False, resolution=512, sample_batch_size=4, save_n_steps=0, save_starting_step=1, scale_lr=False, seed=42, stop_text_encoder_training=270, tokenizer_name=None, train_batch_size=1, train_only_unet=False, train_text_encoder=True, use_8bit_adam=True, with_prior_preservation=False)
Enabling memory efficient attention with xformers...
/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use .from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
0%| | 0/1800 [00:00 File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 292, in run_predict
output = await app.blocks.process_api(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1007, in process_api
result = await self.call_function(fn_index, inputs, iterator, request)
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 848, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "app.py", line 264, in train
run_training(args_general)
File "/home/user/app/train_dreambooth.py", line 771, in run_training
accelerator.backward(loss)
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/accelerate/accelerator.py", line 882, in backward
self.scaler.scale(loss).backward(**kwargs)
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/autograd/function.py", line 253, in apply
return user_fn(self, *args)
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/xformers/ops/memory_efficient_attention.py", line 422, in backward
) = torch.ops.xformers.efficient_attention_backward_cutlass(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/_ops.py", line 143, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
I tried to downgrade diffusers version to 0.10.0, but it didn't change anything, I still git this invalid argument
error
Can you share the settings you are using to reach this issue? I have seen quite a bit of users daily are able to successfully train their models, so I would like to try to replicate your issue
Thanks for your reply, let me write down the steps as I currently trying again:
- I give my duplicated space a A10G Small hardware
- I restart the space (cloned from your latest version, 9 days ago. My requirements.txt modifications have been rolled back)
- Build failed after 30mn (first.log)
- Restart the space manually
- Build failed after 30mn (second.log)
- I factory reboot the space
- Build failed after 30mn (third.log)
- I decide to switch to a A10G large
- I factory reboot the space
- Build failed after 30mn (fourth.log)
- I decide to delete my space and duplicate again a brand new one
- Initial build completed in 2 minutes with the default CPU basic hardware (fifth.log)
- I give this space a A10G Small hardware
- I did not fill the HUGGING_FACE_HUB_TOKEN secret key as I don't know if it should be the Hugging face write token
- Finally, build is complete (after 28 minutes) (sixth.log)
- I choose to train a person, based on 1.5
- I upload 14 512x512 pictures (513Kb total)
- I name my concept Niko (as it's his name)
- I don't use any custom settings
- I name my model niko-1-5
- I paste my Hugging Face Write Token
- I click Start training
- It failed after about a minute (seventh.log)
- I give up but before, I fall back on CPU basic hardware to avoid extra-fees (as it still run whereas it crashed)
Thank you so much for the detailed report.
Some of those issues were caused due to instabilities into the Hugging Face Space management infra-structure and others due to a bug on the HF Training Space itself.
I did push today a bugfix for those bugs and mounting new Spaces should be more stable now. Also, if a CUDA error does occur, now the GPU is removed automatically. Sorry for the hassle and free to try again
Nice, thanks for your support!
I just tried again with a brand new dreambooth-training space, HF helped today and I was able to get the app runnin on a a10g small very fast, but training failed early.
Here is the container output :
/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py:1222: UserWarning: The default_enabled parameter of queue has no effect and will be removed in a future version of gradio.
warnings.warn(
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 337, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1015, in process_api
result = await self.call_function(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 833, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/helpers.py", line 584, in tracked_fn
response = fn(*args)
File "app.py", line 204, in train
sleep_time = get_sleep_time(hf_token)
File "app.py", line 185, in get_sleep_time
return response.json()['runtime']['gcTimeout']
KeyError: 'gcTimeout'
I tried again, deleting, duplicating and refreshing the whole app once the hardware is ready.
I get this training message error then:
Unfortunately there was an error during training your niko-1-5 model.
Please check it out below. Feel free to report this issue to Dreambooth Training:
CUDA error: invalid argument CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
And here is the detailed log
Starting single training...
Namespace(Session_dir='', adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, adam_weight_decay=0.01, cache_latents=False, center_crop=False, class_data_dir=None, class_prompt='', dump_only_text_encoder=False, gradient_accumulation_steps=1, gradient_checkpointing=False, hub_model_id=None, hub_token=None, image_captions_filename=True, instance_data_dir='instance_images', instance_prompt='', learning_rate=2e-06, local_rank=-1, logging_dir='logs', lr_scheduler='polynomial', lr_warmup_steps=0, max_grad_norm=1.0, max_train_steps=2100, mixed_precision='fp16', num_class_images=100, num_train_epochs=1, output_dir='output_model', pretrained_model_name_or_path='/home/user/.cache/huggingface/hub/models--multimodalart--sd-fine-tunable/snapshots/9dabd4dbbdd4c72e2ffbc8fb4e28debef0254949', prior_loss_weight=1.0, push_to_hub=False, resolution=512, sample_batch_size=4, save_n_steps=0, save_starting_step=1, scale_lr=False, seed=42, stop_text_encoder_training=1470, tokenizer_name=None, train_batch_size=1, train_only_unet=False, train_text_encoder=True, use_8bit_adam=True, with_prior_preservation=False)
Enabling memory efficient attention with xformers...
/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/diffusers/configuration_utils.py:195: FutureWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
deprecate("config-passed-as-path", "1.0.0", deprecation_message, standard_warn=False)
Niko Niko Adding Safety Checker to the model...
Traceback (most recent call last):
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/routes.py", line 337, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 1015, in process_api
result = await self.call_function(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/blocks.py", line 833, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/gradio/helpers.py", line 584, in tracked_fn
response = fn(*args)
File "app.py", line 340, in train
push(model_name, where_to_upload, hf_token, which_model, True)
File "app.py", line 360, in push
convert("output_model", "model.ckpt")
File "/home/user/app/convertosd.py", line 270, in convert
unet_state_dict = torch.load(unet_path, map_location="cpu")
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/user/.pyenv/versions/3.8.9/lib/python3.8/site-packages/torch/serialization.py", line 211, in __init__
super(_open_file, self).__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'output_model/unet/diffusion_pytorch_model.bin'