OSErrorOSErrorOSErrorOSError: [Errno 28] No space left on device

#39
by farmer21cn - opened

torchrun --nproc_per_node=4 train/fine-tune_on_custom_dataset.py \

--model_name /root/fyj20220812/dev/whisper-large-v3
--language en
--sampling_rate 16000
--num_proc 1
--train_strategy epoch
--learning_rate 5e-6
--warmup 1000
--train_batchsize 32
--eval_batchsize 8
--num_epochs 5
--resume_from_ckpt None
--output_dir op_dir_epoch
--train_datasets /root/fyj20220812/data_process_whisper/train_des
--eval_datasets /root/fyj20220812/data_process_whisper/dev_des
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

ARGUMENTS OF INTEREST:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

ARGUMENTS OF INTEREST:
{'model_name': '/root/fyj20220812/dev/whisper-large-v3', 'language': 'en', 'sampling_rate': 16000, 'num_proc': 1, 'train_strategy': 'epoch', 'learning_rate': 5e-06, 'warmup': 1000, 'train_batchsize': 32, 'eval_batchsize': 8, 'num_epochs': 5, 'num_steps': 100000, 'resume_from_ckpt': 'None', 'output_dir': 'op_dir_epoch', 'train_datasets': ['/root/fyj20220812/data_process_whisper/train_des'], 'eval_datasets': ['/root/fyj20220812/data_process_whisper/dev_des']}

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

{'model_name': '/root/fyj20220812/dev/whisper-large-v3', 'language': 'en', 'sampling_rate': 16000, 'num_proc': 1, 'train_strategy': 'epoch', 'learning_rate': 5e-06, 'warmup': 1000, 'train_batchsize': 32, 'eval_batchsize': 8, 'num_epochs': 5, 'num_steps': 100000, 'resume_from_ckpt': 'None', 'output_dir': 'op_dir_epoch', 'train_datasets': ['/root/fyj20220812/data_process_whisper/train_des'], 'eval_datasets': ['/root/fyj20220812/data_process_whisper/dev_des']}

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

ARGUMENTS OF INTEREST:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

{'model_name': '/root/fyj20220812/dev/whisper-large-v3', 'language': 'en', 'sampling_rate': 16000, 'num_proc': 1, 'train_strategy': 'epoch', 'learning_rate': 5e-06, 'warmup': 1000, 'train_batchsize': 32, 'eval_batchsize': 8, 'num_epochs': 5, 'num_steps': 100000, 'resume_from_ckpt': 'None', 'output_dir': 'op_dir_epoch', 'train_datasets': ['/root/fyj20220812/data_process_whisper/train_des'], 'eval_datasets': ['/root/fyj20220812/data_process_whisper/dev_des']}ARGUMENTS OF INTEREST:

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

{'model_name': '/root/fyj20220812/dev/whisper-large-v3', 'language': 'en', 'sampling_rate': 16000, 'num_proc': 1, 'train_strategy': 'epoch', 'learning_rate': 5e-06, 'warmup': 1000, 'train_batchsize': 32, 'eval_batchsize': 8, 'num_epochs': 5, 'num_steps': 100000, 'resume_from_ckpt': 'None', 'output_dir': 'op_dir_epoch', 'train_datasets': ['/root/fyj20220812/data_process_whisper/train_des'], 'eval_datasets': ['/root/fyj20220812/data_process_whisper/dev_des']}

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

DATASET PREPARATION IN PROGRESS...
DATASET PREPARATION IN PROGRESS...
DATASET PREPARATION IN PROGRESS...
DATASET PREPARATION IN PROGRESS...
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "train/fine-tune_on_custom_dataset.py", line 264, in
File "train/fine-tune_on_custom_dataset.py", line 264, in
File "train/fine-tune_on_custom_dataset.py", line 264, in
Traceback (most recent call last):
File "train/fine-tune_on_custom_dataset.py", line 264, in
raw_dataset["train"] = load_custom_dataset('train')
raw_dataset["train"] = load_custom_dataset('train') File "train/fine-tune_on_custom_dataset.py", line 226, in load_custom_dataset

raw_dataset["train"] = load_custom_dataset('train') File "train/fine-tune_on_custom_dataset.py", line 226, in load_custom_dataset

raw_dataset["train"] = load_custom_dataset('train') File "train/fine-tune_on_custom_dataset.py", line 226, in load_custom_dataset

File "train/fine-tune_on_custom_dataset.py", line 226, in load_custom_dataset
ds.append(load_from_disk(dset))
File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1884, in load_from_disk
ds.append(load_from_disk(dset))
ds.append(load_from_disk(dset))
File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1884, in load_from_disk
ds.append(load_from_disk(dset))
File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1884, in load_from_disk
File "/usr/local/lib/python3.8/dist-packages/datasets/load.py", line 1884, in load_from_disk
return Dataset.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, storage_options=storage_options)
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 1605, in load_from_disk
return Dataset.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, storage_options=storage_options)
return Dataset.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, storage_options=storage_options)
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 1605, in load_from_disk
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 1605, in load_from_disk
return Dataset.load_from_disk(dataset_path, keep_in_memory=keep_in_memory, storage_options=storage_options)
File "/usr/local/lib/python3.8/dist-packages/datasets/arrow_dataset.py", line 1605, in load_from_disk
fs.download(src_dataset_path, dest_dataset_path.as_posix(), recursive=True)
fs.download(src_dataset_path, dest_dataset_path.as_posix(), recursive=True) File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 1625, in download

  File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 1625, in download

fs.download(src_dataset_path, dest_dataset_path.as_posix(), recursive=True)
fs.download(src_dataset_path, dest_dataset_path.as_posix(), recursive=True)
File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 1625, in download
File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 1625, in download
return self.get(rpath, lpath, recursive=recursive, **kwargs)return self.get(rpath, lpath, recursive=recursive, **kwargs)return self.get(rpath, lpath, recursive=recursive, **kwargs)

File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 976, in get
File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 976, in get
File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 976, in get
return self.get(rpath, lpath, recursive=recursive, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fsspec/spec.py", line 976, in get
self.get_file(rpath, lpath, callback=child, **kwargs)
self.get_file(rpath, lpath, callback=child, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fsspec/implementations/local.py", line 142, in get_file
File "/usr/local/lib/python3.8/dist-packages/fsspec/implementations/local.py", line 142, in get_file
self.get_file(rpath, lpath, callback=child, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fsspec/implementations/local.py", line 142, in get_file
self.get_file(rpath, lpath, callback=child, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fsspec/implementations/local.py", line 142, in get_file
return self.cp_file(path1, path2, **kwargs)return self.cp_file(path1, path2, **kwargs) return self.cp_file(path1, path2, **kwargs)
return self.cp_file(path1, path2, **kwargs)

File "/usr/local/lib/python3.8/dist-packages/fsspec/implementations/local.py", line 123, in cp_file

File "/usr/local/lib/python3.8/dist-packages/fsspec/implementations/local.py", line 123, in cp_file
File "/usr/local/lib/python3.8/dist-packages/fsspec/implementations/local.py", line 123, in cp_file
File "/usr/local/lib/python3.8/dist-packages/fsspec/implementations/local.py", line 123, in cp_file
shutil.copyfile(path1, path2)
File "/usr/lib/python3.8/shutil.py", line 275, in copyfile
shutil.copyfile(path1, path2)shutil.copyfile(path1, path2)shutil.copyfile(path1, path2)

File "/usr/lib/python3.8/shutil.py", line 275, in copyfile
File "/usr/lib/python3.8/shutil.py", line 275, in copyfile
File "/usr/lib/python3.8/shutil.py", line 275, in copyfile
_fastcopy_sendfile(fsrc, fdst)
_fastcopy_sendfile(fsrc, fdst)_fastcopy_sendfile(fsrc, fdst) File "/usr/lib/python3.8/shutil.py", line 166, in _fastcopy_sendfile

_fastcopy_sendfile(fsrc, fdst) File "/usr/lib/python3.8/shutil.py", line 166, in _fastcopy_sendfile

File "/usr/lib/python3.8/shutil.py", line 166, in _fastcopy_sendfile
File "/usr/lib/python3.8/shutil.py", line 166, in _fastcopy_sendfile
raise err from None
raise err from None File "/usr/lib/python3.8/shutil.py", line 152, in _fastcopy_sendfile
raise err from None

File "/usr/lib/python3.8/shutil.py", line 152, in _fastcopy_sendfile
raise err from None File "/usr/lib/python3.8/shutil.py", line 152, in _fastcopy_sendfile

File "/usr/lib/python3.8/shutil.py", line 152, in _fastcopy_sendfile
sent = os.sendfile(outfd, infd, offset, blocksize)
sent = os.sendfile(outfd, infd, offset, blocksize)
sent = os.sendfile(outfd, infd, offset, blocksize)sent = os.sendfile(outfd, infd, offset, blocksize)

OSErrorOSErrorOSErrorOSError: [Errno 28] No space left on device: '/root/fyj20220812/data_process_whisper/train_des/data-00292-of-01036.arrow' -> '/tmp/tmpi_agix1s/root/fyj20220812/data_process_whisper/train_des/data-00292-of-01036.arrow'
: : [Errno 28] No space left on device: '/root/fyj20220812/data_process_whisper/train_des/data-00292-of-01036.arrow' -> '/tmp/tmp4_dd6668/root/fyj20220812/data_process_whisper/train_des/data-00292-of-01036.arrow'[Errno 28] No space left on device: '/root/fyj20220812/data_process_whisper/train_des/data-00292-of-01036.arrow' -> '/tmp/tmpf30qnmru/root/fyj20220812/data_process_whisper/train_des/data-00292-of-01036.arrow'

: [Errno 28] No space left on device: '/root/fyj20220812/data_process_whisper/train_des/data-00292-of-01036.arrow' -> '/tmp/tmpjwxuu8bt/root/fyj20220812/data_process_whisper/train_des/data-00292-of-01036.arrow'
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2445) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train/fine-tune_on_custom_dataset.py FAILED

Sign up or log in to comment