RuntimeError: Expected 2D (unbatched) or 3D (batched) input to conv1d, but got input of size: [1, 1, 2, 476544]
What is the cause of this problem
I have the same Problem and I dont know how to solve it. It seams like most wav2vec models that are fine-tuned for emotion speech recognition have a similar structure of code, so I dont know how to avoid the predict() function.
It seams like that there is a problem of the input tensor being passed to the convulutional layer of the neural network. The input Tensor has a shape of [1, 1, 2, 476544], which is a 4 Dimensional input tensor. However the convolutional layer expects a 2D or 3D input tensor.
It is possible that the 'speech_file_to_array_fn()' function is not returning the expected shape of the audio data, which could cause issues downstream in the prediction process.
But again I dont really know how to solve this issue:/
I think the issue is in speach_file_to_array_fn
. It looks like torchaudio.load
returns the waveform and time (based on channels_first arg: https://pytorch.org/audio/0.8.0/backend.html#torchaudio.backend.sox_io_backend.load). I selected just the second element of speach_array and then the predict function worked for me.
def speech_file_to_array_fn(path, sampling_rate):
speech_array, _sampling_rate = torchaudio.load(path,format="mp3")
resampler = torchaudio.transforms.Resample(_sampling_rate)
speech = resampler(speech_array[1]).squeeze().numpy()
return speech