--- license: apache-2.0 datasets: - agkphysics/AudioSet pipeline_tag: audio-classification --- # Model Details This is a CRNN sound event detection model pre-trained on [AudioSet](https://research.google.com/audioset/download.html) and then finetuned on [AudioSet-strong](https://research.google.com/audioset/download_strong.html). It contains 8 convolution layers and a GRU, with a time resolution of 40ms and a total of about 6.4 million parameters. # Usage ```python import torch from transformers import AutoModel import torchaudio device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = AutoModel.from_pretrained( "wsntxxn/cnn8rnn-audioset-sed", trust_remote_code=True ).to(device) wav1, sr1 = torchaudio.load("/path/to/file1.wav") wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate) wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0] wav2, sr2 = torchaudio.load("/path/to/file2.wav") wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate) wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0] wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True) with torch.no_grad(): output = model(waveform=wav_batch) # output: { # "framewise_output": (2, 447, n_frames), # "clipwise_output": (2, 447) # } # classes is in `model.classes` # for example, the probability sequence of male speech is: male_speech_prob = output[:, model.classes.index("Male speech, man speaking"), :] ```