Whisper Medium Amharic FLEURS

This model is a fine-tuned version of openai/whisper-medium on the google/fleurs am_et dataset. It achieves the following results on the evaluation set:

Loss: 7.8670
Wer: 154.4118

Model description

The main Whisper Small Hugging Face page: Hugging Face - Whisper Small

Intended uses & limitations

For experimentation and curiosity.
Based on the paper AXRIV and Benchmarking OpenAI Whisper for non-English ASR - Dan Shafer, there is a performance bias towards certain languages and curated datasets.
From the Whisper paper, am_et is a low resource language (Table E), with the WER results ranging from 120-229, based on model size. Whisper small WER=120.2, indicating more training time may improve the fine tuning.

Training and evaluation data

This model was trained/evaluated on "test+validation" data from google/fleurs google/fluers - HuggingFace Datasets.

Training procedure

The training was done in Lambda Cloud GPU on A100/40GB GPUs, which were provided by OpenAI Community Events Whisper Fine Tuning Event - Dec 2022. The training was done using HuggingFace Community Events - Whisper - run_speech_recognition_seq2seq_streaming.py using the included whisper_python_am_et.ipynb to setup the Lambda Cloud GPU/Colab environment. For Colab, you must reduce the train batch size to the recommended amount mentioned at , as the T4 GPUs have 16GB of memory Whisper Fine Tuning Event - Dec 2022. The notebook sets up the environment, logs into your huggingface account, and generates a bash script. The bash script generated in the IPYNB, run.sh was run from the terminal to train bash run.sh, as described on the Whisper community events GITHUB page.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 32
eval_batch_size: 16
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 3000
mixed_precision_training: Native AMP

Training results

Training Loss	Epoch	Step	Validation Loss	Wer
0.0194	100.0	100	3.8540	147.9947
0.0001	200.0	200	4.1479	148.1283
0.0001	300.0	300	4.1840	150.5348
0.0001	400.0	400	4.3339	177.9412
0.0	500.0	500	4.5831	151.0695
0.0	600.0	600	4.9317	164.0374
0.0	700.0	700	5.3031	141.0428
0.0	800.0	800	5.6584	122.3262
0.0	900.0	900	5.9711	157.4866
0.0	1000.0	1000	6.2465	141.1765
0.0	1100.0	1100	6.4832	169.6524
0.0	1200.0	1200	6.6890	155.0802
0.0	1300.0	1300	6.8679	159.7594
0.0	1400.0	1400	7.0250	155.0802
0.0	1500.0	1500	7.1615	146.2567
0.0	1600.0	1600	7.2877	143.0481
0.0	1700.0	1700	7.3987	148.5294
0.0	1800.0	1800	7.5010	142.5134
0.0	1900.0	1900	7.5849	136.7647
0.0	2000.0	2000	7.6689	148.2620
0.0	2100.0	2100	7.6955	165.3743
0.0	2200.0	2200	7.7247	162.9679
0.0	2300.0	2300	7.7557	161.6310
0.0	2400.0	2400	7.7842	162.2995
0.0	2500.0	2500	7.8074	150.9358
0.0	2600.0	2600	7.8287	154.8128
0.0	2700.0	2700	7.8434	155.4813
0.0	2800.0	2800	7.8567	154.4118
0.0	2900.0	2900	7.8635	154.4118
0.0	3000.0	3000	7.8670	154.4118

Recommendations

Limit training duration for smaller datasets to ~ 2000 to 3000 steps to avoid overfitting. 5000 steps using the HuggingFace - Whisper Small takes ~ 5hrs on A100 GPUs (1hr/1000 steps). Encountered RuntimeError: The size of tensor a (504) must match the size of tensor b (448) at non-singleton dimension 1 which is related to Trainer RuntimeError as some languages datasets have input lengths that have non-standard lengths. The link did not resolve my issue, and appears elsewhere too Training languagemodel – RuntimeError the expanded size of the tensor (100) must match the existing size (64) at non singleton dimension 1. To circumvent this issue, run.sh paremeters are adjusted. Then run python run_eval_whisper_streaming.py --model_id="openai/whisper-small" --dataset="google/fleurs" --config="am_et" --batch_size=32 --max_eval_samples=64 --device=0 --language="am" to find the WER score manually. Otherwise, erroring out during evaluation prevents the trained model from loading to HugginFace. Based on the paper AXRIV and Benchmarking OpenAI Whisper for non-English ASR - Dan Shafer, there is a performance bias towards certain languages and curated datasets. The OpenAI fintuning community event provided ample free GPU time to help develop the model further and improve WER scores.

Environmental Impact

Carbon emissions were estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). In total roughly 100 hours were used primarily in US East/Asia Pacific (80%/20%), with AWS as the reference. Additional resources are available at Our World in Data - CO2 Emissions

Hardware Type: AMD EPYC 7J13 64-Core Processor (30 core VM) 197GB RAM, with NVIDIA A100-SXM 40GB
Hours Used: 100 hrs
Cloud Provider: Lambda Cloud GPU
Compute Region: US East/Asia Pacific
Carbon Emitted: 12 kg (GPU) + 13 kg (CPU) = 25 kg (the weight of 3 gallons of water)

Framework versions

Transformers 4.26.0.dev0
Pytorch 1.13.1+cu117
Datasets 2.8.1.dev0
Tokenizers 0.13.2

Citation

@misc{https://doi.org/10.48550/arxiv.2212.04356,
  doi = {10.48550/ARXIV.2212.04356},
  url = {https://arxiv.org/abs/2212.04356},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  keywords = {Audio and Speech Processing (eess.AS), Computation and Language (cs.CL), Machine Learning (cs.LG), Sound (cs.SD), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

@article{owidco2andothergreenhousegasemissions,
    author = {Hannah Ritchie and Max Roser and Pablo Rosado},
    title = {CO₂ and Greenhouse Gas Emissions},
    journal = {Our World in Data},
    year = {2020},
    note = {https://ourworldindata.org/co2-and-other-greenhouse-gas-emissions}
}

drmeeseeks
/

whisper-medium-v2-amet