Whisper-WebUI

A Gradio-based browser interface for Whisper. You can use it as an Easy Subtitle Generator!

Notebook

If you wish to try this on Colab, you can do it in here!

Feature

Generate subtitles from various sources, including :
- Files
- Youtube
- Microphone
Currently supported subtitle formats :
- SRT
- WebVTT
- txt ( only text file without timeline )
Speech to Text Translation
- From other languages to English. ( This is Whisper's end-to-end speech-to-text translation feature )
Text to Text Translation
- Translate subtitle files using Facebook NLLB models
- Translate subtitle files using DeepL API

Installation and Running

Prerequisite

To run this WebUI, you need to have git, python version 3.8 ~ 3.10, CUDA version above 12.0 and FFmpeg.

Please follow the links below to install the necessary software:

CUDA : https://developer.nvidia.com/cuda-downloads
git : https://git-scm.com/downloads
python : https://www.python.org/downloads/ ( If your python version is too new, torch will not install properly.)
FFmpeg : https://ffmpeg.org/download.html

After installing FFmpeg, make sure to add the FFmpeg/bin folder to your system PATH!

Automatic Installation

If you have satisfied the prerequisites listed above, you are now ready to start Whisper-WebUI.

Run Install.bat from Windows Explorer as a regular, non-administrator user. ( Run install.sh if you are using Mac )
After installation, run the start-webui.bat. ( Run start-webui.sh if you are using Mac )
Open your web browser and go to http://localhost:7860

( If you're running another Web-UI, it will be hosted on a different port , such as localhost:7861, localhost:7862, and so on )

And you can also run the project with command line arguments if you like by running user-start-webui.bat, see wiki for a guide to arguments.

Using Docker

build the image

docker build -t whisper-webui:latest .

run the container

docker run --gpus all -d \
-v /path/to/models:/Whisper-WebUI/models \
-v /path/to/outputs:/Whisper-WebUI/outputs \
-p 7860:7860 \
whisper-webui:latest --server_name 0.0.0.0 --server_port 7860

VRAM Usages

This project is integrated with faster-whisper by default for better VRAM usage and transcription speed.

According to faster-whisper, the efficiency of the optimized whisper model is as follows:

Implementation	Precision	Beam size	Time	Max. GPU memory	Max. CPU memory
openai/whisper	fp16	5	4m30s	11325MB	9439MB
faster-whisper	fp16	5	54s	4755MB	3244MB

If you want to use the original Open AI whisper implementation instead of optimized whisper, you can set the command line argument DISABLE_FASTER_WHISPER to True. See the wiki for more information.

Available models

This is Whisper's original VRAM usage table for models.

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~32x
base	74 M	`base.en`	`base`	~1 GB	~16x
small	244 M	`small.en`	`small`	~2 GB	~6x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x

.en models are for English only, and the cool thing is that you can use the Translate to English option from the "large" models!