suno
/

bark

Inference Endpoints

Model card Files Files and versions Community

Add optimization tips

#36

by ylacombe HF staff - opened Nov 10, 2023

base: refs/heads/main

←

from: refs/pr/36

Discussion Files changed

Files changed (1) hide show

README.md +42 -0

README.md CHANGED Viewed

@@ -127,6 +127,48 @@ scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cp
 For more details on using the Bark model for inference using the 🤗 Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
 ## Suno Usage
 You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):

 For more details on using the Bark model for inference using the 🤗 Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
+### Optimization tips
+Refers to this [blog post](https://huggingface.co/blog/optimizing-bark#benchmark-results) to find out more about the following methods and a benchmark of their benefits.
+#### Get significant speed-ups:
+**Using 🤗 Better Transformer**
+Better Transformer is an 🤗 Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to 🤗 Better Transformer:
+```python
+model =  model.to_bettertransformer()
+```
+Note that 🤗 Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation)
+**Using Flash Attention 2**
+Flash Attention 2 is an even faster, optimized version of the previous optimization.
+```python
+model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16, use_flash_attention_2=True).to(device)
+```
+Make sure to load your model in half-precision (e.g. `torch.float16``) and to [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2.
+**Note:** Flash Attention 2 is only available on newer GPUs, refer to 🤗 Better Transformer in case your GPU don't support it.
+#### Reduce memory footprint:
+**Using half-precision**
+You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision (e.g. `torch.float16``).
+**Using CPU offload**
+ Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
+If you're using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the GPU's submodels when they're idle. This operation is called CPU offloading. You can use it with one line of code.
+```python
+model.enable_cpu_offload()
+```
+Note that 🤗 Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
 ## Suno Usage
 You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):