Add optimization tips
#36
by
ylacombe
HF staff
- opened
README.md
CHANGED
@@ -127,6 +127,48 @@ scipy.io.wavfile.write("bark_out.wav", rate=sampling_rate, data=speech_values.cp
|
|
127 |
|
128 |
For more details on using the Bark model for inference using the π€ Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
|
129 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
## Suno Usage
|
131 |
|
132 |
You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):
|
|
|
127 |
|
128 |
For more details on using the Bark model for inference using the π€ Transformers library, refer to the [Bark docs](https://huggingface.co/docs/transformers/model_doc/bark).
|
129 |
|
130 |
+
|
131 |
+
### Optimization tips
|
132 |
+
|
133 |
+
Refers to this [blog post](https://huggingface.co/blog/optimizing-bark#benchmark-results) to find out more about the following methods and a benchmark of their benefits.
|
134 |
+
|
135 |
+
#### Get significant speed-ups:
|
136 |
+
|
137 |
+
**Using π€ Better Transformer**
|
138 |
+
|
139 |
+
Better Transformer is an π€ Optimum feature that performs kernel fusion under the hood. You can gain 20% to 30% in speed with zero performance degradation. It only requires one line of code to export the model to π€ Better Transformer:
|
140 |
+
```python
|
141 |
+
model = model.to_bettertransformer()
|
142 |
+
```
|
143 |
+
Note that π€ Optimum must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/optimum/installation)
|
144 |
+
|
145 |
+
**Using Flash Attention 2**
|
146 |
+
|
147 |
+
Flash Attention 2 is an even faster, optimized version of the previous optimization.
|
148 |
+
```python
|
149 |
+
model = BarkModel.from_pretrained("suno/bark", torch_dtype=torch.float16, use_flash_attention_2=True).to(device)
|
150 |
+
```
|
151 |
+
Make sure to load your model in half-precision (e.g. `torch.float16``) and to [install](https://github.com/Dao-AILab/flash-attention#installation-and-features) the latest version of Flash Attention 2.
|
152 |
+
|
153 |
+
**Note:** Flash Attention 2 is only available on newer GPUs, refer to π€ Better Transformer in case your GPU don't support it.
|
154 |
+
|
155 |
+
#### Reduce memory footprint:
|
156 |
+
|
157 |
+
**Using half-precision**
|
158 |
+
|
159 |
+
You can speed up inference and reduce memory footprint by 50% simply by loading the model in half-precision (e.g. `torch.float16``).
|
160 |
+
|
161 |
+
**Using CPU offload**
|
162 |
+
|
163 |
+
Bark is made up of 4 sub-models, which are called up sequentially during audio generation. In other words, while one sub-model is in use, the other sub-models are idle.
|
164 |
+
|
165 |
+
If you're using a CUDA device, a simple solution to benefit from an 80% reduction in memory footprint is to offload the GPU's submodels when they're idle. This operation is called CPU offloading. You can use it with one line of code.
|
166 |
+
|
167 |
+
```python
|
168 |
+
model.enable_cpu_offload()
|
169 |
+
```
|
170 |
+
Note that π€ Accelerate must be installed before using this feature. [Here's how to install it.](https://huggingface.co/docs/accelerate/basic_tutorials/install)
|
171 |
+
|
172 |
## Suno Usage
|
173 |
|
174 |
You can also run Bark locally through the original [Bark library]((https://github.com/suno-ai/bark):
|