Raincleared
commited on
Commit
•
178909c
1
Parent(s):
b70f0ee
Update README.md
Browse files
README.md
CHANGED
@@ -113,6 +113,12 @@ for _, context_enc, continuation_enc in chunk:
|
|
113 |
assert len(continuation_enc) <= self.max_length
|
114 |
```
|
115 |
|
|
|
|
|
|
|
|
|
|
|
|
|
116 |
### Inference Acceleration Effects
|
117 |
|
118 |
First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. As its inference speed and accuracy heavily rely on the performance of activation predictors, we report the activation recall and predicted sparsity (i.e., two key metrics for evaluating the activation predictor) as well as the number of tokens generated per second by PowerInfer (with one A100 GPU and sufficient CPUs). The GGUF files and activation predictors for ProSparse-7B are available at [ProSparse-LLaMA-2-7B-GGUF](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-gguf) ([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-gguf)) and [ProSparse-LLaMA-2-7B-Predictor](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-predictor) ([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-predictor)) respectively.
|
|
|
113 |
assert len(continuation_enc) <= self.max_length
|
114 |
```
|
115 |
|
116 |
+
Here are the steps to adapting the original [vLLM](https://github.com/vllm-project/vllm) to ProSparse models.
|
117 |
+
|
118 |
+
1. Replace the file [vllm/model_executor/models/llama.py](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py) in original vLLM with this [file](https://github.com/Raincleared-Song/DejaVu_predictor/blob/main/llama.py).
|
119 |
+
2. Replace the contents of the original [config.json](https://huggingface.co/SparseLLM/prosparse-llama-2-7b/blob/main/config.json) with this [file](https://github.com/Raincleared-Song/DejaVu_predictor/blob/main/config.json).
|
120 |
+
3. Set the environment variable `ACT_INFO`. To test the version without activation threshold shifting, `export ACT_INFO=relu`. To test the version with activation threshold shifting, `export ACT_INFO=fatrelu_0.01`.
|
121 |
+
|
122 |
### Inference Acceleration Effects
|
123 |
|
124 |
First, we utilize [PowerInfer](https://arxiv.org/pdf/2312.12456.pdf), a state-of-the-art acceleration framework leveraging activation sparsity. As its inference speed and accuracy heavily rely on the performance of activation predictors, we report the activation recall and predicted sparsity (i.e., two key metrics for evaluating the activation predictor) as well as the number of tokens generated per second by PowerInfer (with one A100 GPU and sufficient CPUs). The GGUF files and activation predictors for ProSparse-7B are available at [ProSparse-LLaMA-2-7B-GGUF](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-gguf) ([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-gguf)) and [ProSparse-LLaMA-2-7B-Predictor](https://huggingface.co/PowerInfer/prosparse-llama-2-7b-predictor) ([duplicate](https://huggingface.co/SparseLLM/prosparse-llama-2-7b-predictor)) respectively.
|