Potential ways to reduce inference latency on CPU cluster?
What are some of the potential approaches or arguments that can help reduce the inference latency on a CPU cluster?
hi @TheBacteria , Intel provides the effective LLM quantization tool (Intel Neural Compressor https://github.com/intel/neural-compressor ) to generate low-bit model (e.g., INT4/FP4/NF4, and INT8) and LLM runtime (Intel Extension for Transformers https://github.com/intel/intel-extension-for-transformers/tree/main ) to demonstrate the inference efficiency on Intel platforms by extending Hugging Face Transformers APIs.
you can also refer the paper: https://huggingface.co/papers/2311.16133 and blog: https://medium.com/intel-analytics-software/efficient-streaming-llm-with-intel-extension-for-transformers-runtime-31ee24577d26
Fantastic! Thanks so much @lvkaokao for such a helpful response.