Intel/neural-chat-7b-v3-1 · Potential ways to reduce inference latency on CPU cluster?

Nov 27, 2023

What are some of the potential approaches or arguments that can help reduce the inference latency on a CPU cluster?

Intel org Nov 30, 2023

•

hi @TheBacteria , Intel provides the effective LLM quantization tool (Intel Neural Compressor https://github.com/intel/neural-compressor ) to generate low-bit model (e.g., INT4/FP4/NF4, and INT8) and LLM runtime (Intel Extension for Transformers https://github.com/intel/intel-extension-for-transformers/tree/main ) to demonstrate the inference efficiency on Intel platforms by extending Hugging Face Transformers APIs.

Nov 30, 2023

•

Fantastic! Thanks so much @lvkaokao for such a helpful response.