Issue with q8_0

#1
by sm54 - opened

Hello,

I downloaded the q8_0 model and it is giving me a strange response, shown below. I am using text generation webui and using the chat template of "Custom (obtained from model metadata)". My other parameters are quite standard.

AI
How can I help you today?

You
Hello

AI
Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly

Qwen org

try to not offload to CUDA or enable flash attention (-fa in llama.cpp)

With flash attention enabled I get the same result, and if I set gpu layers to zero and tensorcores to off I get this:

AI
How can I help you today?

You
Hello

AI
Blockly is a visual programming language that allows users to create programs using blocks. It is designed to be intuitive and easy to use, making it a popular choice for teaching programming concepts to beginners. Here are some ways Blockly can help you:

Educational Tool: Blockly is often used in educational settings to teach children and adults the basics of programming. It breaks down complex concepts into simple, manageable blocks that are easy to understand. This makes it an excellent tool for learning programming logic, algorithms, and basic syntax.
Interactive Learning: The visual nature of Blockly allows for interactive learning.

Okay, if I set it to CPU only mode as well, now it seems to work, it just runs slow.

Qwen org

Ah okay. Was hoping fa would work but maybe not. Either way there's a CUDA bug

Confirmed working after applying this patch.

I'm using jan.ai, I get the blocky error when using CUDA, using the CPU works fine but it's super slow. Is there any fix for CUDA?

I encountered the same issue. I was running qwen2 on ollama, but qwen2 returned a bunch of repeating
characters. This issue was resolved by adding the environment variable OLLAMA_FLASH_ATTENTION=1. The
corresponding connection is https://github.com/QwenLM/Qwen2?tab=readme-ov-file.

[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_FLASH_ATTENTION=1"

[Install]
WantedBy=default.target

Above , Ollama serves' setting , I don't know how to add configs in Environment, so I add Environment again. Maybe I was wrong,but it is worked

在 LM STUDIO 環境內,關閉掉 GPU 加速,終於讓這個模型可以使用了! 難道是要大家不要使用 GPU 加速嗎? 別人的模型都沒有這樣問題
GPU_OFFLOAD.png

same

Qwen org

update ollama or enable fa if using llama.cpp

jklj077 changed discussion status to closed

Sign up or log in to comment