Issue with q8_0
Hello,
I downloaded the q8_0 model and it is giving me a strange response, shown below. I am using text generation webui and using the chat template of "Custom (obtained from model metadata)". My other parameters are quite standard.
AI
How can I help you today?
You
Hello
AI
Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly Blockly
try to not offload to CUDA or enable flash attention (-fa in llama.cpp)
With flash attention enabled I get the same result, and if I set gpu layers to zero and tensorcores to off I get this:
AI
How can I help you today?
You
Hello
AI
Blockly is a visual programming language that allows users to create programs using blocks. It is designed to be intuitive and easy to use, making it a popular choice for teaching programming concepts to beginners. Here are some ways Blockly can help you:
Educational Tool: Blockly is often used in educational settings to teach children and adults the basics of programming. It breaks down complex concepts into simple, manageable blocks that are easy to understand. This makes it an excellent tool for learning programming logic, algorithms, and basic syntax.
Interactive Learning: The visual nature of Blockly allows for interactive learning.
Okay, if I set it to CPU only mode as well, now it seems to work, it just runs slow.
Ah okay. Was hoping fa would work but maybe not. Either way there's a CUDA bug
Confirmed working after applying this patch.
I'm using jan.ai, I get the blocky error when using CUDA, using the CPU works fine but it's super slow. Is there any fix for CUDA?
I encountered the same issue. I was running qwen2 on ollama, but qwen2 returned a bunch of repeating
characters. This issue was resolved by adding the environment variable OLLAMA_FLASH_ATTENTION=1. The
corresponding connection is https://github.com/QwenLM/Qwen2?tab=readme-ov-file.
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_FLASH_ATTENTION=1"
[Install]
WantedBy=default.target
Above , Ollama serves' setting , I don't know how to add configs in Environment, so I add Environment again. Maybe I was wrong,but it is worked
same
update ollama or enable fa if using llama.cpp