this model can not stop
only me ?
Which library/software are you using? If it's LM Studio, you need to update the latest. If it's Llama.cpp, you should follow the template correctly: https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/discussions/5
Your models don't stop working. I tested both 8b Q6_K, Q8 and 70b Q2k_M. NousResearch also fixed their tokenizer later on to fix this issue with the llama 3 architecture. Maybe something from llama.cpp commits may help.
Most quantized models by others authors are not stopping in their output except for the recently quantized ones. LM Studio's models work fine. Didn't tested NousResearch quants as they had only 4bit+ models.
I cannot reproduce this in LM Studio unfortunately. Also, others don't have any problem with it: https://huggingface.co/MaziyarPanahi/Meta-Llama-3-8B-Instruct-GGUF/discussions/5
Thanks for letting me know this.
I already am using the llama 3 template from LM Studio. The same as you have shared.
I will download the model again and let you know if the issue persists.
I downloaded the model again to experience the same issue. I don't know why this is happening but the assistant at the end and the consequent non-stop chatter is making this model unusable. I hope that the open source community finds a fix for this novel architecture of llama 3.
P.S. Can't wait for its finetunes.
yes, maybe LM Studio is ok , I am using llamafile
I downloaded the model again to experience the same issue. I don't know why this is happening but the assistant at the end and the consequent non-stop chatter is making this model unusable. I hope that the open source community finds a fix for this novel architecture of llama 3.
P.S. Can't wait for its finetunes.
This reminds me of the older version of LM Studio with a bad Llama-3 prompt. Could you please show me your Stop Strings
and LM Studio version? I would change the GGUF metadata to include that fix, it's just I cannot reproduce this on my side to be 100% sure I fixed anything.
I downloaded the model again to experience the same issue. I don't know why this is happening but the assistant at the end and the consequent non-stop chatter is making this model unusable. I hope that the open source community finds a fix for this novel architecture of llama 3.
P.S. Can't wait for its finetunes.
This reminds me of the older version of LM Studio with a bad Llama-3 prompt. Could you please show me your
Stop Strings
and LM Studio version? I would change the GGUF metadata to include that fix, it's just I cannot reproduce this on my side to be 100% sure I fixed anything.
It is the latest version and my stop strings were the same as you have posted earlier. I am attaching a screenshot of the same. It is understandable to reconfigure a big model is a lot of work and being sure is important in that. I am also looking forward to making this work.
In the About you can see the version. I just made this quick demo to demonstrate everything works properly once the appropriate prompt template is set:
https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing
gives me error:
{
"cause": "(Exit code: 42). Unknown error. Try a different model and/or config.",
"suggestion": "",
"data": {
"memory": {
"ram_capacity": "47.80 GB",
"ram_unused": "40.07 GB"
},
"gpu": {
"type": "AmdOpenCL",
"vram_recommended_capacity": "33.74 GB",
"vram_unused": "33.74 GB"
},
"os": {
"platform": "win32",
"version": "10.0.22631",
"supports_avx2": true
},
"app": {
"version": "0.2.20",
"downloadsDir": "E:\\AI"
},
"model": {}
},
"title": "Error loading model."
}```
Reinstalled lmstudio, problem solved
I also faced issues with this version that keeps generating text without stopping.
After some research, it seems that other versions have encountered the same problem and it has been resolved by the creator of GGUF by modifying the end token.
With the model from the link below, the text generation stops as expected.
Link: https://huggingface.co/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF
Hi all, adding the string "assistant" (without the quotation marks into Stop Strings in LM Studio works for me using the Llama 3 preset. It prevents the model from going on and on in my case.
@zerosystem Yes but now you can't have the word "assistant" in your messages. It's a temporary solution but it's not sustainable. The best thing is to be able to use special tokens.
Yes, please be careful adding tokens to the stop strings that are not meant to be used to stop the streaming.
Here you can see, using the latest Llama.cpp and any of the GGUF here you can have a stream that stops. https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing
However, I can change the metadata for one or two models and see if it fixes for those who couldn't find away to fix it? Who can help me test this so I can safely do it for all?
I went ahead and made the change and I am uploading them again. I realize that so many users workaround this in the application level, most don't even notice it. But there are some who just cannot go around it. I am re-uploading it for them. Hope it helps.
this works now perfectly:
../apps/fine-tuning/quantize/gguf/llama.cpp/main -m Meta-Llama-3-8B-Instruct.Q2_K.gguf -p "<|start_header_id|>user<|end_header_id|>\n\nBuilding a website can be done in 10 simple steps:\nStep 1:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -n 1024
<|start_header_id|>user<|end_header_id|>\n\nBuilding a website can be done in 10 simple steps:\nStep 1:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nYou're talking about building a website! That's a great topic. Here's the rest of
the text:
Building a website can be done in 10 simple steps:
Step 1: Define Your Purpose
* Determine the main purpose of your website
* Identify your target audience
* Decide what you want to achieve with your website
Step 2: Choose a Domain Name
* Research domain name options
* Check if the domain name is available
* Register your domain name
Step 3: Design Your Website
* Choose a website builder or CMS
* Create a wireframe of your website's structure
* Design your website's layout
Step 4: Develop Your Content
* Create high-quality content for your website
* Plan out your content in advance
* Create content calendar
Step 5: Create a Logo
* Design a logo that represents your brand
* Make sure your logo is simple and recognizable
Step 6: Build Your Website
* Choose a website builder or CMS
* Develop your website using the chosen platform
* Make sure your website is mobile-friendly
Step 7: Add Features and Functionality
* Add features and functionality to your website
* Make sure your website is user-friendly
Step 8: Test and Refine
* Test your website's functionality and features
* Make sure your website is bug-free
* Refine your website's design and features
Step 9: Launch Your Website
* Launch your website to the public
* Promote your website to your target audience
* Monitor your website's traffic and analytics
Step 10: Maintain and Update Your Website
* Keep your website up to date
* Make sure your website is secure and updated with the latest features
* Keep your website relevant to your target audience
That's it! Building a website can be a simple and straightforward process if you follow these steps.<|eot_id|> [end of text]
With the latest llama.cpp build, the team added official support for LLama 3 using convert.py, so no hacks should be needed anymore.
Yes, please be careful adding tokens to the stop strings that are not meant to be used to stop the streaming.
Here you can see, using the latest Llama.cpp and any of the GGUF here you can have a stream that stops. https://colab.research.google.com/drive/1HD-_evvGo1l1B-imVfQP7BKfDe-7BbE-?usp=sharing
However, I can change the metadata for one or two models and see if it fixes for those who couldn't find away to fix it? Who can help me test this so I can safely do it for all?
That configuration outlined in the google colab is, sadly, not correct.
EOS and BOS should be
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
instead of
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128001 '<|end_of_text|>'
I think simply by converting with the most recent convert.py this should be fixed.
With that configuration there are no issues stopping.
@Dampfinchen Thanks. I have checked, all these new GGUFs are identical to converting with the latest Llama.cpp.
At this point, there shouldn't be any issue (some experienced) using them anywhere.
@zerosystem Yes but now you can't have the word "assistant" in your messages. It's a temporary solution but it's not sustainable. The best thing is to be able to use special tokens.
apologies! I am quite new to this stuff and I did not give it a think through!
I am still having problems with the Fp16 version not being able to stop on LM Studio, I am using the Llama 3 preset and I have not changed anything.
Hi @zerosystem
If that's the only one that doesn't stop, I can fix it now and upload?
I am still having problems with the Fp16 version not being able to stop on LM Studio, I am using the Llama 3 preset and I have not changed anything.
I am uploading the new Fp16. It was missed in the last night upload :) In 20 minutes you should be able to re-download it and use it.
Hi @MaziyarPanahi yes please thank you! I have tried your q8_0 gguf and it works.
@MaziyarPanahi Thanks so much, that is good to hear!
Fantastic! many thanks for confirming :)
I have deploy my model in aws sagemaker endpoint facing this issue, can you please help me out here ? I'm talking about not stopping issue.
Thanks
Please make sure:
- everything you are using is up to date specially Llama.cpp
- follow the template exactly
Example:
<|start_header_id|>user<|end_header_id|>\n\nBuilding a website can be done in 10 simple steps:\nStep 1:<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n