Prompt template
What is the prompt template ?
prompt = "USER: write a poem about sky in 300 words ASSISTANT:"
Response :
I'm sorry, but i can't do that. A poem about the sky could take take a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
https://github.com/lm-sys/FastChat/blob/main/docs/vicuna_weights_version.md#prompt-template
It requires transformers >= 4.31.0 for the rope scaling.
I cannot make this prompt work in TGI. It writes a little and starts repeating everything!
input="""USER: Give me a 3 day plan to trip to Paris?"""
Day 1:
* Wake up early in the morning and head to the Eiffel Tower for sunrise.
* After the tower, take a stroll around the the beautiful Champs-Élyséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséeséesées
input="""USER: Hi"""
, I'm trying to use the `get_object_or_40()` function in my code, but I'm getting an error message that says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says says
I think we are missing some special tokens somewhere. (I tried the Llama-2 chat prompt template, it didn't work)
Same here ..
I was thinking it's just me. Thanks for reporting
I am also experiencing the same problem with the standard HF transformer example:
from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("lmsys/vicuna-13b-v1.5-16k")
model = LlamaForCausalLM.from_pretrained("lmsys/vicuna-13b-v1.5-16k", device_map="auto")
inputs = tokenizer("How are you?", return_tensors="pt")
generate_ids = model.generate(inputs.input_ids.to('cuda:0'), max_length=16000)
tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(output)
Same here. V1.5 works fine while V1.5-16K continues to repeat nonsense letters after few words.
Did you use transformers >= 4.31.0?
Thansk
@lmzheng
It seems the latest TGI uses older transformers
(https://github.com/huggingface/text-generation-inference/blob/main/server/requirements.txt#L53). Let me try a pure CasualLM and will get back to you/
Did you use transformers >= 4.31.0?
Thx, it seems that the version of the transformers library is the problem, I upgrade it from 4.30.2 to 4.31.0, and the mumbling does not happen again.
However, I start to run into OOM situations "torch.cuda.OutOfMemoryError: CUDA out of memory." with my GPU(Tesla V100 31.75 GiB total capacity) sometimes, does it related to the memory requirement of some intermediate parameters for 16K context?
@lmzheng
It works great with the normal CasualLM coming from 4.31.0
. I'll wait for TGI to start using 4.31.0.
The memory usage is also pretty decent on small text, however, once I use lots of data I also get the same error:
OutOfMemoryError: CUDA out of memory. Tried to allocate 8.85 GiB (GPU 0; 79.19 GiB total capacity; 24.32 GiB
already allocated; 27.62 MiB free; 28.69 GiB reserved in total by PyTorch) If reserved memory is >> allocated
memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and
PYTORCH_CUDA_ALLOC_CONF
Not sure if there is any way around this, but would it be possible to calculate how much memory one needs if one uses the whole 16000 input's length? (60G, 80G?)
@marcelgoya follow this to use transformer api https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/huggingface_api.py
@marcelgoya follow this to use transformer api https://github.com/lm-sys/FastChat/blob/main/fastchat/serve/huggingface_api.py
@lmzheng Will do that, many thanks!
Did you use transformers >= 4.31.0?
my transformers is 4.31.0 but i also have the same problem.how can i fix it?QAQ
I realized I don't see the issue with TGI 1.0 but with other container like DJL. I think the issue may be related to Rope scaling not implemented .
I can confirm it works flawlessly with a fresh install. Just created a new linux user on my GPU server, installed all and it was running like a charm. Quality is shockingly good. I used the OpenAI API interface to redirect some of my existing script to this endpoint and they just worked even with very complex prompts and contexts :-) Well done guys!
@rboehme86
Just out of curiosity, would you mind sharing your GPU specs and how much it uses if you do feed 16k input size?
I tried to use Vicuna-13b-16k with vllm worker(feature in Fastchat library). In that case, it repeats single word in output.
reproduce the error:
" python3 -m fastchat.serve.vllm_worker --model-names "gpt-3.5-turbo,text-davinci-003,text-embedding-ada-002" --model-path lmsys/vicuna-13b-v1.5-16k --num-gpus 2"
however it works when I replace "vllm_worker" to "model_worker"