Failure to run inference when running on Gradio
Hi all, I had successfully run inference with TaiwanLLaMa by loading the model directly on Jupyter notebook. Nevertheless, when I tried to host the model as an API with Gradio (on NVIDIA A100 GPU * 4), the status wheel would just keep spinning and never provide any output. Since the same code worked with Llama-2-7b-chat-hf, I wonder if there's any architectural change in TaiwanLLaMa that prohibits us from using the same code to get the output. Thanks!
I have same question, Thanks!
No architectural change in TW LLM. Could you provide the gradio scripts?
The script is listed here:
https://github.com/johannchu/taiwan-llama-on-gradio/blob/main/demo.py
To avoid downloading the model every time, we save the model checkpoint and load it locally. For the sake of bandwidth, we did not upload the model to the repo.
Also, to make sure there are no mistake in Gradio app itself, we added a "get_random_asnwer()" function in the script for testing. When we switched to using "get_random_answer()" under "respond()", the web app can display chat content normally. Nevertheless, when we switched to "get_reply_from_llm()" (i.e. the function running LlaMa inference), the web app would just keep spinning and never respond.
What's interesting is that if we use "get_reply_from_llm()" in a script with prompt hardcoded in (instead of hosting the whole thing as an app), the script can function normally, albeit the longer waiting time.