Abstract
GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 97.6%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 2.4%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. GPT-4 was less willing to answer sensitive questions in June than in March, and both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. Overall, our findings shows that the behavior of the same LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLM quality.
Community
This is very interesting, I've been wondering to what degree RLHF might have a negative impact on models over time since the feedback might actually be wrong..
@jasonparker The bigger impact was probably the changes that allowed function calling and greater attention to the system message.
This paper is good for pointing out that model checkpoints have serious drift on some tasks. As one of the biggest changes was attention to the system message, I think this research is seriously limited (of course behavior changed against the same prompts - that was the point of the 0613 update).
EDIT: I incorrectly assumed they conducted the review via the ChatGPT app interface.
For more context on this topic and paper, I found this analysis from Arvind Narayanan and Sayash Kapoor quite interesting as well https://www.aisnakeoil.com/p/is-gpt-4-getting-worse-over-time
@thomwolf I pay a lot of attention to this very misconception and trying to drill it into our team. The engineers have a tendency to think parts of prompts will be "reusable" (a reusable CoT phrase here, a guard clause to help it respond to jailbreak attempts there, etc.) because that's how they are used to building software. They have an instinct to build modular, composable things that add "capability".
But, to the well said point of the link you shared, behavior and capability are different.
Monitors trends in performance of GPT-4 and GPT-3.5 (backend LLMs of ChatGPT) from March 2023 to June 2023 on diverse tasks: solving math (GPT 4 decreased and 3.5 increased), answering sensitive/dangerous/controversial questions (4 decreased, 3.5 slightly increased), generating code (both decreased), and visual reasoning/question-answering (both slightly increased). GPT 4โs verbosity has decreased for math problems (CoT might fail for math). Both models have largely refrained from answering sensitive questions (high overlap over timeframe); GPT 3.5 is still susceptible to jailbreaking AIM (Always Intelligent Machiavellian) attacks (act maliciously in a story). Code generation now returns a triple-backtick (markdown syntax) for prompts (maybe remove it manually and check?). GPT 4 is better at visual reasoning; both have high overlap and small improvements. From Stanford, UC Berkeley.
Links: GitHub
How ChatGPT's Skills Are Evolving: Surprising Decreases and Increases!
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper