Interpretation of scores

#2
by sanderland - opened

What is the meaning of the output scores?
I'm asking because they seem around 4-5 times as high as expected given a Bradley-Terry interpretation.

Hi,

The output score is the linear combination of the last hidden state of the model features. This is a mapping from R^D to R, where D is the size of the last hidden state, and R is unbounded. Based on the Bradley-Terry loss definition, -log(sigmoid(margin)), the range is (0, inf). With a large margin, the loss approaches 0, and vice versa.

Can you clarify the "Bradley-Terry interpretation" and the source of this interpretation?

Bradley-Terry interpretation being sigmoid( s_a - s_b ) = p(a is the better completion)
Simply plotting a histogram of s_a - s_b over a large dataset shows it to be 4-5 times larger than other reward models. I suppose this could be an artifact of your data, but wanted to check if there was no scaling factor, as your technical report is not out yet.

Skywork org

Yes, this should be an artifact of either the dataset or the loss function. We did not apply and post-processing on the final reward or the margin during training.

Some other reward models may use different loss functions and scaling, which result in reward with different range.

I know the points of comparison use the Bradley-Terry loss. I suspect it's either your highly filtered dataset making the model somewhat biased to higher score differences, or I am making a mistake in inference. Will give an example soon.

Hmm, it may be due to flash attention or quantization?
'Skywork/Skywork-Reward-Gemma-2-27B-nofa': {'better': -27.5, 'worse': -21.125},
'Skywork/Skywork-Reward-Gemma-2-27B': {'better': -29.125, 'worse': 1.0703125},

note the 30 logit difference in the attn_implementation="flash_attention_2" case. Sample below, but this happens throughout.
(mind that actual accuracy is good in the FA case, and much worse otherwise, strange!)

 'prompt': [{'role': 'User',
   'content': 'How does Sony plan to expand in the next 5 years?'}],
 'better_completion': "As an AI language model, I extrapolate from my last update in October 2021 and can't provide real-time facts or upcoming plans. Still, you can draw certain insights from recent years about Sony's prospective expansion plans:\n    \n1. Evolving Game Streaming: Sony's PlayStation Now will likely be improved upon and expanded to leverage the growing market of game streaming services.\n\n2. Investment in Robotics & AI: Sony has shown interest in AI and robotics. Their tech subsidiary, Sony AI, launched in 2019, indicates plans for exploring AI's potential across gaming, photography, and gastronomy.\n\n3. Expansion of Film and TV Production: Sony Pictures entertainment continues to invest in original content, digital transformation, and consolidate its position as a distinguished multimedia company.\n\n4. Virtual Reality: Sony Interactive Entertainment has a high interest in expanding its work within the VR sector, shown by the development of new VR hardware for their PlayStation systems.\n\n5. Mobile Gaming: Sony's shift to mobile gaming might continue, adapting more PlayStation games for smartphones.\n\n6. 5G Technology: Sony Mobile is likely to continue developing devices that incorporate 5G technology.\n\n7. Expanding Sensor Business: Sony has been investing in its image sensors widely used in various digital cameras and smartphones.\n\n8. Sustainability Projects: Sony announced that it is planning on becoming zero environmental footprint throughout their product life cycle by 2050.\n\nPlease check Sony's official announcements for the most accurate and latest expansion plans.",
 'worse_completion': "Sony has outlined a few key strategies to expand over the next 5 years:\n\n1. Focus on high-growth business areas: Sony plans to focus its investments and resources on high-growth business areas such as gaming, music, image sensors, and entertainment. These are areas where Sony already has a strong presence and competitive advantage, so continued focus here can drive growth.\n\n2. Expand the global reach: Sony aims to expand its customer base globally, especially in emerging markets like India, the Middle East, Africa, etc. This includes strategies like introducing more locally-relevant products, partnerships, and marketing. Expanding globally can tap into new sources of demand and revenue growth.\n\n3. Pursue acquisitions and strategic partnerships: Sony plans to pursue acquisitions, joint ventures, and strategic partnerships as a way to gain new technologies, intellectual property, and customer access. Recent examples include Sony's partnership with Insomniac Games and acquisition of EMI Music Publishing. Acquisitions and partnerships are a key way for Sony to accelerate its growth strategies.\n\n4. Strengthen digital and services: Sony aims to leverage its content and hardware to build out a strong services and digital component of its business. This includes areas like paid digital subscriptions, advertising, e-commerce, as well as bundled hardware and content packages. Focusing on digital and services can diversify Sony's revenue streams and capture new sources of value. \n\n5. Continue cost optimization: Underlying these growth strategies, Sony aims to continue optimizing its cost structure through initiatives like supply chain management, reducing operating expenses, leveraging common platforms and components across devices, and restructuring efforts. Cost optimization can boost Sony's overall profitability and fund its growth investments.\n\nSo in summary, expect Sony to focus on gaming, music and entertainment; expand in emerging markets; pursue strategic partnerships and acquisitions; build out digital services; and optimize its operational costs over the next 5 years. The goal is to tap into new demand sources to drive revenue growth and higher profitability.",

We have a note of this issue in the model card, which suggests using flash attention 2 or eager instead of the default sdpa.

Also see: https://github.com/allenai/reward-bench/issues/163

Just for clarity, although accuracy is better with flash attention, the extreme score differences also appear in this case, such as the 30 logit difference above.
This corresponds to 99.9999999999923% certainty in favour of preferring the 'worse' completion, which seems extreme, even if this rewardbench sample isn't the best.

Sign up or log in to comment