Very flat output? "Probabilities" all close to zero.
Using the sample code, the results look a bit strange - the "probabilities" come out almost perfectly zero. The scoring function looks like a good match for the original - could there be an issue with the tokenizer somehow?
@Moghrua this behaves very differently than softmax where the output is forced to sum to 1. In many cases you can end up with a lot of low scores if none of the texts are a great matches. I've definitely been able to get scores of .5 all the way to .97. Sometimes .1-.2 is a pretty good match.
If you cut and paste the provided beignet example it will output:
Label probabilities: [('a Dog.', 0.0), ('a cat', 0.0), ('a donut', 0.0), ('A Beignet.', 0.517)]
If you suspect any tokenizer issues, can double check by comparing w/ https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb ... I have done some testing and seemed to compare well but could be texts that don't tokenize the same...
@talrejanikhil I've observed it can be exceedingly fussy / specific? as to what's going to yield a high prob ... eg, twiddle yours a bit
[('a dog', 0.0), ('a cat on a catfood box', 0.024), ('a catfood box', 0.351), ('a beignet', 0.0)]
So yeah, I think this is usual behaviour, it also seems a bit sensitive to preprocessing / weight translation, esp when unsure, the prob swings on the output from the reference jax version can be a bit higher than I'd expect. So you could try similar prompts in their notebook...
Example if you do use softmax, obv softmax will push up the probs so the sum is 1.0
Label probabilities: [('a dog', 0.02), ('a cat', 0.98), ('a beignet', 0.0)]
Yes that's true. I actually do miss the high probs that CLIP model outputs
@Moghrua @talrejanikhil I have observed the same behavior so I had to normalize the outputs here: https://huggingface.co/spaces/merve/multilingual-zero-shot-image-clf I guess since the zero shot accuracy is still better than other models (as claimed by paper) it's just you need to stretch the outputs to actually see that?
@merve do you have code to show how you normalized the outputs?
@talrejanikhil it's pretty much making it add up to one proportionally, nothing fancy, here: https://huggingface.co/spaces/merve/multilingual-zero-shot-image-clf/blob/2958a16dc88a49f703e872fb79af237d544c5a18/app.py#L65
@merve @talrejanikhil FYI, down to some numerical differences sigmoid + normalizing like this is essentially softmax
It looks/feels nicer in that everything adding up to 1. must be a probability, but it's pretty obvious there's little to no calibration there. In either case, the sigmoid output is probably more closely calibrated wrt to what was seen in the training distribution...
Hi guys, I haven't read all of this, but the model being generally "more conservative" is totally expected. As Ross says, the model is not calibrated, because it's a "raw" model. What calibration makes most sense depends on your data/task. I guess we should explain this more somewhere at some point.
The good news is that calibrating it is very easy. If you have a dataset representative of your task, you can simply adjust the bias value (a single scalar!) by hand or grid-search, so that the probabilities look like you prefer them. I've done this many times on many tasks, and it works flawlessly. Actually, our official SigLIP colab even contains an interactive demo that shows this:
My question would still be how could we do this in hugging face? Is there a way to set the bias parameter
Actually I figured this out myself. You can do something like this:
model_name = 'google/siglip-so400m-patch14-384'
model = AutoModel.from_pretrained(model_name)
# Set your bias value here:
model.logit_bias = nn.Parameter(torch.tensor([-10.0]))
processor = AutoProcessor.from_pretrained(model_name)
This significantly increased the probability values for the example I posted above
FWIW the same applies to the OpenCLIP variant of the model, once created model.logit_bias = nn.Parameter(torch.tensor([-10.0]))
will be equivalent