gokaygokay/Florence-2 · adding space between task prompt and text input makes better/proper result

Jun 20

I think the code should be changed like this

    if text_input is None:
        prompt = task_prompt
    else:
-        prompt = task_prompt + text_input
+        prompt = task_prompt + " " + text_input

for example,

test image: http://farm3.staticflickr.com/2386/2532343535_41a2d3a9a0_z.jpg (from coco)
task prompt: Region to Description
text input: man on the back (without space)

output:

{'<REGION_TO_DESCRIPTION>': 'A woman with a large backpack in an airport terminal.'}

text input: man on the back (with prepending a space)
output:

{'<REGION_TO_DESCRIPTION>': "person on the back of a large green backpack with straps and buckles. \n\nThe backpack appears to be made of a durable material and has multiple pockets and compartments for storage. The straps are adjustable and the buckles are silver. The backpack is resting on a blue and white checkered floor.\n\nThere is a person's leg visible on the right side of the image, but they are not clearly visible. The background is blurred, so it is difficult to make out any other details."}

I think the example code from microsoft/Florence-2-large is wrong.

gokaygokay

Owner Jun 20

I will test with some pictures and apply your recommendations afterwards. Thanks for feedback.

gokaygokay

Owner Jun 20

•

edited Jun 20

In "Region to Description" you need to give "BBOX" coordinates like 'loc_52 loc_332 loc_932 loc_774' instead of plain "text input". Thats why you are getting different results. For other tasks it looked identical to me.

flrngel

Jun 20

https://huggingface.co/microsoft/Florence-2-large/blob/main/sample_inference.ipynb

I have checked the original code. You are right. Thanks!

flrngel changed discussion status to closed Jun 20