Fix: avoid double-rescaling of images in the BLIP-2 inference pipeline#11
Fix: avoid double-rescaling of images in the BLIP-2 inference pipeline#11Solus-sano wants to merge 1 commit intoControlNet:masterfrom
Conversation
|
I use the exactly same image and config, and seems it works well to me without But I checked in So I'm not sure if this change will work well for different
|
|
Hi — first, my apologies for raising the earlier alarm before I had walked through the full code path. if len(image) > 0 and 'float' in str(image[0].dtype) and image[0].max() <= 1:
image = [im * 255 for im in image]I just use self.qa() to debud and see the warning. My current versions are:
But I am encountering difficulties in reproducing the reported results (~26.8 acc on okvqa, far below the 46 – 48 % reported). Following suggestions from a related issue thread, I optimized the code generation prompt. This revision resulted in a noticeable reduction in "null" and "continue" outputs, and the accuracy improved to 34.2%. However, this is still considerably lower than the expected performance. Upon further investigation, I observed that in several "bad cases," the captions generated by BLIP are inaccurate and deviate significantly from the actual image content. Could you please provide some advice to help me resolve this discrepancy? I am considering several possibilities:
|
Thank you for confirming it. I close the pull request as it is not required.
For the non-RL version, it will heavily depend on the prompt quality, and it will affect the performance significantly.
I don't think it will be the problem.
I think yes. Also, I need to remind that the way calculating the reported numbers in paper follows previous works i.e. excluding the runtime failure. Besides, the non-RL version has a few percentage worse than the RL version. |
While validating OK-VQA results I noticed that BLIP-2 sometimes produces captions that are clearly unrelated to the input image. After tracing the preprocessing steps I found that the same image is rescaled twice
e.g.

when use blip2 to caption the first image(COCO_val2014_000000297147.jpg) in okvqa:
the result is "a black and white image of a sculpture"
After tracing the preprocessing steps I found that the same image is rescaled twice:
The model therefore “sees” an almost-black image, which explains the degraded caption quality. Transformers even emits the warning:
Proposed Change