Example showing separate embedding retrieval
#7
by
schneeman
- opened
The model card shows this code as a mechanism to get image embeds directly:
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
image_embeds = model.get_image_features(
pixel_values=inputs['pixel_values'],
)
will this return the ALIGNED embeddings, or the embeddings from EfficientNet? I ask because when I try the following code, the two image embeddings do not appear the same. I suspect they are the unaligned embeddings.
import requests
import torch
from PIL import Image
from transformers import AlignProcessor, AlignModel
processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
model = AlignModel.from_pretrained("kakaobrain/align-base")
# image embeddings
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
image_embeds = model.get_image_features(
pixel_values=inputs['pixel_values'],
)
# text embeddings
text = "an image of a cat"
inputs = processor(text=text, return_tensors="pt",)
text_embeds = model.get_text_features(
input_ids=inputs['input_ids'],
attention_mask=inputs['attention_mask'],
token_type_ids=inputs['token_type_ids'],
)
print(text_embeds.shape, image_embeds.shape)
# combined embeddings
inputs = processor(text=text, images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
comb_text_embeds = outputs.text_embeds
comb_image_embeds = outputs.image_embeds