kakaobrain/align-base · Example showing separate embedding retrieval

The model card shows this code as a mechanism to get image embeds directly:

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

image_embeds = model.get_image_features(
    pixel_values=inputs['pixel_values'],
)

will this return the ALIGNED embeddings, or the embeddings from EfficientNet? I ask because when I try the following code, the two image embeddings do not appear the same. I suspect they are the unaligned embeddings.

import requests
import torch
from PIL import Image
from transformers import AlignProcessor, AlignModel

processor = AlignProcessor.from_pretrained("kakaobrain/align-base")
model = AlignModel.from_pretrained("kakaobrain/align-base")

# image embeddings
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

image_embeds = model.get_image_features(
    pixel_values=inputs['pixel_values'],
)

# text embeddings
text = "an image of a cat"
inputs = processor(text=text, return_tensors="pt",)

text_embeds = model.get_text_features(
    input_ids=inputs['input_ids'],
    attention_mask=inputs['attention_mask'],
    token_type_ids=inputs['token_type_ids'],
)

print(text_embeds.shape, image_embeds.shape)

# combined embeddings
inputs = processor(text=text, images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)

comb_text_embeds = outputs.text_embeds
comb_image_embeds = outputs.image_embeds