EuroSAT Landcover Classification using CLIP

CLIP (Contrastive Language–Image Pretraining) is a neural network model developed by OpenAI that can understand and generate text from images and vice versa. It stands for "Contrastive Language-Image Pretraining."

According to the paper, CLIP builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning

CLIP uses an abundantly available source of supervision: the text paired with images found across the internet. Given an image, the task of CLIP is to predict which out of a set of 32,768 randomly sampled text snippets, was actually paired with it in the dataset.

ViT-B/32

In this project, we fine-tune the CLIP with a ViT-B/32 transformer. ViT-B/32 is a specific variant of the Vision Transformer model, which is an architecture for computer vision tasks that leverages the transformer model, originally designed for natural language processing. Here are some details about ViT-B/32:

Architecture:
- Transformer Backbone: ViT uses a transformer architecture, which relies on self-attention mechanisms to process input data.
- Patch Embeddings: The input is divided into fixed-size patches, which are then linearly embedded into a sequence of vectors.
- Position Embdeddings: Since the transformer model does not inherently understand the order of the patches, position embeddings are added to the patch embdeddings to retain spatial information.
Model Information:
- B: The "B" in ViT-B/32 stands for "Base" size, which indicates the model's scale. ViT models come in various sizes, with Base being a moderate size compared to larger variants (like large or huge).
- 32: This number denotes the size of the patches into which the input image is divided. For ViT-B/32, the image is split into 32x32 pixel patches.

Dataset

The training dataset contains 22,011 images divided into 10 categories. These categories are:

annual crop land
brushland or shrubland
forest
highway or road
industrial buildings or commercial buildings
lake or sea
pasture land
permanent crop land
residential buildings or homes or apartments
river

The test dataset consists of 5000 images divided into the same 10 categories.

Data Preparation (as suggested by OpenAI)

As mentioned in the paper, when the name of a class is the only information provided to CLIP's text encoder, it is unable to differentiate due to lack of context. Hence, a good default template would be "a satellite photo of a {label}".

After changing our ground truth text descriptions according to this template (which is provided by OpenAI for different datasets here), our outputs should look like this:

classes = [
    'a centered satellite photo of forest',
    'a centered satellite photo of permanent crop land',
    'a centered satellite photo of residential buildings or homes or apartments',
    'a centered satellite photo of river',
    'a centered satellite photo of pasture land',
    'a centered satellite photo of lake or sea',
    'a centered satellite photo of brushland or shrubland',
    'a centered satellite photo of annual crop land',
    'a centered satellite photo of industrial buildings or commercial buildings',
    'a centered satellite photo of highway or road',
]

Note that 'a centered satellite photo of {label}' is one of many template prompts provided by OpenAI.

Installing Dependencies

!pip install transformers
!pip install pytorch==1.7.1 torchvision
!pip install ftfy regex tqdm
!pip install git+https://github.com/openai/CLIP.git

Pre-Trained Model

To run inference on the pre-trained model, run the following script

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

image = preprocess(Image.open("CLIP.png")).unsqueeze(0).to(device)
text = clip.tokenize(["a diagram", "a dog", "a cat"]).to(device)

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    
    logits_per_image, logits_per_text = model(image, text)
    probs = logits_per_image.softmax(dim=-1).cpu().numpy()

print("Label probs:", probs)  # prints: [[0.9927937  0.00421068 0.00299572]]

#source: https://github.com/openai/CLIP

Running this script on the EuroSAT dataset gave an accuracy of 42.18%

Fine Tuning

Further, I decided to fine-tune the model on this dataset. The code for the same can be found in the jupyter notebook.

On fine-tuning, the accuracy of the model came out to be 73.76%

The model was trained for 16 epochs (half of that mentioned in the paper) on L4 GPU.

The model is saved and can be used to run inferences.

To run inference using the fine-tuned model, use the following script:

import requests
import clip
from PIL import Image
from io import BytesIO

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
model.load_state_dict(torch.load("euroSATclip.pt"))

classes = ["a centered satellite photo of annual crop land",
          "a centered satellite photo of forest",
          "a centered satellite photo of lake or sea",
          "a centered satellite photo of pasture land",
          "a centered satellite photo of permanent crop land",
          "a centered satellite photo of river",
          "a centered satellite photo of residential buildings or homes or apartments",
          "a centered satellite photo of industrial buildings or commercial buildings",
          "a centered satellite photo of highway or road",
          "a centered satellite photo of brushland or shrubland"]

# fetch image
image = Image.open('<image-path>')

image_encoded = preprocess(Image.open(image)).unsqueeze(0).to(device)
text_encoded = clip.tokenize(classes).to(device)

with torch.no_grad():
    image_features = model.encode_image(image_encoded)
    text_features = model.encode_text(text_encoded)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    
    similarity = (image_features @ text_features.T).squeeze()
    best_match_idx = similarity.argmax().item()
    best_description = classes[best_match_idx]
    print(best_description)

The fine-tuned model can be found in the files section of this repo as "euroSATclip.pt"