takara-ai
/

pixtral_aerial_VQA_adapter

Image-Text-to-Text

Inference Endpoints

Model card Files Files and versions Community

pixtral_aerial_VQA_adapter / README.md

takarajordan's picture

Update README.md

dbb98f9 verified 27 days ago

|

history blame contribute delete

1.42 kB

metadata

license: mit
base_model:
  - mistralai/Pixtral-12B-2409
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - lora
datasets:
  - Multimodal-Fatima/FGVC_Aircraft_train
  - takara-ai/FloodNet_2021-Track_2_Dataset_HF

pixtral_aerial_VQA_adapter

Model Details

Type: LoRA Adapter
Total Parameters: 6,225,920
Memory Usage: 23.75 MB
Precisions: torch.float32
Layer Types:
- lora_A: 40
- lora_B: 40

Intended Use

Primary intended uses: Processing aerial footage of construction sites for structural and construction surveying.
Can also be applied to any detailed VQA use cases with aerial footage.

Training Data

Dataset:
1. FloodNet Track 2 dataset
2. Subset of FGVC Aircraft dataset
3. Custom dataset of 10 image-caption pairs created using Pixtral

Training Procedure

Training method: LoRA (Low-Rank Adaptation)
Base model: Ertugrul/Pixtral-12B-Captioner-Relaxed
Training hardware: Nebius-hosted NVIDIA H100 machine

Citation

@misc{rahnemoonfar2020floodnet,
  title={FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding},
  author={Maryam Rahnemoonfar and Tashnim Chowdhury and Argho Sarkar and Debvrat Varshney and Masoud Yari and Robin Murphy},
  year={2020},
  eprint={2012.02951},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  doi={10.48550/arXiv.2012.02951}
}