LLaVA-3D

Model Summary
Use
Limitations
Training
License
Citation

Model Summary

The LLaVA-3D model is a 7B parameter models trained on LLaVA-3D-Instruct-1M, based on LLaVA-v1.5-7B.

Repository: ZCMax/LLaVA-3D
Project Website: zcmax.github.io/projects/LLaVA-3D
Paper: LLaVA-3D
Point of Contact: Chenming Zhu
Languages: English

Use

Intended use

The model was trained on LLaVA-3D-Instruct-1M and has the ability to interact with the single image for 2D tasks and posed RBG-D images for 3D tasks.

Feel free to share your generations in the Community tab!

Training

Model

Pretraining Stage: scene-level and region-level caption data, 1 epoch, projector
Instructing Tuning Stage: A mixture of 1M high-quality 2D and 3D data, 1 epoch, full model
Precision: bfloat16

Hardware & Software

GPUs: 8 * Nvidia Tesla A100 (for whole model series training)
Orchestration: Huggingface Trainer
Neural networks: PyTorch

Citation

@article{zhu2024llava,
  title={LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness},
  author={Zhu, Chenming and Wang, Tai and Zhang, Wenwei and Pang, Jiangmiao and Liu, Xihui},
  journal={arXiv preprint arXiv:2409.18125},
  year={2024}
}

ChaimZhu
/

LLaVA-3D-7B