Core techniques: 1) unified grounding loss 2) language-aware deep fusion 3) pre-training with both types of data.