why not include Qwen1.5-MoE-A2.7B in the table?
#4
by
J22
- opened
IMHO, Qwen1.5-MoE-A2.7B is SOTA MOE model with 2B active parameters.
Before comparing, it would be good to know how many tokens the model is trained on and what data they used (including the original dense model before upcycling). Furthermore, it should be considered as a concurrent work.