First MoE-based DiT video generator with fine-grained experts that specialize over motion, content, and style, enabling scalable capacity without sacrificing efficiency.
A single AR-diffusion framework that supports text-to-image, image editing, and text-to-video generation with shared representations and training recipes.
Mammoth25 (20B-A4.5B) delivers 11–15× lower video latency than state-of-the-art video generators(e.g., LongCat-Video) while maintaining strong performance on VBench2 with a 60.97% total score.
The insertion condition images and videos used in these examples are sourced from publicly available channels or generated by models, and are intended solely to demonstrate the capabilities of this research. If there are any concerns, please contact us (lijunqiao.123@bytedance.com) and we will remove the relevant examples in time.
@misc{bytedance_MammothModa_2025,
title = {MammothModa: A Unified AR-Diffusion Framework for Understanding and Generation},
author = {ByteDance Research},
year = {2025},
howpublished = {\url{https://github.com/bytedance/mammothmoda}},
note = {Code and models}
}