Icon

MammothModa

A Unified AR-Diffusion Framework for Visual Understanding and Generation

Equipped with a Frontier Mixture-of-Experts DiT for Video Generation
GitHub Tech Report
Experimental Highlights
MammothModa is built around a Mixture-of-Experts DiT backbone that unifies multimodal understanding and generation within a single AR-Diffusion architecture.
MoE DiT architecture and expert ablations
MoE DiT Architecture

First MoE-based DiT video generator with fine-grained experts that specialize over motion, content, and style, enabling scalable capacity without sacrificing efficiency.

Unified AR-diffusion architecture across modalities
Unified Multimodal Design

A single AR-diffusion framework that supports text-to-image, image editing, and text-to-video generation with shared representations and training recipes.

Inference latency and VBench2 radar comparison
Efficient & Strong on Public Benchmarks

Mammoth25 (20B-A4.5B) delivers 11–15× lower video latency than state-of-the-art video generators(e.g., LongCat-Video) while maintaining strong performance on VBench2 with a 60.97% total score.

Text-to-Video
MammothModa leverages a MoE DiT architecture to deliver strong, scalable text-to-video generation.
Instruction Following
Loading...
0:00 / 0:00
Depict a scenario where an individual is faced with a difficult choice that tests their principles. The camera captures the weight of the decision,
Loading...
0:00 / 0:00
A close-up shot of a fox cautiously approaching the camera, sniffing at the lens curiously. The camera captures the fox’s inquisitive expression, the twitching of its nose, and the rustling of leaves under its paws.
Loading...
0:00 / 0:00
An aerial shot of a swarm of drones flying in formation, mimicking the behavior of a flock of birds. The camera captures the synchronized movements, the technological design of the drones, and the interplay between natural and artificial intelligence.
Loading...
0:00 / 0:00
A deserted tropical island with crystal-clear waters, dense palm trees, and a sandy beach. The camera slowly pans across the island, revealing a hidden cove with a small, makeshift shelter and a campfire. The animation emphasizes the serene beauty of the natural environment, with attention to the gentle sway of the palm trees and the sparkling reflections on the water.
Loading...
0:00 / 0:00
A close-up shot of a hummingbird hovering in front of a brightly colored flower, its wings a blur of motion. The camera focuses on the precision of its hovering, the quick darting movements, and the delicate sipping of nectar.
Loading...
0:00 / 0:00
A wide shot of a group of friends, all in their late 20s, enjoying a sunny day in a backyard. Some are grilling food, others are sitting at a picnic table, and a couple is playing with a dog. The camera captures the laughter, the food being passed around, and the casual, friendly interactions.
Motion Quality
Loading...
0:00 / 0:00
Show a group of people debating a challenging decision that could have far-reaching consequences. The camera highlights the varying perspectives and moral considerations.
Loading...
0:00 / 0:00
A static close-up shot shows a digital clock with a green number rapidly counting up from 10:47 to 12:45. The entire scene in the video remains unchanged.
Loading...
0:00 / 0:00
The camera remains still, a girl with a ponytail and wearing a yellow dress walks forward with a metal bucket of water in her hand, the background is a garden path, soft afternoon sunlight.
Camera & Scenes
Loading...
0:00 / 0:00
A man runs through a forest with the camera fixed to his chest, showing his frantic expressions and the trees rushing past in a blur.
Loading...
0:00 / 0:00
A high-paced shot of a pack of wolves working together to hunt a deer in a forest. The camera captures their coordinated movements, strategic positioning, and the silent communication among the pack members as they close in on their prey.
Loading...
0:00 / 0:00
The camera remains still, a boy with short blonde hair and wearing a green shirt blew gas into the balloon, the background is a bright living room, soft afternoon light.
More Cases
Additional video generation examples showcasing various scenarios and styles.
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Loading...
0:00 / 0:00
Multimodal Editing (Coming Soon)
Multimodal editing capabilities are coming soon. Stay tuned for instruction-guided and multimodal editing features.
Multimodal editing capabilities are under active development. Stay tuned for upcoming examples of instruction-guided and multimodal editing with MammothModa2.

Ethical Considerations

The insertion condition images and videos used in these examples are sourced from publicly available channels or generated by models, and are intended solely to demonstrate the capabilities of this research. If there are any concerns, please contact us (lijunqiao.123@bytedance.com) and we will remove the relevant examples in time.

BibTeX

@misc{bytedance_MammothModa_2025,
  title = {MammothModa: A Unified AR-Diffusion Framework for  Understanding and Generation},
  author = {ByteDance Research},
  year = {2025},
  howpublished = {\url{https://github.com/bytedance/mammothmoda}},
  note = {Code and models}
}