Skip to content

FoundationVision/Alive

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

Alive: Animate Your World with Lifelike Audio-Video Generation

Alive Logo

📄 Technical Report  |  🌊 Project Page  |  🎬 Online Demo (Coming Soon)


Overview

Alive is a unified audio-video generation model that adapts pretrained Text-to-Video (T2V) models to audio-video generation and animation. Built on the MMDiT architecture, it achieves industry-grade performance for lifelike audio-video generation and animation.

Key Features

  • 🎬 Unified Audio-Video Generation: Simultaneously supports Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) within a single framework.
  • ⚙️ Advanced Architecture: Features TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment.
  • 📊 High-Quality Data Pipeline: Comprehensive audio-video captioning and quality control for million-level training data.
  • 🏆 SOTA Performance: Consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions.

Demo Video

Alive Demo Video


Benchmark Evaluation

Alive-Bench 1.0

We introduce a comprehensive benchmark for joint audio-visual generation that evaluates model performance along three complementary axes—motion quality, visual aesthetic, visual prompt following, audio quality, audio prompt following and audio video synchronization—covering 20+ fine-grained dimensions. This design enables diagnostic evaluation: our benchmark pinpoints which capability fails and why. Crucially, our benchmark is built around usage-like prompts that closely mirror how end users actually describe desired content. As a result, the benchmark reduces the common evaluation–deployment mismatch, where strong offline metrics fail to translate into perceived quality in real applications.

9__seed_random.mp4
36__seed_random.mp4
65__seed_random.mp4
84__seed_random.mp4
92__seed_random.mp4
105__seed_random.mp4
172__seed_random.mp4
202__seed_random.mp4
261__seed_random.mp4

Comparison with SOTA

We conducted extensive two-round human evaluations to benchmark our model's performance against leading competitors (Veo 3.1, Kling 2.6, Wan 2.6, Sora 2, and LTX-2). Across all metrics, Alive ranks at or near the top, indicating a well-balanced capability profile rather than a single-metric advantage. Alive is the best on audio prompt following and audio video synchronization, outperforming other competitors by a notable margin. This indicates a strong advantage in cross-modal understanding and alignment, particularly in faithfully reflecting audio instructions and maintaining tight timing correspondence between audio events and visual content.

Alive Benchmark Results


Introduction of Alive

Alive is an unified audio-video generation model that excels in text-to-video&audio (T2VA), image-to-video&audio (I2VA), text-to-video (T2V), and text-to-video (T2A) generation. It offers flexible resolution and aspect ratio, arbitrary video length, and extensible for character-reference audio-video animation.

Joint Audio-Video Modeling

We propose Alive, a joint generation architecture that seamlessly integrates Audio and Video DiTs via an extended "Dual Stream + Single Stream" paradigm. To resolve temporal granularity mismatches, we introduce UniTemp-RoPE and TA-CrossAttn, which map heterogeneous latents into a shared continuous temporal coordinate system, enforcing physical-time alignment for synchronized audio-visual generation.

Joint Audio-Video Modeling Framework

Model Parameters

Model Model Size M N Input Dim. Output Dim. Num. of Heads Head Dim
VideoDiT 12B 16 40 36 16 24 128
AudioDiT 2B 32 32 32 24 64

Audio-Video Refiner

The proposed cascaded audio-video (AV) refiner leverages a 480p base model to efficiently enable 1080p audio-video generation without excessive computational cost. On the video side, low-resolution inputs are refined to high-resolution outputs, effectively mitigating generative artifacts. For audio, the approach preserves both fidelity and audio-visual synchronization by inputting clean audio latents into a frozen Audio DiT module, thereby maintaining the quality and audio-video sync established by the base model.

Audio-Video Refiner

Comprehensive Audio-Video Data Pipelines

Going beyond conventional visual quality filtering, our work introduces a comprehensive data pipeline for joint audio-visual generation. It performs dual-quality filtering on both audio and video modalities, and employs a joint visual + audio keyword labeling system to associate a single visual object with its diverse range of audio events, enabling a more sophisticated level of audio-visual data balancing. Furthermore, we optimize and correct the Subject-Speech correspondence in multi-person and multi-shot scenarios, significantly enhancing character identity consistency and accuracy.

Comprehensive Audio-Video Data Pipelines

Role-Playing Animate

We introduce a cross-pair pipeline and a unified-editing-based reference augmentation pipeline to robustly decouple identity from static appearance, effectively mitigating copy-paste bias. Furthermore, we develop a multi-reference conditioning mechanism with a dedicated temporal offset and a dual-conditioning CFG strategy, enabling the model to treat reference images as persistent identity anchors rather than temporal frames, thus achieving superior identity consistency and motion dynamics.

2__seed_random.mp4
3aacef55_sr.mp4
690dd723_sr.mp4
454015df_sr.mp4

Training Recipe

The Importance of Audio Training: The initial AudioDiT pre-training quality (e.g., tone authenticity, pronunciation accuracy, emotional consistency) sets the upper bound for audio performance in joint generation. Joint training primarily facilitates audio-visual synchronization, with limited impact on the fundamental audio quality. Therefore, inadequate audio pre-training cannot be meaningfully improved during subsequent joint training. Audio Sensitivity and Forgetting: The audio branch is highly sensitive to changes in training data distribution and quickly adapts, often leading to catastrophic forgetting of previously learned robust audio features. To address this, asymmetric learning rates are used to prevent audio quality degradation during joint training.

Training Recipe

Inference Optimization

The Audio DiT and Video DiT are each guided by two distinct conditions: the text prompt and cross-attention signal. The introduction of cross-attention provides a mutual signal that steers the model toward audio-video synchronization. To effectively utilize this secondary cross-attention condition, we adopt a multi-condition control scheme, treating the text prompt (positive cpos / negative cneg) and the mutual cross-attention signal (cmutual) as separate, controllable conditions for guidance.

Inference Optimization


Citation

If you find our work useful for your research, please consider citing:

@article{guo2026Alive,
  title={Alive: Animate Your World with Lifelike Audio-Video Generation},
  author={Ying Guo and Qijun Gan and Yifu Zhang and Jinlai Liu and Yifei Hu and Pan Xie and Dongjun Qian and Yu Zhang and Ruiqi Li and Yuqi Zhang and Ruibiao Lu and Xiaofeng Mei and Bo Han and Xiang Yin and Bingyue Peng and Zehuan Yuan},
  journal={arXiv preprint arXiv:2602.08682},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published