Official repository for KeyVID - a unified diffusion framework that generates temporally coherent videos conditioned on audio, guided by adaptive keyframe localization.
# 1. Clone and install
git clone https://github.com/XingruiWang/KeyVID.git
cd KeyVID
pip install -r requirements.txt
# 2. Create directories and download checkpoints
mkdir -p checkpoint/KeyVID/keyframe_generation checkpoint/KeyVID/asva_12_kf_interp
mkdir -p data/AVSync15/videos
# Download checkpoints (coming soon on HuggingFace)
# Place in: checkpoint/KeyVID/keyframe_generation/epoch=859-step=10320.ckpt
# checkpoint/KeyVID/asva_12_kf_interp/epoch=1479-step=17760.ckpt
# 3. Prepare test data
# Place videos in: data/AVSync15/videos/
# Create: data/AVSync15/test.txt (one video name per line, no extension)
# 4. Set checkpoint path (REQUIRED)
export CHECKPOINT_ROOT=./checkpoint/KeyVID
# 5. Verify setup
bash scripts/check_paths.sh
# 6. Run inference
bash scripts/generation.sh asva_12_kf # Generate keyframes
bash scripts/generation.sh asva_12_kf_interp # Interpolate to full video
# 7. (Optional) Run evaluation
bash scripts/avsync15_metric.sh # Compute metricsNotes:
- If you get "CHECKPOINT_ROOT is not set" error, make sure to export it in step 4
- Evaluation requires AVSync checkpoint:
/dockerx/groups/KeyVID_hf_model/avsync/vggss_sync_contrast_12/ckpts/checkpoint-40000
KeyVID/ # Project root
βββ checkpoint/ # Model checkpoints (git-ignored)
β βββ KeyVID/ # Checkpoint root (set via CHECKPOINT_ROOT)
β βββ keyframe_generation/
β β βββ epoch=859-step=10320.ckpt # β οΈ REQUIRED
β βββ asva_12_kf_interp/
β βββ epoch=1479-step=17760.ckpt # β οΈ REQUIRED
β
βββ .checkpoints/ # Downloaded by Imagebind
β
βββ data/ # Input datasets (git-ignored)
β βββ AVSync15/ # Dataset
β βββ videos/ # β οΈ REQUIRED
β βββ test.txt # β οΈ REQUIRED
β
βββ outputs/ # Inference outputs (git-ignored)
βββ save_results/ # Saved evaluation results (git-ignored)
β
βββ scripts/ # Inference & evaluation scripts
β βββ generation.sh # Main inference script
β βββ avsync15_metric.sh # Evaluation script
β βββ check_paths.sh # Setup verification
βββ configs/ # Configuration files
βββ imagebind/ # ImageBind model code
βββ lvdm/ # Model code
βββ utils/ # Utility modules
βββ motion_scores/ # Motion scoring tools
βββ main/ # Main entry points
βββ Dockerfile # Docker build config
βββ docker-compose.yml # Docker Compose config
βββ requirements.txt # Python dependencies
βββ README.md
External Checkpoints (for evaluation only):
- AVSync checkpoint:
/dockerx/groups/KeyVID_hf_model/avsync/vggss_sync_contrast_12/ckpts/checkpoint-40000- Used by
scripts/avsync15_metric.shfor audio-visual synchronization metrics - Can be customized via
AVSYNC_CKPTenvironment variable
- Used by
# Clone repository
git clone https://github.com/XingruiWang/KeyVID.git
cd KeyVID
# Create environment
conda create -n keyvid python=3.10
conda activate keyvid
# Install PyTorch (CUDA example)
pip install torch torchvision torchaudio
# Or for AMD ROCm:
pip install torch==2.3.0+rocm6.0 torchaudio==2.3.0+rocm6.0 \
--index-url https://download.pytorch.org/whl/rocm6.0
# Install dependencies
pip install -r requirements.txt
# Install envsubst (for config processing)
# Ubuntu/Debian: sudo apt-get install gettext
# macOS: brew install gettext# Build and run
docker-compose up -d
docker-compose exec keyvid bash
# Inside container
cd /workspace
bash scripts/generation.sh asva_12_kfDocker GPU Configuration:
- Update
HSA_OVERRIDE_GFX_VERSIONindocker-compose.ymlfor your GPU - Check GPU:
rocminfo | grep gfx
System Requirements:
- GPU: 16GB+ VRAM recommended
- Python 3.10+, PyTorch 2.1+
- FFmpeg, envsubst (gettext)
KeyVID uses a two-stage pipeline:
bash scripts/generation.sh asva_12_kf- Generates 12-frame videos at 6 FPS
- Results:
outputs/repo/DynamiCrafter/save/asva/asva_12_kf_add_idx_add_fps/
bash scripts/generation.sh asva_12_kf_interp- Generates 48-frame videos at 24 FPS
- Uses keyframes from Stage 1
- Results:
outputs/repo/DynamiCrafter/save/asva/asva_12_kf_interp/
Note: Run Stage 1 before Stage 2.
You must set CHECKPOINT_ROOT before running:
# Option 1: Direct export
export CHECKPOINT_ROOT=./checkpoint/KeyVID
bash scripts/generation.sh asva_12_kf
# Option 2: Use config file (recommended)
cp scripts/env.example scripts/.env
# Edit scripts/.env to set CHECKPOINT_ROOT
source scripts/.env
bash scripts/generation.sh asva_12_kfBy default, scripts use:
- Data:
./data/AVSync15/ - Outputs:
./outputs/
To customize:
export CHECKPOINT_ROOT=./checkpoint/KeyVID
export AVSYNC15_ROOT=/path/to/data
export DATA_ROOT=/path/to/outputs
bash scripts/generation.sh asva_12_kfEdit scripts/generation.sh line 113:
for ((i=0; i<N; i++)); do # Change N to your GPU countStatus: Coming soon on HuggingFace.
Required files (place under checkpoint/KeyVID/):
checkpoint/KeyVID/keyframe_generation/epoch=859-step=10320.ckpt(~2.5GB)checkpoint/KeyVID/asva_12_kf_interp/epoch=1479-step=17760.ckpt(~2.5GB)
Then set the checkpoint root:
export CHECKPOINT_ROOT=./checkpoint/KeyVIDFor quantitative evaluation metrics (AlignSync, RelSync), you need:
- Location:
/dockerx/groups/KeyVID_hf_model/avsync/vggss_sync_contrast_12/ckpts/checkpoint-40000 - Used by:
scripts/avsync15_metric.sh - Configure via:
export AVSYNC_CKPT=/path/to/avsync/checkpoint-40000
bash scripts/check_paths.sh # Verify pathsReduce GPU count in scripts/generation.sh (line 113)
# Ubuntu/Debian
sudo apt-get install gettext
# macOS
brew install gettext# Inside container
rocm-smi
python -c "import torch; print(torch.cuda.is_available())"Evaluate generated videos with quantitative metrics:
# Set dataset path (optional if using default ./data/AVSync15)
export AVSYNC15_ROOT=./data/AVSync15
# Run evaluation
bash scripts/avsync15_metric.shMetrics computed:
- FID (FrΓ©chet Inception Distance)
- FVD (FrΓ©chet Video Distance)
- CLIP-Sim (Image-Audio and Image-Text similarity)
- RelSync (Relative audio-visual synchronization)
- AlignSync (Alignment synchronization)
Requirements:
- AVSync checkpoint:
/dockerx/groups/KeyVID_hf_model/avsync/vggss_sync_contrast_12/ckpts/checkpoint-40000 - Ground truth videos:
data/AVSync15/videos/ - Generated videos:
outputs/repo/DynamiCrafter/save/asva/.../samples/
Custom paths:
export AVSYNC15_ROOT=/path/to/AVSync15
export AVSYNC_CKPT=/path/to/avsync/checkpoint-40000
bash scripts/avsync15_metric.shStatus: Training code coming soon.
Two-stage training:
- Keyframe Generation: Audio + Image + Frame indices β Keyframes
- Interpolation: Keyframes + Audio β Full video
Configurations available in configs/training/.
@article{wang2025keyvid,
title={KeyVID: Keyframe-Aware Video Diffusion for Audio-Synchronized Visual Animation},
author={Wang, Xingrui and Liu, Jiang and Wang, Ze and Yu, Xiaodong and Wu, Jialian and Sun, Ximeng and Su, Yusheng and Yuille, Alan and Liu, Zicheng and Barsoum, Emad},
journal={arXiv preprint arXiv:2504.09656},
year={2025}
}See LICENSE file for details.