Skip to content

iLearn-Lab/CVPR25-OSGNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Object-Shot Enhanced Grounding Network for Egocentric Video (CVPR 2025)

Official implementation of OSGNet at CVPR 2025, and the champion solution repository for three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025.

Authors

Yisen Feng1, Haoyu Zhang1,2, Meng Liu3*, Weili Guan1, Liqiang Nie1*

1 Harbin Institute of Technology (Shenzhen) 2 Pengcheng Laboratory 3 Shandong Jianzhu University

* represents corresponding author

Links


Table of Contents


Introduction

This repo is the official implementation of OSGNet at CVPR 2025. It is also the champion solution repository for three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025.

This repo supports data pre-processing, training, and evaluation for the following datasets:

  • Ego4D-NLQ
  • Ego4D-GoalStep
  • TACoS

The repository currently provides:

  • training scripts
  • inference / evaluation scripts
  • feature preparation instructions
  • pretrained checkpoints

Highlights

  • Supports egocentric and video grounding benchmarks including Ego4D-NLQ, Ego4D-GoalStep, and TACoS
  • Provides scripts for pretraining, finetuning, and inference
  • Includes instructions for preparing text features, video features, LaViLa captions, and object features
  • Releases checkpoints for multiple settings

Project Structure

.
├── configs/       # Configuration files for different datasets
├── ego4d_data/    # Annotation files and dataset-related metadata
├── install/       # Environment setup scripts
├── libs/          # Core modules, datasets, modeling, and utilities
├── tools/         # Training scripts
├── train.py
├── eval_nlq.py
├── README.md
└── LICENSE

Installation

1. Clone the repository and install dependencies

git clone https://github.com/iLearn-Lab/CVPR25-OSGNet.git
cd CVPR25-OSGNet

Follow INSTALL.sh for installing Recommended PyTorch version:

Torch >= 1.8.0

2. Prepare offline data

Required resources include:

  • text feature
  • video feature
  • lavila caption (need to unzip)
  • object feature

Checkpoints / Models

Download from huggingface or from baidu netdisk below.

Pretrained weights for finetuning (train with NaQ)

Ego4D-NLQ

Feature Setting NLQ v1 NLQ v2
InternVideo Finetuned 144
EgoVLP Finetuned 173

GoalStep

Feature Setting GoalStep
InternVideo Finetuned 135

TACoS

Feature Setting Checkpoint
C3D Scratch 150
InternVideo Finetuned 131

Dataset / Benchmark

Pretrain: NaQ

Text Feature

Download features from this Baidu Netdisk link

  • narration feature: narration_clip_token_features
  • narration jsonl: format_unique_pretrain_data_v2.jsonl

Video Feature

The features are the same as the NLQ below.

  • internvideo: em_egovlp+internvideo_visual_features_1.87fps

Config

  • 4 cards, total batch size is 16
  • configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_pretrain_2e-4.yaml

Ego4D-NLQ

Feature Download

Text Feature

  • NLQ v1 feature: nlq_v1_clip_token_features
  • NLQ v2 feature: nlq_v2_clip_token_features
  • egovideo: egovideo_token_lmdb

Video Feature

  • egovlp: egovlp_lmdb
  • internvideo: em_egovlp+internvideo_visual_features_1.87fps
  • egovideo: egovideo_all_lmdb

Lavila Caption

  • lavila.zip

Object Feature

  • anno: co-detr/class-score0.6-minnum10-lmdb
  • classname: classname-clip-base/a_photo_of.pt

Config

  • 2 cards, total batch size is 8

InternVideo

  • v1: ego4d_nlq_v1_multitask_egovlp_256_finetune_2e-4.yaml
  • v2: ego4d_nlq_v2_multitask_finetune_2e-4.yaml

EgoVideo

  • v2: ego4d_nlq_v2_egovideo_finetune_4e-4.yaml

GoalStep

Feature Download

Text Feature

  • clip_query_lmdb

Video Feature

  • internvideo: internvideo_clip_lmdb (Due to memory limitations, we truncated the videos in the training set.), internvideo_lmdb

Lavila Caption

  • lavila.zip

Object Feature

  • anno: co-detr/clip-class-lmdb (after clip)
  • classname: classname-clip-base/a_photo_of.pt (the same as Ego4D-NLQ)

Config

  • 4 cards, total batch size is 4
  • finetune: ego4d_goalstep_v2_baseline_2e-4.yaml

TACoS

Feature Download

Text Feature

  • clip: all_clip_token_features
  • glove: glove_clip_token_features

Video Feature

  • c3d: c3d_lmdb
  • internvideo: internvideo_lmdb

Lavila Caption

  • lavila.zip

Object Feature

  • anno: co-detr/class-score0.6-minnum10-lmdb
  • classname: classname-clip-base/a_photo_of.pt (the same as Ego4D-NLQ)

Config

  • 4 cards, total batch size is 8
  • finetune: tacos_baseline_1e-4.yaml
  • scratch: tacos_c3d_glove_weight1_5e-5.yaml

Usage

We adopt distributed data parallel DDP and fault-tolerant distributed training with torchrun.

Training from scratch

Training and pretraining can be launched by running:

bash tools/train.sh CONFIG_FILE False OUTPUT_PATH CUDA_DEVICE_ID MODE

where:

  • CONFIG_FILE is the config file for model / dataset hyperparameter initialization
  • OUTPUT_PATH is the model output directory name defined by yourself
  • CUDA_DEVICE_ID is the CUDA device id
  • MODE is the running mode

The checkpoints and other experiment log files will be written into <output_folder>/OUTPUT_PATH, where output_folder is defined in the config file.

Example: TACoS

bash tools/train.sh /home/feng_yi_sen/OSGNet/configs/tacos/tacos_c3d_glove_weight1_5e-5.yaml False objectmambafinetune219 0,1,2,3 train

Finetuning

Training can be launched by running:

bash tools/train.sh CONFIG_FILE RESUME_PATH OUTPUT_PATH CUDA_DEVICE_ID MODE

where RESUME_PATH is the path to the pretrained model weights.

Example: Ego4D-NLQ v2

bash tools/train.sh configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 train

Example: GoalStep

For GoalStep, MODE should be not-eval-loss.

bash tools/train.sh configs/goalstep/ego4d_goalstep_v2_baseline_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 not-eval-loss

Inference

Once the model is trained, you can use the following command for inference:

python eval_nlq.py CONFIG_FILE CHECKPOINT_PATH -gpu CUDA_DEVICE_ID

where CHECKPOINT_PATH is the path to the saved checkpoint.

Example: Ego4D-NLQ v2

python eval_nlq.py configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/ego4d_nlq_v2_multitask_finetune_2e-4_objectmambafinetune144/model_2_26.834358523725836.pth.tar -gpu 1

Citation

If you are using our code, please consider citing our paper.

@inproceedings{feng2025object,
  title={Object-shot enhanced grounding network for egocentric video},
  author={Feng, Yisen and Zhang, Haoyu and Liu, Meng and Guan, Weili and Nie, Liqiang},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={24190--24200},
  year={2025}
}
@article{feng2025osgnet,
  title={OSGNet@ Ego4D Episodic Memory Challenge 2025},
  author={Feng, Yisen and Zhang, Haoyu and Chu, Qiaohui and Liu, Meng and Guan, Weili and Wang, Yaowei and Nie, Liqiang},
  journal={arXiv preprint arXiv:2506.03710},
  year={2025}
}

Acknowledgement

This code is inspired by GroundNLQ.

We use the same video and text features as GroundNLQ. We thank the authors for their awesome open-source contributions.


License

This project is released under the MIT License. See LICENSE for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages