Official implementation of OSGNet at CVPR 2025, and the champion solution repository for three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025.
Yisen Feng1, Haoyu Zhang1,2, Meng Liu3*, Weili Guan1, Liqiang Nie1*
1 Harbin Institute of Technology (Shenzhen) 2 Pengcheng Laboratory 3 Shandong Jianzhu University
* represents corresponding author
- Paper: Object-Shot Enhanced Grounding Network for Egocentric Video
- Technical Report: OSGNet@ Ego4D Episodic Memory Challenge 2025
- Code Repository: iLearn-Lab/CVPR25-OSGNet
- Hugging Face: iLearn-Lab/CVPR25-OSGNet
- Introduction
- Highlights
- Project Structure
- Installation
- Checkpoints / Models
- Dataset / Benchmark
- Usage
- Citation
- Acknowledgement
- License
This repo is the official implementation of OSGNet at CVPR 2025. It is also the champion solution repository for three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025.
This repo supports data pre-processing, training, and evaluation for the following datasets:
- Ego4D-NLQ
- Ego4D-GoalStep
- TACoS
The repository currently provides:
- training scripts
- inference / evaluation scripts
- feature preparation instructions
- pretrained checkpoints
- Supports egocentric and video grounding benchmarks including Ego4D-NLQ, Ego4D-GoalStep, and TACoS
- Provides scripts for pretraining, finetuning, and inference
- Includes instructions for preparing text features, video features, LaViLa captions, and object features
- Releases checkpoints for multiple settings
.
├── configs/ # Configuration files for different datasets
├── ego4d_data/ # Annotation files and dataset-related metadata
├── install/ # Environment setup scripts
├── libs/ # Core modules, datasets, modeling, and utilities
├── tools/ # Training scripts
├── train.py
├── eval_nlq.py
├── README.md
└── LICENSE
git clone https://github.com/iLearn-Lab/CVPR25-OSGNet.git
cd CVPR25-OSGNetFollow INSTALL.sh for installing Recommended PyTorch version:
Torch >= 1.8.0
Required resources include:
- text feature
- video feature
- lavila caption (need to unzip)
- object feature
Download from huggingface or from baidu netdisk below.
| Feature | Setting | NLQ v1 | NLQ v2 |
|---|---|---|---|
| InternVideo | Finetuned | 144 | |
| EgoVLP | Finetuned | 173 |
| Feature | Setting | GoalStep |
|---|---|---|
| InternVideo | Finetuned | 135 |
| Feature | Setting | Checkpoint |
|---|---|---|
| C3D | Scratch | 150 |
| InternVideo | Finetuned | 131 |
Pretrain: NaQ
Download features from this Baidu Netdisk link
- narration feature:
narration_clip_token_features - narration jsonl:
format_unique_pretrain_data_v2.jsonl
The features are the same as the NLQ below.
- internvideo:
em_egovlp+internvideo_visual_features_1.87fps
- 4 cards, total batch size is 16
configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_pretrain_2e-4.yaml
-
Video Feature & Text Feature: GroundNLQ leverages the extracted egocentric InterVideo and EgoVLP features and CLIP textual token features. Please refer to GroundNLQ.
-
Download from huggingface
-
Download from baidu netdisk:
- Lavila Caption
- Object Feature (anno, classname)
- Video Feature (egovlp)
- Text Feature (NLQ v1 feature)
- NLQ v1 feature:
nlq_v1_clip_token_features - NLQ v2 feature:
nlq_v2_clip_token_features - egovideo:
egovideo_token_lmdb
- egovlp:
egovlp_lmdb - internvideo:
em_egovlp+internvideo_visual_features_1.87fps - egovideo:
egovideo_all_lmdb
lavila.zip
- anno:
co-detr/class-score0.6-minnum10-lmdb - classname:
classname-clip-base/a_photo_of.pt
- 2 cards, total batch size is 8
InternVideo
- v1:
ego4d_nlq_v1_multitask_egovlp_256_finetune_2e-4.yaml - v2:
ego4d_nlq_v2_multitask_finetune_2e-4.yaml
EgoVideo
- v2:
ego4d_nlq_v2_egovideo_finetune_4e-4.yaml
- Download from huggingface
- Download from baidu netdisk:
- Text Feature
- Video Feature (clip, not clip)
- lavila caption
- Object Feature (clip, not clip)
clip_query_lmdb
- internvideo:
internvideo_clip_lmdb(Due to memory limitations, we truncated the videos in the training set.),internvideo_lmdb
lavila.zip
- anno:
co-detr/clip-class-lmdb(after clip) - classname:
classname-clip-base/a_photo_of.pt(the same as Ego4D-NLQ)
- 4 cards, total batch size is 4
- finetune:
ego4d_goalstep_v2_baseline_2e-4.yaml
- Download from huggingface
- Download features from this Baidu Netdisk link.
- clip:
all_clip_token_features - glove:
glove_clip_token_features
- c3d:
c3d_lmdb - internvideo:
internvideo_lmdb
lavila.zip
- anno:
co-detr/class-score0.6-minnum10-lmdb - classname:
classname-clip-base/a_photo_of.pt(the same as Ego4D-NLQ)
- 4 cards, total batch size is 8
- finetune:
tacos_baseline_1e-4.yaml - scratch:
tacos_c3d_glove_weight1_5e-5.yaml
We adopt distributed data parallel DDP and fault-tolerant distributed training with torchrun.
Training and pretraining can be launched by running:
bash tools/train.sh CONFIG_FILE False OUTPUT_PATH CUDA_DEVICE_ID MODEwhere:
CONFIG_FILEis the config file for model / dataset hyperparameter initializationOUTPUT_PATHis the model output directory name defined by yourselfCUDA_DEVICE_IDis the CUDA device idMODEis the running mode
The checkpoints and other experiment log files will be written into <output_folder>/OUTPUT_PATH, where output_folder is defined in the config file.
bash tools/train.sh /home/feng_yi_sen/OSGNet/configs/tacos/tacos_c3d_glove_weight1_5e-5.yaml False objectmambafinetune219 0,1,2,3 trainTraining can be launched by running:
bash tools/train.sh CONFIG_FILE RESUME_PATH OUTPUT_PATH CUDA_DEVICE_ID MODEwhere RESUME_PATH is the path to the pretrained model weights.
bash tools/train.sh configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 trainFor GoalStep, MODE should be not-eval-loss.
bash tools/train.sh configs/goalstep/ego4d_goalstep_v2_baseline_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 not-eval-lossOnce the model is trained, you can use the following command for inference:
python eval_nlq.py CONFIG_FILE CHECKPOINT_PATH -gpu CUDA_DEVICE_IDwhere CHECKPOINT_PATH is the path to the saved checkpoint.
python eval_nlq.py configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/ego4d_nlq_v2_multitask_finetune_2e-4_objectmambafinetune144/model_2_26.834358523725836.pth.tar -gpu 1If you are using our code, please consider citing our paper.
@inproceedings{feng2025object,
title={Object-shot enhanced grounding network for egocentric video},
author={Feng, Yisen and Zhang, Haoyu and Liu, Meng and Guan, Weili and Nie, Liqiang},
booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
pages={24190--24200},
year={2025}
}@article{feng2025osgnet,
title={OSGNet@ Ego4D Episodic Memory Challenge 2025},
author={Feng, Yisen and Zhang, Haoyu and Chu, Qiaohui and Liu, Meng and Guan, Weili and Wang, Yaowei and Nie, Liqiang},
journal={arXiv preprint arXiv:2506.03710},
year={2025}
}This code is inspired by GroundNLQ.
We use the same video and text features as GroundNLQ. We thank the authors for their awesome open-source contributions.
This project is released under the MIT License. See LICENSE for details.