Object-Shot Enhanced Grounding Network for Egocentric Video (CVPR 2025)

Official implementation of OSGNet at CVPR 2025, and the champion solution repository for three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025.

Authors

Yisen Feng¹, Haoyu Zhang^1,2, Meng Liu³*, Weili Guan¹, Liqiang Nie¹*

¹ Harbin Institute of Technology (Shenzhen) ² Pengcheng Laboratory ³ Shandong Jianzhu University

* represents corresponding author

Links

Paper: Object-Shot Enhanced Grounding Network for Egocentric Video
Technical Report: OSGNet@ Ego4D Episodic Memory Challenge 2025
Code Repository: iLearn-Lab/CVPR25-OSGNet
Hugging Face: iLearn-Lab/CVPR25-OSGNet

Introduction

This repo is the official implementation of OSGNet at CVPR 2025. It is also the champion solution repository for three egocentric video localization tracks of the Ego4D Episodic Memory Challenge at CVPR 2025.

This repo supports data pre-processing, training, and evaluation for the following datasets:

Ego4D-NLQ
Ego4D-GoalStep
TACoS

The repository currently provides:

training scripts
inference / evaluation scripts
feature preparation instructions
pretrained checkpoints

Highlights

Supports egocentric and video grounding benchmarks including Ego4D-NLQ, Ego4D-GoalStep, and TACoS
Provides scripts for pretraining, finetuning, and inference
Includes instructions for preparing text features, video features, LaViLa captions, and object features
Releases checkpoints for multiple settings

Project Structure

.
├── configs/       # Configuration files for different datasets
├── ego4d_data/    # Annotation files and dataset-related metadata
├── install/       # Environment setup scripts
├── libs/          # Core modules, datasets, modeling, and utilities
├── tools/         # Training scripts
├── train.py
├── eval_nlq.py
├── README.md
└── LICENSE

Installation

1. Clone the repository and install dependencies

git clone https://github.com/iLearn-Lab/CVPR25-OSGNet.git
cd CVPR25-OSGNet

Follow INSTALL.sh for installing Recommended PyTorch version:

Torch >= 1.8.0

2. Prepare offline data

Required resources include:

text feature
video feature
lavila caption (need to unzip)
object feature

Checkpoints / Models

Download from huggingface or from baidu netdisk below.

Pretrained weights for finetuning (train with NaQ)

Ego4D-NLQ

Feature	Setting	NLQ v1	NLQ v2
InternVideo	Finetuned		144
EgoVLP	Finetuned	173

GoalStep

Feature	Setting	GoalStep
InternVideo	Finetuned	135

TACoS

Feature	Setting	Checkpoint
C3D	Scratch	150
InternVideo	Finetuned	131

Dataset / Benchmark

Pretrain: NaQ

Text Feature

Download features from this Baidu Netdisk link

narration feature: narration_clip_token_features
narration jsonl: format_unique_pretrain_data_v2.jsonl

Video Feature

The features are the same as the NLQ below.

internvideo: em_egovlp+internvideo_visual_features_1.87fps

Config

4 cards, total batch size is 16
configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_pretrain_2e-4.yaml

Ego4D-NLQ

Feature Download

Video Feature & Text Feature: GroundNLQ leverages the extracted egocentric InterVideo and EgoVLP features and CLIP textual token features. Please refer to GroundNLQ.
Download from huggingface
Download from baidu netdisk:
- Lavila Caption
- Object Feature (anno, classname)
- Video Feature (egovlp)
- Text Feature (NLQ v1 feature)

Text Feature

NLQ v1 feature: nlq_v1_clip_token_features
NLQ v2 feature: nlq_v2_clip_token_features
egovideo: egovideo_token_lmdb

Video Feature

egovlp: egovlp_lmdb
internvideo: em_egovlp+internvideo_visual_features_1.87fps
egovideo: egovideo_all_lmdb

Lavila Caption

lavila.zip

Object Feature

anno: co-detr/class-score0.6-minnum10-lmdb
classname: classname-clip-base/a_photo_of.pt

Config

2 cards, total batch size is 8

InternVideo

v1: ego4d_nlq_v1_multitask_egovlp_256_finetune_2e-4.yaml
v2: ego4d_nlq_v2_multitask_finetune_2e-4.yaml

EgoVideo

v2: ego4d_nlq_v2_egovideo_finetune_4e-4.yaml

GoalStep

Feature Download

Download from huggingface
Download from baidu netdisk:
- Text Feature
- Video Feature (clip, not clip)
- lavila caption
- Object Feature (clip, not clip)

Text Feature

clip_query_lmdb

Video Feature

internvideo: internvideo_clip_lmdb (Due to memory limitations, we truncated the videos in the training set.), internvideo_lmdb

Lavila Caption

lavila.zip

Object Feature

anno: co-detr/clip-class-lmdb (after clip)
classname: classname-clip-base/a_photo_of.pt (the same as Ego4D-NLQ)

Config

4 cards, total batch size is 4
finetune: ego4d_goalstep_v2_baseline_2e-4.yaml

TACoS

Feature Download

Download from huggingface
Download features from this Baidu Netdisk link.

Text Feature

clip: all_clip_token_features
glove: glove_clip_token_features

Video Feature

c3d: c3d_lmdb
internvideo: internvideo_lmdb

Lavila Caption

lavila.zip

Object Feature

anno: co-detr/class-score0.6-minnum10-lmdb
classname: classname-clip-base/a_photo_of.pt (the same as Ego4D-NLQ)

Config

4 cards, total batch size is 8
finetune: tacos_baseline_1e-4.yaml
scratch: tacos_c3d_glove_weight1_5e-5.yaml

Usage

We adopt distributed data parallel DDP and fault-tolerant distributed training with torchrun.

Training from scratch

Training and pretraining can be launched by running:

bash tools/train.sh CONFIG_FILE False OUTPUT_PATH CUDA_DEVICE_ID MODE

where:

CONFIG_FILE is the config file for model / dataset hyperparameter initialization
OUTPUT_PATH is the model output directory name defined by yourself
CUDA_DEVICE_ID is the CUDA device id
MODE is the running mode

The checkpoints and other experiment log files will be written into <output_folder>/OUTPUT_PATH, where output_folder is defined in the config file.

Example: TACoS

bash tools/train.sh /home/feng_yi_sen/OSGNet/configs/tacos/tacos_c3d_glove_weight1_5e-5.yaml False objectmambafinetune219 0,1,2,3 train

Finetuning

Training can be launched by running:

bash tools/train.sh CONFIG_FILE RESUME_PATH OUTPUT_PATH CUDA_DEVICE_ID MODE

where RESUME_PATH is the path to the pretrained model weights.

Example: Ego4D-NLQ v2

bash tools/train.sh configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 train

Example: GoalStep

For GoalStep, MODE should be not-eval-loss.

bash tools/train.sh configs/goalstep/ego4d_goalstep_v2_baseline_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/save/model_7_pretrain.pth.tar objectmambafinetune219 0,1 not-eval-loss

Inference

Once the model is trained, you can use the following command for inference:

python eval_nlq.py CONFIG_FILE CHECKPOINT_PATH -gpu CUDA_DEVICE_ID

where CHECKPOINT_PATH is the path to the saved checkpoint.

Example: Ego4D-NLQ v2

python eval_nlq.py configs/Ego4D-NLQ/v2/ego4d_nlq_v2_multitask_finetune_2e-4.yaml /root/autodl-tmp/model/GroundNLQ/ckpt/ego4d_nlq_v2_multitask_finetune_2e-4_objectmambafinetune144/model_2_26.834358523725836.pth.tar -gpu 1

Citation

If you are using our code, please consider citing our paper.

@inproceedings{feng2025object,
  title={Object-shot enhanced grounding network for egocentric video},
  author={Feng, Yisen and Zhang, Haoyu and Liu, Meng and Guan, Weili and Nie, Liqiang},
  booktitle={Proceedings of the Computer Vision and Pattern Recognition Conference},
  pages={24190--24200},
  year={2025}
}

@article{feng2025osgnet,
  title={OSGNet@ Ego4D Episodic Memory Challenge 2025},
  author={Feng, Yisen and Zhang, Haoyu and Chu, Qiaohui and Liu, Meng and Guan, Weili and Wang, Yaowei and Nie, Liqiang},
  journal={arXiv preprint arXiv:2506.03710},
  year={2025}
}

Acknowledgement

This code is inspired by GroundNLQ.

We use the same video and text features as GroundNLQ. We thank the authors for their awesome open-source contributions.

License

This project is released under the MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
configs		configs
ego4d_data		ego4d_data
install		install
libs		libs
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
basic_utils.py		basic_utils.py
ensemble.py		ensemble.py
eval_nlq.py		eval_nlq.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Object-Shot Enhanced Grounding Network for Egocentric Video (CVPR 2025)

Authors

Links

Table of Contents

Introduction

Highlights

Project Structure

Installation

1. Clone the repository and install dependencies

2. Prepare offline data

Checkpoints / Models

Pretrained weights for finetuning (train with NaQ)

Ego4D-NLQ

GoalStep

TACoS

Dataset / Benchmark

Pretrain: NaQ

Text Feature

Video Feature

Config

Ego4D-NLQ

Feature Download

Text Feature

Video Feature

Lavila Caption

Object Feature

Config

GoalStep

Feature Download

Text Feature

Video Feature

Lavila Caption

Object Feature

Config

TACoS

Feature Download

Text Feature

Video Feature

Lavila Caption

Object Feature

Config

Usage

Training from scratch

Example: TACoS

Finetuning

Example: Ego4D-NLQ v2

Example: GoalStep

Inference

Example: Ego4D-NLQ v2

Citation

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages