GitHub - zhangxjohn/MoDr: [ICLR 2026] MoDr: Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning

MoDr: Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning

This work is accepted by ICLR'26.

Abstract

Large Language Models have demonstrated superior reasoning capabilities by generating step-by-step reasoning in natural language before deriving the final answer. Recently, Geiping et al. (2025) introduced 3.5B-Huginn as an alternative to this paradigm, a depth-recurrent Transformer that increases computational depth per token by reusing a recurrent block in latent space. Despite its performance gains with increasing recurrences, this approach is inadequate for tasks demanding exploration and adaptivity, a limitation arising from its single, chain-like propagation mechanism. To address this, we propose a novel dynamic multi-branches routing approach for Huginn, termed as Mixture-of-Depth-Recurrent (MoDr) Transformer, which enables effective exploration of the solution space by shifting linear latent reasoning into a LoRA-based multi-branch dynamic relay mode with a learnable hard-gate routing. Meanwhile, we introduce an auxiliary-loss-free load balancing strategy to mitigate the potential routing collapse. Our empirical results reveal that MoDr achieves average accuracy improvements of +7.2% and +2.48% over the original Huginn model and its fine-tuned variant, respectively, across various mathematical reasoning benchmarks and improvements of +21.21% and +1.52% on commonsense reasoning benchmarks.

Framework Overview

Methodology

Key Features

Dynamic Multi-Branch Routing: Enables effective exploration of the solution space through LoRA-based multi-branch dynamic relay mode
Learnable Hard-Gate Routing: Implements a learnable routing mechanism for adaptive branch selection
Auxiliary-Loss-Free Load Balancing: Mitigates routing collapse without additional auxiliary losses
Strong Performance: Achieves significant improvements on mathematical and commonsense reasoning benchmarks

Installation

Setup

Clone the repository:

git clone https://github.com/zhangxjohn/MoDr.git
cd MoDr

Install training dependencies:

cd train
pip install -r requirements.txt

Install evaluation dependencies:

cd ../evaluation
pip install -r requirements.txt

Quick Start

Training

We provide training scripts for different tasks:

Mathematical Reasoning

bash train/scripts/run_modr_math.sh

Commonsense Reasoning

bash train/scripts/run_modr_cs.sh

Code Generation

bash train/scripts/run_modr_code.sh

Evaluation

After training, you can evaluate the model on various benchmarks:

cd evaluation
python eval_hf.py \
    --model_name_or_path /path/to/checkpoint \
    --origin_model_path /path/to/base/model \
    --data_name "gsm8k" \
    --temperature 0.0001 \
    --start_idx 0 \
    --end_idx -1 \
    --split "test" \
    --max_tokens 1024 \
    --top_p 0.95 \
    --seed 0 \
    --num_steps 16 \
    --num_choices 1 \
    --surround_with_messages \

Performance

Mathematical Reasoning

MoDr achieves average accuracy improvements of +7.2% and +2.48% over the original Huginn model and its fine-tuned variant, respectively, across various mathematical reasoning benchmarks.

Commonsense Reasoning

MoDr shows improvements of +21.21% and +1.52% on commonsense reasoning benchmarks compared to the original Huginn model and its fine-tuned variant.

Citation

If you find MoDr useful or relevant to your research, please consider citing our paper:

@inproceedings{zhang2026modr,
  title={MoDr: Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning},
  author={Zhang, Xiaojing and Wu, Haifeng and He, Gang and Shen, Jiyang and Lyu, Bochen and Zhu, Zhanxing},
  booktitle={ICLR},
  year={2026}
}

Acknowledgments

We thank the Huginn team for their valuable contributions to the depth-recurrent Transformer architecture. The original Huginn implementation can be found at recurrent-pretraining.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
datasets		datasets
evaluation		evaluation
images		images
train		train
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MoDr: Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning

Abstract

Framework Overview

Methodology

Key Features

Installation

Setup

Quick Start

Training

Mathematical Reasoning

Commonsense Reasoning

Code Generation

Evaluation

Performance

Mathematical Reasoning

Commonsense Reasoning

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MoDr: Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning

Abstract

Framework Overview

Methodology

Key Features

Installation

Setup

Quick Start

Training

Mathematical Reasoning

Commonsense Reasoning

Code Generation

Evaluation

Performance

Mathematical Reasoning

Commonsense Reasoning

Citation

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages