This work is accepted by ICLR'26.
Large Language Models have demonstrated superior reasoning capabilities by generating step-by-step reasoning in natural language before deriving the final answer. Recently, Geiping et al. (2025) introduced 3.5B-Huginn as an alternative to this paradigm, a depth-recurrent Transformer that increases computational depth per token by reusing a recurrent block in latent space. Despite its performance gains with increasing recurrences, this approach is inadequate for tasks demanding exploration and adaptivity, a limitation arising from its single, chain-like propagation mechanism. To address this, we propose a novel dynamic multi-branches routing approach for Huginn, termed as Mixture-of-Depth-Recurrent (MoDr) Transformer, which enables effective exploration of the solution space by shifting linear latent reasoning into a LoRA-based multi-branch dynamic relay mode with a learnable hard-gate routing. Meanwhile, we introduce an auxiliary-loss-free load balancing strategy to mitigate the potential routing collapse. Our empirical results reveal that MoDr achieves average accuracy improvements of +7.2% and +2.48% over the original Huginn model and its fine-tuned variant, respectively, across various mathematical reasoning benchmarks and improvements of +21.21% and +1.52% on commonsense reasoning benchmarks.
- Dynamic Multi-Branch Routing: Enables effective exploration of the solution space through LoRA-based multi-branch dynamic relay mode
- Learnable Hard-Gate Routing: Implements a learnable routing mechanism for adaptive branch selection
- Auxiliary-Loss-Free Load Balancing: Mitigates routing collapse without additional auxiliary losses
- Strong Performance: Achieves significant improvements on mathematical and commonsense reasoning benchmarks
- Clone the repository:
git clone https://github.com/zhangxjohn/MoDr.git
cd MoDr- Install training dependencies:
cd train
pip install -r requirements.txt- Install evaluation dependencies:
cd ../evaluation
pip install -r requirements.txtWe provide training scripts for different tasks:
bash train/scripts/run_modr_math.shbash train/scripts/run_modr_cs.shbash train/scripts/run_modr_code.shAfter training, you can evaluate the model on various benchmarks:
cd evaluation
python eval_hf.py \
--model_name_or_path /path/to/checkpoint \
--origin_model_path /path/to/base/model \
--data_name "gsm8k" \
--temperature 0.0001 \
--start_idx 0 \
--end_idx -1 \
--split "test" \
--max_tokens 1024 \
--top_p 0.95 \
--seed 0 \
--num_steps 16 \
--num_choices 1 \
--surround_with_messages \MoDr achieves average accuracy improvements of +7.2% and +2.48% over the original Huginn model and its fine-tuned variant, respectively, across various mathematical reasoning benchmarks.
MoDr shows improvements of +21.21% and +1.52% on commonsense reasoning benchmarks compared to the original Huginn model and its fine-tuned variant.
If you find MoDr useful or relevant to your research, please consider citing our paper:
@inproceedings{zhang2026modr,
title={MoDr: Mixture-of-Depth-Recurrent Transformers for Test-Time Reasoning},
author={Zhang, Xiaojing and Wu, Haifeng and He, Gang and Shen, Jiyang and Lyu, Bochen and Zhu, Zhanxing},
booktitle={ICLR},
year={2026}
}We thank the Huginn team for their valuable contributions to the depth-recurrent Transformer architecture. The original Huginn implementation can be found at recurrent-pretraining.



