Questions about token ordering and attention masking in the mixed Transformer

Hi, thank you for your excellent work and for open-sourcing the code!

I have a question about the mixed Transformer design, especially how different tokens interact. Since the model is based on a Qwen2-VL decoder with an additional geometric expert, I’m trying to understand how geometric, visual, and text tokens are arranged in the input sequence. Given that LLM decoders are typically causal, I initially expected geometric tokens to be placed before visual/text tokens so that the semantic expert can benefit from them.

However, after reading the code, I noticed that you seem to apply a special attention masking strategy. For example, within the same geometric or visual split, you use local bidirectional attention, which is different from the standard causal attention in most modern MLLMs. Could you clarify the motivation behind this design, and how the visibility between geometric, visual, and text tokens is defined in practice?

I also have a small question about model scale. Since the model is based on Qwen2-VL-2B but additionally introduces a DINO encoder and a geometric expert, does this significantly increase the total parameter count (you still call your model as G2VLM-2B)? And with the dual-encoder design increasing the number of input tokens, how do you handle the impact on inference efficiency given the quadratic complexity of attention?

Thanks again for your great work!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about token ordering and attention masking in the mixed Transformer #18

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about token ordering and attention masking in the mixed Transformer #18

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions