Skip to content

Questions about token ordering and attention masking in the mixed Transformer #18

@wnn2000

Description

@wnn2000

Hi, thank you for your excellent work and for open-sourcing the code!

I have a question about the mixed Transformer design, especially how different tokens interact. Since the model is based on a Qwen2-VL decoder with an additional geometric expert, I’m trying to understand how geometric, visual, and text tokens are arranged in the input sequence. Given that LLM decoders are typically causal, I initially expected geometric tokens to be placed before visual/text tokens so that the semantic expert can benefit from them.

However, after reading the code, I noticed that you seem to apply a special attention masking strategy. For example, within the same geometric or visual split, you use local bidirectional attention, which is different from the standard causal attention in most modern MLLMs. Could you clarify the motivation behind this design, and how the visibility between geometric, visual, and text tokens is defined in practice?

I also have a small question about model scale. Since the model is based on Qwen2-VL-2B but additionally introduces a DINO encoder and a geometric expert, does this significantly increase the total parameter count (you still call your model as G2VLM-2B)? And with the dual-encoder design increasing the number of input tokens, how do you handle the impact on inference efficiency given the quadratic complexity of attention?

Thanks again for your great work!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions