Skip to content

Conversation

@airborne12
Copy link
Member

Summary

Cherry-pick of #59117 to branch-4.0.

This PR adds support for multiple tokenize indexes on a single column, allowing users to create multiple inverted indexes with different analyzers (e.g., chinese, english, standard) on the same text column and query using specific analyzers.

Key changes:

  • Add USING ANALYZER syntax for MATCH predicates
  • Support multiple inverted indexes with different analyzers on the same column
  • Add analyzer key normalization and matching logic in BE
  • Add filterIndexesByAnalyzer method in OlapTable for index selection

Conflicts Resolved

  • MatchPredicate.java - Updated constructor signatures for analyzer support
  • OlapTable.java - Added imports and new filtering methods
  • ExpressionTranslator.java - Updated to use new MatchPredicate constructor

Test plan

  • Regression tests included in original PR
  • Unit tests for analyzer key matcher and normalizer

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test
    • No need to test
  • Behavior changed:

    • No.
    • Yes. Adds new USING ANALYZER syntax for MATCH predicates.
  • Does this need documentation?

    • Yes. (covered by original PR documentation)

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…mn (apache#59117)

This PR implements **Multi-Analyzer Inverted Index** feature, which
allows creating multiple inverted indexes with different analyzers on a
single column.

1. **Multiple Indexes on Single Column**: Create multiple inverted
indexes with different analyzers (standard, keyword, chinese, custom) on
the same column
2. **USING ANALYZER Syntax**: Query with specific analyzer using `MATCH
... USING ANALYZER analyzer_name`
3. **Smart Index Selection**: When specified analyzer's index is not
built, automatically falls back to non-index path (correct results
guaranteed)
4. **Analyzer Identity Detection**: Prevents duplicate indexes with same
analyzer configuration

- Multi-language search on same text column
- Precision vs. recall trade-off (exact match vs. fuzzy search)
- Autocomplete with edge_ngram while keeping standard search

```sql
-- Create table with multiple indexes
CREATE TABLE articles (
    id INT,
    content TEXT,
    INDEX idx_std (content) USING INVERTED PROPERTIES("analyzer" = "std_analyzer"),
    INDEX idx_kw (content) USING INVERTED PROPERTIES("analyzer" = "kw_analyzer")
) ...;

-- Query with specific analyzer
SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER std_analyzer;
SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer;
```
@airborne12 airborne12 requested a review from yiguolei as a code owner February 2, 2026 01:59
@Thearas
Copy link
Contributor

Thearas commented Feb 2, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@airborne12
Copy link
Member Author

run buildall

- Add getAnalyzerIdentity() method to IndexDef (branch-4.0 uses IndexDef, not IndexDefinition)
- Fix MatchPredicate to use setNullableFromNereids() instead of direct field assignment
- Change IndexDefinition.IndexType to IndexDef.IndexType in Index.java
- Remove unrelated ColumnSeqMapping methods from OlapTable (not present in branch-4.0)
@airborne12
Copy link
Member Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 69.59% (238/342) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.99% (19033/35917)
Line Coverage 36.12% (177235/490668)
Region Coverage 32.76% (137441/419497)
Branch Coverage 33.62% (59528/177075)

@yiguolei yiguolei merged commit 9512b4a into apache:branch-4.0 Feb 2, 2026
24 of 28 checks passed
@airborne12 airborne12 deleted the pick-59117-to-4.0 branch February 2, 2026 08:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants