Skip to content

analyzer: add pluggable tokenizer and basic CJK (code-point) support#5

Open
amirHdev wants to merge 1 commit intowizenheimer:mainfrom
amirHdev:feat-cjk-tokenizer-support
Open

analyzer: add pluggable tokenizer and basic CJK (code-point) support#5
amirHdev wants to merge 1 commit intowizenheimer:mainfrom
amirHdev:feat-cjk-tokenizer-support

Conversation

@amirHdev
Copy link
Copy Markdown

Adds a minimal, backwards-compatible path for CJK search support by making tokenization configurable and providing a built-in Unicode code-point tokenizer.

What changed

  • Added TokenizerFunc to AnalyzerConfig for custom tokenization.
  • Added UnicodeCodePointTokenizer for rune/code-point tokenization.
  • Added DefaultCJKConfig():
    • code-point tokenizer
    • MinTokenLength = 1
    • disables English stopword filtering and stemming
  • Added NewInvertedIndexWithAnalyzer(config) to construct an index with custom analyzer behavior.
  • Routed indexing and query-time analysis through the index analyzer config (indexing, BM25/proximity ranking, query builder term/phrase).
  • Added tests covering:
    • tokenizer output for Unicode/CJK text
    • BM25 match for "你好,世界" queried with "好"
    • QueryBuilder term query honoring the index analyzer config

Why

Current behavior handles Unicode characters but tokenizes CJK text as whole words (e.g. "你好"), so single-character queries like "好" do not match. This PR adds a simple tokenizer strategy that enables basic CJK matching without changing default English behavior.

Notes / limitations

  • This is a basic code-point tokenizer, not a full Chinese/Japanese segmentation solution.
  • It improves correctness for single-character matching and provides an extension point for future tokenizers (e.g. bigram or dictionary-based segmentation).

Example

idx := blaze.NewInvertedIndexWithAnalyzer(blaze.DefaultCJKConfig())
idx.Index(1, "你好,世界")
matches := idx.RankBM25("好", 10)

- add configurable tokenizer hook to AnalyzerConfig\n- add Unicode code-point tokenizer and DefaultCJKConfig\n- route index/search/query paths through index analyzer config\n- add tests covering Chinese single-character query matching

Signed-off-by: Amirhossein Akhlaghpour <m9.akhlaghpoor@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant