analyzer: add pluggable tokenizer and basic CJK (code-point) support by amirHdev · Pull Request #5 · wizenheimer/blaze

amirHdev · 2026-02-22T15:06:34Z

Adds a minimal, backwards-compatible path for CJK search support by making tokenization configurable and providing a built-in Unicode code-point tokenizer.

What changed

Added TokenizerFunc to AnalyzerConfig for custom tokenization.
Added UnicodeCodePointTokenizer for rune/code-point tokenization.
Added DefaultCJKConfig():
- code-point tokenizer
- MinTokenLength = 1
- disables English stopword filtering and stemming
Added NewInvertedIndexWithAnalyzer(config) to construct an index with custom analyzer behavior.
Routed indexing and query-time analysis through the index analyzer config (indexing, BM25/proximity ranking, query builder term/phrase).
Added tests covering:
- tokenizer output for Unicode/CJK text
- BM25 match for "你好，世界" queried with "好"
- QueryBuilder term query honoring the index analyzer config

Why

Current behavior handles Unicode characters but tokenizes CJK text as whole words (e.g. "你好"), so single-character queries like "好" do not match. This PR adds a simple tokenizer strategy that enables basic CJK matching without changing default English behavior.

Notes / limitations

This is a basic code-point tokenizer, not a full Chinese/Japanese segmentation solution.
It improves correctness for single-character matching and provides an extension point for future tokenizers (e.g. bigram or dictionary-based segmentation).

Example

idx := blaze.NewInvertedIndexWithAnalyzer(blaze.DefaultCJKConfig())
idx.Index(1, "你好，世界")
matches := idx.RankBM25("好", 10)

- add configurable tokenizer hook to AnalyzerConfig\n- add Unicode code-point tokenizer and DefaultCJKConfig\n- route index/search/query paths through index analyzer config\n- add tests covering Chinese single-character query matching Signed-off-by: Amirhossein Akhlaghpour <m9.akhlaghpoor@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analyzer: add pluggable tokenizer and basic CJK (code-point) support#5

analyzer: add pluggable tokenizer and basic CJK (code-point) support#5
amirHdev wants to merge 1 commit intowizenheimer:mainfrom
amirHdev:feat-cjk-tokenizer-support

amirHdev commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

amirHdev commented Feb 22, 2026

What changed

Why

Notes / limitations

Example

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant