analyzer: add pluggable tokenizer and basic CJK (code-point) support#5
Open
amirHdev wants to merge 1 commit intowizenheimer:mainfrom
Open
analyzer: add pluggable tokenizer and basic CJK (code-point) support#5amirHdev wants to merge 1 commit intowizenheimer:mainfrom
amirHdev wants to merge 1 commit intowizenheimer:mainfrom
Conversation
- add configurable tokenizer hook to AnalyzerConfig\n- add Unicode code-point tokenizer and DefaultCJKConfig\n- route index/search/query paths through index analyzer config\n- add tests covering Chinese single-character query matching Signed-off-by: Amirhossein Akhlaghpour <m9.akhlaghpoor@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a minimal, backwards-compatible path for CJK search support by making tokenization configurable and providing a built-in Unicode code-point tokenizer.
What changed
TokenizerFunctoAnalyzerConfigfor custom tokenization.UnicodeCodePointTokenizerfor rune/code-point tokenization.DefaultCJKConfig():MinTokenLength = 1NewInvertedIndexWithAnalyzer(config)to construct an index with custom analyzer behavior."你好,世界"queried with"好"QueryBuilderterm query honoring the index analyzer configWhy
Current behavior handles Unicode characters but tokenizes CJK text as whole words (e.g.
"你好"), so single-character queries like"好"do not match. This PR adds a simple tokenizer strategy that enables basic CJK matching without changing default English behavior.Notes / limitations
Example