Skip to content

fix: CJK word count and delimiters in recursive chunker#114

Open
vinsew wants to merge 1 commit intogarrytan:masterfrom
vinsew:fix/cjk-chunker-word-count
Open

fix: CJK word count and delimiters in recursive chunker#114
vinsew wants to merge 1 commit intogarrytan:masterfrom
vinsew:fix/cjk-chunker-word-count

Conversation

@vinsew
Copy link
Copy Markdown

@vinsew vinsew commented Apr 14, 2026

Summary

The recursive chunker's countWords() uses text.match(/\S+/g) to count whitespace-separated tokens. CJK text (Chinese/Japanese/Korean) has no spaces between characters, so a whole paragraph of Chinese is counted as 1 "word". The chunker never splits it, the paragraph gets sent as one >8192-token embedding request, and OpenAI returns:

Error embedding: 400 Invalid 'input[N]': maximum input length is 8192 tokens

This is the chunker-side counterpart to #98 (fix: CJK word count in query expansion), which fixed the same root cause in expandQuery().

Reproduction

# Any pure Chinese markdown file >~2KB triggers it
gbrain import path/to/chinese-notes/
gbrain embed --stale
# → Error embedding <slug>: 400 Invalid 'input[N]': maximum input length is 8192 tokens

Real case that surfaced this: a 101 KB Chinese interview transcript produced chunks that each exceeded the OpenAI embedding token limit, so those chunks never got vectorized and vector search on them silently degraded.

Fix

Three small changes in src/core/chunkers/recursive.ts:

  1. countWords() — CJK-aware counting. When CJK characters are present, count non-whitespace characters (1 char ≈ 1 "word"). Mirrors the detection already used in search/expansion.ts (PR fix: CJK word count in query expansion #98) so semantics stay consistent across the codebase.

  2. DELIMITERS L2 and L3 — CJK full-width punctuation. Chinese/Japanese use 。!? for sentence endings and ;:,、 for clauses. Without these, a CJK paragraph finds no split points even when the size target is exceeded.

  3. splitOnWhitespace() — character-slice fallback. When a CJK paragraph reaches the whitespace-split level but has no whitespace, fall back to character-boundary slicing (text.slice(i, i + target)) so each piece is ≤ target characters.

Impact

  • 2 files changed, +61 / -2 lines
  • Zero behavior change for non-CJK text (same regex, same code path)
  • Pure CJK paragraphs now split into ≤ target-char chunks instead of one oversized chunk
  • Unblocks embedding + vector search for CJK users

Test plan

  • 3 new tests in test/chunkers/recursive.test.ts cover:
    • Long Chinese paragraph splits into multiple chunks with per-chunk char count ≤ 1.5× target (greedy-merge tolerance)
    • Japanese (Hiragana + ) and Korean (Hangul + space-delimited) both split
    • Mixed CJK + English text still splits
  • All 18 chunker tests pass locally (15 existing + 3 new)
  • bun test shows no new regressions (the 4 pre-existing failures in PGLiteEngine are unrelated and exist on master)

Inspired by and stylistically consistent with #98 by @YIING99.

The whitespace-based countWords() treats a whole CJK paragraph as 1
"word" (no spaces between Chinese/Japanese/Korean chars), so chunkText()
never splits it. The paragraph then gets sent as one >8192-token
embedding request and fails with OpenAI's max input length error.

This is the chunker-side counterpart to PR garrytan#98 (fix: CJK word count in
query expansion), which fixed the same root cause in expandQuery.

Changes:
- countWords(): detect CJK and count non-whitespace characters instead
  of space-separated tokens. Mirrors the detection in PR garrytan#98 for
  consistent semantics.
- DELIMITERS L2/L3: add CJK full-width punctuation (。!?;:,、)
  so the chunker can find semantic split points in Chinese/Japanese.
- splitOnWhitespace(): fall back to character-slicing when CJK text
  has no whitespace (otherwise a whole paragraph collapses to 1 piece).

Impact: pure Chinese/Japanese/Korean documents now produce properly
sized chunks (≤ target chars ≈ tokens) instead of one giant chunk
that exceeds OpenAI's 8192-token embedding limit.

Tests: 3 new cases cover long Chinese paragraphs, Japanese/Korean
text, and CJK+English mixed text. All 18 chunker tests pass; no
pre-existing test regressions.
vinsew added a commit to vinsew/gbrain that referenced this pull request Apr 14, 2026
slugifySegment's filter regex /[^a-z0-9.\s_-]/g strips every non-ASCII
character, so a pure-CJK filename (e.g., "品牌圣经.md", "销售论证文档.md")
collapses to an empty string and gets filtered out by slugifyPath's
.filter(Boolean). Both files then collapse to the parent directory slug
(e.g., "inbox") and silently overwrite each other during gbrain import,
which still reports "N imported, 0 errors".

This is the slug-side counterpart to PR garrytan#98 (query expansion) and
PR garrytan#114 (chunker) — same root cause (ASCII-only text handling) in a
third part of the codebase.

Changes:
- slugifySegment(): add CJK ranges (Han, Hiragana, Katakana, Hangul)
  to the character-class whitelist. Mirrors the CJK range constant used
  in the chunker fix for consistent semantics.
- Add .normalize('NFC') after the NFD+accent-strip step so Hangul
  syllables, which NFD decomposes into conjoining Jamo, get recomposed
  before the filter runs. Without this Korean names still collapse to
  empty because Jamo are outside the Hangul Syllables block.

Impact: pure/mixed CJK filenames now produce meaningful, non-colliding
slugs. ASCII-only behavior is unchanged.

Tests: 8 new cases cover Chinese, Japanese (Hiragana + Katakana),
Korean, mixed CJK+ASCII, CJK directories, and the collision-regression
scenario. All 46 slug tests pass. No new regressions in full suite
(the 4 PGLiteEngine failures pre-exist on master).
vinsew added a commit to vinsew/gbrain that referenced this pull request Apr 14, 2026
When a git repository contains files with non-ASCII names (common for
Chinese/Japanese/Korean users, or for files exported from Apple Notes
with spaces + CJK like "2026-04-14 22_38 记录.md"), `git diff
--name-status` wraps those paths in double quotes and octal-escapes
each byte:

    A   "inbox/2026-04-14 22_38 \350\256\260\345\275\225.md"

buildSyncManifest then treats that literal quoted-escaped string as
the path, downstream filesystem lookups fail, and the file is
silently dropped from the sync manifest. The user sees "added: 0"
in the sync result even though git has those files committed, and
`gbrain search` can't find the content. The cron log shows success
because nothing technically errored.

This is the sync-layer counterpart to the same CJK root cause class
fixed in garrytan#98 (query expansion), garrytan#114 (chunker), and garrytan#115 (slugify):
ASCII-only assumptions baked into a fourth part of the codebase.

Reproduction:
    cd some-brain-repo
    echo "# test" > "inbox/测试文件.md"
    git add . && git commit -m test
    gbrain sync --repo .
    # -> "added: 0, chunksCreated: 0"  ← bug
    # -> But git log clearly shows the commit added the file.

Fix:
- Add `-c core.quotepath=false` to the `git()` helper in
  src/commands/sync.ts. This config tells git to emit paths as-is
  (UTF-8) in diff/log output instead of the default double-quoted
  octal-escaped form. The fix is at the call site so all future git
  invocations through this helper are covered, not just `diff`.

Impact:
- 2 files changed, +18 / -1 lines (1-line code fix + comment + tests)
- Zero behavior change for ASCII-only paths
- CJK filenames (with or without spaces) now sync correctly

Test plan:
- [x] 3 new tests in test/sync.test.ts cover pure-CJK paths (Chinese
      + Japanese + Korean), CJK-with-spaces (Apple Notes pattern),
      and CJK rename entries.
- [x] All 35 sync tests pass (32 existing + 3 new).
- [x] Full `bun test` suite: no new regressions (the 4 pre-existing
      PGLiteEngine failures are unrelated and exist on master).

Companion to garrytan#114 (chunker CJK) and garrytan#115 (slugify CJK). Third in
the series; all three can merge independently.
vinsew added a commit to vinsew/gbrain that referenced this pull request Apr 14, 2026
GBrain stores internal cross-page references in slug form (e.g.
`[Alice](./alice)`) because the slug is the canonical identifier in the
DB. That works inside GBrain's own resolution layer.

But when those pages are exported as `.md` files on disk and opened in
standard markdown viewers (Obsidian, VS Code preview, GitHub web view,
typical mkdocs/jekyll renderers), the viewers look for a literal file
at `./alice` — which doesn't exist. The actual file is `./alice.md`.

Result: every internal link in an exported brain is silently broken on
disk. The user clicks `[小龙]` in `龙虾群.md`, sees a 404 / empty page,
and cannot navigate the brain outside of GBrain itself. This defeats
half the value of having the brain stored as portable markdown.

Fix:

Add `normalizeInternalLinks(content)` that runs over each page's
serialized markdown right before `writeFileSync` and rewrites slug-form
internal links to filename-form by appending `.md`:

  [Alice](./alice)            -> [Alice](./alice.md)
  [Alice](alice)              -> [Alice](alice.md)
  [Alice](../people/alice)    -> [Alice](../people/alice.md)
  [小龙](../people/小龙)        -> [小龙](../people/小龙.md)

Conservative: leaves untouched anything that looks external or already
extended:

- URL schemes (http:, https:, mailto:, ftp:, file:, tel:, ...) — skip
- Anchors (#section)                                            — skip
- Empty targets                                                 — skip
- Trailing slash (directory references)                         — skip
- Already has any extension (.md, .png, .pdf, .MD, ...)         — skip
- Preserves query strings and anchors when appending:
  [Section](./alice#bio) -> [Section](./alice.md#bio)
  [Search](./alice?q=t)  -> [Search](./alice.md?q=t)

The DB content stays slug-form (GBrain's internal convention is
unchanged). Only the on-disk export gets the `.md` annotation, so the
exported markdown is viewable as-is by any standard renderer.

Real-world reproduction this fix addresses:

  $ gbrain put 龙虾群 < <(echo '[小龙](./小龙)')
  $ gbrain export --dir /tmp/out
  $ cat /tmp/out/龙虾群.md
  # before this PR: contains [小龙](./小龙)  — clicking 404s
  # after this PR:  contains [小龙](./小龙.md) — clicking opens the file

Impact:
- 2 files changed, +149 / -1 lines (1 line of helper invocation +
  ~40 lines of helper + comment + 26 tests)
- Zero behavior change for external URLs, anchors, or already-extended
  links
- DB content unchanged — only the on-disk export representation gains
  the `.md` annotation
- Existing exports remain valid (re-running export on an already-exported
  brain is idempotent because already-extended links are skipped)

Tests:
- 26 new tests covering: same-dir slug, parent-dir slug, deep nesting,
  CJK slugs, multiple links per line, multi-line markdown, all 6
  external schemes (http/https/mailto/file/ftp/tel), all 4 extension
  cases (md/png/pdf/uppercase), anchor preservation, query preservation,
  empty/trailing-slash/no-link edge cases.
- All 26 tests pass.
- Full suite: 612 pass / no new regressions (4 pre-existing PGLiteEngine
  failures are unrelated and exist on master).

Fifth in a series of practical PRs from a real Chinese-speaking deploy.
Companion to:
- garrytan#114 (chunker CJK)
- garrytan#115 (slugify CJK)
- garrytan#119 (sync git quotepath CJK)
- garrytan#121 (self-contained API keys)

Same theme: GBrain is meaningfully more useful when the markdown export
is a first-class deliverable, not a half-broken side-effect.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant