Add Unicode replacement and surrogate tolerance flags by edsiper · Pull Request #227 · ibireme/yyjson

edsiper · 2025-08-19T23:03:38Z

Add YYJSON_READ_REPLACE_INVALID_UNICODE and YYJSON_READ_ALLOW_INVALID_SURROGATE flags to enable graceful handling of malformed Unicode in JSON strings.

YYJSON_READ_REPLACE_INVALID_UNICODE replaces all invalid Unicode sequences with U+FFFD replacement character, ensuring output is always valid UTF-8. Handles invalid \uXXXX escapes, unpaired surrogates, invalid UTF-8 bytes, and control characters.

YYJSON_READ_ALLOW_INVALID_SURROGATE permits unpaired surrogate code units in \uXXXX escapes, encoding them as UTF-8.

Key changes

Modified read_uni_esc() to accept flags parameter
Added replacement logic throughout string parsing paths
Replacement flag implicitly enables unicode tolerance flags
Added comprehensive test coverage
Flags not supported in incremental parsing mode

why ?

This enables robust parsing of real-world JSON with encoding errors while maintaining strict standards compliance by default.
While we would love to live in strict parsing mode, we have to deal with malformed data, dealing with invalid bytes is times expensive in the caller than handling this directly inside yyjson

Add YYJSON_READ_REPLACE_INVALID_UNICODE and YYJSON_READ_ALLOW_INVALID_SURROGATE flags to enable graceful handling of malformed Unicode in JSON strings. YYJSON_READ_REPLACE_INVALID_UNICODE replaces all invalid Unicode sequences with U+FFFD replacement character, ensuring output is always valid UTF-8. Handles invalid \uXXXX escapes, unpaired surrogates, invalid UTF-8 bytes, and control characters. YYJSON_READ_ALLOW_INVALID_SURROGATE permits unpaired surrogate code units in \uXXXX escapes, encoding them as UTF-8. Key changes: - Modified read_uni_esc() to accept flags parameter - Added replacement logic throughout string parsing paths - Replacement flag implicitly enables unicode tolerance flags - Added comprehensive test coverage - Flags not supported in incremental parsing mode This enables robust parsing of real-world JSON with encoding errors while maintaining strict standards compliance by default. Signed-off-by: Eduardo Silva <edsiper@gmail.com>

codecov · 2025-08-20T15:51:01Z

Codecov Report

❌ Patch coverage is 70.93596% with 59 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.79%. Comparing base (706ae61) to head (825aa37).
⚠️ Report is 37 commits behind head on master.

Files with missing lines	Patch %	Lines
src/yyjson.c	70.79%	59 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #227      +/-   ##
==========================================
- Coverage   98.50%   97.79%   -0.72%     
==========================================
  Files           2        2              
  Lines        9040     9255     +215     
==========================================
+ Hits         8905     9051     +146     
- Misses        135      204      +69

Flag	Coverage Δ
unittests	`97.79% <70.93%> (-0.72%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ibireme · 2025-08-20T16:25:14Z

Thanks for the PR!

Replacing invalid chars with U+FFFD makes sense — I already support do that on the serialization side.

The main issue is during parsing: the unescaping process is performed in-place, so the unescaped character must not be longer than the original escape sequence. Since U+FFFD takes 3 bytes in UTF-8, it cannot safely replace invalid sequences that are only 1–2 bytes.

For example: "abcdefg", `d` is escaped
    src         
     │          
┌────▼─────────┐
│"abd\u0064efg"│
└────▲─────────┘
     │          
    dst         
          src   
           │    
┌──────────▼───┐
│"abcdu0064efg"│
└─────▲────────┘
      │         
     dst        
             src
              │ 
┌─────────────▼┐
│"abcdefg 4efg"│
└────────▲─────┘
         │      
        dst

I'd suggest introduce a new string subtype (e.g. YYJSON_SUBTYPE_UNIERR) to indicate strings containing invalid Unicode. This would be similar to the existing YYJSON_SUBTYPE_NOESC, but it would still require the caller to detect and handle such cases.

Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

Signed-off-by: Eduardo Silva <edsiper@gmail.com>

edsiper · 2025-09-18T13:47:01Z

@ibireme thanks for your feedback. I have updated the patch with the suggestions (majority of the changes were done with Codex)

Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

Signed-off-by: Eduardo Silva <edsiper@gmail.com>

ibireme · 2025-09-21T16:23:38Z

I thought a bit more about this and I think we should split bad strings into three categories:

Invalid UTF-8 bytes
Since yyjson decodes in-place, we can't always replace them with U+FFFD. So REPLACE_INVALID_UNICODE doesn't really help here.
Invalid single-character escapes
Right now we just error on things like \Q.
For reference, JSON5 just drops the \ and outputs Q.
Invalid \uXXXX sequences
These are trickier since the error could come from many places.
The current PR sometimes gives surprising results, e.g. \u😀 -> \uŸ˜€, \uA_CD -> \uACD.

The original goal was to turn malformed strings into valid UTF-8 internally. But with the current in-place parsing design, callers still need to check for invalid strings using SUBTYPE_UNIERR.

So my suggestion is keep using the existing READ_ALLOW_INVALID_UNICODE, just mark invalid strings with SUBTYPE_UNIERR.
Add a new flag READ_ALLOW_INVALID_ESCAPE to handle all bad escapes (either keep them as-is or just drop the \).

What do you think?

Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

edsiper added a commit to fluent/fluent-bit that referenced this pull request Aug 20, 2025

lib: yyjson: add Unicode replacement and surrogate tolerance flags

5069392

Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

edsiper mentioned this pull request Aug 20, 2025

pack: 2.8x performance improvements for JSON parsing fluent/fluent-bit#10766

Merged

3 tasks

adjust internal API

92a5d0b

Signed-off-by: Eduardo Silva <edsiper@gmail.com>

edsiper added a commit to fluent/fluent-bit that referenced this pull request Sep 18, 2025

lib: yyjson: add Unicode replacement and surrogate tolerance flags

eb8ec93

Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

edsiper added a commit to fluent/fluent-bit that referenced this pull request Sep 19, 2025

lib: yyjson: add Unicode replacement and surrogate tolerance flags

baff322

Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

test: string: fix handling of test_invalid_unicode_flags

825aa37

Signed-off-by: Eduardo Silva <edsiper@gmail.com>

SagiROosto pushed a commit to AnyVisionltd/fluent-bit that referenced this pull request Nov 5, 2025

lib: yyjson: add Unicode replacement and surrogate tolerance flags

53a18a4

Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Unicode replacement and surrogate tolerance flags#227

Add Unicode replacement and surrogate tolerance flags#227
edsiper wants to merge 3 commits intoibireme:masterfrom
edsiper:replace-invalid-unicode

edsiper commented Aug 19, 2025

Uh oh!

codecov bot commented Aug 20, 2025 •

edited

Loading

Uh oh!

ibireme commented Aug 20, 2025

Uh oh!

edsiper commented Sep 18, 2025

Uh oh!

ibireme commented Sep 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

edsiper commented Aug 19, 2025

Uh oh!

codecov bot commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ibireme commented Aug 20, 2025

Uh oh!

edsiper commented Sep 18, 2025

Uh oh!

ibireme commented Sep 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov bot commented Aug 20, 2025 •

edited

Loading

ibireme commented Sep 21, 2025 •

edited

Loading