Add Unicode replacement and surrogate tolerance flags#227
Add Unicode replacement and surrogate tolerance flags#227edsiper wants to merge 3 commits intoibireme:masterfrom
Conversation
Add YYJSON_READ_REPLACE_INVALID_UNICODE and YYJSON_READ_ALLOW_INVALID_SURROGATE flags to enable graceful handling of malformed Unicode in JSON strings. YYJSON_READ_REPLACE_INVALID_UNICODE replaces all invalid Unicode sequences with U+FFFD replacement character, ensuring output is always valid UTF-8. Handles invalid \uXXXX escapes, unpaired surrogates, invalid UTF-8 bytes, and control characters. YYJSON_READ_ALLOW_INVALID_SURROGATE permits unpaired surrogate code units in \uXXXX escapes, encoding them as UTF-8. Key changes: - Modified read_uni_esc() to accept flags parameter - Added replacement logic throughout string parsing paths - Replacement flag implicitly enables unicode tolerance flags - Added comprehensive test coverage - Flags not supported in incremental parsing mode This enables robust parsing of real-world JSON with encoding errors while maintaining strict standards compliance by default. Signed-off-by: Eduardo Silva <edsiper@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #227 +/- ##
==========================================
- Coverage 98.50% 97.79% -0.72%
==========================================
Files 2 2
Lines 9040 9255 +215
==========================================
+ Hits 8905 9051 +146
- Misses 135 204 +69
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Thanks for the PR! Replacing invalid chars with U+FFFD makes sense — I already support do that on the serialization side. The main issue is during parsing: the unescaping process is performed in-place, so the unescaped character must not be longer than the original escape sequence. Since U+FFFD takes 3 bytes in UTF-8, it cannot safely replace invalid sequences that are only 1–2 bytes. I'd suggest introduce a new string subtype (e.g. YYJSON_SUBTYPE_UNIERR) to indicate strings containing invalid Unicode. This would be similar to the existing |
Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <edsiper@gmail.com>
|
@ibireme thanks for your feedback. I have updated the patch with the suggestions (majority of the changes were done with Codex) |
Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Signed-off-by: Eduardo Silva <edsiper@gmail.com>
|
I thought a bit more about this and I think we should split bad strings into three categories:
The original goal was to turn malformed strings into valid UTF-8 internally. But with the current in-place parsing design, callers still need to check for invalid strings using So my suggestion is keep using the existing What do you think? |
Upstream PR: ibireme/yyjson#227 Signed-off-by: Eduardo Silva <eduardo@chronosphere.io>
Add
YYJSON_READ_REPLACE_INVALID_UNICODEandYYJSON_READ_ALLOW_INVALID_SURROGATEflags to enable graceful handling of malformed Unicode in JSON strings.YYJSON_READ_REPLACE_INVALID_UNICODEreplaces all invalid Unicode sequences withU+FFFDreplacement character, ensuring output is always valid UTF-8. Handles invalid \uXXXX escapes, unpaired surrogates, invalid UTF-8 bytes, and control characters.YYJSON_READ_ALLOW_INVALID_SURROGATEpermits unpaired surrogate code units in \uXXXX escapes, encoding them as UTF-8.Key changes
why ?