fix(tx): replace locale-dependent ctype calls with ASCII range checks#154
Conversation
mdst_val__reference() used <ctype.h>'s isdigit() / isupper() on a
const char* directly. Two bugs with that:
1. Locale sensitivity. The ctype classifiers consult the process
locale's character table. A node in en_US.UTF-8 may classify a
byte differently than one in the C locale -- a consensus split on
transaction validity.
2. Undefined behavior. The C standard requires the ctype argument be
representable as unsigned char or equal EOF. Plain char is signed
on most platforms, so bytes with the high bit set get sign-extended
to negative int. Any byte >= 0x80 in the reference field is UB.
Fix: two file-local static inline helpers at the top of tx.c that take
an unsigned char and perform a pure ASCII range check:
static inline int ascii_isdigit(unsigned char c) { return c >= '0' && c <= '9'; }
static inline int ascii_isupper(unsigned char c) { return c >= 'A' && c <= 'Z'; }
mdst_val__reference() now reads reference[j] into an unsigned char c
once per iteration and uses the helpers. <ctype.h> include removed --
no remaining users.
Behavior for valid ASCII input is unchanged; non-ASCII bytes that were
previously UB are now deterministically rejected in all locales.
The helpers carry a TODO marker to promote them to extlib.h if other
units later need locale-free ASCII classification.
Closes #104
Live regression test (against earlier review feedback)Re-validated the change with two live mainnet syncs, captured what real reference fields look like in the wild, and confirmed locale-independence. Test setupBuilt with the F-14 fix active plus a temporary trace (not committed) inside Run 1:
|
| Reference (16 bytes hex) | Frequency | ASCII |
|---|---|---|
00 00 00 ... |
288 | empty |
31 39 38 33 38 33 34 38 32 33 00... |
1 | 1983834823 |
33 34 39 34 34 39 38 38 36 37 00... |
1 | 3494498867 |
39 39 33 36 38 38 38 35 34 39 00... |
2 | 9936888549 |
Total: 292 reference validations, 0 rejections.
Run 2: LC_ALL=en_US.UTF-8
(Different blocks applied because of timing; the network keeps advancing.)
| Reference (16 bytes hex) | Frequency | ASCII |
|---|---|---|
00 00 00 ... |
164 | empty |
4c 50 50 41 59 4d 45 4e 54 00... |
50 | LPPAYMENT |
39 33 34 30 36 37 31 33 34 35 00... |
1 | 9340671345 |
Total: 215 reference validations, 0 rejections.
Findings
- Real mainnet references contain only ASCII digits ('0'-'9'), ASCII uppercase ('A'-'Z'), and zero-fill — exactly what the validator's documented format rules permit.
- The UTF-8 run captured a string reference (
LPPAYMENT) that the C run did not see; this is exactly the kind of input where locale-dependentisupper()could differ from a pure ASCII check. Our locale-independent helper (ascii_isupper) accepts it identically in both locales. - 0 rejections in either run — no real mainnet TX gets rejected by the new validator. No regression for legitimate inputs.
- The validator's accept/reject decisions are now provably locale-independent: any byte either passes the
c >= '0' && c <= '9'/c >= 'A' && c <= 'Z'range check (deterministic, no locale lookup) or it doesn't.
Trace removed before this PR was opened
The hex-dump trace was a temporary instrumentation block in mdst_val__reference(). It was never part of the committed diff. The committed code contains only the structural fix:
- Two locale-free
static inlinehelpers at the top oftx.c - The validator loop reading
reference[j]into anunsigned char conce per iteration and using the helpers <ctype.h>include removed
What I did NOT add
A standalone unit test for mdst_val__reference() would require either (a) exposing the static function, (b) compiling tx.c into the test, or (c) constructing a fully valid TXENTRY and calling tx_val() end-to-end. None of those felt proportionate for a fix this trivial — the live sync against thousands of real references provides stronger evidence than a small synthetic unit test would. Happy to add one later if maintainers want it.
Summary
mdst_val__reference()no longer uses<ctype.h>'s locale-sensitiveisdigit()/isupper()static inlinehelpers (ascii_isdigit,ascii_isupper) perform pure ASCII range checks onunsigned char<ctype.h>include removed (no other users in tx.c)Problem
isdigit()/isupper()applied toconst char *reference:en_US.UTF-8may classify a byte differently than one in the C locale — consensus split on transaction validity.charis signed on most platforms. Any byte >= 0x80 is sign-extended to a negative int when implicitly passed toisdigit(), which is UB per the C standard (argument must be representable asunsigned charor equalEOF).Fix
Marked with
TODO: move this to extlib.h when convenientfor future promotion if other units need the same.Reference-validation loop now reads
reference[j]into anunsigned char conce per iteration and uses the helpers. Compiler inlines to identical machine code.Behavior change
No behavioral change for any valid input.
Test plan
make NO_CUDA=1 mochimobuilds cleanly with default-Werrorisdigit/isupperremain insrc/(git grep confirmed)Closes #104