Skip to content

fix(tx): replace locale-dependent ctype calls with ASCII range checks#154

Merged
adequatelimited merged 1 commit intoaudit-fixesfrom
audit/F-14-locale-isdigit-isupper
Apr 13, 2026
Merged

fix(tx): replace locale-dependent ctype calls with ASCII range checks#154
adequatelimited merged 1 commit intoaudit-fixesfrom
audit/F-14-locale-isdigit-isupper

Conversation

@adequatelimited
Copy link
Copy Markdown
Collaborator

Summary

  • mdst_val__reference() no longer uses <ctype.h>'s locale-sensitive isdigit() / isupper()
  • Two file-local static inline helpers (ascii_isdigit, ascii_isupper) perform pure ASCII range checks on unsigned char
  • <ctype.h> include removed (no other users in tx.c)

Problem

isdigit()/isupper() applied to const char *reference:

  1. Locale sensitive. A node in en_US.UTF-8 may classify a byte differently than one in the C locale — consensus split on transaction validity.
  2. Undefined behavior. Plain char is signed on most platforms. Any byte >= 0x80 is sign-extended to a negative int when implicitly passed to isdigit(), which is UB per the C standard (argument must be representable as unsigned char or equal EOF).

Fix

static inline int ascii_isdigit(unsigned char c) { return c >= '0' && c <= '9'; }
static inline int ascii_isupper(unsigned char c) { return c >= 'A' && c <= 'Z'; }

Marked with TODO: move this to extlib.h when convenient for future promotion if other units need the same.

Reference-validation loop now reads reference[j] into an unsigned char c once per iteration and uses the helpers. Compiler inlines to identical machine code.

Behavior change

Input Before After
ASCII digit/upper accepted accepted (identical)
Non-ASCII bytes (>= 0x80) UB in all locales rejected deterministically
Legitimate TX references accepted accepted (identical)

No behavioral change for any valid input.

Test plan

  • make NO_CUDA=1 mochimo builds cleanly with default -Werror
  • Confirm no other callers of isdigit/isupper remain in src/ (git grep confirmed)
  • Reference-field validation accepts the same valid inputs (unit test could be added later, but the audit scope is narrow)

Closes #104

mdst_val__reference() used <ctype.h>'s isdigit() / isupper() on a
const char* directly. Two bugs with that:

1. Locale sensitivity. The ctype classifiers consult the process
   locale's character table. A node in en_US.UTF-8 may classify a
   byte differently than one in the C locale -- a consensus split on
   transaction validity.

2. Undefined behavior. The C standard requires the ctype argument be
   representable as unsigned char or equal EOF. Plain char is signed
   on most platforms, so bytes with the high bit set get sign-extended
   to negative int. Any byte >= 0x80 in the reference field is UB.

Fix: two file-local static inline helpers at the top of tx.c that take
an unsigned char and perform a pure ASCII range check:

   static inline int ascii_isdigit(unsigned char c) { return c >= '0' && c <= '9'; }
   static inline int ascii_isupper(unsigned char c) { return c >= 'A' && c <= 'Z'; }

mdst_val__reference() now reads reference[j] into an unsigned char c
once per iteration and uses the helpers. <ctype.h> include removed --
no remaining users.

Behavior for valid ASCII input is unchanged; non-ASCII bytes that were
previously UB are now deterministically rejected in all locales.

The helpers carry a TODO marker to promote them to extlib.h if other
units later need locale-free ASCII classification.

Closes #104
@adequatelimited
Copy link
Copy Markdown
Collaborator Author

Live regression test (against earlier review feedback)

Re-validated the change with two live mainnet syncs, captured what real reference fields look like in the wild, and confirmed locale-independence.

Test setup

Built with the F-14 fix active plus a temporary trace (not committed) inside mdst_val__reference() that hex-dumps every reference being validated. PoW bypass active for fast sync.

Run 1: LC_ALL=C

Reference (16 bytes hex) Frequency ASCII
00 00 00 ... 288 empty
31 39 38 33 38 33 34 38 32 33 00... 1 1983834823
33 34 39 34 34 39 38 38 36 37 00... 1 3494498867
39 39 33 36 38 38 38 35 34 39 00... 2 9936888549

Total: 292 reference validations, 0 rejections.

Run 2: LC_ALL=en_US.UTF-8

(Different blocks applied because of timing; the network keeps advancing.)

Reference (16 bytes hex) Frequency ASCII
00 00 00 ... 164 empty
4c 50 50 41 59 4d 45 4e 54 00... 50 LPPAYMENT
39 33 34 30 36 37 31 33 34 35 00... 1 9340671345

Total: 215 reference validations, 0 rejections.

Findings

  • Real mainnet references contain only ASCII digits ('0'-'9'), ASCII uppercase ('A'-'Z'), and zero-fill — exactly what the validator's documented format rules permit.
  • The UTF-8 run captured a string reference (LPPAYMENT) that the C run did not see; this is exactly the kind of input where locale-dependent isupper() could differ from a pure ASCII check. Our locale-independent helper (ascii_isupper) accepts it identically in both locales.
  • 0 rejections in either run — no real mainnet TX gets rejected by the new validator. No regression for legitimate inputs.
  • The validator's accept/reject decisions are now provably locale-independent: any byte either passes the c >= '0' && c <= '9' / c >= 'A' && c <= 'Z' range check (deterministic, no locale lookup) or it doesn't.

Trace removed before this PR was opened

The hex-dump trace was a temporary instrumentation block in mdst_val__reference(). It was never part of the committed diff. The committed code contains only the structural fix:

  • Two locale-free static inline helpers at the top of tx.c
  • The validator loop reading reference[j] into an unsigned char c once per iteration and using the helpers
  • <ctype.h> include removed

What I did NOT add

A standalone unit test for mdst_val__reference() would require either (a) exposing the static function, (b) compiling tx.c into the test, or (c) constructing a fully valid TXENTRY and calling tx_val() end-to-end. None of those felt proportionate for a fix this trivial — the live sync against thousands of real references provides stronger evidence than a small synthetic unit test would. Happy to add one later if maintainers want it.

@adequatelimited adequatelimited merged commit 50a2c0a into audit-fixes Apr 13, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant