Skip to content

pmarreck/validate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

695 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

validate

CI built with garnix

Data silently rots.

If you aren't actively validating, you likely already have corrupt files that are being quietly re-copied to the cloud or your NAS as "good" backups. Family photos, legal documents, old projects, and cherished media are exactly the kind of files that get silently damaged and then preserved in that damaged state.

Drive failures are obvious. Silent sector failures, copy errors, and transmission errors are not. That's why validate exists: deterministic, byte-level validation across a wide range of file formats (100+, see FORMAT_VERIFICATIONS.md).

Why some formats resist corruption detection

Not all formats are equally detectable. Some formats include checksums (PNG, FLAC, ZIP) that make corruption trivially provable — a single flipped bit anywhere in the file will be caught. Others have no integrity mechanism at all (WAV, TIFF, raw images) and can only be validated structurally.

The most insidious case is entropy-coded formats like HEIC, JPEG, and H.264 video. These formats use arithmetic or Huffman coding where every possible bit pattern decodes to a valid output. A corrupted HEIC file doesn't crash the decoder — it silently produces a slightly wrong image. There are no invalid bitstream states for the decoder to catch, because the encoding is designed to use the entire code space efficiently. This is the fundamental tradeoff of high-efficiency compression: the same property that makes it compress well (no wasted bit patterns) also makes it corruption-opaque.

HEIC is arguably the worst case here because it is the default photo format on every iPhone. Billions of photos worldwide are stored in a format where a single bit flip in the CABAC-encoded data is mathematically undetectable without the original file to compare against. Even a full decode — parsing every arithmetic-coded symbol — cannot distinguish corruption from valid data, because corruption simply produces a different valid decode.

validate reports these realities honestly: formats are classified as "fully validated" only when every byte is covered by a checksum, decompression, or decode that would fail on corruption. Formats where corruption can hide in opaque payload data are reported as "structural" validation depth, regardless of how much parsing we perform. See FORMAT_VERIFICATIONS.md for measured detection rates per format.

When the decoder lies — VP8 error concealment

Some decoders go further than just "decode any bit pattern": they actively hide corruption from the caller. libvpx's VP8 decoder is a textbook case. Feed it a frame with mangled coefficients and it returns VPX_CODEC_OK, transparently patching up the damage via built-in error concealment. The caller — your video player, your transcoder, your backup-checker — is told everything is fine.

The damage is detectable, but only if you ask. libvpx exposes a runtime control, VP8D_GET_FRAME_CORRUPTED, that surfaces the internal flag the decoder set when it had to conceal something. Without that explicit query, every concealed frame validates as clean.

Without that query, validate's VP8 sniper detection sat at 0%. With it, the same sample hits 88% sniper / 90% shotgun. The bytes were always damaged; the decoder just declined to mention it.

Your codec has error concealment. Your validator should not.

This is exactly the silent-corruption pattern validate exists to catch — the reader said OK, but the bytes were not OK.

Components

  • Zig library (core validation)
  • C FFI (stable-enough for integration, but not yet 1.0)
  • C CLI wrapper: validate

Status

The C FFI mirrors the current Zig validation API for ease of integration. It is expected to evolve before a 1.0 release.

Build

./build

Runs ./test first. When DEBUG is unset/0, dependencies build in ReleaseFast and ./build defaults to -Doptimize=ReleaseFast.

CLI

# Validate files or directories
validate <path> [path ...]
validate ~/Photos/vacation/

# Read paths from stdin (pipe, -, --stdin, or @stdin)
find . -name '*.jpg' | validate --json
validate --ndjson - < paths.txt

# JSON output for scripting
validate --json file.png          # JSON array
validate --ndjson file1 file2     # One JSON object per line

# Platform info
validate --about

Options

Flag Description
--json Output results as a JSON array
--ndjson Output one JSON object per line (newline-delimited)
--jobs N Number of parallel workers (0 = auto, default)
-j N Alias for --jobs
--about Print version and platform info
--lang CODE Set output language (e.g., en, de, ja)
--no-color Disable colored output
--color Force colored output (even when piping)
--simple-progress Use simple ASCII progress instead of TUI
--shuffle Shuffle file order
--append Append to output files instead of overwriting
-, --stdin, @stdin Read file paths from stdin (newline-delimited)

--jobs 0 (default) uses all available cores (logical CPU count). MAX_FILES limits the number of files scanned when validating a directory. MAX_VIDEO_SIZE limits deep video validation to files under N MB (unset = no limit). MEM_TELEMETRY=1 logs per-file RSS memory samples (use MEM_TELEMETRY_PATH to log to a file, MEM_TELEMETRY_EVERY=N to sample every N files). UNKNOWN_OUT=/path writes UNKNOWN entries to that path instead of stdout (supports /dev/null, /dev/fd/1, /dev/fd/2). ZIP_TELEMETRY=1 logs slow ZIP entry validation details to stderr (adjust threshold with ZIP_SLOW_SECONDS). PDF_TELEMETRY=1 logs slow PDF deep-validation breakdowns to stderr (adjust threshold with PDF_SLOW_SECONDS).

Tests

./test

Windows Tests

./test-windows

On Linux (x86_64): Uses Wine from the Nix flake devShell (automatically provided).

On macOS: Uses CrossOver. Requires a bottle named windows-dev-test (or set CROSSOVER_BOTTLE).

About

a full binary file format validator for over 100 (EDIT: now over 200) different filetypes, written in Zig with frontier AI assistance

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors