Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data#60
Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data#60
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the container compression architecture so that Zstd compression is integrated directly into the block-writing step (v2 format), instead of being applied as a separate outer layer via wrapper types (ZstdCompressContext/ZstdDecompressContext). The key motivation is to allow WEBP and JPEG Lepton compressed data — which is already compressed and incompressible by Zstd — to bypass the Zstd encoder entirely, while all other blocks (DEFLATE, PNG, literal) are compressed via a persistent, shared Zstd encoder that maintains cross-block history.
Changes:
- Introduced a new v2 container wire format with a block-type bit-field (
BLOCK_COMPRESSION_*|BLOCK_TYPE_*), replacing the simple v1 constant block types. JPEG and WebP blocks useBLOCK_COMPRESSION_NONE; all other blocks useBLOCK_COMPRESSION_ZSTDwith per-block Zstd flush segments. - Deleted
zstd_compression.rsand moved Zstd handling intocontainer_processor.rs; removed public wrapper types (ZstdCompressContext,ZstdDecompressContext) and helper functions (zstd_preflate_whole_deflate_stream,zstd_recreate_whole_deflate_stream,preflate_whole_into_container,recreate_whole_from_container). ChangedPreflateContainerProcessor::newto acceptlevelandtest_baselinedirectly. - Added extensive new tests (block-type assertions, incremental feeding tests, fixture-based multi-stream tests) and new binary test fixtures in
samples/.
Reviewed changes
Copilot reviewed 15 out of 25 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
container/src/container_processor.rs |
Core change: integrates Zstd into block writing (v2 format), new block type constants, write_chunk_block_v2, drain_zstd_block, process_compressed_block, extensive new test module |
container/src/zstd_compression.rs |
Deleted: wrapper types and standalone functions removed |
container/src/lib.rs |
Removed public re-exports of deleted types and functions |
dll/src/unmanaged_api.rs |
Updated to use PreflateContainerProcessor/RecreateContainerProcessor directly |
util/src/main.rs |
Updated to use new PreflateContainerProcessor::new API |
tests/end_to_end.rs |
Updated container round-trip test to use new API |
container/CLAUDE.md |
New documentation file describing the v2 wire format |
CLAUDE.md |
New root-level architecture documentation |
preflate/CLAUDE.md |
New documentation for the core preflate library |
preflate/src/stream_processor.rs |
New test for partial gzip DEFLATE body round-trip |
fuzz/CLAUDE.md |
New documentation for fuzz harnesses |
dll/CLAUDE.md, tests/CLAUDE.md, util/CLAUDE.md |
New documentation for respective crates/modules |
samples/*.bin, samples/*.zip |
New binary test fixture files |
.gitignore |
Ignores .claude/ and .unref/ directories |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pdf_parse.rs ← detects PDF FlateDecode compressed object streams | ||
| zstd_compression.rs ← ZstdCompressContext / ZstdDecompressContext (internal) |
There was a problem hiding this comment.
The Module Layout section in the new container/CLAUDE.md still lists zstd_compression.rs as a module (line 169), but this file is deleted in this PR. Additionally, pdf_parse.rs is listed at line 168, but this file does not appear in the directory listing of container/src/. The module layout should be updated to reflect the actual file structure: remove zstd_compression.rs and pdf_parse.rs, and ensure the listed types/enums (e.g., V2BlockInfo at line 178) match what actually exists in the code.
| pdf_parse.rs ← detects PDF FlateDecode compressed object streams | |
| zstd_compression.rs ← ZstdCompressContext / ZstdDecompressContext (internal) |
| - **`zstd_compression.rs`** — Pipelines preflate output through Zstd for final storage. | ||
| - **`container_processor.rs`** — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress). |
There was a problem hiding this comment.
The root CLAUDE.md still mentions zstd_compression.rs in the container crate description (line 64), but this file was deleted in this PR. The description should be updated to reflect that Zstd compression is now integrated directly into container_processor.rs.
| - **`zstd_compression.rs`** — Pipelines preflate output through Zstd for final storage. | |
| - **`container_processor.rs`** — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress). | |
| - **`container_processor.rs`** — Contains the integrated Zstd compression logic and orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress). |
| fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>; | ||
| } | ||
|
|
||
| // Decompress a preflate container back to the original file | ||
| RecreateContainerProcessor::new(capacity) -> Self | ||
| impl ProcessBuffer for RecreateContainerProcessor { | ||
| fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>; | ||
| } | ||
|
|
||
| // DLL helper: wraps process_buffer to respect a max output size per call | ||
| fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>; | ||
|
|
||
| // Stats after compression | ||
| PreflateContainerProcessor::get_stats() -> &PreflateStats |
There was a problem hiding this comment.
The Public API section in container/CLAUDE.md has two inaccuracies: (1) the process_buffer signature at lines 13 and 19 shows Result<Vec<u8>> as the return type, but the actual signature is process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>; (2) line 26 refers to PreflateContainerProcessor::get_stats(), but the actual method is named stats().
| fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>; | |
| } | |
| // Decompress a preflate container back to the original file | |
| RecreateContainerProcessor::new(capacity) -> Self | |
| impl ProcessBuffer for RecreateContainerProcessor { | |
| fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>; | |
| } | |
| // DLL helper: wraps process_buffer to respect a max output size per call | |
| fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>; | |
| // Stats after compression | |
| PreflateContainerProcessor::get_stats() -> &PreflateStats | |
| fn process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>; | |
| } | |
| // Decompress a preflate container back to the original file | |
| RecreateContainerProcessor::new(capacity) -> Self | |
| impl ProcessBuffer for RecreateContainerProcessor { | |
| fn process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>; | |
| } | |
| // DLL helper: wraps process_buffer to respect a max output size per call | |
| fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>; | |
| // Stats after compression | |
| PreflateContainerProcessor::stats() -> &PreflateStats |
| IdatContents metadata: | ||
| varint(chunk_size_1) … varint(chunk_size_N) varint(0) | ||
| zlib_header[2] | ||
| addler32[4] |
There was a problem hiding this comment.
"addler32" is a misspelling of "adler32" (the Adler-32 checksum algorithm used in zlib). This appears in two places in the wire format documentation.
| addler32[4] | |
| adler32[4] |
| Feeds arbitrary bytes (minimum 1 byte) to `preflate_whole_into_container()`, then | ||
| `recreate_whole_from_container()`, and asserts the output matches the original input. |
There was a problem hiding this comment.
The fuzz target in fuzz/fuzz_container.rs (not part of this PR's diff) imports preflate_whole_into_container and recreate_whole_from_container from preflate_container. This PR removes both of these from the public API (they were re-exported from zstd_compression.rs and container_processor.rs). The fuzz harness will fail to compile unless updated to use PreflateContainerProcessor::new(...) and RecreateContainerProcessor::new(...) directly. Note that PreflateContainerProcessor::new now takes two additional arguments (level: i32 and test_baseline: bool) compared to before. The fuzz/CLAUDE.md is added in this PR and still incorrectly describes the old functions as the API for fuzz_container.
| Feeds arbitrary bytes (minimum 1 byte) to `preflate_whole_into_container()`, then | |
| `recreate_whole_from_container()`, and asserts the output matches the original input. | |
| Feeds arbitrary bytes (minimum 1 byte) through a `PreflateContainerProcessor` created | |
| via `PreflateContainerProcessor::new(..., level: i32, test_baseline: bool)`, then | |
| a `RecreateContainerProcessor::new(...)`, and asserts the output matches the original input. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Now that we support putting WEBP and JPEG Lepton compressed data inside the container, we need to skip this when we are doing our zstd compression. Instead of doing it as an extra pass on top, now we do it as part of writing each block.