Skip to content

Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data#60

Merged
mcroomp merged 8 commits intomainfrom
compressionrefactor
Mar 4, 2026
Merged

Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data#60
mcroomp merged 8 commits intomainfrom
compressionrefactor

Conversation

@mcroomp
Copy link
Copy Markdown
Collaborator

@mcroomp mcroomp commented Feb 25, 2026

Now that we support putting WEBP and JPEG Lepton compressed data inside the container, we need to skip this when we are doing our zstd compression. Instead of doing it as an extra pass on top, now we do it as part of writing each block.

@mcroomp mcroomp changed the title Compressionrefactor Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data Feb 25, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the container compression architecture so that Zstd compression is integrated directly into the block-writing step (v2 format), instead of being applied as a separate outer layer via wrapper types (ZstdCompressContext/ZstdDecompressContext). The key motivation is to allow WEBP and JPEG Lepton compressed data — which is already compressed and incompressible by Zstd — to bypass the Zstd encoder entirely, while all other blocks (DEFLATE, PNG, literal) are compressed via a persistent, shared Zstd encoder that maintains cross-block history.

Changes:

  • Introduced a new v2 container wire format with a block-type bit-field (BLOCK_COMPRESSION_* | BLOCK_TYPE_*), replacing the simple v1 constant block types. JPEG and WebP blocks use BLOCK_COMPRESSION_NONE; all other blocks use BLOCK_COMPRESSION_ZSTD with per-block Zstd flush segments.
  • Deleted zstd_compression.rs and moved Zstd handling into container_processor.rs; removed public wrapper types (ZstdCompressContext, ZstdDecompressContext) and helper functions (zstd_preflate_whole_deflate_stream, zstd_recreate_whole_deflate_stream, preflate_whole_into_container, recreate_whole_from_container). Changed PreflateContainerProcessor::new to accept level and test_baseline directly.
  • Added extensive new tests (block-type assertions, incremental feeding tests, fixture-based multi-stream tests) and new binary test fixtures in samples/.

Reviewed changes

Copilot reviewed 15 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
container/src/container_processor.rs Core change: integrates Zstd into block writing (v2 format), new block type constants, write_chunk_block_v2, drain_zstd_block, process_compressed_block, extensive new test module
container/src/zstd_compression.rs Deleted: wrapper types and standalone functions removed
container/src/lib.rs Removed public re-exports of deleted types and functions
dll/src/unmanaged_api.rs Updated to use PreflateContainerProcessor/RecreateContainerProcessor directly
util/src/main.rs Updated to use new PreflateContainerProcessor::new API
tests/end_to_end.rs Updated container round-trip test to use new API
container/CLAUDE.md New documentation file describing the v2 wire format
CLAUDE.md New root-level architecture documentation
preflate/CLAUDE.md New documentation for the core preflate library
preflate/src/stream_processor.rs New test for partial gzip DEFLATE body round-trip
fuzz/CLAUDE.md New documentation for fuzz harnesses
dll/CLAUDE.md, tests/CLAUDE.md, util/CLAUDE.md New documentation for respective crates/modules
samples/*.bin, samples/*.zip New binary test fixture files
.gitignore Ignores .claude/ and .unref/ directories

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread container/CLAUDE.md Outdated
Comment on lines +168 to +169
pdf_parse.rs ← detects PDF FlateDecode compressed object streams
zstd_compression.rs ← ZstdCompressContext / ZstdDecompressContext (internal)
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Module Layout section in the new container/CLAUDE.md still lists zstd_compression.rs as a module (line 169), but this file is deleted in this PR. Additionally, pdf_parse.rs is listed at line 168, but this file does not appear in the directory listing of container/src/. The module layout should be updated to reflect the actual file structure: remove zstd_compression.rs and pdf_parse.rs, and ensure the listed types/enums (e.g., V2BlockInfo at line 178) match what actually exists in the code.

Suggested change
pdf_parse.rs ← detects PDF FlateDecode compressed object streams
zstd_compression.rs ← ZstdCompressContext / ZstdDecompressContext (internal)

Copilot uses AI. Check for mistakes.
Comment thread CLAUDE.md Outdated
Comment on lines +64 to +65
- **`zstd_compression.rs`** — Pipelines preflate output through Zstd for final storage.
- **`container_processor.rs`** — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root CLAUDE.md still mentions zstd_compression.rs in the container crate description (line 64), but this file was deleted in this PR. The description should be updated to reflect that Zstd compression is now integrated directly into container_processor.rs.

Suggested change
- **`zstd_compression.rs`** — Pipelines preflate output through Zstd for final storage.
- **`container_processor.rs`** — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).
- **`container_processor.rs`** — Contains the integrated Zstd compression logic and orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).

Copilot uses AI. Check for mistakes.
Comment thread container/CLAUDE.md Outdated
Comment on lines +13 to +26
fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>;
}

// Decompress a preflate container back to the original file
RecreateContainerProcessor::new(capacity) -> Self
impl ProcessBuffer for RecreateContainerProcessor {
fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>;
}

// DLL helper: wraps process_buffer to respect a max output size per call
fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>;

// Stats after compression
PreflateContainerProcessor::get_stats() -> &PreflateStats
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Public API section in container/CLAUDE.md has two inaccuracies: (1) the process_buffer signature at lines 13 and 19 shows Result<Vec<u8>> as the return type, but the actual signature is process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>; (2) line 26 refers to PreflateContainerProcessor::get_stats(), but the actual method is named stats().

Suggested change
fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>;
}
// Decompress a preflate container back to the original file
RecreateContainerProcessor::new(capacity) -> Self
impl ProcessBuffer for RecreateContainerProcessor {
fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>;
}
// DLL helper: wraps process_buffer to respect a max output size per call
fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>;
// Stats after compression
PreflateContainerProcessor::get_stats() -> &PreflateStats
fn process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>;
}
// Decompress a preflate container back to the original file
RecreateContainerProcessor::new(capacity) -> Self
impl ProcessBuffer for RecreateContainerProcessor {
fn process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>;
}
// DLL helper: wraps process_buffer to respect a max output size per call
fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>;
// Stats after compression
PreflateContainerProcessor::stats() -> &PreflateStats

Copilot uses AI. Check for mistakes.
Comment thread preflate/CLAUDE.md Outdated
Comment thread container/CLAUDE.md
IdatContents metadata:
varint(chunk_size_1) … varint(chunk_size_N) varint(0)
zlib_header[2]
addler32[4]
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"addler32" is a misspelling of "adler32" (the Adler-32 checksum algorithm used in zlib). This appears in two places in the wire format documentation.

Suggested change
addler32[4]
adler32[4]

Copilot uses AI. Check for mistakes.
Comment thread fuzz/CLAUDE.md
Comment on lines +15 to +16
Feeds arbitrary bytes (minimum 1 byte) to `preflate_whole_into_container()`, then
`recreate_whole_from_container()`, and asserts the output matches the original input.
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fuzz target in fuzz/fuzz_container.rs (not part of this PR's diff) imports preflate_whole_into_container and recreate_whole_from_container from preflate_container. This PR removes both of these from the public API (they were re-exported from zstd_compression.rs and container_processor.rs). The fuzz harness will fail to compile unless updated to use PreflateContainerProcessor::new(...) and RecreateContainerProcessor::new(...) directly. Note that PreflateContainerProcessor::new now takes two additional arguments (level: i32 and test_baseline: bool) compared to before. The fuzz/CLAUDE.md is added in this PR and still incorrectly describes the old functions as the API for fuzz_container.

Suggested change
Feeds arbitrary bytes (minimum 1 byte) to `preflate_whole_into_container()`, then
`recreate_whole_from_container()`, and asserts the output matches the original input.
Feeds arbitrary bytes (minimum 1 byte) through a `PreflateContainerProcessor` created
via `PreflateContainerProcessor::new(..., level: i32, test_baseline: bool)`, then
a `RecreateContainerProcessor::new(...)`, and asserts the output matches the original input.

Copilot uses AI. Check for mistakes.
mcroomp and others added 3 commits March 2, 2026 20:58
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@mcroomp mcroomp merged commit a7932a6 into main Mar 4, 2026
2 checks passed
@mcroomp mcroomp deleted the compressionrefactor branch March 4, 2026 17:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants