Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data by mcroomp · Pull Request #60 · microsoft/preflate-rs

mcroomp · 2026-02-25T15:38:24Z

Now that we support putting WEBP and JPEG Lepton compressed data inside the container, we need to skip this when we are doing our zstd compression. Instead of doing it as an extra pass on top, now we do it as part of writing each block.

Copilot

Pull request overview

This PR refactors the container compression architecture so that Zstd compression is integrated directly into the block-writing step (v2 format), instead of being applied as a separate outer layer via wrapper types (ZstdCompressContext/ZstdDecompressContext). The key motivation is to allow WEBP and JPEG Lepton compressed data — which is already compressed and incompressible by Zstd — to bypass the Zstd encoder entirely, while all other blocks (DEFLATE, PNG, literal) are compressed via a persistent, shared Zstd encoder that maintains cross-block history.

Changes:

Introduced a new v2 container wire format with a block-type bit-field (BLOCK_COMPRESSION_* | BLOCK_TYPE_*), replacing the simple v1 constant block types. JPEG and WebP blocks use BLOCK_COMPRESSION_NONE; all other blocks use BLOCK_COMPRESSION_ZSTD with per-block Zstd flush segments.
Deleted zstd_compression.rs and moved Zstd handling into container_processor.rs; removed public wrapper types (ZstdCompressContext, ZstdDecompressContext) and helper functions (zstd_preflate_whole_deflate_stream, zstd_recreate_whole_deflate_stream, preflate_whole_into_container, recreate_whole_from_container). Changed PreflateContainerProcessor::new to accept level and test_baseline directly.
Added extensive new tests (block-type assertions, incremental feeding tests, fixture-based multi-stream tests) and new binary test fixtures in samples/.

Reviewed changes

Copilot reviewed 15 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
`container/src/container_processor.rs`	Core change: integrates Zstd into block writing (v2 format), new block type constants, `write_chunk_block_v2`, `drain_zstd_block`, `process_compressed_block`, extensive new test module
`container/src/zstd_compression.rs`	Deleted: wrapper types and standalone functions removed
`container/src/lib.rs`	Removed public re-exports of deleted types and functions
`dll/src/unmanaged_api.rs`	Updated to use `PreflateContainerProcessor`/`RecreateContainerProcessor` directly
`util/src/main.rs`	Updated to use new `PreflateContainerProcessor::new` API
`tests/end_to_end.rs`	Updated container round-trip test to use new API
`container/CLAUDE.md`	New documentation file describing the v2 wire format
`CLAUDE.md`	New root-level architecture documentation
`preflate/CLAUDE.md`	New documentation for the core preflate library
`preflate/src/stream_processor.rs`	New test for partial gzip DEFLATE body round-trip
`fuzz/CLAUDE.md`	New documentation for fuzz harnesses
`dll/CLAUDE.md`, `tests/CLAUDE.md`, `util/CLAUDE.md`	New documentation for respective crates/modules
`samples/.bin`, `samples/.zip`	New binary test fixture files
`.gitignore`	Ignores `.claude/` and `.unref/` directories

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-02T19:07:56Z

+  pdf_parse.rs            ← detects PDF FlateDecode compressed object streams
+  zstd_compression.rs     ← ZstdCompressContext / ZstdDecompressContext (internal)


The Module Layout section in the new container/CLAUDE.md still lists zstd_compression.rs as a module (line 169), but this file is deleted in this PR. Additionally, pdf_parse.rs is listed at line 168, but this file does not appear in the directory listing of container/src/. The module layout should be updated to reflect the actual file structure: remove zstd_compression.rs and pdf_parse.rs, and ensure the listed types/enums (e.g., V2BlockInfo at line 178) match what actually exists in the code.

Suggested change

pdf_parse.rs ← detects PDF FlateDecode compressed object streams

zstd_compression.rs ← ZstdCompressContext / ZstdDecompressContext (internal)

Copilot · 2026-03-02T19:07:56Z

+- **`zstd_compression.rs`** — Pipelines preflate output through Zstd for final storage.
+- **`container_processor.rs`** — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).


The root CLAUDE.md still mentions zstd_compression.rs in the container crate description (line 64), but this file was deleted in this PR. The description should be updated to reflect that Zstd compression is now integrated directly into container_processor.rs.

Suggested change

- **`zstd_compression.rs`** — Pipelines preflate output through Zstd for final storage.

- **`container_processor.rs`** — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).

- **`container_processor.rs`** — Contains the integrated Zstd compression logic and orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).

Copilot · 2026-03-02T19:07:57Z

+    fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>;
+}
+
+// Decompress a preflate container back to the original file
+RecreateContainerProcessor::new(capacity) -> Self
+impl ProcessBuffer for RecreateContainerProcessor {
+    fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>;
+}
+
+// DLL helper: wraps process_buffer to respect a max output size per call
+fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>;
+
+// Stats after compression
+PreflateContainerProcessor::get_stats() -> &PreflateStats


The Public API section in container/CLAUDE.md has two inaccuracies: (1) the process_buffer signature at lines 13 and 19 shows Result<Vec<u8>> as the return type, but the actual signature is process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>; (2) line 26 refers to PreflateContainerProcessor::get_stats(), but the actual method is named stats().

Suggested change

fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>;

}

// Decompress a preflate container back to the original file

RecreateContainerProcessor::new(capacity) -> Self

impl ProcessBuffer for RecreateContainerProcessor {

fn process_buffer(&mut self, input: &[u8], input_complete: bool) -> Result<Vec<u8>>;

}

// DLL helper: wraps process_buffer to respect a max output size per call

fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>;

// Stats after compression

PreflateContainerProcessor::get_stats() -> &PreflateStats

fn process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>;

}

// Decompress a preflate container back to the original file

RecreateContainerProcessor::new(capacity) -> Self

impl ProcessBuffer for RecreateContainerProcessor {

fn process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>;

}

// DLL helper: wraps process_buffer to respect a max output size per call

fn process_limited_buffer(processor, input, input_complete, max_output) -> Result<(Vec<u8>, bool)>;

// Stats after compression

PreflateContainerProcessor::stats() -> &PreflateStats

Copilot · 2026-03-02T19:07:57Z

+IdatContents metadata:
+  varint(chunk_size_1) … varint(chunk_size_N) varint(0)
+  zlib_header[2]
+  addler32[4]


"addler32" is a misspelling of "adler32" (the Adler-32 checksum algorithm used in zlib). This appears in two places in the wire format documentation.

Suggested change

addler32[4]

adler32[4]

Copilot · 2026-03-02T19:07:57Z

+Feeds arbitrary bytes (minimum 1 byte) to `preflate_whole_into_container()`, then
+`recreate_whole_from_container()`, and asserts the output matches the original input.


The fuzz target in fuzz/fuzz_container.rs (not part of this PR's diff) imports preflate_whole_into_container and recreate_whole_from_container from preflate_container. This PR removes both of these from the public API (they were re-exported from zstd_compression.rs and container_processor.rs). The fuzz harness will fail to compile unless updated to use PreflateContainerProcessor::new(...) and RecreateContainerProcessor::new(...) directly. Note that PreflateContainerProcessor::new now takes two additional arguments (level: i32 and test_baseline: bool) compared to before. The fuzz/CLAUDE.md is added in this PR and still incorrectly describes the old functions as the API for fuzz_container.

Suggested change

Feeds arbitrary bytes (minimum 1 byte) to `preflate_whole_into_container()`, then

`recreate_whole_from_container()`, and asserts the output matches the original input.

Feeds arbitrary bytes (minimum 1 byte) through a `PreflateContainerProcessor` created

via `PreflateContainerProcessor::new(..., level: i32, test_baseline: bool)`, then

a `RecreateContainerProcessor::new(...)`, and asserts the output matches the original input.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

mcroomp added 3 commits February 24, 2026 23:32

v1 of refactor

9db650c

work and tests

82c7b15

update with claude help

f2394a6

mcroomp changed the title ~~Compressionrefactor~~ Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data Feb 25, 2026

mcroomp added 2 commits February 25, 2026 16:39

remove pdf parse

e2332ad

fix formatting

18701bc

mcroomp requested review from MarioszMSFT and Copilot March 2, 2026 19:01

Copilot started reviewing on behalf of mcroomp March 2, 2026 19:01 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

mcroomp and others added 3 commits March 2, 2026 20:58

update claude.md

7800a3f

refactored out container reader / writer

cbdd8b6

Apply suggestion from @Copilot

0804476

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

MarioszMSFT approved these changes Mar 4, 2026

View reviewed changes

mcroomp merged commit a7932a6 into main Mar 4, 2026
2 checks passed

mcroomp deleted the compressionrefactor branch March 4, 2026 17:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data#60

Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data#60
mcroomp merged 8 commits intomainfrom
compressionrefactor

mcroomp commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Copilot AI Mar 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		pdf_parse.rs ← detects PDF FlateDecode compressed object streams
		zstd_compression.rs ← ZstdCompressContext / ZstdDecompressContext (internal)

		- `zstd_compression.rs` — Pipelines preflate output through Zstd for final storage.
		- `container_processor.rs` — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).

	- `zstd_compression.rs` — Pipelines preflate output through Zstd for final storage.
	- `container_processor.rs` — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).
	- `container_processor.rs` — Contains the integrated Zstd compression logic and orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress).

		Feeds arbitrary bytes (minimum 1 byte) to `preflate_whole_into_container()`, then
		`recreate_whole_from_container()`, and asserts the output matches the original input.

-Feeds arbitrary bytes (minimum 1 byte) to `preflate_whole_into_container()`, then
-`recreate_whole_from_container()`, and asserts the output matches the original input.
+Feeds arbitrary bytes (minimum 1 byte) through a `PreflateContainerProcessor` created
+via `PreflateContainerProcessor::new(..., level: i32, test_baseline: bool)`, then
+a `RecreateContainerProcessor::new(...)`, and asserts the output matches the original input.

Conversation

mcroomp commented Feb 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants