[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode by Hale423 · Pull Request #1015 · NVIDIA/Model-Optimizer

Hale423 · 2026-03-10T02:14:13Z

Pull Request: ONNX Q/DQ Autotuning with Subgraph Mode

Branch: dev-wahao-autotune-subgraph-profile → main
Type: Feature

Summary

This PR adds automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models using TensorRT performance measurements. It introduces two workflow modes:

Region mode (default) – Pattern-based region analysis with random mutation; optimizes by region pattern and reuses schemes across similar regions.
Subgraph mode – Fusion-aware grouping derived from TensorRT’s layer/fusion boundaries (graph.json); profiles isolated subgraphs for much faster tuning on large or dynamic-shape models (~30 min vs ~25 h in practice).

Subgraph mode is the main addition over a baseline “auto QDQ placement” implementation: it uses TRT fusion info, optional per-layer timing, incremental full-model validation, and cache/resume.

What’s New (vs main)

Area	Description
Autotune package	New `modelopt.onnx.quantization.autotune` package: region discovery, scheme generation, TensorRT benchmarking (Python API + optional trtexec), pattern cache, QDQ baseline import.
Subgraph workflow	`--mode subgraph`: fusion-aware grouping from TensorRT `graph.json`; per-subgraph QDQ scheme profiling; optional per-layer timing when trtexec supports it (with fallback to total latency).
Fusion grouping	`fusion_grouping.py`: parse TRT `graph.json`, build fusion groups, infer shapes for extracted subgraphs. If `--graph-json` is omitted, runs trtexec once to generate `graph.json` (FP16 build with `--exportLayerInfo`).
Incremental validation	Optional per-group full-model check: apply QDQ groups one-by-one and keep only those that improve latency; saves `optimized_raw.onnx` (all qualifying QDQ) and `optimized_final.onnx` (validated). Default: on (`--incremental-validation`); use `--no-incremental-validation` to disable.
Cache / resume	Subgraph mode writes progress to `autotune_cache.json` for Phase 2 (subgraph profiling) and Phase 3 (incremental validation). Re-running the same command resumes from the last checkpoint.
trtexec path	`--use-trtexec` plus `--trtexec-args` for benchmarking with dynamic shapes (e.g. `--optShapes`) and custom options (e.g. `--useCudaGraph`, `--stronglyTyped`). trtexec profiling flags are optional; on “Unknown option” the code strips them and retries (fallback to total latency).
Example	New `examples/qdq_placement/`: README (Quick Start, region vs subgraph, output layout, subgraph best practices) and `set_batch_size.py` for fixed-batch ResNet50.

Key Files

Path	Role
`modelopt/onnx/quantization/autotune/__main__.py`	CLI: `--mode`, `--graph-json`, `--incremental-validation`, `--use-trtexec`, `--trtexec-args`, etc.
`modelopt/onnx/quantization/autotune/subgraph_workflow.py`	Subgraph pipeline: Phase 1 (fusion grouping), Phase 2 (subgraph profiling), Phase 3 (full-model + incremental validation), cache I/O.
`modelopt/onnx/quantization/autotune/fusion_grouping.py`	Parse `graph.json`, create fusion groups, `generate_graph_json()` (trtexec FP16 build when no graph is provided).
`modelopt/onnx/quantization/autotune/subgraph_extractor.py`	Extract subgraph ONNX from full model given group inputs/outputs and shapes.
`modelopt/onnx/quantization/autotune/tensorrt_utils.py`	Trtexec benchmark runner: optional `export_profile_path`, profiling-flag dedup and “Unknown option” retry without profiling.
`modelopt/onnx/quantization/autotune/workflows.py`	Dispatcher and `benchmark_onnx_model()`; passes through `export_profile_path` when using trtexec.
`modelopt/onnx/quantization/autotune/autotuner.py`	Region-mode autotuner (pattern discovery, scheme generation, state/pattern cache).
`modelopt/onnx/quantization/autotune/region_*.py`	Region search and pattern handling for region mode.
`examples/qdq_placement/README.md`	User-facing example: prerequisites, Quick Start (region + subgraph), output layout, subgraph best practices, optional graph.json behavior.
`examples/qdq_placement/set_batch_size.py`	ResNet50 fixed-batch script for reproducible benchmarking.

How to Test

Region mode (no trtexec):

cd examples/qdq_placement
curl -L -o resnet50_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet50_Opset17_torch_hub/resnet50_Opset17.onnx
python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx
python3 -m modelopt.onnx.quantization.autotune --model resnet50.bs128.onnx --output ./resnet50_results --quant-type int8 --schemes-per-region 20
# Expect: ./resnet50_results/optimized_final.onnx and logs under ./resnet50_results/logs/

Subgraph mode with trtexec (FP8, optional graph.json):

python3 -m modelopt.onnx.quantization.autotune \
  --model resnet50.bs128.onnx \
  --output ./resnet50_subgraph \
  --mode subgraph \
  --quant-type fp8 \
  --use-trtexec \
  --warmup-runs 5 \
  --timing-runs 20 \
  --incremental-validation \
  --trtexec-args "--stronglyTyped" \
  --schemes-per-region 30
# If --graph-json is omitted, first run will trigger trtexec to generate graph.json under output dir.
# Expect: optimized_raw.onnx, optimized_final.onnx, autotune_cache.json, logs/, subgraphs/

Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from autotune_cache.json.

Checklist

CI / unit tests pass (if applicable).
Region mode runs end-to-end (ResNet50 or equivalent).
Subgraph mode runs end-to-end with --use-trtexec (with or without --graph-json).
Without --graph-json, one trtexec FP16 build runs and produces *.fp16.graph.json in the output dir.
Interrupted subgraph run resumes after re-run.
examples/qdq_placement/README.md matches behavior (region vs subgraph, outputs, best practices).

Documentation

Example: examples/qdq_placement/README.md – Quick Start, subgraph best practices, output layout, optional graph generation.
Guides / API: This branch may add or update docs/source/guides/9_qdq_placement.rst and docs/source/reference/2_qdq_placement.rst; confirm they align with the CLI and behavior above when submitting.

Notes

main does not currently contain the autotune package; this PR adds it in full (region + subgraph).
Subgraph mode is recommended for large or dynamic-shape models; region mode remains the default for compatibility and smaller models.
trtexec versions that do not support --exportProfile / --profilingVerbosity are handled by retrying without those flags and using total latency for scheme selection.

Summary by CodeRabbit

Release Notes

New Features
- Added subgraph-based autotuning workflow for QDQ placement optimization alongside the existing region-based approach.
- Introduced TensorRT and Python API benchmarking utilities for performance measurement.
- Added PyTorch-specific hierarchical region discovery for optimized model analysis.
- Enhanced incremental validation and crash recovery capabilities.
- Expanded CLI with --workflow, --graph_json, and --incremental_validation options.
Documentation
- Added comprehensive guides on automated Q/DQ placement optimization and architecture.
- Added examples and usage documentation for subgraph autotuning workflows.
Tests
- Added configuration validation unit tests.

Signed-off-by: Will Guo <willg@nvidia.com>

…terns

- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested - Skip adding --separateProfileRun if already present in user trtexec args - On trtexec 'Unknown option' error, strip profiling flags and retry once without them - Set _profile_unsupported so later runs use total-latency comparison only - Extract _exec_and_log for shared run-and-log logic Made-with: Cursor

- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested - Skip adding --separateProfileRun if already present in user trtexec args - On trtexec 'Unknown option' error, strip profiling flags and retry once without them - Set _profile_unsupported so later runs use total-latency comparison only - Extract _exec_and_log for shared run-and-log logic

copy-pr-bot · 2026-03-10T02:14:16Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

cjluo-nv

This PR introduces 16k+ lines of changes. Please consider sharing a design and get design review.

Hale423 · 2026-03-10T09:01:20Z

Thanks for the feedback.

Sharing this design, please kindly take a look.

Design: ONNX Q/DQ Autotuning with Subgraph Mode

Design: ONNX Q/DQ Autotuning for TensorRT

Design review document for PR #1015
Branch: dev-wahao-autotune-subgraph-profile
Target: main

1. Background

TensorRT performance for quantized ONNX models depends not only on whether Q/DQ nodes exist, but also on where they are inserted. In practice:

Quantizing too aggressively can break fusion opportunities or introduce reformat overhead.
Quantizing too conservatively leaves performance on the table.
The best placement strategy is model-specific and often difficult to determine manually.

This branch introduces an ONNX Q/DQ autotuning system that searches for better Q/DQ placement using actual TensorRT latency measurements.

The design intentionally supports two workflows:

Region mode for structure-aware, pattern-based optimization across the whole model.
Subgraph mode for faster, fusion-aware tuning on large or dynamic-shape models.

2. Goals

Provide an end-to-end ONNX Q/DQ autotuning workflow that optimizes for TensorRT latency.
Reuse optimization results across structurally similar parts of the same model.
Support warm-start from:
- Previously learned pattern caches
- Existing QDQ baseline ONNX models
Support large and dynamic-shape models via a trtexec-based path.
Provide resumable workflows for long-running tuning jobs.
Expose the feature through a simple CLI and user-facing example.

3. Non-goals

This system does not perform accuracy validation or calibration quality evaluation.
This system does not try to search all possible Q/DQ placements exhaustively.
This system does not replace standard PTQ/QAT flows; it only optimizes placement.
This system does not target backends other than TensorRT.

4. Scope Relative to `main`

main does not contain the ONNX Q/DQ autotuning package. This branch adds:

A new modelopt.onnx.quantization.autotune package
CLI support for region and subgraph autotuning workflows
TensorRT benchmarking utilities (Python API and trtexec)
Region discovery, pattern matching, insertion-point modeling, scheme generation, state persistence, and warm-start support
A subgraph workflow based on TensorRT fusion boundaries
User documentation and examples

5. User Experience

5.1 Region Mode

Region mode is the default workflow:

python3 -m modelopt.onnx.quantization.autotune \
    --model model.onnx \
    --output ./results \
    --quant-type int8 \
    --schemes-per-region 30

User-visible properties:

Automatically discovers optimization regions
Groups structurally equivalent regions into patterns
Profiles multiple insertion schemes per pattern
Saves:
- baseline.onnx
- optimized_final.onnx
- autotuner_state.yaml
- autotuner_state_pattern_cache.yaml

5.2 Subgraph Mode

Subgraph mode is intended for large and/or dynamic-shape models:

python3 -m modelopt.onnx.quantization.autotune \
    --model model.onnx \
    --output ./results_subgraph \
    --mode subgraph \
    --quant-type fp8 \
    --use-trtexec \
    --incremental-validation \
    --trtexec-args "--stronglyTyped --optShapes=input:1x3x224x224"

User-visible properties:

Uses TensorRT layer/fusion information from graph.json
If --graph-json is not provided, runs trtexec once to generate it
Profiles isolated subgraphs
Saves:
- optimized_raw.onnx
- optimized_final.onnx
- autotune_cache.json
- logs/
- subgraphs/

6. High-level Architecture

The implementation is organized around four layers.

6.1 Core Modeling Layer

Files:

common.py
insertion_points.py
region_pattern.py
qdq_utils.py

Core abstractions:

Region: hierarchical optimization unit in the graph
RegionPattern: deterministic structural signature used to group equivalent regions
InsertionScheme: a candidate set of Q/DQ insertion points
PatternSchemes: all candidate schemes for one pattern plus measurements
PatternCache: reusable best schemes across runs
Insertion point types:
- NodeInputInsertionPoint
- ChildRegionInputInsertionPoint
- RegionOutputInsertionPoint
- ResolvedInsertionPoint

Key design choice:

Schemes are represented using pattern-relative insertion points, not absolute node IDs.
This allows one measured scheme to be reused for every region instance with the same structure.

6.2 Region-mode Optimization Layer

Files:

autotuner.py
torch_region_builder.py
region_search.py
workflows.py

Responsibilities:

Discover optimization regions from the ONNX graph
Build structural patterns
Generate candidate insertion schemes
Export test models with current/best schemes applied
Record measurements and update best-known schemes
Persist and restore session state

Current implementation detail:

The default region discovery path in QDQAutotuner._search_regions() uses TorchRegionBuilder.
TorchRegionBuilder builds hierarchical regions from PyTorch-style ONNX node names.
region_search.py provides a more general region-search foundation and algorithms, but this branch’s default autotuner path is centered on the torch-name-based hierarchy.

6.3 Subgraph-mode Optimization Layer

Files:

subgraph_workflow.py
fusion_grouping.py
subgraph_extractor.py

Responsibilities:

Obtain TensorRT fusion boundaries from graph.json
Convert fused TRT layers into ONNX-level optimization groups
Extract standalone ONNX subgraphs per group
Test heuristic QDQ schemes per group
Merge accepted schemes back into the full model
Optionally validate improvements incrementally on the full model
Cache progress for resume

6.4 Benchmarking Layer

Files:

tensorrt_utils.py
workflows.py

Responsibilities:

Benchmark ONNX models using either:
- TensorRT Python API
- trtexec
Reuse timing caches and persistent benchmark instances
Support plugin libraries
Support dynamic-shape arguments
Export per-layer profiles when trtexec supports them
Fall back gracefully when profiling flags are unsupported

7. Region Mode Design

7.1 Why Region Mode Exists

Region mode is intended to solve the general case:

The model may not have TensorRT fusion metadata available yet
Similar subgraphs may appear repeatedly across the model
A single structural pattern may benefit from the same placement policy everywhere

This workflow optimizes patterns, not isolated instances.

7.2 Region Discovery

The branch uses TorchRegionBuilder as the default discovery mechanism because many production ONNX models are exported from PyTorch and preserve hierarchical naming such as:

/encoder/layer_0/self_attn/q_proj/MatMul

This naming convention enables:

Stable hierarchical grouping
More semantically meaningful optimization units
Better scheme reuse across repeated blocks (e.g. transformer layers)

TorchRegionBuilder:

Builds a path trie from node names
Creates composite and leaf regions
Merges small neighboring regions
Probes epilogues and computes region boundaries
Can filter out non-quantizable regions

7.3 Pattern-based Scheme Reuse

Once regions are discovered:

A RegionPattern is built from region topology and op types.
Regions with identical signatures share a single PatternSchemes object.
Candidate schemes are generated and measured for that pattern.
The best scheme is reused for all matching regions.

This reduces duplicated work and makes the search tractable.

7.4 Warm Start

The region workflow supports two warm-start mechanisms:

Pattern cache: load prior best schemes by pattern signature
QDQ baseline import: inspect an already-quantized model and infer pattern-level insertion points from it

This allows transfer learning across:

Different runs of the same model
Similar models
User-provided manually tuned quantized baselines

7.5 State and Resume

Region mode saves:

Baseline latency
Profiled schemes and latencies
Best scheme selection
Region discovery results

The state file is autotuner_state.yaml, allowing restart from the last profiled region.

8. Subgraph Mode Design

8.1 Why Subgraph Mode Exists

Region mode can become expensive for large models because each candidate scheme requires building and profiling increasingly quantized versions of the full model.

Subgraph mode addresses this by:

Aligning optimization units with TensorRT fusion groups
Profiling isolated subgraphs instead of the whole model
Using a small, heuristic scheme set instead of a broad mutation search

This makes the workflow significantly faster on large and dynamic-shape models.

8.2 Phase 1: Fusion-aware Grouping

Subgraph mode begins by obtaining TensorRT layer information:

If --graph-json is provided, use it directly.
Otherwise, call generate_graph_json() in fusion_grouping.py.

generate_graph_json() runs trtexec once in FP16 mode and writes:

<model>.fp16.graph.json
<model>.fp16.build.log

The exported layer info is then parsed to create fusion groups that map TensorRT fused layers back to ONNX nodes.

8.3 Phase 2: Subgraph Profiling

For each quantizable fusion group:

Resolve boundary tensors
Extract a standalone ONNX subgraph
Build a small heuristic set of schemes, for example:
- baseline
- full QDQ
- weight-only
- activation-only
Benchmark each scheme
Choose the best scheme for that group

Dynamic-shape handling:

The workflow parses --minShapes/--optShapes/--maxShapes from --trtexec-args
Infers intermediate tensor shapes
Constructs subgraph-specific shape arguments for extracted subgraphs

8.4 Per-layer Comparison

A major issue with isolated subgraph benchmarking is that total subgraph latency can be dominated by artifacts such as:

Reformat layers
Transposes
Layout adaptation overhead

To reduce this noise, subgraph mode attempts to use trtexec per-layer profile export:

--exportProfile
--profilingVerbosity=detailed
--separateProfileRun

If supported, the workflow:

Parses the per-layer profile JSON
Filters out reformat-style overhead
Compares the time spent on compute layers

If the local trtexec version does not support those flags, the code falls back to total subgraph latency.

8.5 Phase 3: Full-model Merge and Incremental Validation

The subgraph workflow distinguishes two outputs:

optimized_raw.onnx: all qualifying groups applied together
optimized_final.onnx: final validated model after incremental validation

Incremental validation exists because multiple individually beneficial subgraph changes can still regress full-model latency when combined.

When enabled:

Start from the FP16 baseline full model
Apply candidate groups one at a time
Benchmark the full model after each change
Keep the change only if latency improves

This design preserves:

A raw “all accepted by local heuristic” model
A validated “accepted by full-model measurement” model

8.6 Cache and Resume

Subgraph mode uses autotune_cache.json in the output directory.

It stores:

Phase 2 group profiling results
Phase 3 validation progress
Current latency and keep/reject decisions

This enables:

Resuming long runs after interruption
Restarting directly at incremental validation if Phase 2 is already complete

9. Benchmarking Design

9.1 Python API Benchmark

The Python API path is used for standard latency measurement when trtexec is not required.

Design decisions:

Reuse a global benchmark instance
Reuse TensorRT builder/runtime state
Reuse timing cache across measurements

This reduces repeated benchmark setup overhead.

9.2 trtexec Benchmark

trtexec is needed for:

Dynamic-shape benchmarking with custom shape flags
Generating TensorRT graph.json
Optional per-layer profile export

The trtexec path accepts a free-form --trtexec-args string so users can pass deployment-specific settings.

9.3 Profiling-flag Compatibility Fallback

Different TensorRT/trtexec versions do not always support the same profiling flags.

The branch therefore implements the following behavior:

Attempt to run with per-layer profiling flags if a profile is requested
If trtexec fails with Unknown option, remove profiling-related flags
Retry once without profiling
Mark profiling as unsupported for subsequent runs

This keeps subgraph mode functional even on older or restricted trtexec versions.

10. CLI Design

The CLI is defined in __main__.py.

Important options:

--mode {region,subgraph}
--graph-json
--incremental-validation / --no-incremental-validation
--use-trtexec
--trtexec-args
--pattern-cache
--qdq-baseline
--state-file
--plugin-libraries
--schemes-per-region

Design rationale:

Keep region mode as the default for backward compatibility
Make subgraph mode opt-in because it depends on TRT fusion metadata and trtexec
Keep advanced deployment settings in --trtexec-args instead of introducing many narrowly scoped flags

11. Data and Output Model

11.1 Region Mode Outputs

baseline.onnx
optimized_final.onnx
autotuner_state.yaml
autotuner_state_pattern_cache.yaml
logs/
region_models/

11.2 Subgraph Mode Outputs

optimized_raw.onnx
optimized_final.onnx
autotune_cache.json
logs/
subgraphs/
optional generated <model>.fp16.graph.json

12. Key Trade-offs

12.1 Region Mode vs Subgraph Mode

Region mode:

Pros:
- General
- Pattern-reuse friendly
- Does not require TRT fusion metadata up front
Cons:
- Can be slow on large models
- More expensive search loop

Subgraph mode:

Pros:
- Much faster
- Aligned with actual TRT fusion boundaries
- Better fit for large and dynamic-shape models
Cons:
- Requires TRT graph export or prior graph
- More heuristic
- Local subgraph wins may not transfer cleanly to full-model wins

12.2 Why Keep Both Raw and Final Models

We save both optimized_raw.onnx and optimized_final.onnx because they serve different purposes:

optimized_raw.onnx preserves all locally accepted group changes
optimized_final.onnx preserves only changes that survive full-model validation

This avoids hiding useful intermediate output from advanced users while still giving a safer default final artifact.

12.3 Why Pattern-relative Insertion Points

Absolute node IDs would make learned schemes brittle and model-instance-specific.

Pattern-relative insertion points:

Generalize across repeated structures
Enable pattern cache reuse
Make imported QDQ baselines transferable

13. Limitations and Risks

Region-mode default discovery currently relies on PyTorch-style ONNX naming for best results.
The autotuner optimizes for latency only; it does not enforce accuracy constraints.
Subgraph heuristics do not explore the full combinatorial space.
Per-layer profiling depends on local trtexec support.
Full-model validation still depends on benchmark stability; small latency differences can be noisy.

14. Alternatives Considered

14.1 Full-model-only Search

Rejected because it is too slow for large models and dynamic-shape deployments.

14.2 Subgraph-only Design

Rejected because it gives up the benefits of pattern reuse and a backend-agnostic structural view of the graph.

14.3 Absolute-node Placement Search

Rejected because it would prevent scheme reuse across equivalent structures and make state transfer less useful.

15. Validation Plan

Recommended design-review validation:

Basic region-mode run
- Run on ResNet50 fixed-batch example
- Confirm optimized_final.onnx and state files are produced
Subgraph mode without --graph-json
- Confirm trtexec first generates *.fp16.graph.json
- Confirm the workflow proceeds into Phase 2 and Phase 3
Resume behavior
- Interrupt a subgraph run mid-way
- Re-run with the same output directory
- Confirm resume from autotune_cache.json
Profiling fallback
- Run in an environment where trtexec does not support profiling flags
- Confirm automatic retry without those flags
Warm-start path
- Run once to produce a pattern cache
- Re-run using --pattern-cache
- Confirm seeding behavior and faster convergence

16. Rollout and Documentation

This branch already includes:

CLI integration
Example under examples/qdq_placement/
User documentation updates
API documentation updates

Suggested rollout:

Complete design review
Land PR behind the existing CLI mode separation
Gather feedback on:
- default region discovery behavior
- subgraph heuristic quality
- benchmark noise and validation thresholds

17. Open Questions for Review

Should region mode continue to default to TorchRegionBuilder, or should a more general search path be reinstated as the primary default?
Should subgraph mode require --use-trtexec, or should it remain flexible while documenting that trtexec is the recommended path?
Should the per-group acceptance threshold remain latency-only, or should we expose stronger hysteresis / noise margins for incremental validation?
Should we eventually add accuracy hooks so this becomes a joint latency + quality optimization flow?

Resolve add/add conflicts across 13 files by taking main's refactored codebase (autotuner_base.py split, benchmark.py extraction, op_types reorganization) and re-applying subgraph-mode additions: - __init__.py: add subgraph_autotuning_workflow export - __main__.py: add --workflow subgraph/region, --graph_json, --incremental_validation CLI arguments - workflows.py: add strip_shape_args, extra_run_args, export_profile_path parameters to benchmark_onnx_model - benchmark.py: add profiling support (export_profile_path, strip_shape_args) with trtexec fallback on unsupported flags - subgraph_workflow.py: migrate BOOL_OUTPUT_OPS to use get_bool_ops/get_comparison_ops/get_value_check_ops from op_types Files using only base code (region_pattern, region_search, 5 test files, common, insertion_points, autotuner) resolved by accepting main's version. Made-with: Cursor

coderabbitai · 2026-03-13T08:03:22Z

📝 Walkthrough

Walkthrough

A comprehensive subgraph-based quantize/dequantize (Q/DQ) placement autotuning system for ONNX models has been added, including documentation, CLI integration, TensorRT benchmarking infrastructure, graph fusion analysis, PyTorch region discovery, and a multi-phase optimization workflow.

Changes

Cohort / File(s)	Summary
Documentation `docs/source/guides/9_qdq_placement.rst`, `docs/source/reference/2_qdq_placement.rst`, `examples/qdq_placement/README.md`	Comprehensive user guides and API reference documentation for the Q/DQ placement optimizer, covering workflows, CLI usage, Python API, configuration, best practices, and deployment via TensorRT.
Example Utilities `examples/qdq_placement/set_batch_size.py`	CLI utility script to fix batch dimensions in ONNX models with shape inference propagation.
Module Initialization & CLI `modelopt/onnx/quantization/autotune/__init__.py`, `modelopt/onnx/quantization/autotune/__main__.py`	Exports new `subgraph_autotuning_workflow` function and adds CLI flags (--workflow, --graph_json, --incremental_validation) to route between region and subgraph workflow modes.
Benchmarking Infrastructure `modelopt/onnx/quantization/autotune/tensorrt_utils.py`, `modelopt/onnx/quantization/autotune/benchmark.py`	Abstract Benchmark base class, TrtExecBenchmark extension with profiling/shape-stripping support, and TensorRTPyBenchmark for Python API-based TensorRT engine benchmarking with dynamic shape handling.
Graph Analysis & Processing `modelopt/onnx/quantization/autotune/fusion_grouping.py`, `modelopt/onnx/quantization/autotune/subgraph_extractor.py`, `modelopt/onnx/quantization/autotune/qdq_utils.py`	Graph.json parsing for TRT fusion groups, subgraph extraction utilities via onnx_graphsurgeon, and quantized tensor discovery helpers.
Region Discovery `modelopt/onnx/quantization/autotune/torch_region_builder.py`	PyTorch-exported ONNX model analysis with hierarchical region construction, quantizability detection, epilogue fusion, and region search/inspection APIs.
Subgraph Workflow Implementation `modelopt/onnx/quantization/autotune/subgraph_workflow.py`	Multi-phase subgraph-based QDQ autotuning: fusion group discovery, per-subgraph heuristic scheme profiling, baseline benchmarking, incremental validation, caching, and optimized model export with latency-driven scheme selection.
Workflow Integration `modelopt/onnx/quantization/autotune/workflows.py`	Extended `benchmark_onnx_model` signature to support shape stripping, extra profiling arguments, and per-layer profiling export.
Tests `tests/unit/onnx/quantization/autotune/test_config.py`	Unit tests validating Config class defaults, custom assignments, and parameter constraints.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / User
    participant WF as Subgraph Workflow
    participant GA as Graph Analysis<br/>(Fusion Groups)
    participant SG as Subgraph<br/>Extraction
    participant BM as Benchmarking<br/>(TensorRT)
    participant FS as FileSystem<br/>(Cache/Output)

    CLI->>WF: subgraph_autotuning_workflow(model_path, ...)
    
    rect rgba(100, 200, 150, 0.5)
    note over WF: Phase 1: Setup
    WF->>GA: parse graph.json + create_fusion_groups
    GA->>GA: Map TRT layers to ONNX nodes
    GA-->>WF: FusionGroup list
    end

    rect rgba(100, 150, 200, 0.5)
    note over WF: Phase 2: Per-Subgraph Profiling
    loop For each FusionGroup
        WF->>WF: generate_heuristic_schemes
        loop For each scheme
            WF->>SG: extract_subgraph_by_nodes
            SG-->>WF: subgraph bytes
            WF->>WF: insert_qdq_on_graph
            WF->>BM: benchmark_onnx_model
            BM-->>WF: latency_ms
        end
        WF->>WF: Select best_scheme
    end
    WF->>FS: Cache phase2 results
    end

    rect rgba(200, 150, 100, 0.5)
    note over WF: Phase 3: Full-Model Validation
    WF->>BM: Baseline FP16 benchmark full model
    BM-->>WF: baseline_latency_ms
    loop Incremental Validation (if enabled)
        WF->>WF: Apply candidate QDQ schemes
        WF->>BM: Benchmark full model per candidate
        BM-->>WF: Test latency_ms
        WF->>WF: Keep if latency improves
    end
    WF->>FS: Save optimized_final.onnx
    WF->>FS: Cache phase3 progress
    end

    WF-->>CLI: Return optimized_model_path

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❓ Inconclusive	No verification output was provided above to convert into the required JSON format.	Please provide the verification output that needs to be converted into JSON format.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main feature: support for a new subgraph mode in ONNX Q/DQ autotuning, which is the primary objective of this PR.
Docstring Coverage	✅ Passed	Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

modelopt/onnx/quantization/autotune/benchmark.py (1)
34-34: ⚠️ Potential issue | 🟡 Minor

# nosec comments require explicit justification and approval.

As per coding guidelines, use of # nosec comments to bypass Bandit security checks is not allowed. If this security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by @NVIDIA/modelopt-setup-codeowners with explicit justification in the PR description.

The subprocess import itself is safe when used properly (with list arguments, not shell=True), but the # nosec B404 comment should be documented or removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` at line 34, Remove the
inline "# nosec B404" on the subprocess import (import subprocess) unless you
add an explicit justification and approval from
`@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is
required, keep the comment but add the justification text to the PR and ensure
all uses of subprocess in this module (functions/methods that call
subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a
list, avoid shell=True) and reference that approval in the PR description.

🧹 Nitpick comments (6)

examples/qdq_placement/set_batch_size.py (1)

62-65: Consider handling large models (>2GB) for model verification.

The direct call to onnx.checker.check_model(output_path) may fail for models larger than 2GB due to protobuf size limits. The codebase has a utility in modelopt/onnx/utils.py (see check_model function) that handles this case by using external data storage.

Suggested approach

+import tempfile
+import os
+
 def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None:
     # ... existing code ...
     
     # Verify the saved model
     print("Verifying model...")
-    onnx.checker.check_model(output_path)
+    # Handle large models that exceed protobuf 2GB limit
+    saved_model = onnx.load(output_path)
+    if saved_model.ByteSize() > (2 * (1024**3)):
+        # For large models, check_model needs the file path with external data
+        onnx.checker.check_model(output_path)
+    else:
+        onnx.checker.check_model(saved_model)
     print("✓ Model saved and verified successfully!")

Alternatively, consider importing and using modelopt.onnx.utils.check_model for consistency with the rest of the codebase.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/qdq_placement/set_batch_size.py` around lines 62 - 65, Replace the
direct call to onnx.checker.check_model(output_path) with the project's robust
checker that handles >2GB models by importing and calling
modelopt.onnx.utils.check_model; locate the verification block around the
print("Verifying model...") and change the call to use
modelopt.onnx.utils.check_model(output_path) (or import the function as
check_model and call check_model(output_path)) so external-data large protobuf
models are supported.

modelopt/onnx/quantization/autotune/qdq_utils.py (2)

60-60: Simplify redundant condition.

The check node.input and len(node.input) > 0 is redundant since node.input being truthy already implies it's non-empty.

Suggested simplification

-            if node.input and len(node.input) > 0:
+            if node.input:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 60, The condition
checking node inputs is redundant: replace the `if node.input and
len(node.input) > 0:` guard with a simpler truthy check `if node.input:` in the
function where this appears (look for the code handling `node.input` in
qdq_utils.py) so the branch behavior remains identical but the expression is
simplified.

1-1: Update copyright year for consistency.

The copyright year is 2024, but other new files in this PR use 2026. Consider updating for consistency.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 1, Update the SPDX
header year in modelopt/onnx/quantization/autotune/qdq_utils.py to match the
rest of the PR (change "2024" to "2026"); locate the SPDX comment at the top of
the file and modify the year in the copyright line so it is consistent across
files.

tests/unit/onnx/quantization/autotune/test_config.py (1)

12-13: Remove unnecessary sys.path manipulation.

The sys.path.insert is unnecessary if the package is properly installed in the test environment. This pattern can cause import issues and is not recommended.

Suggested fix

-# Add parent directory to path
-sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-
 from modelopt.onnx.quantization.autotune.common import Config

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 12 - 13,
Remove the ad-hoc sys.path manipulation in the test file: delete the
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
line in tests/unit/onnx/quantization/autotune/test_config.py and rely on the
package being installed in the test environment (or use test runner
configuration / PYTHONPATH) so imports resolve cleanly; do not add alternative
path hacks in this file.

modelopt/onnx/quantization/autotune/subgraph_extractor.py (1)

170-181: Consider using collections.deque for BFS queue.

Using list.pop(0) is O(n) per operation. For potentially large graphs, using collections.deque with popleft() provides O(1) performance.

♻️ Suggested improvement

+from collections import deque
+
 def _find_reachable_graph_inputs(
     graph: gs.Graph,
     target_nodes: List[gs.Node],
 ) -> List[str]:
     """BFS backward from target_nodes to find graph inputs that feed into them."""
     graph_input_names = {t.name for t in graph.inputs}
     visited = set()
-    queue = []
+    queue = deque()
     result = []

     for node in target_nodes:
         for inp in node.inputs:
             if isinstance(inp, gs.Variable) and inp.name not in visited:
                 visited.add(inp.name)
                 queue.append(inp)

     while queue:
-        tensor = queue.pop(0)
+        tensor = queue.popleft()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py` around lines 170 -
181, The BFS loop in subgraph_extractor.py uses a Python list and pop(0),
causing O(n) dequeuing; replace the list-based queue with collections.deque:
import deque, initialize queue = deque(...) where the queue is created, and
change queue.pop(0) to queue.popleft(), keeping queue.append(...) for
enqueueing; update any code that constructs the initial queue and ensure imports
include collections.deque so the BFS in the function that contains this loop
uses O(1) dequeue operations.

examples/qdq_placement/README.md (1)

152-162: Add language specifier to fenced code blocks.

The directory structure code blocks should have a language specifier for consistency. Consider using text or plaintext for directory listings.

📝 Suggested fix

-```
+```text
 resnet50_results/
 ├── optimized_final.onnx              # Optimized model

-```
+```text
 <output_dir>/
 ├── optimized_final.onnx              # Incrementally validated model (if --incremental-validation)

Also applies to: 166-175

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@examples/qdq_placement/README.md` around lines 152 - 162, Update the fenced
code blocks that show directory listings so they include a language specifier
(use ```text) — specifically change the blocks containing the
"resnet50_results/" tree and the block containing the "<output_dir>/" tree to
start with ```text instead of ``` so the directory listings render consistently;
locate the blocks by the directory headers "resnet50_results/" and
"<output_dir>/" in the README and replace their opening fences accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/source/reference/2_qdq_placement.rst`:
- Line 828: The docs claim "Default threshold: 1.01" but the actual default is
defined as performance_threshold: float = 1.02 on the Config class in common.py;
update the documentation string in 2_qdq_placement.rst to state "Default
threshold: 1.02 (2% improvement minimum)" so the docs match the implementation
(or alternatively change Config.performance_threshold to 1.01 if you intend the
docs to be canonical).

In `@examples/qdq_placement/set_batch_size.py`:
- Around line 1-10: Add the required NVIDIA Apache 2.0 license header to the top
of the examples/qdq_placement/set_batch_size.py script (above or immediately
after the existing shebang and before the module docstring) so the file includes
the full license boilerplate mandated for examples/**/*.py; ensure the header
explicitly names "NVIDIA CORPORATION" and the Apache 2.0 terms and leave the
rest of the script (including the module docstring and usage comments)
unchanged.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of the module modelopt/onnx/quantization/autotune/fusion_grouping.py by
inserting the standard multi-line NVIDIA Apache 2.0 header (including copyright
line, SPDX identifier and license text reference) as used across other files in
modelopt; ensure the header appears before the existing module docstring so
classes/functions in this file (e.g., the fusion grouping logic in
fusion_grouping.py) are properly licensed.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 1-6: Add the NVIDIA Apache 2.0 license header to the top of the
module file modelopt/onnx/quantization/autotune/subgraph_extractor.py by
inserting the standard multi-line license comment before the existing module
docstring; ensure the header includes the SPDX identifier and full NVIDIA Apache
2.0 text (or the project-standard short header) so the file begins with the
required license block followed by the current docstring and existing code in
subgraph_extractor.py.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 1-8: This file is missing the NVIDIA Apache 2.0 license header;
prepend the standard NVIDIA Apache-2.0 license block to the very top of
subgraph_workflow.py (above the existing module docstring) including the
copyright line and license text/notice, ensuring the SPDX identifier
(Apache-2.0) or full license header is present and that the existing file
docstring remains unchanged below the header.
- Around line 99-105: The loops that mutate the protobuf repeated field use
nonstandard pop()/add() calls on dim_proto; replace popping with explicit
deletion (e.g., use del dim_proto[-1] in the shrink loop) and call
dim_proto.add() when growing but capture the returned element if you need to
initialize it (e.g., new_dim = dim_proto.add()); keep the subsequent loop that
sets dim_proto[i].ClearField("dim_param") and dim_proto[i].dim_value = d, but
ensure dim_proto has been resized using del and dim_proto.add() rather than
pop()/add() without using the returned message.
- Around line 39-42: QUANT_DTYPES currently sets "fp8" to np.int8 when the FP8
dtype isn't present, causing silent fallback; change this by removing the silent
fallback from the module-level QUANT_DTYPES and instead detect/import ml_dtypes
(or np.float8 variant) conditionally at runtime where quantization is requested
(e.g., inside the function(s) that handle quantization/autotune and any call
sites referencing QUANT_DTYPES), and when a user requests "fp8" but the FP8
dtype isn't available emit a clear user-facing warning or raise a descriptive
error; specifically, update references to QUANT_DTYPES to validate availability
at usage time, attempt a conditional import of ml_dtypes (or check hasattr(np,
"float8_e4m3fn")) there, and log/warn if FP8 cannot be supported rather than
silently substituting int8.

In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py`:
- Around line 1-41: The license header at the top of the module (the
module-level docstring and SPDX header in tensorrt_utils.py) uses an unusual
year range "1993-2025"; update the copyright/SPDX header to the project's
standard single-year format (e.g., "Copyright (c) 2026 NVIDIA CORPORATION &
AFFILIATES") and adjust the SPDX-License-Identifier line if needed so the header
matches other files; edit the top-of-file docstring/SPDX block where the current
year range appears to replace it with the standard format.
- Around line 987-1000: Fix _save_timing_cache: the condition and method call
are wrong and the finally cleans up an undefined name. Change the inverted check
so you only call combine when self._timing_cache is NOT None, correct the typo
`combline` to `combine` for the timing cache object returned by
config.create_timing_cache, and in the finally block remove the undefined
`builder` deletion (either delete only `config` or explicitly reference
`self.builder` if you intended to delete it); ensure you still serialize and
write timing_cache_data to self.timing_cache_file when a timing cache exists.

In `@modelopt/onnx/quantization/autotune/torch_region_builder.py`:
- Around line 311-318: The _build_id_to_region_map function currently uses a
mutable default dict for id_to_region_map which can be shared across calls;
change the signature to default id_to_region_map to None and inside
_build_id_to_region_map initialize id_to_region_map = {} when it's None, then
proceed to assign id_to_region_map[region.id] = region and recursively call
self._build_id_to_region_map(child, id_to_region_map) so each top-level call
gets a fresh map; keep the return type dict[int, Region] and behavior otherwise
unchanged.
- Around line 16-24: The module docstring in torch_region_builder.py duplicates
the license/copyright header (the "SPDX-FileCopyrightText" lines and
years)—remove the duplicate license block from the top of the module docstring
so only the canonical file header remains; keep the descriptive docstring text
("Torch Region Builder - Hierarchical Region Discovery...") and ensure
functions/classes in this module (e.g., any top-level descriptions used by
torch_region_builder.py) are unaffected by the removal.
- Around line 752-754: The parameter only_quantizable is being ignored because
the code unconditionally sets only_quantizable = True; fix by either removing
that assignment so the function respects the incoming only_quantizable
parameter, or remove the only_quantizable parameter from the function signature
if it's not needed; locate the assignment near the logger.info(f"Loading model:
{onnx_path}") call in torch_region_builder.py and update the function that
accepts only_quantizable to either use the passed value or eliminate the unused
parameter and adjust all callers accordingly.
- Around line 320-331: The _build_tensor_to_regions_map function uses a mutable
default argument (dict = {}), which can be shared across calls; change the
signature to accept tensor_to_regions_map: Optional[dict[str, set[int]]] = None
and inside the function initialize tensor_to_regions_map = {} if None, mirroring
the fix used in _build_id_to_region_map; keep the recursive behavior and return
the map as before so each top-level call gets a fresh dict.
- Line 333: The _merge_neighboring_regions function currently uses a mutable
default argument for to_remove (set()); change the signature to accept None
(e.g., to_remove: Optional[set[int]] = None or to_remove: set[int] | None =
None) and inside the body initialize it to an empty set when None (e.g., if
to_remove is None: to_remove = set()), updating any type imports as needed and
leaving the rest of the logic in _merge_neighboring_regions unchanged.

In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 71-82: The test test_performance_threshold_validation is
incomplete: it only asserts valid values for Config.performance_threshold but
doesn't check that invalid values are rejected; either add validation in Config
or update the test to assert the expected failure. If Config should validate,
implement a check in Config.__init__ or the performance_threshold setter to
raise ValueError for values < 1.0 and then update
test_performance_threshold_validation to include a with
self.assertRaises(ValueError): Config(performance_threshold=0.9) case; otherwise
remove the misleading comment and the "Invalid values" note from
test_performance_threshold_validation.
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of tests/unit/onnx/quantization/autotune/test_config.py: insert the standard
NVIDIA Apache-2.0 license block (including copyright notice/SPDX identifier and
license text) immediately after the existing shebang (#!/usr/bin/env python3)
and before the module docstring so the file complies with the repository's
license/header convention.

---

Outside diff comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Line 34: Remove the inline "# nosec B404" on the subprocess import (import
subprocess) unless you add an explicit justification and approval from
`@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is
required, keep the comment but add the justification text to the PR and ensure
all uses of subprocess in this module (functions/methods that call
subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a
list, avoid shell=True) and reference that approval in the PR description.

---

Nitpick comments:
In `@examples/qdq_placement/README.md`:
- Around line 152-162: Update the fenced code blocks that show directory
listings so they include a language specifier (use ```text) — specifically
change the blocks containing the "resnet50_results/" tree and the block
containing the "<output_dir>/" tree to start with ```text instead of ``` so the
directory listings render consistently; locate the blocks by the directory
headers "resnet50_results/" and "<output_dir>/" in the README and replace their
opening fences accordingly.

In `@examples/qdq_placement/set_batch_size.py`:
- Around line 62-65: Replace the direct call to
onnx.checker.check_model(output_path) with the project's robust checker that
handles >2GB models by importing and calling modelopt.onnx.utils.check_model;
locate the verification block around the print("Verifying model...") and change
the call to use modelopt.onnx.utils.check_model(output_path) (or import the
function as check_model and call check_model(output_path)) so external-data
large protobuf models are supported.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py`:
- Line 60: The condition checking node inputs is redundant: replace the `if
node.input and len(node.input) > 0:` guard with a simpler truthy check `if
node.input:` in the function where this appears (look for the code handling
`node.input` in qdq_utils.py) so the branch behavior remains identical but the
expression is simplified.
- Line 1: Update the SPDX header year in
modelopt/onnx/quantization/autotune/qdq_utils.py to match the rest of the PR
(change "2024" to "2026"); locate the SPDX comment at the top of the file and
modify the year in the copyright line so it is consistent across files.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 170-181: The BFS loop in subgraph_extractor.py uses a Python list
and pop(0), causing O(n) dequeuing; replace the list-based queue with
collections.deque: import deque, initialize queue = deque(...) where the queue
is created, and change queue.pop(0) to queue.popleft(), keeping
queue.append(...) for enqueueing; update any code that constructs the initial
queue and ensure imports include collections.deque so the BFS in the function
that contains this loop uses O(1) dequeue operations.

In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 12-13: Remove the ad-hoc sys.path manipulation in the test file:
delete the sys.path.insert(0,
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) line in
tests/unit/onnx/quantization/autotune/test_config.py and rely on the package
being installed in the test environment (or use test runner configuration /
PYTHONPATH) so imports resolve cleanly; do not add alternative path hacks in
this file.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 48cf674c-cea7-4964-b1ed-c9c27e9d6e6e

📥 Commits

Reviewing files that changed from the base of the PR and between bc87981 and 48eed5d.

📒 Files selected for processing (15)

docs/source/guides/9_qdq_placement.rst
docs/source/reference/2_qdq_placement.rst
examples/qdq_placement/README.md
examples/qdq_placement/set_batch_size.py
modelopt/onnx/quantization/autotune/__init__.py
modelopt/onnx/quantization/autotune/__main__.py
modelopt/onnx/quantization/autotune/benchmark.py
modelopt/onnx/quantization/autotune/fusion_grouping.py
modelopt/onnx/quantization/autotune/qdq_utils.py
modelopt/onnx/quantization/autotune/subgraph_extractor.py
modelopt/onnx/quantization/autotune/subgraph_workflow.py
modelopt/onnx/quantization/autotune/tensorrt_utils.py
modelopt/onnx/quantization/autotune/torch_region_builder.py
modelopt/onnx/quantization/autotune/workflows.py
tests/unit/onnx/quantization/autotune/test_config.py

coderabbitai · 2026-03-13T08:10:02Z

docs/source/reference/2_qdq_placement.rst

+   if speedup >= config.performance_threshold:
+       accept_scheme()
+
+Default threshold: 1.01 (1% improvement minimum)


⚠️ Potential issue | 🟡 Minor

Documentation inconsistency with actual default value.

The documentation states "Default threshold: 1.01 (1% improvement minimum)" but the Config class in common.py defines performance_threshold: float = 1.02 (2% improvement). This should be corrected to match the actual implementation.

Suggested fix

-Default threshold: 1.01 (1% improvement minimum) +Default threshold: 1.02 (2% improvement minimum)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Default threshold: 1.01 (1% improvement minimum)

Default threshold: 1.02 (2% improvement minimum)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/source/reference/2_qdq_placement.rst` at line 828, The docs claim "Default threshold: 1.01" but the actual default is defined as performance_threshold: float = 1.02 on the Config class in common.py; update the documentation string in 2_qdq_placement.rst to state "Default threshold: 1.02 (2% improvement minimum)" so the docs match the implementation (or alternatively change Config.performance_threshold to 1.01 if you intend the docs to be canonical).

coderabbitai · 2026-03-13T08:10:03Z

examples/qdq_placement/set_batch_size.py

+#!/usr/bin/env python3
+"""
+Script to set a fixed batch size for ONNX models.
+
+This script modifies an ONNX model with dynamic batch size to use a fixed batch size,
+which is often beneficial for TensorRT performance benchmarking.
+
+Usage:
+    python set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx
+"""


⚠️ Potential issue | 🟠 Major

Missing required NVIDIA Apache 2.0 license header.

Per coding guidelines, all new Python files in examples/**/*.py require the NVIDIA Apache 2.0 license header.

Add license header

#!/usr/bin/env python3 +# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """ Script to set a fixed batch size for ONNX models.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/qdq_placement/set_batch_size.py` around lines 1 - 10, Add the required NVIDIA Apache 2.0 license header to the top of the examples/qdq_placement/set_batch_size.py script (above or immediately after the existing shebang and before the module docstring) so the file includes the full license boilerplate mandated for examples/**/*.py; ensure the header explicitly names "NVIDIA CORPORATION" and the Apache 2.0 terms and leave the rest of the script (including the module docstring and usage comments) unchanged.

coderabbitai · 2026-03-13T08:10:03Z

modelopt/onnx/quantization/autotune/fusion_grouping.py

+"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.
+
+Parses TensorRT's exported layer information (graph.json) to understand
+how ONNX operations are fused into TRT layers. Creates FusionGroups that
+map back to ONNX node names, enabling subgraph-level QDQ optimization.
+"""


⚠️ Potential issue | 🟠 Major

Missing required NVIDIA Apache 2.0 license header.

Per coding guidelines, all new Python files in modelopt/**/*.py require the NVIDIA Apache 2.0 license header.

Add license header

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.

Parses TensorRT's exported layer information (graph.json) to understand

how ONNX operations are fused into TRT layers. Creates FusionGroups that

map back to ONNX node names, enabling subgraph-level QDQ optimization.

"""

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.

Parses TensorRT's exported layer information (graph.json) to understand

how ONNX operations are fused into TRT layers. Creates FusionGroups that

map back to ONNX node names, enabling subgraph-level QDQ optimization.

"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 1 - 6, Add the required NVIDIA Apache 2.0 license header to the top of the module modelopt/onnx/quantization/autotune/fusion_grouping.py by inserting the standard multi-line NVIDIA Apache 2.0 header (including copyright line, SPDX identifier and license text reference) as used across other files in modelopt; ensure the header appears before the existing module docstring so classes/functions in this file (e.g., the fusion grouping logic in fusion_grouping.py) are properly licensed.

coderabbitai · 2026-03-13T08:10:03Z

modelopt/onnx/quantization/autotune/subgraph_extractor.py

+"""Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.
+
+Given boundary tensor names (inputs/outputs), marks them as the subgraph's
+I/O, cleans up unreferenced nodes, runs shape inference, and serializes
+the result to bytes for direct TensorRT consumption.
+"""


⚠️ Potential issue | 🟠 Major

Missing NVIDIA Apache 2.0 license header.

As per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header at the top of the file.

📄 Add license header

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"""Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.

Given boundary tensor names (inputs/outputs), marks them as the subgraph's

I/O, cleans up unreferenced nodes, runs shape inference, and serializes

the result to bytes for direct TensorRT consumption.

"""

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

"""Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.

Given boundary tensor names (inputs/outputs), marks them as the subgraph's

I/O, cleans up unreferenced nodes, runs shape inference, and serializes

the result to bytes for direct TensorRT consumption.

"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py` around lines 1 - 6, Add the NVIDIA Apache 2.0 license header to the top of the module file modelopt/onnx/quantization/autotune/subgraph_extractor.py by inserting the standard multi-line license comment before the existing module docstring; ensure the header includes the SPDX identifier and full NVIDIA Apache 2.0 text (or the project-standard short header) so the file begins with the required license block followed by the current docstring and existing code in subgraph_extractor.py.

coderabbitai · 2026-03-13T08:10:03Z

modelopt/onnx/quantization/autotune/subgraph_workflow.py

+"""Subgraph-based QDQ autotune workflow.
+
+Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize
+Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by:
+  1. Grouping ONNX nodes by TRT fusion boundaries (graph.json)
+  2. Profiling isolated subgraphs instead of full model
+  3. Using domain-informed heuristic schemes instead of random mutation
+"""


⚠️ Potential issue | 🟠 Major

Missing NVIDIA Apache 2.0 license header.

As per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header at the top of the file.

📄 Add license header

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + """Subgraph-based QDQ autotune workflow.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"""Subgraph-based QDQ autotune workflow.

Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize

Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by:

1. Grouping ONNX nodes by TRT fusion boundaries (graph.json)

2. Profiling isolated subgraphs instead of full model

3. Using domain-informed heuristic schemes instead of random mutation

"""

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

"""Subgraph-based QDQ autotune workflow.

Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize

Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by:

1. Grouping ONNX nodes by TRT fusion boundaries (graph.json)

2. Profiling isolated subgraphs instead of full model

3. Using domain-informed heuristic schemes instead of random mutation

"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 1 - 8, This file is missing the NVIDIA Apache 2.0 license header; prepend the standard NVIDIA Apache-2.0 license block to the very top of subgraph_workflow.py (above the existing module docstring) including the copyright line and license text/notice, ensuring the SPDX identifier (Apache-2.0) or full license header is present and that the existing file docstring remains unchanged below the header.

coderabbitai · 2026-03-13T08:10:03Z

modelopt/onnx/quantization/autotune/torch_region_builder.py

+    def _build_tensor_to_regions_map(
+        self, region: Region, tensor_to_regions_map: dict[str, set[int]] = {}
+    ) -> dict[str, set[int]]:
+        """Build a map from tensor names to regions."""
+        for input in region.inputs:
+            if input not in tensor_to_regions_map:
+                tensor_to_regions_map[input] = set()
+            tensor_to_regions_map[input].add(region.id)
+
+        for child in region.get_children():
+            self._build_tensor_to_regions_map(child, tensor_to_regions_map)
+        return tensor_to_regions_map


⚠️ Potential issue | 🟠 Major

Mutable default argument anti-pattern.

Same issue as _build_id_to_region_map - using mutable default dict = {}.

Proposed fix

def _build_tensor_to_regions_map( - self, region: Region, tensor_to_regions_map: dict[str, set[int]] = {} + self, region: Region, tensor_to_regions_map: dict[str, set[int]] | None = None ) -> dict[str, set[int]]: """Build a map from tensor names to regions.""" + if tensor_to_regions_map is None: + tensor_to_regions_map = {} for input in region.inputs:

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 320 - 331, The _build_tensor_to_regions_map function uses a mutable default argument (dict = {}), which can be shared across calls; change the signature to accept tensor_to_regions_map: Optional[dict[str, set[int]]] = None and inside the function initialize tensor_to_regions_map = {} if None, mirroring the fix used in _build_id_to_region_map; keep the recursive behavior and return the map as before so each top-level call gets a fresh dict.

coderabbitai · 2026-03-13T08:10:03Z

modelopt/onnx/quantization/autotune/torch_region_builder.py

+            self._build_tensor_to_regions_map(child, tensor_to_regions_map)
+        return tensor_to_regions_map
+
+    def _merge_neighboring_regions(self, region: Region, to_remove: set[int] = set()) -> None:


⚠️ Potential issue | 🟠 Major

Mutable default argument anti-pattern.

Same issue - using mutable default set = set().

Proposed fix

-def _merge_neighboring_regions(self, region: Region, to_remove: set[int] = set()) -> None: +def _merge_neighboring_regions(self, region: Region, to_remove: set[int] | None = None) -> None: + if to_remove is None: + to_remove = set() self._compute_all_boundaries(region)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` at line 333, The _merge_neighboring_regions function currently uses a mutable default argument for to_remove (set()); change the signature to accept None (e.g., to_remove: Optional[set[int]] = None or to_remove: set[int] | None = None) and inside the body initialize it to an empty set when None (e.g., if to_remove is None: to_remove = set()), updating any type imports as needed and leaving the rest of the logic in _merge_neighboring_regions unchanged.

coderabbitai · 2026-03-13T08:10:03Z

modelopt/onnx/quantization/autotune/torch_region_builder.py

+    """
+    only_quantizable = True
+    logger.info(f"Loading model: {onnx_path}")


⚠️ Potential issue | 🟡 Minor

Function parameter immediately overwritten.

The only_quantizable parameter is accepted but immediately overwritten to True on line 753, making the parameter useless. Either remove the parameter or respect its value.

Option 1: Remove the override to respect the parameter

def inspect_torch_regions( onnx_path: str, include_all_regions: bool = False, only_quantizable: bool = False, ) -> list[Region]: - only_quantizable = True logger.info(f"Loading model: {onnx_path}")

Option 2: Remove the unused parameter

def inspect_torch_regions( onnx_path: str, include_all_regions: bool = False, - only_quantizable: bool = False, ) -> list[Region]: + only_quantizable = True logger.info(f"Loading model: {onnx_path}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"""

only_quantizable = True

logger.info(f"Loading model: {onnx_path}")

def inspect_torch_regions(

onnx_path: str,

include_all_regions: bool = False,

only_quantizable: bool = False,

) -> list[Region]:

logger.info(f"Loading model: {onnx_path}")

Suggested change

"""

only_quantizable = True

logger.info(f"Loading model: {onnx_path}")

def inspect_torch_regions(

onnx_path: str,

include_all_regions: bool = False,

) -> list[Region]:

only_quantizable = True

logger.info(f"Loading model: {onnx_path}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 752 - 754, The parameter only_quantizable is being ignored because the code unconditionally sets only_quantizable = True; fix by either removing that assignment so the function respects the incoming only_quantizable parameter, or remove the only_quantizable parameter from the function signature if it's not needed; locate the assignment near the logger.info(f"Loading model: {onnx_path}") call in torch_region_builder.py and update the function that accepts only_quantizable to either use the passed value or eliminate the unused parameter and adjust all callers accordingly.

coderabbitai · 2026-03-13T08:10:03Z

tests/unit/onnx/quantization/autotune/test_config.py

+#!/usr/bin/env python3
+"""
+Tests for the Config class in the autotuner.
+
+Tests configuration parameter validation and defaults.
+"""


⚠️ Potential issue | 🟠 Major

Missing required NVIDIA Apache 2.0 license header.

Per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header.

Add license header

+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + #!/usr/bin/env python3 """ Tests for the Config class in the autotuner.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#!/usr/bin/env python3

"""

Tests for the Config class in the autotuner.

Tests configuration parameter validation and defaults.

"""

# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

# SPDX-License-Identifier: Apache-2.0

#

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

#!/usr/bin/env python3

"""

Tests for the Config class in the autotuner.

Tests configuration parameter validation and defaults.

"""

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 1 - 6, Add the required NVIDIA Apache 2.0 license header to the top of tests/unit/onnx/quantization/autotune/test_config.py: insert the standard NVIDIA Apache-2.0 license block (including copyright notice/SPDX identifier and license text) immediately after the existing shebang (#!/usr/bin/env python3) and before the module docstring so the file complies with the repository's license/header convention.

coderabbitai · 2026-03-13T08:10:03Z

tests/unit/onnx/quantization/autotune/test_config.py

+    def test_performance_threshold_validation(self):
+        """Test that performance_threshold must be >= 1.0."""
+        # Valid values
+        config1 = Config(performance_threshold=1.0)
+        self.assertEqual(config1.performance_threshold, 1.0)
+
+        config2 = Config(performance_threshold=1.5)
+        self.assertEqual(config2.performance_threshold, 1.5)
+
+        # Invalid values should not be accepted
+        # Note: This test assumes validation exists, if not we should add it
+        print("✓ Config performance_threshold validation")


⚠️ Potential issue | 🟡 Minor

Incomplete validation test - invalid values not actually tested.

The test test_performance_threshold_validation only tests valid values but the comment on line 80-81 indicates invalid values should be rejected. Either implement the validation in Config and test it, or remove the misleading comment.

Option 1: If validation exists, add the test

def test_performance_threshold_validation(self): """Test that performance_threshold must be >= 1.0.""" # Valid values config1 = Config(performance_threshold=1.0) self.assertEqual(config1.performance_threshold, 1.0) config2 = Config(performance_threshold=1.5) self.assertEqual(config2.performance_threshold, 1.5) # Invalid values should raise ValueError with self.assertRaises(ValueError): Config(performance_threshold=0.9) print("✓ Config performance_threshold validation")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 71 - 82, The test test_performance_threshold_validation is incomplete: it only asserts valid values for Config.performance_threshold but doesn't check that invalid values are rejected; either add validation in Config or update the test to assert the expected failure. If Config should validate, implement a check in Config.__init__ or the performance_threshold setter to raise ValueError for values < 1.0 and then update test_performance_threshold_validation to include a with self.assertRaises(ValueError): Config(performance_threshold=0.9) case; otherwise remove the misleading comment and the "Invalid values" note from test_performance_threshold_validation.

gcunhase · 2026-03-13T15:13:10Z

@Hale423 please find my comments below:

Please rebase this PR and ensure that the DCO is valid.
Please update this PR's description. It says This PR adds automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models using TensorRT performance measurements. However, Autotune has already been introduced previously, so the correct description is that this expands on Autotune, correct?
Is this PR related to Draft: Integrate TorchRegionBuilder to AutoQDQ #963?

Thanks

willg-nv and others added 9 commits December 8, 2025 04:51

Implement automatic qdq placement tool

160550a

use global builder/runtime in py benchmark

91d5f4a

optimize insertion point resolver code

645f1d7

Implement torch region search

ba09bda

Signed-off-by: Will Guo <willg@nvidia.com>

refine torch region search

d3d3885

Signed-off-by: Will Guo <willg@nvidia.com>

feat: autotune, support splite regions according to myelin fusion pat…

8027d9c

…terns

update readme

cec80b6

Hale423 requested review from a team as code owners March 10, 2026 02:14

Hale423 requested a review from cjluo-nv March 10, 2026 02:14

cjluo-nv requested review from ajrasane and gcunhase March 10, 2026 06:38

cjluo-nv requested changes Mar 10, 2026

View reviewed changes

coderabbitai bot reviewed Mar 13, 2026

View reviewed changes

	Default threshold: 1.01 (1% improvement minimum)
	Default threshold: 1.02 (2% improvement minimum)

-"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.
-Parses TensorRT's exported layer information (graph.json) to understand
-how ONNX operations are fused into TRT layers. Creates FusionGroups that
-map back to ONNX node names, enabling subgraph-level QDQ optimization.
-"""
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.
+Parses TensorRT's exported layer information (graph.json) to understand
+how ONNX operations are fused into TRT layers. Creates FusionGroups that
+map back to ONNX node names, enabling subgraph-level QDQ optimization.
+"""

-    """
-    only_quantizable = True
-    logger.info(f"Loading model: {onnx_path}")
+def inspect_torch_regions(
+    onnx_path: str,
+    include_all_regions: bool = False,
+    only_quantizable: bool = False,
+) -> list[Region]:
+    logger.info(f"Loading model: {onnx_path}")

Conversation

Hale423 commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request: ONNX Q/DQ Autotuning with Subgraph Mode

Summary

What’s New (vs main)

Key Files

How to Test

Checklist

Documentation

Notes

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Mar 10, 2026

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Hale423 commented Mar 10, 2026

Design: ONNX Q/DQ Autotuning for TensorRT

1. Background

2. Goals

3. Non-goals

4. Scope Relative to main

5. User Experience

5.1 Region Mode

5.2 Subgraph Mode

6. High-level Architecture

6.1 Core Modeling Layer

6.2 Region-mode Optimization Layer

6.3 Subgraph-mode Optimization Layer

6.4 Benchmarking Layer

7. Region Mode Design

7.1 Why Region Mode Exists

7.2 Region Discovery

7.3 Pattern-based Scheme Reuse

7.4 Warm Start

7.5 State and Resume

8. Subgraph Mode Design

8.1 Why Subgraph Mode Exists

8.2 Phase 1: Fusion-aware Grouping

8.3 Phase 2: Subgraph Profiling

8.4 Per-layer Comparison

8.5 Phase 3: Full-model Merge and Incremental Validation

8.6 Cache and Resume

9. Benchmarking Design

9.1 Python API Benchmark

9.2 trtexec Benchmark

9.3 Profiling-flag Compatibility Fallback

10. CLI Design

11. Data and Output Model

11.1 Region Mode Outputs

11.2 Subgraph Mode Outputs

12. Key Trade-offs

12.1 Region Mode vs Subgraph Mode

12.2 Why Keep Both Raw and Final Models

12.3 Why Pattern-relative Insertion Points

13. Limitations and Risks

14. Alternatives Considered

14.1 Full-model-only Search

14.2 Subgraph-only Design

14.3 Absolute-node Placement Search

15. Validation Plan

16. Rollout and Documentation

17. Open Questions for Review

Uh oh!

coderabbitai bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 inconclusive)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 13, 2026

Hale423 commented Mar 10, 2026 •

edited

Loading

4. Scope Relative to `main`

coderabbitai bot commented Mar 13, 2026 •

edited

Loading