[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015
[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015Hale423 wants to merge 10 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested - Skip adding --separateProfileRun if already present in user trtexec args - On trtexec 'Unknown option' error, strip profiling flags and retry once without them - Set _profile_unsupported so later runs use total-latency comparison only - Extract _exec_and_log for shared run-and-log logic Made-with: Cursor
- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested - Skip adding --separateProfileRun if already present in user trtexec args - On trtexec 'Unknown option' error, strip profiling flags and retry once without them - Set _profile_unsupported so later runs use total-latency comparison only - Extract _exec_and_log for shared run-and-log logic
cjluo-nv
left a comment
There was a problem hiding this comment.
This PR introduces 16k+ lines of changes. Please consider sharing a design and get design review.
|
Thanks for the feedback. Sharing this design, please kindly take a look. Design: ONNX Q/DQ Autotuning with Subgraph ModeDesign: ONNX Q/DQ Autotuning for TensorRTDesign review document for PR #1015 1. BackgroundTensorRT performance for quantized ONNX models depends not only on whether Q/DQ nodes exist, but also on where they are inserted. In practice:
This branch introduces an ONNX Q/DQ autotuning system that searches for better Q/DQ placement using actual TensorRT latency measurements. The design intentionally supports two workflows:
2. Goals
3. Non-goals
4. Scope Relative to
|
Resolve add/add conflicts across 13 files by taking main's refactored codebase (autotuner_base.py split, benchmark.py extraction, op_types reorganization) and re-applying subgraph-mode additions: - __init__.py: add subgraph_autotuning_workflow export - __main__.py: add --workflow subgraph/region, --graph_json, --incremental_validation CLI arguments - workflows.py: add strip_shape_args, extra_run_args, export_profile_path parameters to benchmark_onnx_model - benchmark.py: add profiling support (export_profile_path, strip_shape_args) with trtexec fallback on unsupported flags - subgraph_workflow.py: migrate BOOL_OUTPUT_OPS to use get_bool_ops/get_comparison_ops/get_value_check_ops from op_types Files using only base code (region_pattern, region_search, 5 test files, common, insertion_points, autotuner) resolved by accepting main's version. Made-with: Cursor
📝 WalkthroughWalkthroughA comprehensive subgraph-based quantize/dequantize (Q/DQ) placement autotuning system for ONNX models has been added, including documentation, CLI integration, TensorRT benchmarking infrastructure, graph fusion analysis, PyTorch region discovery, and a multi-phase optimization workflow. Changes
Sequence DiagramsequenceDiagram
participant CLI as CLI / User
participant WF as Subgraph Workflow
participant GA as Graph Analysis<br/>(Fusion Groups)
participant SG as Subgraph<br/>Extraction
participant BM as Benchmarking<br/>(TensorRT)
participant FS as FileSystem<br/>(Cache/Output)
CLI->>WF: subgraph_autotuning_workflow(model_path, ...)
rect rgba(100, 200, 150, 0.5)
note over WF: Phase 1: Setup
WF->>GA: parse graph.json + create_fusion_groups
GA->>GA: Map TRT layers to ONNX nodes
GA-->>WF: FusionGroup list
end
rect rgba(100, 150, 200, 0.5)
note over WF: Phase 2: Per-Subgraph Profiling
loop For each FusionGroup
WF->>WF: generate_heuristic_schemes
loop For each scheme
WF->>SG: extract_subgraph_by_nodes
SG-->>WF: subgraph bytes
WF->>WF: insert_qdq_on_graph
WF->>BM: benchmark_onnx_model
BM-->>WF: latency_ms
end
WF->>WF: Select best_scheme
end
WF->>FS: Cache phase2 results
end
rect rgba(200, 150, 100, 0.5)
note over WF: Phase 3: Full-Model Validation
WF->>BM: Baseline FP16 benchmark full model
BM-->>WF: baseline_latency_ms
loop Incremental Validation (if enabled)
WF->>WF: Apply candidate QDQ schemes
WF->>BM: Benchmark full model per candidate
BM-->>WF: Test latency_ms
WF->>WF: Keep if latency improves
end
WF->>FS: Save optimized_final.onnx
WF->>FS: Cache phase3 progress
end
WF-->>CLI: Return optimized_model_path
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)
34-34:⚠️ Potential issue | 🟡 Minor
# noseccomments require explicit justification and approval.As per coding guidelines, use of
# noseccomments to bypass Bandit security checks is not allowed. If this security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by@NVIDIA/modelopt-setup-codeownerswith explicit justification in the PR description.The subprocess import itself is safe when used properly (with list arguments, not shell=True), but the
# nosec B404comment should be documented or removed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/benchmark.py` at line 34, Remove the inline "# nosec B404" on the subprocess import (import subprocess) unless you add an explicit justification and approval from `@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is required, keep the comment but add the justification text to the PR and ensure all uses of subprocess in this module (functions/methods that call subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a list, avoid shell=True) and reference that approval in the PR description.
🧹 Nitpick comments (6)
examples/qdq_placement/set_batch_size.py (1)
62-65: Consider handling large models (>2GB) for model verification.The direct call to
onnx.checker.check_model(output_path)may fail for models larger than 2GB due to protobuf size limits. The codebase has a utility inmodelopt/onnx/utils.py(seecheck_modelfunction) that handles this case by using external data storage.Suggested approach
+import tempfile +import os + def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None: # ... existing code ... # Verify the saved model print("Verifying model...") - onnx.checker.check_model(output_path) + # Handle large models that exceed protobuf 2GB limit + saved_model = onnx.load(output_path) + if saved_model.ByteSize() > (2 * (1024**3)): + # For large models, check_model needs the file path with external data + onnx.checker.check_model(output_path) + else: + onnx.checker.check_model(saved_model) print("✓ Model saved and verified successfully!")Alternatively, consider importing and using
modelopt.onnx.utils.check_modelfor consistency with the rest of the codebase.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qdq_placement/set_batch_size.py` around lines 62 - 65, Replace the direct call to onnx.checker.check_model(output_path) with the project's robust checker that handles >2GB models by importing and calling modelopt.onnx.utils.check_model; locate the verification block around the print("Verifying model...") and change the call to use modelopt.onnx.utils.check_model(output_path) (or import the function as check_model and call check_model(output_path)) so external-data large protobuf models are supported.modelopt/onnx/quantization/autotune/qdq_utils.py (2)
60-60: Simplify redundant condition.The check
node.input and len(node.input) > 0is redundant sincenode.inputbeing truthy already implies it's non-empty.Suggested simplification
- if node.input and len(node.input) > 0: + if node.input:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 60, The condition checking node inputs is redundant: replace the `if node.input and len(node.input) > 0:` guard with a simpler truthy check `if node.input:` in the function where this appears (look for the code handling `node.input` in qdq_utils.py) so the branch behavior remains identical but the expression is simplified.
1-1: Update copyright year for consistency.The copyright year is 2024, but other new files in this PR use 2026. Consider updating for consistency.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 1, Update the SPDX header year in modelopt/onnx/quantization/autotune/qdq_utils.py to match the rest of the PR (change "2024" to "2026"); locate the SPDX comment at the top of the file and modify the year in the copyright line so it is consistent across files.tests/unit/onnx/quantization/autotune/test_config.py (1)
12-13: Remove unnecessary sys.path manipulation.The
sys.path.insertis unnecessary if the package is properly installed in the test environment. This pattern can cause import issues and is not recommended.Suggested fix
-# Add parent directory to path -sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) - from modelopt.onnx.quantization.autotune.common import Config🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 12 - 13, Remove the ad-hoc sys.path manipulation in the test file: delete the sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) line in tests/unit/onnx/quantization/autotune/test_config.py and rely on the package being installed in the test environment (or use test runner configuration / PYTHONPATH) so imports resolve cleanly; do not add alternative path hacks in this file.modelopt/onnx/quantization/autotune/subgraph_extractor.py (1)
170-181: Consider usingcollections.dequefor BFS queue.Using
list.pop(0)is O(n) per operation. For potentially large graphs, usingcollections.dequewithpopleft()provides O(1) performance.♻️ Suggested improvement
+from collections import deque + def _find_reachable_graph_inputs( graph: gs.Graph, target_nodes: List[gs.Node], ) -> List[str]: """BFS backward from target_nodes to find graph inputs that feed into them.""" graph_input_names = {t.name for t in graph.inputs} visited = set() - queue = [] + queue = deque() result = [] for node in target_nodes: for inp in node.inputs: if isinstance(inp, gs.Variable) and inp.name not in visited: visited.add(inp.name) queue.append(inp) while queue: - tensor = queue.pop(0) + tensor = queue.popleft()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py` around lines 170 - 181, The BFS loop in subgraph_extractor.py uses a Python list and pop(0), causing O(n) dequeuing; replace the list-based queue with collections.deque: import deque, initialize queue = deque(...) where the queue is created, and change queue.pop(0) to queue.popleft(), keeping queue.append(...) for enqueueing; update any code that constructs the initial queue and ensure imports include collections.deque so the BFS in the function that contains this loop uses O(1) dequeue operations.examples/qdq_placement/README.md (1)
152-162: Add language specifier to fenced code blocks.The directory structure code blocks should have a language specifier for consistency. Consider using
textorplaintextfor directory listings.📝 Suggested fix
-``` +```text resnet50_results/ ├── optimized_final.onnx # Optimized model-``` +```text <output_dir>/ ├── optimized_final.onnx # Incrementally validated model (if --incremental-validation)Also applies to: 166-175
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/qdq_placement/README.md` around lines 152 - 162, Update the fenced code blocks that show directory listings so they include a language specifier (use ```text) — specifically change the blocks containing the "resnet50_results/" tree and the block containing the "<output_dir>/" tree to start with ```text instead of ``` so the directory listings render consistently; locate the blocks by the directory headers "resnet50_results/" and "<output_dir>/" in the README and replace their opening fences accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/source/reference/2_qdq_placement.rst`:
- Line 828: The docs claim "Default threshold: 1.01" but the actual default is
defined as performance_threshold: float = 1.02 on the Config class in common.py;
update the documentation string in 2_qdq_placement.rst to state "Default
threshold: 1.02 (2% improvement minimum)" so the docs match the implementation
(or alternatively change Config.performance_threshold to 1.01 if you intend the
docs to be canonical).
In `@examples/qdq_placement/set_batch_size.py`:
- Around line 1-10: Add the required NVIDIA Apache 2.0 license header to the top
of the examples/qdq_placement/set_batch_size.py script (above or immediately
after the existing shebang and before the module docstring) so the file includes
the full license boilerplate mandated for examples/**/*.py; ensure the header
explicitly names "NVIDIA CORPORATION" and the Apache 2.0 terms and leave the
rest of the script (including the module docstring and usage comments)
unchanged.
In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of the module modelopt/onnx/quantization/autotune/fusion_grouping.py by
inserting the standard multi-line NVIDIA Apache 2.0 header (including copyright
line, SPDX identifier and license text reference) as used across other files in
modelopt; ensure the header appears before the existing module docstring so
classes/functions in this file (e.g., the fusion grouping logic in
fusion_grouping.py) are properly licensed.
In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 1-6: Add the NVIDIA Apache 2.0 license header to the top of the
module file modelopt/onnx/quantization/autotune/subgraph_extractor.py by
inserting the standard multi-line license comment before the existing module
docstring; ensure the header includes the SPDX identifier and full NVIDIA Apache
2.0 text (or the project-standard short header) so the file begins with the
required license block followed by the current docstring and existing code in
subgraph_extractor.py.
In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 1-8: This file is missing the NVIDIA Apache 2.0 license header;
prepend the standard NVIDIA Apache-2.0 license block to the very top of
subgraph_workflow.py (above the existing module docstring) including the
copyright line and license text/notice, ensuring the SPDX identifier
(Apache-2.0) or full license header is present and that the existing file
docstring remains unchanged below the header.
- Around line 99-105: The loops that mutate the protobuf repeated field use
nonstandard pop()/add() calls on dim_proto; replace popping with explicit
deletion (e.g., use del dim_proto[-1] in the shrink loop) and call
dim_proto.add() when growing but capture the returned element if you need to
initialize it (e.g., new_dim = dim_proto.add()); keep the subsequent loop that
sets dim_proto[i].ClearField("dim_param") and dim_proto[i].dim_value = d, but
ensure dim_proto has been resized using del and dim_proto.add() rather than
pop()/add() without using the returned message.
- Around line 39-42: QUANT_DTYPES currently sets "fp8" to np.int8 when the FP8
dtype isn't present, causing silent fallback; change this by removing the silent
fallback from the module-level QUANT_DTYPES and instead detect/import ml_dtypes
(or np.float8 variant) conditionally at runtime where quantization is requested
(e.g., inside the function(s) that handle quantization/autotune and any call
sites referencing QUANT_DTYPES), and when a user requests "fp8" but the FP8
dtype isn't available emit a clear user-facing warning or raise a descriptive
error; specifically, update references to QUANT_DTYPES to validate availability
at usage time, attempt a conditional import of ml_dtypes (or check hasattr(np,
"float8_e4m3fn")) there, and log/warn if FP8 cannot be supported rather than
silently substituting int8.
In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py`:
- Around line 1-41: The license header at the top of the module (the
module-level docstring and SPDX header in tensorrt_utils.py) uses an unusual
year range "1993-2025"; update the copyright/SPDX header to the project's
standard single-year format (e.g., "Copyright (c) 2026 NVIDIA CORPORATION &
AFFILIATES") and adjust the SPDX-License-Identifier line if needed so the header
matches other files; edit the top-of-file docstring/SPDX block where the current
year range appears to replace it with the standard format.
- Around line 987-1000: Fix _save_timing_cache: the condition and method call
are wrong and the finally cleans up an undefined name. Change the inverted check
so you only call combine when self._timing_cache is NOT None, correct the typo
`combline` to `combine` for the timing cache object returned by
config.create_timing_cache, and in the finally block remove the undefined
`builder` deletion (either delete only `config` or explicitly reference
`self.builder` if you intended to delete it); ensure you still serialize and
write timing_cache_data to self.timing_cache_file when a timing cache exists.
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py`:
- Around line 311-318: The _build_id_to_region_map function currently uses a
mutable default dict for id_to_region_map which can be shared across calls;
change the signature to default id_to_region_map to None and inside
_build_id_to_region_map initialize id_to_region_map = {} when it's None, then
proceed to assign id_to_region_map[region.id] = region and recursively call
self._build_id_to_region_map(child, id_to_region_map) so each top-level call
gets a fresh map; keep the return type dict[int, Region] and behavior otherwise
unchanged.
- Around line 16-24: The module docstring in torch_region_builder.py duplicates
the license/copyright header (the "SPDX-FileCopyrightText" lines and
years)—remove the duplicate license block from the top of the module docstring
so only the canonical file header remains; keep the descriptive docstring text
("Torch Region Builder - Hierarchical Region Discovery...") and ensure
functions/classes in this module (e.g., any top-level descriptions used by
torch_region_builder.py) are unaffected by the removal.
- Around line 752-754: The parameter only_quantizable is being ignored because
the code unconditionally sets only_quantizable = True; fix by either removing
that assignment so the function respects the incoming only_quantizable
parameter, or remove the only_quantizable parameter from the function signature
if it's not needed; locate the assignment near the logger.info(f"Loading model:
{onnx_path}") call in torch_region_builder.py and update the function that
accepts only_quantizable to either use the passed value or eliminate the unused
parameter and adjust all callers accordingly.
- Around line 320-331: The _build_tensor_to_regions_map function uses a mutable
default argument (dict = {}), which can be shared across calls; change the
signature to accept tensor_to_regions_map: Optional[dict[str, set[int]]] = None
and inside the function initialize tensor_to_regions_map = {} if None, mirroring
the fix used in _build_id_to_region_map; keep the recursive behavior and return
the map as before so each top-level call gets a fresh dict.
- Line 333: The _merge_neighboring_regions function currently uses a mutable
default argument for to_remove (set()); change the signature to accept None
(e.g., to_remove: Optional[set[int]] = None or to_remove: set[int] | None =
None) and inside the body initialize it to an empty set when None (e.g., if
to_remove is None: to_remove = set()), updating any type imports as needed and
leaving the rest of the logic in _merge_neighboring_regions unchanged.
In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 71-82: The test test_performance_threshold_validation is
incomplete: it only asserts valid values for Config.performance_threshold but
doesn't check that invalid values are rejected; either add validation in Config
or update the test to assert the expected failure. If Config should validate,
implement a check in Config.__init__ or the performance_threshold setter to
raise ValueError for values < 1.0 and then update
test_performance_threshold_validation to include a with
self.assertRaises(ValueError): Config(performance_threshold=0.9) case; otherwise
remove the misleading comment and the "Invalid values" note from
test_performance_threshold_validation.
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of tests/unit/onnx/quantization/autotune/test_config.py: insert the standard
NVIDIA Apache-2.0 license block (including copyright notice/SPDX identifier and
license text) immediately after the existing shebang (#!/usr/bin/env python3)
and before the module docstring so the file complies with the repository's
license/header convention.
---
Outside diff comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Line 34: Remove the inline "# nosec B404" on the subprocess import (import
subprocess) unless you add an explicit justification and approval from
`@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is
required, keep the comment but add the justification text to the PR and ensure
all uses of subprocess in this module (functions/methods that call
subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a
list, avoid shell=True) and reference that approval in the PR description.
---
Nitpick comments:
In `@examples/qdq_placement/README.md`:
- Around line 152-162: Update the fenced code blocks that show directory
listings so they include a language specifier (use ```text) — specifically
change the blocks containing the "resnet50_results/" tree and the block
containing the "<output_dir>/" tree to start with ```text instead of ``` so the
directory listings render consistently; locate the blocks by the directory
headers "resnet50_results/" and "<output_dir>/" in the README and replace their
opening fences accordingly.
In `@examples/qdq_placement/set_batch_size.py`:
- Around line 62-65: Replace the direct call to
onnx.checker.check_model(output_path) with the project's robust checker that
handles >2GB models by importing and calling modelopt.onnx.utils.check_model;
locate the verification block around the print("Verifying model...") and change
the call to use modelopt.onnx.utils.check_model(output_path) (or import the
function as check_model and call check_model(output_path)) so external-data
large protobuf models are supported.
In `@modelopt/onnx/quantization/autotune/qdq_utils.py`:
- Line 60: The condition checking node inputs is redundant: replace the `if
node.input and len(node.input) > 0:` guard with a simpler truthy check `if
node.input:` in the function where this appears (look for the code handling
`node.input` in qdq_utils.py) so the branch behavior remains identical but the
expression is simplified.
- Line 1: Update the SPDX header year in
modelopt/onnx/quantization/autotune/qdq_utils.py to match the rest of the PR
(change "2024" to "2026"); locate the SPDX comment at the top of the file and
modify the year in the copyright line so it is consistent across files.
In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 170-181: The BFS loop in subgraph_extractor.py uses a Python list
and pop(0), causing O(n) dequeuing; replace the list-based queue with
collections.deque: import deque, initialize queue = deque(...) where the queue
is created, and change queue.pop(0) to queue.popleft(), keeping
queue.append(...) for enqueueing; update any code that constructs the initial
queue and ensure imports include collections.deque so the BFS in the function
that contains this loop uses O(1) dequeue operations.
In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 12-13: Remove the ad-hoc sys.path manipulation in the test file:
delete the sys.path.insert(0,
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) line in
tests/unit/onnx/quantization/autotune/test_config.py and rely on the package
being installed in the test environment (or use test runner configuration /
PYTHONPATH) so imports resolve cleanly; do not add alternative path hacks in
this file.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 48cf674c-cea7-4964-b1ed-c9c27e9d6e6e
📒 Files selected for processing (15)
docs/source/guides/9_qdq_placement.rstdocs/source/reference/2_qdq_placement.rstexamples/qdq_placement/README.mdexamples/qdq_placement/set_batch_size.pymodelopt/onnx/quantization/autotune/__init__.pymodelopt/onnx/quantization/autotune/__main__.pymodelopt/onnx/quantization/autotune/benchmark.pymodelopt/onnx/quantization/autotune/fusion_grouping.pymodelopt/onnx/quantization/autotune/qdq_utils.pymodelopt/onnx/quantization/autotune/subgraph_extractor.pymodelopt/onnx/quantization/autotune/subgraph_workflow.pymodelopt/onnx/quantization/autotune/tensorrt_utils.pymodelopt/onnx/quantization/autotune/torch_region_builder.pymodelopt/onnx/quantization/autotune/workflows.pytests/unit/onnx/quantization/autotune/test_config.py
| if speedup >= config.performance_threshold: | ||
| accept_scheme() | ||
|
|
||
| Default threshold: 1.01 (1% improvement minimum) |
There was a problem hiding this comment.
Documentation inconsistency with actual default value.
The documentation states "Default threshold: 1.01 (1% improvement minimum)" but the Config class in common.py defines performance_threshold: float = 1.02 (2% improvement). This should be corrected to match the actual implementation.
Suggested fix
-Default threshold: 1.01 (1% improvement minimum)
+Default threshold: 1.02 (2% improvement minimum)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Default threshold: 1.01 (1% improvement minimum) | |
| Default threshold: 1.02 (2% improvement minimum) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/source/reference/2_qdq_placement.rst` at line 828, The docs claim
"Default threshold: 1.01" but the actual default is defined as
performance_threshold: float = 1.02 on the Config class in common.py; update the
documentation string in 2_qdq_placement.rst to state "Default threshold: 1.02
(2% improvement minimum)" so the docs match the implementation (or alternatively
change Config.performance_threshold to 1.01 if you intend the docs to be
canonical).
| #!/usr/bin/env python3 | ||
| """ | ||
| Script to set a fixed batch size for ONNX models. | ||
|
|
||
| This script modifies an ONNX model with dynamic batch size to use a fixed batch size, | ||
| which is often beneficial for TensorRT performance benchmarking. | ||
|
|
||
| Usage: | ||
| python set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx | ||
| """ |
There was a problem hiding this comment.
Missing required NVIDIA Apache 2.0 license header.
Per coding guidelines, all new Python files in examples/**/*.py require the NVIDIA Apache 2.0 license header.
Add license header
#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
"""
Script to set a fixed batch size for ONNX models.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/qdq_placement/set_batch_size.py` around lines 1 - 10, Add the
required NVIDIA Apache 2.0 license header to the top of the
examples/qdq_placement/set_batch_size.py script (above or immediately after the
existing shebang and before the module docstring) so the file includes the full
license boilerplate mandated for examples/**/*.py; ensure the header explicitly
names "NVIDIA CORPORATION" and the Apache 2.0 terms and leave the rest of the
script (including the module docstring and usage comments) unchanged.
| """Fusion-aware grouping of ONNX nodes based on TensorRT graph.json. | ||
|
|
||
| Parses TensorRT's exported layer information (graph.json) to understand | ||
| how ONNX operations are fused into TRT layers. Creates FusionGroups that | ||
| map back to ONNX node names, enabling subgraph-level QDQ optimization. | ||
| """ |
There was a problem hiding this comment.
Missing required NVIDIA Apache 2.0 license header.
Per coding guidelines, all new Python files in modelopt/**/*.py require the NVIDIA Apache 2.0 license header.
Add license header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Fusion-aware grouping of ONNX nodes based on TensorRT graph.json. | |
| Parses TensorRT's exported layer information (graph.json) to understand | |
| how ONNX operations are fused into TRT layers. Creates FusionGroups that | |
| map back to ONNX node names, enabling subgraph-level QDQ optimization. | |
| """ | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| """Fusion-aware grouping of ONNX nodes based on TensorRT graph.json. | |
| Parses TensorRT's exported layer information (graph.json) to understand | |
| how ONNX operations are fused into TRT layers. Creates FusionGroups that | |
| map back to ONNX node names, enabling subgraph-level QDQ optimization. | |
| """ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 1 - 6,
Add the required NVIDIA Apache 2.0 license header to the top of the module
modelopt/onnx/quantization/autotune/fusion_grouping.py by inserting the standard
multi-line NVIDIA Apache 2.0 header (including copyright line, SPDX identifier
and license text reference) as used across other files in modelopt; ensure the
header appears before the existing module docstring so classes/functions in this
file (e.g., the fusion grouping logic in fusion_grouping.py) are properly
licensed.
| """Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon. | ||
|
|
||
| Given boundary tensor names (inputs/outputs), marks them as the subgraph's | ||
| I/O, cleans up unreferenced nodes, runs shape inference, and serializes | ||
| the result to bytes for direct TensorRT consumption. | ||
| """ |
There was a problem hiding this comment.
Missing NVIDIA Apache 2.0 license header.
As per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header at the top of the file.
📄 Add license header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
"""Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon. | |
| Given boundary tensor names (inputs/outputs), marks them as the subgraph's | |
| I/O, cleans up unreferenced nodes, runs shape inference, and serializes | |
| the result to bytes for direct TensorRT consumption. | |
| """ | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| """Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon. | |
| Given boundary tensor names (inputs/outputs), marks them as the subgraph's | |
| I/O, cleans up unreferenced nodes, runs shape inference, and serializes | |
| the result to bytes for direct TensorRT consumption. | |
| """ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py` around lines 1 -
6, Add the NVIDIA Apache 2.0 license header to the top of the module file
modelopt/onnx/quantization/autotune/subgraph_extractor.py by inserting the
standard multi-line license comment before the existing module docstring; ensure
the header includes the SPDX identifier and full NVIDIA Apache 2.0 text (or the
project-standard short header) so the file begins with the required license
block followed by the current docstring and existing code in
subgraph_extractor.py.
| """Subgraph-based QDQ autotune workflow. | ||
|
|
||
| Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize | ||
| Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by: | ||
| 1. Grouping ONNX nodes by TRT fusion boundaries (graph.json) | ||
| 2. Profiling isolated subgraphs instead of full model | ||
| 3. Using domain-informed heuristic schemes instead of random mutation | ||
| """ |
There was a problem hiding this comment.
Missing NVIDIA Apache 2.0 license header.
As per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header at the top of the file.
📄 Add license header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
"""Subgraph-based QDQ autotune workflow.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """Subgraph-based QDQ autotune workflow. | |
| Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize | |
| Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by: | |
| 1. Grouping ONNX nodes by TRT fusion boundaries (graph.json) | |
| 2. Profiling isolated subgraphs instead of full model | |
| 3. Using domain-informed heuristic schemes instead of random mutation | |
| """ | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| """Subgraph-based QDQ autotune workflow. | |
| Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize | |
| Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by: | |
| 1. Grouping ONNX nodes by TRT fusion boundaries (graph.json) | |
| 2. Profiling isolated subgraphs instead of full model | |
| 3. Using domain-informed heuristic schemes instead of random mutation | |
| """ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 1 - 8,
This file is missing the NVIDIA Apache 2.0 license header; prepend the standard
NVIDIA Apache-2.0 license block to the very top of subgraph_workflow.py (above
the existing module docstring) including the copyright line and license
text/notice, ensuring the SPDX identifier (Apache-2.0) or full license header is
present and that the existing file docstring remains unchanged below the header.
| def _build_tensor_to_regions_map( | ||
| self, region: Region, tensor_to_regions_map: dict[str, set[int]] = {} | ||
| ) -> dict[str, set[int]]: | ||
| """Build a map from tensor names to regions.""" | ||
| for input in region.inputs: | ||
| if input not in tensor_to_regions_map: | ||
| tensor_to_regions_map[input] = set() | ||
| tensor_to_regions_map[input].add(region.id) | ||
|
|
||
| for child in region.get_children(): | ||
| self._build_tensor_to_regions_map(child, tensor_to_regions_map) | ||
| return tensor_to_regions_map |
There was a problem hiding this comment.
Mutable default argument anti-pattern.
Same issue as _build_id_to_region_map - using mutable default dict = {}.
Proposed fix
def _build_tensor_to_regions_map(
- self, region: Region, tensor_to_regions_map: dict[str, set[int]] = {}
+ self, region: Region, tensor_to_regions_map: dict[str, set[int]] | None = None
) -> dict[str, set[int]]:
"""Build a map from tensor names to regions."""
+ if tensor_to_regions_map is None:
+ tensor_to_regions_map = {}
for input in region.inputs:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 320
- 331, The _build_tensor_to_regions_map function uses a mutable default argument
(dict = {}), which can be shared across calls; change the signature to accept
tensor_to_regions_map: Optional[dict[str, set[int]]] = None and inside the
function initialize tensor_to_regions_map = {} if None, mirroring the fix used
in _build_id_to_region_map; keep the recursive behavior and return the map as
before so each top-level call gets a fresh dict.
| self._build_tensor_to_regions_map(child, tensor_to_regions_map) | ||
| return tensor_to_regions_map | ||
|
|
||
| def _merge_neighboring_regions(self, region: Region, to_remove: set[int] = set()) -> None: |
There was a problem hiding this comment.
Mutable default argument anti-pattern.
Same issue - using mutable default set = set().
Proposed fix
-def _merge_neighboring_regions(self, region: Region, to_remove: set[int] = set()) -> None:
+def _merge_neighboring_regions(self, region: Region, to_remove: set[int] | None = None) -> None:
+ if to_remove is None:
+ to_remove = set()
self._compute_all_boundaries(region)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` at line 333, The
_merge_neighboring_regions function currently uses a mutable default argument
for to_remove (set()); change the signature to accept None (e.g., to_remove:
Optional[set[int]] = None or to_remove: set[int] | None = None) and inside the
body initialize it to an empty set when None (e.g., if to_remove is None:
to_remove = set()), updating any type imports as needed and leaving the rest of
the logic in _merge_neighboring_regions unchanged.
| """ | ||
| only_quantizable = True | ||
| logger.info(f"Loading model: {onnx_path}") |
There was a problem hiding this comment.
Function parameter immediately overwritten.
The only_quantizable parameter is accepted but immediately overwritten to True on line 753, making the parameter useless. Either remove the parameter or respect its value.
Option 1: Remove the override to respect the parameter
def inspect_torch_regions(
onnx_path: str,
include_all_regions: bool = False,
only_quantizable: bool = False,
) -> list[Region]:
- only_quantizable = True
logger.info(f"Loading model: {onnx_path}")Option 2: Remove the unused parameter
def inspect_torch_regions(
onnx_path: str,
include_all_regions: bool = False,
- only_quantizable: bool = False,
) -> list[Region]:
+ only_quantizable = True
logger.info(f"Loading model: {onnx_path}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| """ | |
| only_quantizable = True | |
| logger.info(f"Loading model: {onnx_path}") | |
| def inspect_torch_regions( | |
| onnx_path: str, | |
| include_all_regions: bool = False, | |
| only_quantizable: bool = False, | |
| ) -> list[Region]: | |
| logger.info(f"Loading model: {onnx_path}") |
| """ | |
| only_quantizable = True | |
| logger.info(f"Loading model: {onnx_path}") | |
| def inspect_torch_regions( | |
| onnx_path: str, | |
| include_all_regions: bool = False, | |
| ) -> list[Region]: | |
| only_quantizable = True | |
| logger.info(f"Loading model: {onnx_path}") |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 752
- 754, The parameter only_quantizable is being ignored because the code
unconditionally sets only_quantizable = True; fix by either removing that
assignment so the function respects the incoming only_quantizable parameter, or
remove the only_quantizable parameter from the function signature if it's not
needed; locate the assignment near the logger.info(f"Loading model:
{onnx_path}") call in torch_region_builder.py and update the function that
accepts only_quantizable to either use the passed value or eliminate the unused
parameter and adjust all callers accordingly.
| #!/usr/bin/env python3 | ||
| """ | ||
| Tests for the Config class in the autotuner. | ||
|
|
||
| Tests configuration parameter validation and defaults. | ||
| """ |
There was a problem hiding this comment.
Missing required NVIDIA Apache 2.0 license header.
Per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header.
Add license header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
#!/usr/bin/env python3
"""
Tests for the Config class in the autotuner.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #!/usr/bin/env python3 | |
| """ | |
| Tests for the Config class in the autotuner. | |
| Tests configuration parameter validation and defaults. | |
| """ | |
| # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | |
| # SPDX-License-Identifier: Apache-2.0 | |
| # | |
| # Licensed under the Apache License, Version 2.0 (the "License"); | |
| # you may not use this file except in compliance with the License. | |
| # You may obtain a copy of the License at | |
| # | |
| # http://www.apache.org/licenses/LICENSE-2.0 | |
| # | |
| # Unless required by applicable law or agreed to in writing, software | |
| # distributed under the License is distributed on an "AS IS" BASIS, | |
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |
| # See the License for the specific language governing permissions and | |
| # limitations under the License. | |
| #!/usr/bin/env python3 | |
| """ | |
| Tests for the Config class in the autotuner. | |
| Tests configuration parameter validation and defaults. | |
| """ |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 1 - 6, Add
the required NVIDIA Apache 2.0 license header to the top of
tests/unit/onnx/quantization/autotune/test_config.py: insert the standard NVIDIA
Apache-2.0 license block (including copyright notice/SPDX identifier and license
text) immediately after the existing shebang (#!/usr/bin/env python3) and before
the module docstring so the file complies with the repository's license/header
convention.
| def test_performance_threshold_validation(self): | ||
| """Test that performance_threshold must be >= 1.0.""" | ||
| # Valid values | ||
| config1 = Config(performance_threshold=1.0) | ||
| self.assertEqual(config1.performance_threshold, 1.0) | ||
|
|
||
| config2 = Config(performance_threshold=1.5) | ||
| self.assertEqual(config2.performance_threshold, 1.5) | ||
|
|
||
| # Invalid values should not be accepted | ||
| # Note: This test assumes validation exists, if not we should add it | ||
| print("✓ Config performance_threshold validation") |
There was a problem hiding this comment.
Incomplete validation test - invalid values not actually tested.
The test test_performance_threshold_validation only tests valid values but the comment on line 80-81 indicates invalid values should be rejected. Either implement the validation in Config and test it, or remove the misleading comment.
Option 1: If validation exists, add the test
def test_performance_threshold_validation(self):
"""Test that performance_threshold must be >= 1.0."""
# Valid values
config1 = Config(performance_threshold=1.0)
self.assertEqual(config1.performance_threshold, 1.0)
config2 = Config(performance_threshold=1.5)
self.assertEqual(config2.performance_threshold, 1.5)
# Invalid values should raise ValueError
with self.assertRaises(ValueError):
Config(performance_threshold=0.9)
print("✓ Config performance_threshold validation")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 71 - 82,
The test test_performance_threshold_validation is incomplete: it only asserts
valid values for Config.performance_threshold but doesn't check that invalid
values are rejected; either add validation in Config or update the test to
assert the expected failure. If Config should validate, implement a check in
Config.__init__ or the performance_threshold setter to raise ValueError for
values < 1.0 and then update test_performance_threshold_validation to include a
with self.assertRaises(ValueError): Config(performance_threshold=0.9) case;
otherwise remove the misleading comment and the "Invalid values" note from
test_performance_threshold_validation.
|
@Hale423 please find my comments below:
Thanks |
Pull Request: ONNX Q/DQ Autotuning with Subgraph Mode
Branch:
dev-wahao-autotune-subgraph-profile→mainType: Feature
Summary
This PR adds automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models using TensorRT performance measurements. It introduces two workflow modes:
graph.json); profiles isolated subgraphs for much faster tuning on large or dynamic-shape models (~30 min vs ~25 h in practice).Subgraph mode is the main addition over a baseline “auto QDQ placement” implementation: it uses TRT fusion info, optional per-layer timing, incremental full-model validation, and cache/resume.
What’s New (vs main)
modelopt.onnx.quantization.autotunepackage: region discovery, scheme generation, TensorRT benchmarking (Python API + optional trtexec), pattern cache, QDQ baseline import.--mode subgraph: fusion-aware grouping from TensorRTgraph.json; per-subgraph QDQ scheme profiling; optional per-layer timing when trtexec supports it (with fallback to total latency).fusion_grouping.py: parse TRTgraph.json, build fusion groups, infer shapes for extracted subgraphs. If--graph-jsonis omitted, runs trtexec once to generategraph.json(FP16 build with--exportLayerInfo).optimized_raw.onnx(all qualifying QDQ) andoptimized_final.onnx(validated). Default: on (--incremental-validation); use--no-incremental-validationto disable.autotune_cache.jsonfor Phase 2 (subgraph profiling) and Phase 3 (incremental validation). Re-running the same command resumes from the last checkpoint.--use-trtexecplus--trtexec-argsfor benchmarking with dynamic shapes (e.g.--optShapes) and custom options (e.g.--useCudaGraph,--stronglyTyped). trtexec profiling flags are optional; on “Unknown option” the code strips them and retries (fallback to total latency).examples/qdq_placement/: README (Quick Start, region vs subgraph, output layout, subgraph best practices) andset_batch_size.pyfor fixed-batch ResNet50.Key Files
modelopt/onnx/quantization/autotune/__main__.py--mode,--graph-json,--incremental-validation,--use-trtexec,--trtexec-args, etc.modelopt/onnx/quantization/autotune/subgraph_workflow.pymodelopt/onnx/quantization/autotune/fusion_grouping.pygraph.json, create fusion groups,generate_graph_json()(trtexec FP16 build when no graph is provided).modelopt/onnx/quantization/autotune/subgraph_extractor.pymodelopt/onnx/quantization/autotune/tensorrt_utils.pyexport_profile_path, profiling-flag dedup and “Unknown option” retry without profiling.modelopt/onnx/quantization/autotune/workflows.pybenchmark_onnx_model(); passes throughexport_profile_pathwhen using trtexec.modelopt/onnx/quantization/autotune/autotuner.pymodelopt/onnx/quantization/autotune/region_*.pyexamples/qdq_placement/README.mdexamples/qdq_placement/set_batch_size.pyHow to Test
Region mode (no trtexec):
Subgraph mode with trtexec (FP8, optional graph.json):
Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from
autotune_cache.json.Checklist
--use-trtexec(with or without--graph-json).--graph-json, one trtexec FP16 build runs and produces*.fp16.graph.jsonin the output dir.examples/qdq_placement/README.mdmatches behavior (region vs subgraph, outputs, best practices).Documentation
examples/qdq_placement/README.md– Quick Start, subgraph best practices, output layout, optional graph generation.docs/source/guides/9_qdq_placement.rstanddocs/source/reference/2_qdq_placement.rst; confirm they align with the CLI and behavior above when submitting.Notes
--exportProfile/--profilingVerbosityare handled by retrying without those flags and using total latency for scheme selection.Summary by CodeRabbit
Release Notes
New Features
--workflow,--graph_json, and--incremental_validationoptions.Documentation
Tests