Skip to content

[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015

Open
Hale423 wants to merge 10 commits intoNVIDIA:mainfrom
Hale423:dev-wahao-autotune-subgraph-profile
Open

[Feature Request] Support ONNX Q/DQ Autotuning with Subgraph Mode#1015
Hale423 wants to merge 10 commits intoNVIDIA:mainfrom
Hale423:dev-wahao-autotune-subgraph-profile

Conversation

@Hale423
Copy link

@Hale423 Hale423 commented Mar 10, 2026

Pull Request: ONNX Q/DQ Autotuning with Subgraph Mode

Branch: dev-wahao-autotune-subgraph-profilemain
Type: Feature


Summary

This PR adds automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models using TensorRT performance measurements. It introduces two workflow modes:

  1. Region mode (default) – Pattern-based region analysis with random mutation; optimizes by region pattern and reuses schemes across similar regions.
  2. Subgraph mode – Fusion-aware grouping derived from TensorRT’s layer/fusion boundaries (graph.json); profiles isolated subgraphs for much faster tuning on large or dynamic-shape models (~30 min vs ~25 h in practice).

Subgraph mode is the main addition over a baseline “auto QDQ placement” implementation: it uses TRT fusion info, optional per-layer timing, incremental full-model validation, and cache/resume.


What’s New (vs main)

Area Description
Autotune package New modelopt.onnx.quantization.autotune package: region discovery, scheme generation, TensorRT benchmarking (Python API + optional trtexec), pattern cache, QDQ baseline import.
Subgraph workflow --mode subgraph: fusion-aware grouping from TensorRT graph.json; per-subgraph QDQ scheme profiling; optional per-layer timing when trtexec supports it (with fallback to total latency).
Fusion grouping fusion_grouping.py: parse TRT graph.json, build fusion groups, infer shapes for extracted subgraphs. If --graph-json is omitted, runs trtexec once to generate graph.json (FP16 build with --exportLayerInfo).
Incremental validation Optional per-group full-model check: apply QDQ groups one-by-one and keep only those that improve latency; saves optimized_raw.onnx (all qualifying QDQ) and optimized_final.onnx (validated). Default: on (--incremental-validation); use --no-incremental-validation to disable.
Cache / resume Subgraph mode writes progress to autotune_cache.json for Phase 2 (subgraph profiling) and Phase 3 (incremental validation). Re-running the same command resumes from the last checkpoint.
trtexec path --use-trtexec plus --trtexec-args for benchmarking with dynamic shapes (e.g. --optShapes) and custom options (e.g. --useCudaGraph, --stronglyTyped). trtexec profiling flags are optional; on “Unknown option” the code strips them and retries (fallback to total latency).
Example New examples/qdq_placement/: README (Quick Start, region vs subgraph, output layout, subgraph best practices) and set_batch_size.py for fixed-batch ResNet50.

Key Files

Path Role
modelopt/onnx/quantization/autotune/__main__.py CLI: --mode, --graph-json, --incremental-validation, --use-trtexec, --trtexec-args, etc.
modelopt/onnx/quantization/autotune/subgraph_workflow.py Subgraph pipeline: Phase 1 (fusion grouping), Phase 2 (subgraph profiling), Phase 3 (full-model + incremental validation), cache I/O.
modelopt/onnx/quantization/autotune/fusion_grouping.py Parse graph.json, create fusion groups, generate_graph_json() (trtexec FP16 build when no graph is provided).
modelopt/onnx/quantization/autotune/subgraph_extractor.py Extract subgraph ONNX from full model given group inputs/outputs and shapes.
modelopt/onnx/quantization/autotune/tensorrt_utils.py Trtexec benchmark runner: optional export_profile_path, profiling-flag dedup and “Unknown option” retry without profiling.
modelopt/onnx/quantization/autotune/workflows.py Dispatcher and benchmark_onnx_model(); passes through export_profile_path when using trtexec.
modelopt/onnx/quantization/autotune/autotuner.py Region-mode autotuner (pattern discovery, scheme generation, state/pattern cache).
modelopt/onnx/quantization/autotune/region_*.py Region search and pattern handling for region mode.
examples/qdq_placement/README.md User-facing example: prerequisites, Quick Start (region + subgraph), output layout, subgraph best practices, optional graph.json behavior.
examples/qdq_placement/set_batch_size.py ResNet50 fixed-batch script for reproducible benchmarking.

How to Test

Region mode (no trtexec):

cd examples/qdq_placement
curl -L -o resnet50_Opset17.onnx https://github.com/onnx/models/raw/main/Computer_Vision/resnet50_Opset17_torch_hub/resnet50_Opset17.onnx
python3 set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx
python3 -m modelopt.onnx.quantization.autotune --model resnet50.bs128.onnx --output ./resnet50_results --quant-type int8 --schemes-per-region 20
# Expect: ./resnet50_results/optimized_final.onnx and logs under ./resnet50_results/logs/

Subgraph mode with trtexec (FP8, optional graph.json):

python3 -m modelopt.onnx.quantization.autotune \
  --model resnet50.bs128.onnx \
  --output ./resnet50_subgraph \
  --mode subgraph \
  --quant-type fp8 \
  --use-trtexec \
  --warmup-runs 5 \
  --timing-runs 20 \
  --incremental-validation \
  --trtexec-args "--stronglyTyped" \
  --schemes-per-region 30
# If --graph-json is omitted, first run will trigger trtexec to generate graph.json under output dir.
# Expect: optimized_raw.onnx, optimized_final.onnx, autotune_cache.json, logs/, subgraphs/

Resume: Kill the subgraph run mid-way, then re-run the same command; it should resume from autotune_cache.json.


Checklist

  • CI / unit tests pass (if applicable).
  • Region mode runs end-to-end (ResNet50 or equivalent).
  • Subgraph mode runs end-to-end with --use-trtexec (with or without --graph-json).
  • Without --graph-json, one trtexec FP16 build runs and produces *.fp16.graph.json in the output dir.
  • Interrupted subgraph run resumes after re-run.
  • examples/qdq_placement/README.md matches behavior (region vs subgraph, outputs, best practices).

Documentation

  • Example: examples/qdq_placement/README.md – Quick Start, subgraph best practices, output layout, optional graph generation.
  • Guides / API: This branch may add or update docs/source/guides/9_qdq_placement.rst and docs/source/reference/2_qdq_placement.rst; confirm they align with the CLI and behavior above when submitting.

Notes

  • main does not currently contain the autotune package; this PR adds it in full (region + subgraph).
  • Subgraph mode is recommended for large or dynamic-shape models; region mode remains the default for compatibility and smaller models.
  • trtexec versions that do not support --exportProfile / --profilingVerbosity are handled by retrying without those flags and using total latency for scheme selection.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added subgraph-based autotuning workflow for QDQ placement optimization alongside the existing region-based approach.
    • Introduced TensorRT and Python API benchmarking utilities for performance measurement.
    • Added PyTorch-specific hierarchical region discovery for optimized model analysis.
    • Enhanced incremental validation and crash recovery capabilities.
    • Expanded CLI with --workflow, --graph_json, and --incremental_validation options.
  • Documentation

    • Added comprehensive guides on automated Q/DQ placement optimization and architecture.
    • Added examples and usage documentation for subgraph autotuning workflows.
  • Tests

    • Added configuration validation unit tests.

willg-nv and others added 9 commits December 8, 2025 04:51
Signed-off-by: Will Guo <willg@nvidia.com>
Signed-off-by: Will Guo <willg@nvidia.com>
- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested
- Skip adding --separateProfileRun if already present in user trtexec args
- On trtexec 'Unknown option' error, strip profiling flags and retry once without them
- Set _profile_unsupported so later runs use total-latency comparison only
- Extract _exec_and_log for shared run-and-log logic

Made-with: Cursor
- Add export_profile_path support; append --exportProfile/--profilingVerbosity when requested
- Skip adding --separateProfileRun if already present in user trtexec args
- On trtexec 'Unknown option' error, strip profiling flags and retry once without them
- Set _profile_unsupported so later runs use total-latency comparison only
- Extract _exec_and_log for shared run-and-log logic
@Hale423 Hale423 requested review from a team as code owners March 10, 2026 02:14
@Hale423 Hale423 requested a review from cjluo-nv March 10, 2026 02:14
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cjluo-nv cjluo-nv requested review from ajrasane and gcunhase March 10, 2026 06:38
Copy link
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduces 16k+ lines of changes. Please consider sharing a design and get design review.

@Hale423
Copy link
Author

Hale423 commented Mar 10, 2026

Thanks for the feedback.

Sharing this design, please kindly take a look.

Design: ONNX Q/DQ Autotuning with Subgraph Mode

Design: ONNX Q/DQ Autotuning for TensorRT

Design review document for PR #1015
Branch: dev-wahao-autotune-subgraph-profile
Target: main

1. Background

TensorRT performance for quantized ONNX models depends not only on whether Q/DQ nodes exist, but also on where they are inserted. In practice:

  • Quantizing too aggressively can break fusion opportunities or introduce reformat overhead.
  • Quantizing too conservatively leaves performance on the table.
  • The best placement strategy is model-specific and often difficult to determine manually.

This branch introduces an ONNX Q/DQ autotuning system that searches for better Q/DQ placement using actual TensorRT latency measurements.

The design intentionally supports two workflows:

  1. Region mode for structure-aware, pattern-based optimization across the whole model.
  2. Subgraph mode for faster, fusion-aware tuning on large or dynamic-shape models.

2. Goals

  • Provide an end-to-end ONNX Q/DQ autotuning workflow that optimizes for TensorRT latency.
  • Reuse optimization results across structurally similar parts of the same model.
  • Support warm-start from:
    • Previously learned pattern caches
    • Existing QDQ baseline ONNX models
  • Support large and dynamic-shape models via a trtexec-based path.
  • Provide resumable workflows for long-running tuning jobs.
  • Expose the feature through a simple CLI and user-facing example.

3. Non-goals

  • This system does not perform accuracy validation or calibration quality evaluation.
  • This system does not try to search all possible Q/DQ placements exhaustively.
  • This system does not replace standard PTQ/QAT flows; it only optimizes placement.
  • This system does not target backends other than TensorRT.

4. Scope Relative to main

main does not contain the ONNX Q/DQ autotuning package. This branch adds:

  • A new modelopt.onnx.quantization.autotune package
  • CLI support for region and subgraph autotuning workflows
  • TensorRT benchmarking utilities (Python API and trtexec)
  • Region discovery, pattern matching, insertion-point modeling, scheme generation, state persistence, and warm-start support
  • A subgraph workflow based on TensorRT fusion boundaries
  • User documentation and examples

5. User Experience

5.1 Region Mode

Region mode is the default workflow:

python3 -m modelopt.onnx.quantization.autotune \
    --model model.onnx \
    --output ./results \
    --quant-type int8 \
    --schemes-per-region 30

User-visible properties:

  • Automatically discovers optimization regions
  • Groups structurally equivalent regions into patterns
  • Profiles multiple insertion schemes per pattern
  • Saves:
    • baseline.onnx
    • optimized_final.onnx
    • autotuner_state.yaml
    • autotuner_state_pattern_cache.yaml

5.2 Subgraph Mode

Subgraph mode is intended for large and/or dynamic-shape models:

python3 -m modelopt.onnx.quantization.autotune \
    --model model.onnx \
    --output ./results_subgraph \
    --mode subgraph \
    --quant-type fp8 \
    --use-trtexec \
    --incremental-validation \
    --trtexec-args "--stronglyTyped --optShapes=input:1x3x224x224"

User-visible properties:

  • Uses TensorRT layer/fusion information from graph.json
  • If --graph-json is not provided, runs trtexec once to generate it
  • Profiles isolated subgraphs
  • Saves:
    • optimized_raw.onnx
    • optimized_final.onnx
    • autotune_cache.json
    • logs/
    • subgraphs/

6. High-level Architecture

The implementation is organized around four layers.

6.1 Core Modeling Layer

Files:

  • common.py
  • insertion_points.py
  • region_pattern.py
  • qdq_utils.py

Core abstractions:

  • Region: hierarchical optimization unit in the graph
  • RegionPattern: deterministic structural signature used to group equivalent regions
  • InsertionScheme: a candidate set of Q/DQ insertion points
  • PatternSchemes: all candidate schemes for one pattern plus measurements
  • PatternCache: reusable best schemes across runs
  • Insertion point types:
    • NodeInputInsertionPoint
    • ChildRegionInputInsertionPoint
    • RegionOutputInsertionPoint
    • ResolvedInsertionPoint

Key design choice:

  • Schemes are represented using pattern-relative insertion points, not absolute node IDs.
    This allows one measured scheme to be reused for every region instance with the same structure.

6.2 Region-mode Optimization Layer

Files:

  • autotuner.py
  • torch_region_builder.py
  • region_search.py
  • workflows.py

Responsibilities:

  • Discover optimization regions from the ONNX graph
  • Build structural patterns
  • Generate candidate insertion schemes
  • Export test models with current/best schemes applied
  • Record measurements and update best-known schemes
  • Persist and restore session state

Current implementation detail:

  • The default region discovery path in QDQAutotuner._search_regions() uses TorchRegionBuilder.
  • TorchRegionBuilder builds hierarchical regions from PyTorch-style ONNX node names.
  • region_search.py provides a more general region-search foundation and algorithms, but this branch’s default autotuner path is centered on the torch-name-based hierarchy.

6.3 Subgraph-mode Optimization Layer

Files:

  • subgraph_workflow.py
  • fusion_grouping.py
  • subgraph_extractor.py

Responsibilities:

  • Obtain TensorRT fusion boundaries from graph.json
  • Convert fused TRT layers into ONNX-level optimization groups
  • Extract standalone ONNX subgraphs per group
  • Test heuristic QDQ schemes per group
  • Merge accepted schemes back into the full model
  • Optionally validate improvements incrementally on the full model
  • Cache progress for resume

6.4 Benchmarking Layer

Files:

  • tensorrt_utils.py
  • workflows.py

Responsibilities:

  • Benchmark ONNX models using either:
    • TensorRT Python API
    • trtexec
  • Reuse timing caches and persistent benchmark instances
  • Support plugin libraries
  • Support dynamic-shape arguments
  • Export per-layer profiles when trtexec supports them
  • Fall back gracefully when profiling flags are unsupported

7. Region Mode Design

7.1 Why Region Mode Exists

Region mode is intended to solve the general case:

  • The model may not have TensorRT fusion metadata available yet
  • Similar subgraphs may appear repeatedly across the model
  • A single structural pattern may benefit from the same placement policy everywhere

This workflow optimizes patterns, not isolated instances.

7.2 Region Discovery

The branch uses TorchRegionBuilder as the default discovery mechanism because many production ONNX models are exported from PyTorch and preserve hierarchical naming such as:

/encoder/layer_0/self_attn/q_proj/MatMul

This naming convention enables:

  • Stable hierarchical grouping
  • More semantically meaningful optimization units
  • Better scheme reuse across repeated blocks (e.g. transformer layers)

TorchRegionBuilder:

  • Builds a path trie from node names
  • Creates composite and leaf regions
  • Merges small neighboring regions
  • Probes epilogues and computes region boundaries
  • Can filter out non-quantizable regions

7.3 Pattern-based Scheme Reuse

Once regions are discovered:

  1. A RegionPattern is built from region topology and op types.
  2. Regions with identical signatures share a single PatternSchemes object.
  3. Candidate schemes are generated and measured for that pattern.
  4. The best scheme is reused for all matching regions.

This reduces duplicated work and makes the search tractable.

7.4 Warm Start

The region workflow supports two warm-start mechanisms:

  • Pattern cache: load prior best schemes by pattern signature
  • QDQ baseline import: inspect an already-quantized model and infer pattern-level insertion points from it

This allows transfer learning across:

  • Different runs of the same model
  • Similar models
  • User-provided manually tuned quantized baselines

7.5 State and Resume

Region mode saves:

  • Baseline latency
  • Profiled schemes and latencies
  • Best scheme selection
  • Region discovery results

The state file is autotuner_state.yaml, allowing restart from the last profiled region.

8. Subgraph Mode Design

8.1 Why Subgraph Mode Exists

Region mode can become expensive for large models because each candidate scheme requires building and profiling increasingly quantized versions of the full model.

Subgraph mode addresses this by:

  • Aligning optimization units with TensorRT fusion groups
  • Profiling isolated subgraphs instead of the whole model
  • Using a small, heuristic scheme set instead of a broad mutation search

This makes the workflow significantly faster on large and dynamic-shape models.

8.2 Phase 1: Fusion-aware Grouping

Subgraph mode begins by obtaining TensorRT layer information:

  • If --graph-json is provided, use it directly.
  • Otherwise, call generate_graph_json() in fusion_grouping.py.

generate_graph_json() runs trtexec once in FP16 mode and writes:

  • <model>.fp16.graph.json
  • <model>.fp16.build.log

The exported layer info is then parsed to create fusion groups that map TensorRT fused layers back to ONNX nodes.

8.3 Phase 2: Subgraph Profiling

For each quantizable fusion group:

  1. Resolve boundary tensors
  2. Extract a standalone ONNX subgraph
  3. Build a small heuristic set of schemes, for example:
    • baseline
    • full QDQ
    • weight-only
    • activation-only
  4. Benchmark each scheme
  5. Choose the best scheme for that group

Dynamic-shape handling:

  • The workflow parses --minShapes/--optShapes/--maxShapes from --trtexec-args
  • Infers intermediate tensor shapes
  • Constructs subgraph-specific shape arguments for extracted subgraphs

8.4 Per-layer Comparison

A major issue with isolated subgraph benchmarking is that total subgraph latency can be dominated by artifacts such as:

  • Reformat layers
  • Transposes
  • Layout adaptation overhead

To reduce this noise, subgraph mode attempts to use trtexec per-layer profile export:

  • --exportProfile
  • --profilingVerbosity=detailed
  • --separateProfileRun

If supported, the workflow:

  • Parses the per-layer profile JSON
  • Filters out reformat-style overhead
  • Compares the time spent on compute layers

If the local trtexec version does not support those flags, the code falls back to total subgraph latency.

8.5 Phase 3: Full-model Merge and Incremental Validation

The subgraph workflow distinguishes two outputs:

  • optimized_raw.onnx: all qualifying groups applied together
  • optimized_final.onnx: final validated model after incremental validation

Incremental validation exists because multiple individually beneficial subgraph changes can still regress full-model latency when combined.

When enabled:

  1. Start from the FP16 baseline full model
  2. Apply candidate groups one at a time
  3. Benchmark the full model after each change
  4. Keep the change only if latency improves

This design preserves:

  • A raw “all accepted by local heuristic” model
  • A validated “accepted by full-model measurement” model

8.6 Cache and Resume

Subgraph mode uses autotune_cache.json in the output directory.

It stores:

  • Phase 2 group profiling results
  • Phase 3 validation progress
  • Current latency and keep/reject decisions

This enables:

  • Resuming long runs after interruption
  • Restarting directly at incremental validation if Phase 2 is already complete

9. Benchmarking Design

9.1 Python API Benchmark

The Python API path is used for standard latency measurement when trtexec is not required.

Design decisions:

  • Reuse a global benchmark instance
  • Reuse TensorRT builder/runtime state
  • Reuse timing cache across measurements

This reduces repeated benchmark setup overhead.

9.2 trtexec Benchmark

trtexec is needed for:

  • Dynamic-shape benchmarking with custom shape flags
  • Generating TensorRT graph.json
  • Optional per-layer profile export

The trtexec path accepts a free-form --trtexec-args string so users can pass deployment-specific settings.

9.3 Profiling-flag Compatibility Fallback

Different TensorRT/trtexec versions do not always support the same profiling flags.

The branch therefore implements the following behavior:

  1. Attempt to run with per-layer profiling flags if a profile is requested
  2. If trtexec fails with Unknown option, remove profiling-related flags
  3. Retry once without profiling
  4. Mark profiling as unsupported for subsequent runs

This keeps subgraph mode functional even on older or restricted trtexec versions.

10. CLI Design

The CLI is defined in __main__.py.

Important options:

  • --mode {region,subgraph}
  • --graph-json
  • --incremental-validation / --no-incremental-validation
  • --use-trtexec
  • --trtexec-args
  • --pattern-cache
  • --qdq-baseline
  • --state-file
  • --plugin-libraries
  • --schemes-per-region

Design rationale:

  • Keep region mode as the default for backward compatibility
  • Make subgraph mode opt-in because it depends on TRT fusion metadata and trtexec
  • Keep advanced deployment settings in --trtexec-args instead of introducing many narrowly scoped flags

11. Data and Output Model

11.1 Region Mode Outputs

  • baseline.onnx
  • optimized_final.onnx
  • autotuner_state.yaml
  • autotuner_state_pattern_cache.yaml
  • logs/
  • region_models/

11.2 Subgraph Mode Outputs

  • optimized_raw.onnx
  • optimized_final.onnx
  • autotune_cache.json
  • logs/
  • subgraphs/
  • optional generated <model>.fp16.graph.json

12. Key Trade-offs

12.1 Region Mode vs Subgraph Mode

Region mode:

  • Pros:
    • General
    • Pattern-reuse friendly
    • Does not require TRT fusion metadata up front
  • Cons:
    • Can be slow on large models
    • More expensive search loop

Subgraph mode:

  • Pros:
    • Much faster
    • Aligned with actual TRT fusion boundaries
    • Better fit for large and dynamic-shape models
  • Cons:
    • Requires TRT graph export or prior graph
    • More heuristic
    • Local subgraph wins may not transfer cleanly to full-model wins

12.2 Why Keep Both Raw and Final Models

We save both optimized_raw.onnx and optimized_final.onnx because they serve different purposes:

  • optimized_raw.onnx preserves all locally accepted group changes
  • optimized_final.onnx preserves only changes that survive full-model validation

This avoids hiding useful intermediate output from advanced users while still giving a safer default final artifact.

12.3 Why Pattern-relative Insertion Points

Absolute node IDs would make learned schemes brittle and model-instance-specific.

Pattern-relative insertion points:

  • Generalize across repeated structures
  • Enable pattern cache reuse
  • Make imported QDQ baselines transferable

13. Limitations and Risks

  • Region-mode default discovery currently relies on PyTorch-style ONNX naming for best results.
  • The autotuner optimizes for latency only; it does not enforce accuracy constraints.
  • Subgraph heuristics do not explore the full combinatorial space.
  • Per-layer profiling depends on local trtexec support.
  • Full-model validation still depends on benchmark stability; small latency differences can be noisy.

14. Alternatives Considered

14.1 Full-model-only Search

Rejected because it is too slow for large models and dynamic-shape deployments.

14.2 Subgraph-only Design

Rejected because it gives up the benefits of pattern reuse and a backend-agnostic structural view of the graph.

14.3 Absolute-node Placement Search

Rejected because it would prevent scheme reuse across equivalent structures and make state transfer less useful.

15. Validation Plan

Recommended design-review validation:

  1. Basic region-mode run

    • Run on ResNet50 fixed-batch example
    • Confirm optimized_final.onnx and state files are produced
  2. Subgraph mode without --graph-json

    • Confirm trtexec first generates *.fp16.graph.json
    • Confirm the workflow proceeds into Phase 2 and Phase 3
  3. Resume behavior

    • Interrupt a subgraph run mid-way
    • Re-run with the same output directory
    • Confirm resume from autotune_cache.json
  4. Profiling fallback

    • Run in an environment where trtexec does not support profiling flags
    • Confirm automatic retry without those flags
  5. Warm-start path

    • Run once to produce a pattern cache
    • Re-run using --pattern-cache
    • Confirm seeding behavior and faster convergence

16. Rollout and Documentation

This branch already includes:

  • CLI integration
  • Example under examples/qdq_placement/
  • User documentation updates
  • API documentation updates

Suggested rollout:

  1. Complete design review
  2. Land PR behind the existing CLI mode separation
  3. Gather feedback on:
    • default region discovery behavior
    • subgraph heuristic quality
    • benchmark noise and validation thresholds

17. Open Questions for Review

  • Should region mode continue to default to TorchRegionBuilder, or should a more general search path be reinstated as the primary default?
  • Should subgraph mode require --use-trtexec, or should it remain flexible while documenting that trtexec is the recommended path?
  • Should the per-group acceptance threshold remain latency-only, or should we expose stronger hysteresis / noise margins for incremental validation?
  • Should we eventually add accuracy hooks so this becomes a joint latency + quality optimization flow?

Resolve add/add conflicts across 13 files by taking main's refactored
codebase (autotuner_base.py split, benchmark.py extraction, op_types
reorganization) and re-applying subgraph-mode additions:

- __init__.py: add subgraph_autotuning_workflow export
- __main__.py: add --workflow subgraph/region, --graph_json,
  --incremental_validation CLI arguments
- workflows.py: add strip_shape_args, extra_run_args,
  export_profile_path parameters to benchmark_onnx_model
- benchmark.py: add profiling support (export_profile_path,
  strip_shape_args) with trtexec fallback on unsupported flags
- subgraph_workflow.py: migrate BOOL_OUTPUT_OPS to use
  get_bool_ops/get_comparison_ops/get_value_check_ops from op_types

Files using only base code (region_pattern, region_search, 5 test files,
common, insertion_points, autotuner) resolved by accepting main's version.

Made-with: Cursor
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 13, 2026

📝 Walkthrough

Walkthrough

A comprehensive subgraph-based quantize/dequantize (Q/DQ) placement autotuning system for ONNX models has been added, including documentation, CLI integration, TensorRT benchmarking infrastructure, graph fusion analysis, PyTorch region discovery, and a multi-phase optimization workflow.

Changes

Cohort / File(s) Summary
Documentation
docs/source/guides/9_qdq_placement.rst, docs/source/reference/2_qdq_placement.rst, examples/qdq_placement/README.md
Comprehensive user guides and API reference documentation for the Q/DQ placement optimizer, covering workflows, CLI usage, Python API, configuration, best practices, and deployment via TensorRT.
Example Utilities
examples/qdq_placement/set_batch_size.py
CLI utility script to fix batch dimensions in ONNX models with shape inference propagation.
Module Initialization & CLI
modelopt/onnx/quantization/autotune/__init__.py, modelopt/onnx/quantization/autotune/__main__.py
Exports new subgraph_autotuning_workflow function and adds CLI flags (--workflow, --graph_json, --incremental_validation) to route between region and subgraph workflow modes.
Benchmarking Infrastructure
modelopt/onnx/quantization/autotune/tensorrt_utils.py, modelopt/onnx/quantization/autotune/benchmark.py
Abstract Benchmark base class, TrtExecBenchmark extension with profiling/shape-stripping support, and TensorRTPyBenchmark for Python API-based TensorRT engine benchmarking with dynamic shape handling.
Graph Analysis & Processing
modelopt/onnx/quantization/autotune/fusion_grouping.py, modelopt/onnx/quantization/autotune/subgraph_extractor.py, modelopt/onnx/quantization/autotune/qdq_utils.py
Graph.json parsing for TRT fusion groups, subgraph extraction utilities via onnx_graphsurgeon, and quantized tensor discovery helpers.
Region Discovery
modelopt/onnx/quantization/autotune/torch_region_builder.py
PyTorch-exported ONNX model analysis with hierarchical region construction, quantizability detection, epilogue fusion, and region search/inspection APIs.
Subgraph Workflow Implementation
modelopt/onnx/quantization/autotune/subgraph_workflow.py
Multi-phase subgraph-based QDQ autotuning: fusion group discovery, per-subgraph heuristic scheme profiling, baseline benchmarking, incremental validation, caching, and optimized model export with latency-driven scheme selection.
Workflow Integration
modelopt/onnx/quantization/autotune/workflows.py
Extended benchmark_onnx_model signature to support shape stripping, extra profiling arguments, and per-layer profiling export.
Tests
tests/unit/onnx/quantization/autotune/test_config.py
Unit tests validating Config class defaults, custom assignments, and parameter constraints.

Sequence Diagram

sequenceDiagram
    participant CLI as CLI / User
    participant WF as Subgraph Workflow
    participant GA as Graph Analysis<br/>(Fusion Groups)
    participant SG as Subgraph<br/>Extraction
    participant BM as Benchmarking<br/>(TensorRT)
    participant FS as FileSystem<br/>(Cache/Output)

    CLI->>WF: subgraph_autotuning_workflow(model_path, ...)
    
    rect rgba(100, 200, 150, 0.5)
    note over WF: Phase 1: Setup
    WF->>GA: parse graph.json + create_fusion_groups
    GA->>GA: Map TRT layers to ONNX nodes
    GA-->>WF: FusionGroup list
    end

    rect rgba(100, 150, 200, 0.5)
    note over WF: Phase 2: Per-Subgraph Profiling
    loop For each FusionGroup
        WF->>WF: generate_heuristic_schemes
        loop For each scheme
            WF->>SG: extract_subgraph_by_nodes
            SG-->>WF: subgraph bytes
            WF->>WF: insert_qdq_on_graph
            WF->>BM: benchmark_onnx_model
            BM-->>WF: latency_ms
        end
        WF->>WF: Select best_scheme
    end
    WF->>FS: Cache phase2 results
    end

    rect rgba(200, 150, 100, 0.5)
    note over WF: Phase 3: Full-Model Validation
    WF->>BM: Baseline FP16 benchmark full model
    BM-->>WF: baseline_latency_ms
    loop Incremental Validation (if enabled)
        WF->>WF: Apply candidate QDQ schemes
        WF->>BM: Benchmark full model per candidate
        BM-->>WF: Test latency_ms
        WF->>WF: Keep if latency improves
    end
    WF->>FS: Save optimized_final.onnx
    WF->>FS: Cache phase3 progress
    end

    WF-->>CLI: Return optimized_model_path
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Security Anti-Patterns ❓ Inconclusive No verification output was provided above to convert into the required JSON format. Please provide the verification output that needs to be converted into JSON format.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature: support for a new subgraph mode in ONNX Q/DQ autotuning, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 90.00% which is sufficient. The required threshold is 80.00%.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/onnx/quantization/autotune/benchmark.py (1)

34-34: ⚠️ Potential issue | 🟡 Minor

# nosec comments require explicit justification and approval.

As per coding guidelines, use of # nosec comments to bypass Bandit security checks is not allowed. If this security-sensitive pattern is genuinely necessary, the PR must be reviewed and approved by @NVIDIA/modelopt-setup-codeowners with explicit justification in the PR description.

The subprocess import itself is safe when used properly (with list arguments, not shell=True), but the # nosec B404 comment should be documented or removed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/benchmark.py` at line 34, Remove the
inline "# nosec B404" on the subprocess import (import subprocess) unless you
add an explicit justification and approval from
`@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is
required, keep the comment but add the justification text to the PR and ensure
all uses of subprocess in this module (functions/methods that call
subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a
list, avoid shell=True) and reference that approval in the PR description.
🧹 Nitpick comments (6)
examples/qdq_placement/set_batch_size.py (1)

62-65: Consider handling large models (>2GB) for model verification.

The direct call to onnx.checker.check_model(output_path) may fail for models larger than 2GB due to protobuf size limits. The codebase has a utility in modelopt/onnx/utils.py (see check_model function) that handles this case by using external data storage.

Suggested approach
+import tempfile
+import os
+
 def set_batch_size(model_path: str, batch_size: int, output_path: str) -> None:
     # ... existing code ...
     
     # Verify the saved model
     print("Verifying model...")
-    onnx.checker.check_model(output_path)
+    # Handle large models that exceed protobuf 2GB limit
+    saved_model = onnx.load(output_path)
+    if saved_model.ByteSize() > (2 * (1024**3)):
+        # For large models, check_model needs the file path with external data
+        onnx.checker.check_model(output_path)
+    else:
+        onnx.checker.check_model(saved_model)
     print("✓ Model saved and verified successfully!")

Alternatively, consider importing and using modelopt.onnx.utils.check_model for consistency with the rest of the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qdq_placement/set_batch_size.py` around lines 62 - 65, Replace the
direct call to onnx.checker.check_model(output_path) with the project's robust
checker that handles >2GB models by importing and calling
modelopt.onnx.utils.check_model; locate the verification block around the
print("Verifying model...") and change the call to use
modelopt.onnx.utils.check_model(output_path) (or import the function as
check_model and call check_model(output_path)) so external-data large protobuf
models are supported.
modelopt/onnx/quantization/autotune/qdq_utils.py (2)

60-60: Simplify redundant condition.

The check node.input and len(node.input) > 0 is redundant since node.input being truthy already implies it's non-empty.

Suggested simplification
-            if node.input and len(node.input) > 0:
+            if node.input:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 60, The condition
checking node inputs is redundant: replace the `if node.input and
len(node.input) > 0:` guard with a simpler truthy check `if node.input:` in the
function where this appears (look for the code handling `node.input` in
qdq_utils.py) so the branch behavior remains identical but the expression is
simplified.

1-1: Update copyright year for consistency.

The copyright year is 2024, but other new files in this PR use 2026. Consider updating for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py` at line 1, Update the SPDX
header year in modelopt/onnx/quantization/autotune/qdq_utils.py to match the
rest of the PR (change "2024" to "2026"); locate the SPDX comment at the top of
the file and modify the year in the copyright line so it is consistent across
files.
tests/unit/onnx/quantization/autotune/test_config.py (1)

12-13: Remove unnecessary sys.path manipulation.

The sys.path.insert is unnecessary if the package is properly installed in the test environment. This pattern can cause import issues and is not recommended.

Suggested fix
-# Add parent directory to path
-sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
-
 from modelopt.onnx.quantization.autotune.common import Config
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 12 - 13,
Remove the ad-hoc sys.path manipulation in the test file: delete the
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
line in tests/unit/onnx/quantization/autotune/test_config.py and rely on the
package being installed in the test environment (or use test runner
configuration / PYTHONPATH) so imports resolve cleanly; do not add alternative
path hacks in this file.
modelopt/onnx/quantization/autotune/subgraph_extractor.py (1)

170-181: Consider using collections.deque for BFS queue.

Using list.pop(0) is O(n) per operation. For potentially large graphs, using collections.deque with popleft() provides O(1) performance.

♻️ Suggested improvement
+from collections import deque
+
 def _find_reachable_graph_inputs(
     graph: gs.Graph,
     target_nodes: List[gs.Node],
 ) -> List[str]:
     """BFS backward from target_nodes to find graph inputs that feed into them."""
     graph_input_names = {t.name for t in graph.inputs}
     visited = set()
-    queue = []
+    queue = deque()
     result = []

     for node in target_nodes:
         for inp in node.inputs:
             if isinstance(inp, gs.Variable) and inp.name not in visited:
                 visited.add(inp.name)
                 queue.append(inp)

     while queue:
-        tensor = queue.pop(0)
+        tensor = queue.popleft()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py` around lines 170 -
181, The BFS loop in subgraph_extractor.py uses a Python list and pop(0),
causing O(n) dequeuing; replace the list-based queue with collections.deque:
import deque, initialize queue = deque(...) where the queue is created, and
change queue.pop(0) to queue.popleft(), keeping queue.append(...) for
enqueueing; update any code that constructs the initial queue and ensure imports
include collections.deque so the BFS in the function that contains this loop
uses O(1) dequeue operations.
examples/qdq_placement/README.md (1)

152-162: Add language specifier to fenced code blocks.

The directory structure code blocks should have a language specifier for consistency. Consider using text or plaintext for directory listings.

📝 Suggested fix
-```
+```text
 resnet50_results/
 ├── optimized_final.onnx              # Optimized model
-```
+```text
 <output_dir>/
 ├── optimized_final.onnx              # Incrementally validated model (if --incremental-validation)

Also applies to: 166-175

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qdq_placement/README.md` around lines 152 - 162, Update the fenced
code blocks that show directory listings so they include a language specifier
(use ```text) — specifically change the blocks containing the
"resnet50_results/" tree and the block containing the "<output_dir>/" tree to
start with ```text instead of ``` so the directory listings render consistently;
locate the blocks by the directory headers "resnet50_results/" and
"<output_dir>/" in the README and replace their opening fences accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/source/reference/2_qdq_placement.rst`:
- Line 828: The docs claim "Default threshold: 1.01" but the actual default is
defined as performance_threshold: float = 1.02 on the Config class in common.py;
update the documentation string in 2_qdq_placement.rst to state "Default
threshold: 1.02 (2% improvement minimum)" so the docs match the implementation
(or alternatively change Config.performance_threshold to 1.01 if you intend the
docs to be canonical).

In `@examples/qdq_placement/set_batch_size.py`:
- Around line 1-10: Add the required NVIDIA Apache 2.0 license header to the top
of the examples/qdq_placement/set_batch_size.py script (above or immediately
after the existing shebang and before the module docstring) so the file includes
the full license boilerplate mandated for examples/**/*.py; ensure the header
explicitly names "NVIDIA CORPORATION" and the Apache 2.0 terms and leave the
rest of the script (including the module docstring and usage comments)
unchanged.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py`:
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of the module modelopt/onnx/quantization/autotune/fusion_grouping.py by
inserting the standard multi-line NVIDIA Apache 2.0 header (including copyright
line, SPDX identifier and license text reference) as used across other files in
modelopt; ensure the header appears before the existing module docstring so
classes/functions in this file (e.g., the fusion grouping logic in
fusion_grouping.py) are properly licensed.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 1-6: Add the NVIDIA Apache 2.0 license header to the top of the
module file modelopt/onnx/quantization/autotune/subgraph_extractor.py by
inserting the standard multi-line license comment before the existing module
docstring; ensure the header includes the SPDX identifier and full NVIDIA Apache
2.0 text (or the project-standard short header) so the file begins with the
required license block followed by the current docstring and existing code in
subgraph_extractor.py.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py`:
- Around line 1-8: This file is missing the NVIDIA Apache 2.0 license header;
prepend the standard NVIDIA Apache-2.0 license block to the very top of
subgraph_workflow.py (above the existing module docstring) including the
copyright line and license text/notice, ensuring the SPDX identifier
(Apache-2.0) or full license header is present and that the existing file
docstring remains unchanged below the header.
- Around line 99-105: The loops that mutate the protobuf repeated field use
nonstandard pop()/add() calls on dim_proto; replace popping with explicit
deletion (e.g., use del dim_proto[-1] in the shrink loop) and call
dim_proto.add() when growing but capture the returned element if you need to
initialize it (e.g., new_dim = dim_proto.add()); keep the subsequent loop that
sets dim_proto[i].ClearField("dim_param") and dim_proto[i].dim_value = d, but
ensure dim_proto has been resized using del and dim_proto.add() rather than
pop()/add() without using the returned message.
- Around line 39-42: QUANT_DTYPES currently sets "fp8" to np.int8 when the FP8
dtype isn't present, causing silent fallback; change this by removing the silent
fallback from the module-level QUANT_DTYPES and instead detect/import ml_dtypes
(or np.float8 variant) conditionally at runtime where quantization is requested
(e.g., inside the function(s) that handle quantization/autotune and any call
sites referencing QUANT_DTYPES), and when a user requests "fp8" but the FP8
dtype isn't available emit a clear user-facing warning or raise a descriptive
error; specifically, update references to QUANT_DTYPES to validate availability
at usage time, attempt a conditional import of ml_dtypes (or check hasattr(np,
"float8_e4m3fn")) there, and log/warn if FP8 cannot be supported rather than
silently substituting int8.

In `@modelopt/onnx/quantization/autotune/tensorrt_utils.py`:
- Around line 1-41: The license header at the top of the module (the
module-level docstring and SPDX header in tensorrt_utils.py) uses an unusual
year range "1993-2025"; update the copyright/SPDX header to the project's
standard single-year format (e.g., "Copyright (c) 2026 NVIDIA CORPORATION &
AFFILIATES") and adjust the SPDX-License-Identifier line if needed so the header
matches other files; edit the top-of-file docstring/SPDX block where the current
year range appears to replace it with the standard format.
- Around line 987-1000: Fix _save_timing_cache: the condition and method call
are wrong and the finally cleans up an undefined name. Change the inverted check
so you only call combine when self._timing_cache is NOT None, correct the typo
`combline` to `combine` for the timing cache object returned by
config.create_timing_cache, and in the finally block remove the undefined
`builder` deletion (either delete only `config` or explicitly reference
`self.builder` if you intended to delete it); ensure you still serialize and
write timing_cache_data to self.timing_cache_file when a timing cache exists.

In `@modelopt/onnx/quantization/autotune/torch_region_builder.py`:
- Around line 311-318: The _build_id_to_region_map function currently uses a
mutable default dict for id_to_region_map which can be shared across calls;
change the signature to default id_to_region_map to None and inside
_build_id_to_region_map initialize id_to_region_map = {} when it's None, then
proceed to assign id_to_region_map[region.id] = region and recursively call
self._build_id_to_region_map(child, id_to_region_map) so each top-level call
gets a fresh map; keep the return type dict[int, Region] and behavior otherwise
unchanged.
- Around line 16-24: The module docstring in torch_region_builder.py duplicates
the license/copyright header (the "SPDX-FileCopyrightText" lines and
years)—remove the duplicate license block from the top of the module docstring
so only the canonical file header remains; keep the descriptive docstring text
("Torch Region Builder - Hierarchical Region Discovery...") and ensure
functions/classes in this module (e.g., any top-level descriptions used by
torch_region_builder.py) are unaffected by the removal.
- Around line 752-754: The parameter only_quantizable is being ignored because
the code unconditionally sets only_quantizable = True; fix by either removing
that assignment so the function respects the incoming only_quantizable
parameter, or remove the only_quantizable parameter from the function signature
if it's not needed; locate the assignment near the logger.info(f"Loading model:
{onnx_path}") call in torch_region_builder.py and update the function that
accepts only_quantizable to either use the passed value or eliminate the unused
parameter and adjust all callers accordingly.
- Around line 320-331: The _build_tensor_to_regions_map function uses a mutable
default argument (dict = {}), which can be shared across calls; change the
signature to accept tensor_to_regions_map: Optional[dict[str, set[int]]] = None
and inside the function initialize tensor_to_regions_map = {} if None, mirroring
the fix used in _build_id_to_region_map; keep the recursive behavior and return
the map as before so each top-level call gets a fresh dict.
- Line 333: The _merge_neighboring_regions function currently uses a mutable
default argument for to_remove (set()); change the signature to accept None
(e.g., to_remove: Optional[set[int]] = None or to_remove: set[int] | None =
None) and inside the body initialize it to an empty set when None (e.g., if
to_remove is None: to_remove = set()), updating any type imports as needed and
leaving the rest of the logic in _merge_neighboring_regions unchanged.

In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 71-82: The test test_performance_threshold_validation is
incomplete: it only asserts valid values for Config.performance_threshold but
doesn't check that invalid values are rejected; either add validation in Config
or update the test to assert the expected failure. If Config should validate,
implement a check in Config.__init__ or the performance_threshold setter to
raise ValueError for values < 1.0 and then update
test_performance_threshold_validation to include a with
self.assertRaises(ValueError): Config(performance_threshold=0.9) case; otherwise
remove the misleading comment and the "Invalid values" note from
test_performance_threshold_validation.
- Around line 1-6: Add the required NVIDIA Apache 2.0 license header to the top
of tests/unit/onnx/quantization/autotune/test_config.py: insert the standard
NVIDIA Apache-2.0 license block (including copyright notice/SPDX identifier and
license text) immediately after the existing shebang (#!/usr/bin/env python3)
and before the module docstring so the file complies with the repository's
license/header convention.

---

Outside diff comments:
In `@modelopt/onnx/quantization/autotune/benchmark.py`:
- Line 34: Remove the inline "# nosec B404" on the subprocess import (import
subprocess) unless you add an explicit justification and approval from
`@NVIDIA/modelopt-setup-codeowners` in the PR description; if the pattern is
required, keep the comment but add the justification text to the PR and ensure
all uses of subprocess in this module (functions/methods that call
subprocess.run, subprocess.Popen, etc.) follow safe patterns (pass args as a
list, avoid shell=True) and reference that approval in the PR description.

---

Nitpick comments:
In `@examples/qdq_placement/README.md`:
- Around line 152-162: Update the fenced code blocks that show directory
listings so they include a language specifier (use ```text) — specifically
change the blocks containing the "resnet50_results/" tree and the block
containing the "<output_dir>/" tree to start with ```text instead of ``` so the
directory listings render consistently; locate the blocks by the directory
headers "resnet50_results/" and "<output_dir>/" in the README and replace their
opening fences accordingly.

In `@examples/qdq_placement/set_batch_size.py`:
- Around line 62-65: Replace the direct call to
onnx.checker.check_model(output_path) with the project's robust checker that
handles >2GB models by importing and calling modelopt.onnx.utils.check_model;
locate the verification block around the print("Verifying model...") and change
the call to use modelopt.onnx.utils.check_model(output_path) (or import the
function as check_model and call check_model(output_path)) so external-data
large protobuf models are supported.

In `@modelopt/onnx/quantization/autotune/qdq_utils.py`:
- Line 60: The condition checking node inputs is redundant: replace the `if
node.input and len(node.input) > 0:` guard with a simpler truthy check `if
node.input:` in the function where this appears (look for the code handling
`node.input` in qdq_utils.py) so the branch behavior remains identical but the
expression is simplified.
- Line 1: Update the SPDX header year in
modelopt/onnx/quantization/autotune/qdq_utils.py to match the rest of the PR
(change "2024" to "2026"); locate the SPDX comment at the top of the file and
modify the year in the copyright line so it is consistent across files.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py`:
- Around line 170-181: The BFS loop in subgraph_extractor.py uses a Python list
and pop(0), causing O(n) dequeuing; replace the list-based queue with
collections.deque: import deque, initialize queue = deque(...) where the queue
is created, and change queue.pop(0) to queue.popleft(), keeping
queue.append(...) for enqueueing; update any code that constructs the initial
queue and ensure imports include collections.deque so the BFS in the function
that contains this loop uses O(1) dequeue operations.

In `@tests/unit/onnx/quantization/autotune/test_config.py`:
- Around line 12-13: Remove the ad-hoc sys.path manipulation in the test file:
delete the sys.path.insert(0,
os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) line in
tests/unit/onnx/quantization/autotune/test_config.py and rely on the package
being installed in the test environment (or use test runner configuration /
PYTHONPATH) so imports resolve cleanly; do not add alternative path hacks in
this file.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 48cf674c-cea7-4964-b1ed-c9c27e9d6e6e

📥 Commits

Reviewing files that changed from the base of the PR and between bc87981 and 48eed5d.

📒 Files selected for processing (15)
  • docs/source/guides/9_qdq_placement.rst
  • docs/source/reference/2_qdq_placement.rst
  • examples/qdq_placement/README.md
  • examples/qdq_placement/set_batch_size.py
  • modelopt/onnx/quantization/autotune/__init__.py
  • modelopt/onnx/quantization/autotune/__main__.py
  • modelopt/onnx/quantization/autotune/benchmark.py
  • modelopt/onnx/quantization/autotune/fusion_grouping.py
  • modelopt/onnx/quantization/autotune/qdq_utils.py
  • modelopt/onnx/quantization/autotune/subgraph_extractor.py
  • modelopt/onnx/quantization/autotune/subgraph_workflow.py
  • modelopt/onnx/quantization/autotune/tensorrt_utils.py
  • modelopt/onnx/quantization/autotune/torch_region_builder.py
  • modelopt/onnx/quantization/autotune/workflows.py
  • tests/unit/onnx/quantization/autotune/test_config.py

if speedup >= config.performance_threshold:
accept_scheme()

Default threshold: 1.01 (1% improvement minimum)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Documentation inconsistency with actual default value.

The documentation states "Default threshold: 1.01 (1% improvement minimum)" but the Config class in common.py defines performance_threshold: float = 1.02 (2% improvement). This should be corrected to match the actual implementation.

Suggested fix
-Default threshold: 1.01 (1% improvement minimum)
+Default threshold: 1.02 (2% improvement minimum)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Default threshold: 1.01 (1% improvement minimum)
Default threshold: 1.02 (2% improvement minimum)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/source/reference/2_qdq_placement.rst` at line 828, The docs claim
"Default threshold: 1.01" but the actual default is defined as
performance_threshold: float = 1.02 on the Config class in common.py; update the
documentation string in 2_qdq_placement.rst to state "Default threshold: 1.02
(2% improvement minimum)" so the docs match the implementation (or alternatively
change Config.performance_threshold to 1.01 if you intend the docs to be
canonical).

Comment on lines +1 to +10
#!/usr/bin/env python3
"""
Script to set a fixed batch size for ONNX models.

This script modifies an ONNX model with dynamic batch size to use a fixed batch size,
which is often beneficial for TensorRT performance benchmarking.

Usage:
python set_batch_size.py resnet50_Opset17.onnx --batch-size 128 --output resnet50.bs128.onnx
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing required NVIDIA Apache 2.0 license header.

Per coding guidelines, all new Python files in examples/**/*.py require the NVIDIA Apache 2.0 license header.

Add license header
 #!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """
 Script to set a fixed batch size for ONNX models.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/qdq_placement/set_batch_size.py` around lines 1 - 10, Add the
required NVIDIA Apache 2.0 license header to the top of the
examples/qdq_placement/set_batch_size.py script (above or immediately after the
existing shebang and before the module docstring) so the file includes the full
license boilerplate mandated for examples/**/*.py; ensure the header explicitly
names "NVIDIA CORPORATION" and the Apache 2.0 terms and leave the rest of the
script (including the module docstring and usage comments) unchanged.

Comment on lines +1 to +6
"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.

Parses TensorRT's exported layer information (graph.json) to understand
how ONNX operations are fused into TRT layers. Creates FusionGroups that
map back to ONNX node names, enabling subgraph-level QDQ optimization.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing required NVIDIA Apache 2.0 license header.

Per coding guidelines, all new Python files in modelopt/**/*.py require the NVIDIA Apache 2.0 license header.

Add license header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.
Parses TensorRT's exported layer information (graph.json) to understand
how ONNX operations are fused into TRT layers. Creates FusionGroups that
map back to ONNX node names, enabling subgraph-level QDQ optimization.
"""
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fusion-aware grouping of ONNX nodes based on TensorRT graph.json.
Parses TensorRT's exported layer information (graph.json) to understand
how ONNX operations are fused into TRT layers. Creates FusionGroups that
map back to ONNX node names, enabling subgraph-level QDQ optimization.
"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/fusion_grouping.py` around lines 1 - 6,
Add the required NVIDIA Apache 2.0 license header to the top of the module
modelopt/onnx/quantization/autotune/fusion_grouping.py by inserting the standard
multi-line NVIDIA Apache 2.0 header (including copyright line, SPDX identifier
and license text reference) as used across other files in modelopt; ensure the
header appears before the existing module docstring so classes/functions in this
file (e.g., the fusion grouping logic in fusion_grouping.py) are properly
licensed.

Comment on lines +1 to +6
"""Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.

Given boundary tensor names (inputs/outputs), marks them as the subgraph's
I/O, cleans up unreferenced nodes, runs shape inference, and serializes
the result to bytes for direct TensorRT consumption.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing NVIDIA Apache 2.0 license header.

As per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header at the top of the file.

📄 Add license header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.
Given boundary tensor names (inputs/outputs), marks them as the subgraph's
I/O, cleans up unreferenced nodes, runs shape inference, and serializes
the result to bytes for direct TensorRT consumption.
"""
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Extract standalone ONNX subgraphs from a full model using onnx_graphsurgeon.
Given boundary tensor names (inputs/outputs), marks them as the subgraph's
I/O, cleans up unreferenced nodes, runs shape inference, and serializes
the result to bytes for direct TensorRT consumption.
"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_extractor.py` around lines 1 -
6, Add the NVIDIA Apache 2.0 license header to the top of the module file
modelopt/onnx/quantization/autotune/subgraph_extractor.py by inserting the
standard multi-line license comment before the existing module docstring; ensure
the header includes the SPDX identifier and full NVIDIA Apache 2.0 text (or the
project-standard short header) so the file begins with the required license
block followed by the current docstring and existing code in
subgraph_extractor.py.

Comment on lines +1 to +8
"""Subgraph-based QDQ autotune workflow.

Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize
Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by:
1. Grouping ONNX nodes by TRT fusion boundaries (graph.json)
2. Profiling isolated subgraphs instead of full model
3. Using domain-informed heuristic schemes instead of random mutation
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing NVIDIA Apache 2.0 license header.

As per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header at the top of the file.

📄 Add license header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 """Subgraph-based QDQ autotune workflow.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""Subgraph-based QDQ autotune workflow.
Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize
Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by:
1. Grouping ONNX nodes by TRT fusion boundaries (graph.json)
2. Profiling isolated subgraphs instead of full model
3. Using domain-informed heuristic schemes instead of random mutation
"""
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Subgraph-based QDQ autotune workflow.
Uses fusion-aware subgraph extraction and heuristic QDQ schemes to optimize
Q/DQ placement. Reduces autotune time from ~25 hours to ~30 minutes by:
1. Grouping ONNX nodes by TRT fusion boundaries (graph.json)
2. Profiling isolated subgraphs instead of full model
3. Using domain-informed heuristic schemes instead of random mutation
"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/subgraph_workflow.py` around lines 1 - 8,
This file is missing the NVIDIA Apache 2.0 license header; prepend the standard
NVIDIA Apache-2.0 license block to the very top of subgraph_workflow.py (above
the existing module docstring) including the copyright line and license
text/notice, ensuring the SPDX identifier (Apache-2.0) or full license header is
present and that the existing file docstring remains unchanged below the header.

Comment on lines +320 to +331
def _build_tensor_to_regions_map(
self, region: Region, tensor_to_regions_map: dict[str, set[int]] = {}
) -> dict[str, set[int]]:
"""Build a map from tensor names to regions."""
for input in region.inputs:
if input not in tensor_to_regions_map:
tensor_to_regions_map[input] = set()
tensor_to_regions_map[input].add(region.id)

for child in region.get_children():
self._build_tensor_to_regions_map(child, tensor_to_regions_map)
return tensor_to_regions_map
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mutable default argument anti-pattern.

Same issue as _build_id_to_region_map - using mutable default dict = {}.

Proposed fix
 def _build_tensor_to_regions_map(
-    self, region: Region, tensor_to_regions_map: dict[str, set[int]] = {}
+    self, region: Region, tensor_to_regions_map: dict[str, set[int]] | None = None
 ) -> dict[str, set[int]]:
     """Build a map from tensor names to regions."""
+    if tensor_to_regions_map is None:
+        tensor_to_regions_map = {}
     for input in region.inputs:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 320
- 331, The _build_tensor_to_regions_map function uses a mutable default argument
(dict = {}), which can be shared across calls; change the signature to accept
tensor_to_regions_map: Optional[dict[str, set[int]]] = None and inside the
function initialize tensor_to_regions_map = {} if None, mirroring the fix used
in _build_id_to_region_map; keep the recursive behavior and return the map as
before so each top-level call gets a fresh dict.

self._build_tensor_to_regions_map(child, tensor_to_regions_map)
return tensor_to_regions_map

def _merge_neighboring_regions(self, region: Region, to_remove: set[int] = set()) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Mutable default argument anti-pattern.

Same issue - using mutable default set = set().

Proposed fix
-def _merge_neighboring_regions(self, region: Region, to_remove: set[int] = set()) -> None:
+def _merge_neighboring_regions(self, region: Region, to_remove: set[int] | None = None) -> None:
+    if to_remove is None:
+        to_remove = set()
     self._compute_all_boundaries(region)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` at line 333, The
_merge_neighboring_regions function currently uses a mutable default argument
for to_remove (set()); change the signature to accept None (e.g., to_remove:
Optional[set[int]] = None or to_remove: set[int] | None = None) and inside the
body initialize it to an empty set when None (e.g., if to_remove is None:
to_remove = set()), updating any type imports as needed and leaving the rest of
the logic in _merge_neighboring_regions unchanged.

Comment on lines +752 to +754
"""
only_quantizable = True
logger.info(f"Loading model: {onnx_path}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Function parameter immediately overwritten.

The only_quantizable parameter is accepted but immediately overwritten to True on line 753, making the parameter useless. Either remove the parameter or respect its value.

Option 1: Remove the override to respect the parameter
 def inspect_torch_regions(
     onnx_path: str,
     include_all_regions: bool = False,
     only_quantizable: bool = False,
 ) -> list[Region]:
-    only_quantizable = True
     logger.info(f"Loading model: {onnx_path}")
Option 2: Remove the unused parameter
 def inspect_torch_regions(
     onnx_path: str,
     include_all_regions: bool = False,
-    only_quantizable: bool = False,
 ) -> list[Region]:
+    only_quantizable = True
     logger.info(f"Loading model: {onnx_path}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"""
only_quantizable = True
logger.info(f"Loading model: {onnx_path}")
def inspect_torch_regions(
onnx_path: str,
include_all_regions: bool = False,
only_quantizable: bool = False,
) -> list[Region]:
logger.info(f"Loading model: {onnx_path}")
Suggested change
"""
only_quantizable = True
logger.info(f"Loading model: {onnx_path}")
def inspect_torch_regions(
onnx_path: str,
include_all_regions: bool = False,
) -> list[Region]:
only_quantizable = True
logger.info(f"Loading model: {onnx_path}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/onnx/quantization/autotune/torch_region_builder.py` around lines 752
- 754, The parameter only_quantizable is being ignored because the code
unconditionally sets only_quantizable = True; fix by either removing that
assignment so the function respects the incoming only_quantizable parameter, or
remove the only_quantizable parameter from the function signature if it's not
needed; locate the assignment near the logger.info(f"Loading model:
{onnx_path}") call in torch_region_builder.py and update the function that
accepts only_quantizable to either use the passed value or eliminate the unused
parameter and adjust all callers accordingly.

Comment on lines +1 to +6
#!/usr/bin/env python3
"""
Tests for the Config class in the autotuner.

Tests configuration parameter validation and defaults.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing required NVIDIA Apache 2.0 license header.

Per coding guidelines, all new Python files require the NVIDIA Apache 2.0 license header.

Add license header
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 #!/usr/bin/env python3
 """
 Tests for the Config class in the autotuner.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/usr/bin/env python3
"""
Tests for the Config class in the autotuner.
Tests configuration parameter validation and defaults.
"""
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#!/usr/bin/env python3
"""
Tests for the Config class in the autotuner.
Tests configuration parameter validation and defaults.
"""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 1 - 6, Add
the required NVIDIA Apache 2.0 license header to the top of
tests/unit/onnx/quantization/autotune/test_config.py: insert the standard NVIDIA
Apache-2.0 license block (including copyright notice/SPDX identifier and license
text) immediately after the existing shebang (#!/usr/bin/env python3) and before
the module docstring so the file complies with the repository's license/header
convention.

Comment on lines +71 to +82
def test_performance_threshold_validation(self):
"""Test that performance_threshold must be >= 1.0."""
# Valid values
config1 = Config(performance_threshold=1.0)
self.assertEqual(config1.performance_threshold, 1.0)

config2 = Config(performance_threshold=1.5)
self.assertEqual(config2.performance_threshold, 1.5)

# Invalid values should not be accepted
# Note: This test assumes validation exists, if not we should add it
print("✓ Config performance_threshold validation")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Incomplete validation test - invalid values not actually tested.

The test test_performance_threshold_validation only tests valid values but the comment on line 80-81 indicates invalid values should be rejected. Either implement the validation in Config and test it, or remove the misleading comment.

Option 1: If validation exists, add the test
def test_performance_threshold_validation(self):
    """Test that performance_threshold must be >= 1.0."""
    # Valid values
    config1 = Config(performance_threshold=1.0)
    self.assertEqual(config1.performance_threshold, 1.0)

    config2 = Config(performance_threshold=1.5)
    self.assertEqual(config2.performance_threshold, 1.5)

    # Invalid values should raise ValueError
    with self.assertRaises(ValueError):
        Config(performance_threshold=0.9)

    print("✓ Config performance_threshold validation")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/onnx/quantization/autotune/test_config.py` around lines 71 - 82,
The test test_performance_threshold_validation is incomplete: it only asserts
valid values for Config.performance_threshold but doesn't check that invalid
values are rejected; either add validation in Config or update the test to
assert the expected failure. If Config should validate, implement a check in
Config.__init__ or the performance_threshold setter to raise ValueError for
values < 1.0 and then update test_performance_threshold_validation to include a
with self.assertRaises(ValueError): Config(performance_threshold=0.9) case;
otherwise remove the misleading comment and the "Invalid values" note from
test_performance_threshold_validation.

@gcunhase
Copy link
Contributor

@Hale423 please find my comments below:

  1. Please rebase this PR and ensure that the DCO is valid.
  2. Please update this PR's description. It says This PR adds automated Q/DQ (Quantize/Dequantize) placement optimization for ONNX models using TensorRT performance measurements. However, Autotune has already been introduced previously, so the correct description is that this expands on Autotune, correct?
  3. Is this PR related to Draft: Integrate TorchRegionBuilder to AutoQDQ #963?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants