Quantization output using different ONNX Execution Providers

Hi,

Firstly, thank you for creating this library.

I have 3 questions.

**1) What is the difference in an output quantized ONNX model when using different ONNX Execution Providers ?**

I train models and convert them to ONNX in Pytorch. I then create TRT engines (timing caches) using ONNXRuntime using the TRT Execution Provider.

So, I am not sure why I would use the TRT Execution Provider for **quantization** when I will use that provider to **create a timing cache** after I quantize a model (the result of quantization is a quantized ONNX model).

Another way of asking:

**1) What is the difference among these three ?**

pth -> onnx -> quantized onnx **(CPU)**        -> timing cache (ONNXRuntime, TRT EP)
pth -> onnx -> quantized onnx **(CUDA EP)** -> timing cache (ONNXRuntime, TRT EP)
pth -> onnx -> quantized onnx **(TRT EP)**     -> timing cache (ONNXRuntime, TRT EP)

**2) Does quantizing using the TRT EP mean that such quantized ONNX models should only/preferably be used on the GPU family the quantization process was done on ?**

Hope I managed to explain what I am confused about.

Thanks!

Edit: Another question. **3) Does the library support only the RTX 40 and 50 series ?** (I use an RTX 3090, RTX 6000 ADA and some other RTX 30 GPUs)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization output using different ONNX Execution Providers #1032

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Quantization output using different ONNX Execution Providers #1032

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions