Skip to content

Quantization output using different ONNX Execution Providers #1032

@adaber

Description

@adaber

Hi,

Firstly, thank you for creating this library.

I have 3 questions.

1) What is the difference in an output quantized ONNX model when using different ONNX Execution Providers ?

I train models and convert them to ONNX in Pytorch. I then create TRT engines (timing caches) using ONNXRuntime using the TRT Execution Provider.

So, I am not sure why I would use the TRT Execution Provider for quantization when I will use that provider to create a timing cache after I quantize a model (the result of quantization is a quantized ONNX model).

Another way of asking:

1) What is the difference among these three ?

pth -> onnx -> quantized onnx (CPU) -> timing cache (ONNXRuntime, TRT EP)
pth -> onnx -> quantized onnx (CUDA EP) -> timing cache (ONNXRuntime, TRT EP)
pth -> onnx -> quantized onnx (TRT EP) -> timing cache (ONNXRuntime, TRT EP)

2) Does quantizing using the TRT EP mean that such quantized ONNX models should only/preferably be used on the GPU family the quantization process was done on ?

Hope I managed to explain what I am confused about.

Thanks!

Edit: Another question. 3) Does the library support only the RTX 40 and 50 series ? (I use an RTX 3090, RTX 6000 ADA and some other RTX 30 GPUs)

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions