-
Notifications
You must be signed in to change notification settings - Fork 299
Description
Hi,
Firstly, thank you for creating this library.
I have 3 questions.
1) What is the difference in an output quantized ONNX model when using different ONNX Execution Providers ?
I train models and convert them to ONNX in Pytorch. I then create TRT engines (timing caches) using ONNXRuntime using the TRT Execution Provider.
So, I am not sure why I would use the TRT Execution Provider for quantization when I will use that provider to create a timing cache after I quantize a model (the result of quantization is a quantized ONNX model).
Another way of asking:
1) What is the difference among these three ?
pth -> onnx -> quantized onnx (CPU) -> timing cache (ONNXRuntime, TRT EP)
pth -> onnx -> quantized onnx (CUDA EP) -> timing cache (ONNXRuntime, TRT EP)
pth -> onnx -> quantized onnx (TRT EP) -> timing cache (ONNXRuntime, TRT EP)
2) Does quantizing using the TRT EP mean that such quantized ONNX models should only/preferably be used on the GPU family the quantization process was done on ?
Hope I managed to explain what I am confused about.
Thanks!
Edit: Another question. 3) Does the library support only the RTX 40 and 50 series ? (I use an RTX 3090, RTX 6000 ADA and some other RTX 30 GPUs)