Speeding up T5 inference 🚀

you can quantize the model. As mentioned in the PyTorch doc

PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x reduction in the model size and a 4x reduction in memory bandwidth requirements. Hardware support for INT8 computations is typically 2 to 4 times faster compared to FP32 compute.

this notebook shows the benchmarks of using quantized models with onnx.

1 Like