Speeding up T5 inference 🚀

kira · December 5, 2020, 9:29am

you can quantize the model. As mentioned in the PyTorch doc

PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x reduction in the model size and a 4x reduction in memory bandwidth requirements. Hardware support for INT8 computations is typically 2 to 4 times faster compared to FP32 compute.

this notebook shows the benchmarks of using quantized models with onnx.

Topic		Replies	Views
Boost inference speed of T5 models up to 5X & reduce the model size by 3X 🤗Transformers	2	4767	June 8, 2023
How to convert mT5 and ByT5 to ONNX format? 🤗Transformers	4	1732	December 22, 2021
Improving decoding speed by onnx conversion model Beginners	0	201	November 17, 2021
When exporting seq2seq models with ONNX, why do we need both decoder_with_past_model.onnx and decoder_model.onnx? 🤗Optimum	12	2753	March 7, 2024
Transformers / T5 , jit trace, script, quantize 🤗Transformers	2	477	April 18, 2023

Speeding up T5 inference 🚀

Related Topics