you can quantize the model. As mentioned in the PyTorch doc
PyTorch supports INT8 quantization compared to typical FP32 models allowing for a 4x reduction in the model size and a 4x reduction in memory bandwidth requirements. Hardware support for INT8 computations is typically 2 to 4 times faster compared to FP32 compute.
this notebook shows the benchmarks of using quantized models with onnx.