Why a quantized TensorFlow Lite model performs poorly on latency?

Question

I am currently testing the latency of the inference of a U-Net network transformed with TensorFlow Lite. I am testing three NN with the same architecture on a segmentation problem (I'm testing them on my laptop with Windows OS):

First model: TensorFlow model (without optimization and created with the Keras interface).
Second model: TensorFlow model optimized with TFLite (transformed with the Python TFLite api and without quantization). It is actually the first model transformed.
Third model: TensorFlow model optimized with TFLite and quantized (transformed with the Python TFLite api and quantized with tensorflow.lite.Optimize.DEFAULT). It is actually the first model transformed.

Indeed, the second model (optimized with TFLite) improves the time performance of the first model (normal TF model) by a factor of x3 (three times faster). However, the third model (TFLite & quantization) has the worst performance time-wise. It is even slower than the first model (normal TF model).

Why the quantized model is the slowest?

Have you noticed what actual hardware is being used during inference of every model? — Alex K.

Karim Nosseir Karim Nosseir · Accepted Answer · 2021-01-12T22:10:45

It depends on which kernels your model is running.

Generally TFLite is more optimized for running on mobile devices. So it might be that in your case quantized+desktop it is using a reference implementation for some op(s).

One way to check further is to run the benchmark tool with --enable_op_profiling=true.

It will run your model with dummy data and profile the ops, and then show you summary like this

If you saw something off, then you can file a github issue with details and how to reproduce the issue and the team can debug the performance issue.

Why a quantized TensorFlow Lite model performs poorly on latency?

1 Answers