I am currently testing the latency of the inference of a U-Net network transformed with TensorFlow Lite. I am testing three NN with the same architecture on a segmentation problem (I'm testing them on my laptop with Windows OS):
- First model: TensorFlow model (without optimization and created with the Keras interface).
- Second model: TensorFlow model optimized with TFLite (transformed with the Python TFLite api and without quantization). It is actually the first model transformed.
- Third model: TensorFlow model optimized with TFLite and quantized (transformed with the Python TFLite api and quantized with tensorflow.lite.Optimize.DEFAULT). It is actually the first model transformed.
Indeed, the second model (optimized with TFLite) improves the time performance of the first model (normal TF model) by a factor of x3 (three times faster). However, the third model (TFLite & quantization) has the worst performance time-wise. It is even slower than the first model (normal TF model).
Why the quantized model is the slowest?