0
votes

The tensorflow documentation for dynamic range quantization states that:

At inference, weights are converted from 8-bits of precision to floating point and computed using floating-point kernels. This conversion is done once and cached to reduce latency.

and also in dynamic range quantization, the activations are always stored in float 32, however, they are converted to 8-bit integers while processing and back to floating point after the processing is done.

I am confused that if weights are converted to float32 at inference time, then how is quantization done?

1

1 Answers

0
votes

Quote from https://www.tensorflow.org/lite/performance/post_training_quant

In addition, TFLite supports on the fly quantization and dequantization of activations to allow for:

Using quantized kernels for faster implementation when available. Mixing of floating-point kernels with quantized kernels for different parts of the graph.

If the kernel has an optimized path that supports quantization, the float activation is quantized to be applied with the quantized weights.

Otherwise, activation is kept in float and weights will be converted to float for inference.