TensorFlow Model is still floating point after Post-training quantization

Question

After applying post-training quantization, my custom CNN model was shrinked to 1/4 of its original size (from 56.1MB to 14MB). I put the image(100x100x3) that is to be predicted into ByteBuffer as 100x100x3=30,000 bytes. However, I got the following error during inference:

java.lang.IllegalArgumentException: Cannot convert between a TensorFlowLite buffer with 120000 bytes and a ByteBuffer with 30000 bytes.**
        at org.tensorflow.lite.Tensor.throwExceptionIfTypeIsIncompatible(Tensor.java:221)
        at org.tensorflow.lite.Tensor.setTo(Tensor.java:93)
        at org.tensorflow.lite.NativeInterpreterWrapper.run(NativeInterpreterWrapper.java:136)
        at org.tensorflow.lite.Interpreter.runForMultipleInputsOutputs(Interpreter.java:216)
        at org.tensorflow.lite.Interpreter.run(Interpreter.java:195)
        at gov.nih.nlm.malaria_screener.imageProcessing.TFClassifier_Lite.recongnize(TFClassifier_Lite.java:102)
        at gov.nih.nlm.malaria_screener.imageProcessing.TFClassifier_Lite.process_by_batch(TFClassifier_Lite.java:145)
        at gov.nih.nlm.malaria_screener.Cells.runCells(Cells.java:269)
        at gov.nih.nlm.malaria_screener.CameraActivity.ProcessThinSmearImage(CameraActivity.java:1020)
        at gov.nih.nlm.malaria_screener.CameraActivity.access$600(CameraActivity.java:75)
        at gov.nih.nlm.malaria_screener.CameraActivity$8.run(CameraActivity.java:810)
        at java.lang.Thread.run(Thread.java:762)

The imput image size to the model is: 100x100x3. I'm currently predicting one image at a time. So, if I'm making the Bytebuffer: 100x100x3 = 30,000 bytes. However, the log info above says the TensorFlowLite buffer has 120,000 bytes. This makes me suspect that the converted tflite model is still in float format. Is this expected behavior? How can I get a quantized model that take input image in 8 pit precision like it does in the example from TensorFlow official repository ?

In the example code, the ByteBuffer used as input for tflite.run() is in 8 bit precision for the quantized model.

But I also read from the google doc saying, "At inference, weights are converted from 8-bits of precision to floating-point and computed using floating point kernels." This two instances seems to contradict each other.

private static final int BATCH_SIZE = 1;

private static final int DIM_IMG_SIZE = 100;

private static final int DIM_PIXEL_SIZE = 3;

private static final int BYTE_NUM = 1;

imgData = ByteBuffer.allocateDirect(BYTE_NUM * BATCH_SIZE * DIM_IMG_SIZE * DIM_IMG_SIZE * DIM_PIXEL_SIZE);
imgData.order(ByteOrder.nativeOrder());

... ...

int pixel = 0;

        for (int i = 0; i < DIM_IMG_SIZE; ++i) {
            for (int j = 0; j < DIM_IMG_SIZE; ++j) {

                final int val = intValues[pixel++];

                imgData.put((byte)((val >> 16) & 0xFF));
                imgData.put((byte)((val >> 8) & 0xFF));
                imgData.put((byte)(val & 0xFF));

//                imgData.putFloat(((val >> 16) & 0xFF) / 255.0f);
//                imgData.putFloat(((val >> 8) & 0xFF) / 255.0f);
//                imgData.putFloat((val & 0xFF) / 255.0f);

            }
        } 

... ...

tfLite.run(imgData, labelProb);

Post-training quantization code:

import tensorflow as tf
import sys
import os

saved_model_dir = '/home/yuh5/Downloads/malaria_thinsmear.h5.pb'

input_arrays = ["input_2"]

output_arrays = ["output_node0"]

converter = tf.contrib.lite.TocoConverter.from_frozen_graph(saved_model_dir, input_arrays, output_arrays)

converter.post_training_quantize = True

tflite_model = converter.convert()
open("thinSmear_100.tflite", "wb").write(tflite_model)

rednuht rednuht · Accepted Answer · 2018-10-24T14:33:46

Post-training quantization does not change the format of the input or output layers. You can run your model with data in the same format as used for training.

You may look into quantization-aware training to generate fully-quantized models, but I have no experience with it.

As for the sentence "At inference, weights are converted from 8-bits of precision to floating-point and computed using floating point kernels." This means that the weights are "de-quantized" to floating point values in memory, and computed with FP instructions, instead of performing integer operations.

TensorFlow Model is still floating point after Post-training quantization

1 Answers