1
votes

I'm currently working to translate a model from TensorFlow to TensorFlow Lite. I have converted the model from a regular TF 1.x session into a .tflite file by first creating a checkpoint and a saved weightless graph (.pbtxt) and then freezing the model into a .pb with graph weights using the freeze_graph() function, and then finally running the tflite_convert command on my frozen model file. There was no quantization during this process - floats were preserved. After that, I put the model into Android Studio and ran it on my Motorola Moto G7 Power with an Adreno 506 GPU.

There are, however, two major differences between my original model and the one that TF Lite is running. My inference accuracy is lower by roughly 2% on TF Lite (I'm currently looking into this), and GPU compute is much slower than CPU compute on my phone. On computer GPU compute considerably sped up inference speed, however on my phone inference speed on GPU is ~30x slower than CPU.

I have seen that there may be issues with that fact that my phone is currently running Android 9 as it only supports ~30 NNAPI operations but Android 10 supports ~100 (Video : @6:19), but haven't seen anything like that regarding the GPU delegate. My model is just a simple MLP with a hidden layer size of 200, and I can't see its simple operations not being supported and causing CPU-GPU operation switching lag.

My input array is of size [N]x[384] and outputs an array of size [N]x[1], with N being the number of 384-sized inputs I wish to feed in at a given time. N is between 400-800 for all of the input arrays I have fed into it, however I have tried with a larger n to see if perhaps the slowing I noticed was due to the creation of a delegate kernel when the GPU inference was run. For large n, the inference time of GPU approaches that of CPU, making me think that perhaps the GPU delegate is simply falling back on my phone's CPU for its computations.

Here are some examples of CPU/GPU timing compared to the size of n:

N = 500
CPU: 21ms
GPU: 601ms

N = 5,000
CPU: 454ms
GPU: 1004ms

N = 10,000
CPU: 949ms
GPU: 1490ms

Note how the GPU time seems to be a consistent 480ms slower than the CPU time, making me think that the 480ms is spent on the delegate kernel creation, which ends up just running the inference entirely on the CPU.

I am creating my GPU delegate using this code:

    GpuDelegate delegate = new GpuDelegate();
    Interpreter.Options options = (new Interpreter.Options()).addDelegate(delegate);

And creating and running my interpreter with this code:

    Interpreter tfliteGPU = new Interpreter(loadedFile, options);
    tfliteGPU.run(inputArray, outputArray);

And I am using TF Lite Nightly 0.0.0 for TF Lite GPU and TF Lite Base:

    implementation 'org.tensorflow:tensorflow-lite:0.0.0-nightly'
    implementation 'org.tensorflow:tensorflow-lite-gpu:0.0.0-nightly'

Why might this be happening? Any help is appreciated!

1
I think it should be new Interpreter(loadedFile, options); that costs the processing time. - Wei Liu

1 Answers

0
votes

check your model using benchmark app whether all Ops are running on GPU or not, issue might be model is falling back to CPU and that is costing to processing time.