Do Tensorflow Serving run inference with cache?

Question

When I serve my TF model with tensorflow serveing, on version 2.1.0, through docker, I perform a stress testing with Jmeter. There is a problem. TPS will hit 4400 by testing with single data, while it only reach 1700 with multiple data in a txt file. The model is BiLSTM which I've trained without any cache setting. The experiments all perform in local server rather than through network.

Metrics:

In single data task, I set running HTTP request with identical data without interval by 30 request threads for 10 minutes.

TPS: 4491
CPU occupied: 2100%
99% Latancy Line(ms): 17
error rate: 0

In multiple data task, I set running HTTP request with reading a txt file, a dataset with 9740000 different examples, by 30 request threads.

TPS: 1711
CPU occupied: 2300%
99% Latancy Line(ms): 42
error rate: 0

Hardware：

CPU cores:12
processor: 24
Intel(R) Xeon(R) Silver 4214 CPU @ 2.20GHz

Is there a cache in Tensorflow Serving?

Why is TPS with single data testing larger thrice than with various data testing in stress testing task?

Taylor Taylor · Accepted Answer · 2020-09-08T06:32:54

I've solved the problem. Request threads reading the same file needs to wait for which cost CPU for running Jmeter.

Do Tensorflow Serving run inference with cache?

Metrics:

Hardware：

1 Answers