Improve performances of division Vivado HLS

Question

I'm computing the incremental mean of my input data (which is an array of 6 elements, so i'll end up with 6 means).

This is the code I'm using everytime a new input array is available (obviously I update the number of samples ecc...):

computing_mean:for(int i=0;i<6;i++){
       temp_mean[i]=temp_mean[i] + (input[i]-temp_mean[i])/number_of_samples;
       //Possible optimization?
       //temp_mean[i]=temp_mean[i] + divide(input[i]-temp_mean[i],number_of_samples);

}

Where all the data in the code are arrays or single number of the following type:

typedef ap_fixed <36,24,AP_RND_CONV,AP_SAT> decimalNumber;

From my synthesis report this loop hase 324 latency and 54 iteration latency, caused mainly by the division operation.

Are there any ways I can improve the speed of the division? I tried using hls_math and the divide function, but it doesn't seem to work with my type of data.

EDIT 1: I'm including my performance profiler inside vivado HLS. I'll add a self-contained reproducible code later with another edit. As you can see, the majority of the time is spent in SDIV

ANSI C? Is this some legacy codebase? Vectorizing is just taking advantage of SIMD instructions, it's not a function of C itself. — tadman
If you're asking "is there a magic wand I can wave over this code and make it faster" the answer is no. If you're asking "is there a way to make this code faster by reworking it" the answer is long and complicated, but yes. The difficult, but relatively simplest approach is to see if you can vectorize this with SIMD instructions. The more difficult but possibly better approach is to see if you can dump all of this onto a GPU and write it as a kernel function. — tadman
You're asking for a magic wand. It does not exist. Why do you think this particular line is the problem? Do you you have a profiler report? — tadman
That might seem like a lot of time, but it's really not. Division is computationally more expensive, it's true, but it won't be cripplingly expensive in most cases. If you have a performance problem it's probably do do with one of: cache misses, data layout issues, poor structure design, or a fundamentally ineffective algorithm. — tadman
Other than trigonometric functions like sin() and cos(), or things like sqrt(), division will always be the most painful. There are occasions where you can skip division and do bit-shifting instead, if you're working with integer data and the number you're dividing by is a power of 2, but that's about it. — tadman

tadman tadman · Accepted Answer · 2020-10-13T19:29:06

Other than trigonometric functions like sin() (FSIN = ~50-170 cycles) and cos() (FCOS = ~50-120 cycles), or things like sqrt() (FSQRT = ~22 cycles), division will always be the most painful.

FDIV is 15 cycles. FADD and FMUL are both 5.

There are occasions where you can skip division and do bit-shifting instead, if you're working with integer data and the number you're dividing by is a power of 2, but that's about it.

You can look up the approximate CPU cycle cost of any given instruction in tables like this. FDIV is an example of an expensive one.

That being said, one thing you could try is to compute the division factor in advance, then apply it using multiplication instead:

double inverse_n = 1 / number_of_samples;

temp_mean[i]=temp_mean[i] + (input[i]-temp_mean[i]) * inverse_n;

I'm not sure that's saving a whole lot, but if you really do need to shave off cycles, it's worth a shot.

Improve performances of division Vivado HLS

1 Answers