Facing problem in implementing multiplication of 64 bit variables using ARM Neon intrinsics

Question

I want to use a similar intrinsic as shown below for my code.

   int32x2_t vmla_s32 (int32x2_t a, int32x2_t b, int32x2_t c)

The only change is that my data is 64 bit i.e. I need 64x2_t vectors. I went through all relevant intrinsics mentioned in the ARM references but didn't find the suitable one. Should I use float datatypes and then convert (Cast) them into int64 (as shown below)? Is that the only option left with me?

   float64x2_t vmlaq_f64 (float64x2_t a, float64x2_t b, float64x2_t c)

What is the actual result you are trying to achieve? Are you trying to multiply two 64-bit integers and get a 128-bit result? Or are the low 64-bits enough? The 64-bit floating-point format has 53-bit signfiicands. So, when you ask about using that instead, it suggests you do not need full 64-bit multiplications—it suggests 53-bit by 53-bit, producing the high 53 bits would be enough, which contracts the statement “I need 64x2_t vectors.” So you need to make the problem clear. — Eric Postpischil
Actually, I am getting overflow at the output (32bit type), when I pass (my) input data which is of 32 bit. So, I started searching for Neon intrinsics which can do multiplications on 64 bit (after casting all the 32 bit variables to 64 bit variables), just to make sure that overflow doesn't happen. — rkc
The information in that comment tells us that 32-bit products are insufficient. It does not tell us how many bits are sufficient in the product. You must specify the problem completely and clearly. What are the minimum and maximum possible values of the first operand, the second operand, and the product? Is an exact product needed, or would a product rounded in the low bits suffice? (Note that the maximum value of the product is not necessarily the product of the maximum value of the operands, if the operands are not independent of each other.) — Eric Postpischil
If you just need a few extra bits, e.g., a 34-bit product, the solution may be very different than if you need a full 64-bit product. And if you only need an approximate result, that may also change the solution. — Eric Postpischil
@EricPostpischil The values keep varying in my case. So, I can't predict the maximum and minimum values. I have no idea about whether adding a few other bits will solve the problem. However, I would like to know how to do multiplication of 34 bits or any other sized variables — rkc

keith keith · Accepted Answer · 2021-06-27T12:11:06

For anyone who stumbles across this question (like I did) and wants to know how to implement a 2-lane 64-bit multiply for the neon register int64x2_t using C++ intrinsics which would be a poly-fill to the instruction vmulq_s64 which isn't available on say, Apple M1, then this would be it:

inline int64x2_t arm_vmulq_s64(const int64x2_t& a, const int64x2_t& b)
{
   const auto ac = vmovn_s64(a);
   const auto pr = vmovn_s64(b);

   const auto hi = vmulq_s32(b, vrev64q_s32(a));

   return vmlal_u32(vshlq_n_s64(vpaddlq_u32(hi), 32), ac, pr);
}

To get to vmlaq_s64 will require combining this with an extra addition which I think is what the OP wants.

Facing problem in implementing multiplication of 64 bit variables using ARM Neon intrinsics

1 Answers