0
votes

What is the fastest between four times vld1 and once vld4. Obviously, loaded data is not the same but if I have the choice, what is best or is it the same?

pld[in]
vld1.u8 { d0 }, [in]!
vld1.u8 { d1 }, [in]!
vld1.u8 { d2 }, [in]!
vld1.u8 { d3 }, [in]!

vs.

pld[in]
vld4.u8 { d0, d1, d2, d3 }, [in]!
1
You realise vld1 can still take a list of up to 4 consecutive registers, right?Notlikethat
No, I didn't know :-( I feel like stupid. I read the documentation thoroughly now...gregoiregentil

1 Answers

2
votes

vld1.u8 {d0, d1, d2, d3}, [in]! will generally be faster than or equal to vld4.u8 on the same list. This is because vld4 may have to permute the data after it's loaded to get it into the right registers.

Even if it does have to more work, that extra cost may be hidden behind other factors so it's not necessarily a big deal.