Consider the following codes, The first code snippet:
void run_new(const float* src, float* dst,
size_t IH, size_t IW, size_t OH, size_t OW,
size_t N) {
rep(n, N) {
const float* src_ptr = src + IW * IH * n;
float* outptr = dst;
const float* r0 = src_ptr;
const float* r1 = src_ptr + IW;
float32x4_t k0123 = vdupq_n_f32(3.f);
rep(h, OH) {
size_t width = OW >> 2;
asm volatile(
"dup v21.4s, %4.s[0] \n"
"dup v22.4s, %4.s[1] \n"
"dup v23.4s, %4.s[2] \n"
"dup v24.4s, %4.s[3] \n"
"mov x3, xzr \n"
"0: \n"
"ldr q0, [%1] \n"
"ld1 {v1.4s, v2.4s}, [%2], #32 \n"
"add x3, x3, #0x1 \n"
"cmp %0, x3 \n"
"ld1 {v3.4s, v4.4s}, [%3], #32 \n"
"fmla v0.4s, v1.4s, v21.4s \n" // src[i] * k[i]
"fmla v0.4s, v2.4s, v22.4s \n"
"fmla v0.4s, v3.4s, v23.4s \n"
"fmla v0.4s, v4.4s, v24.4s \n"
"str q0, [%1], #16 \n"
"bne 0b \n"
: "+r"(width), "+r"(outptr), "+r"(r0), "+r"(r1)
: "w"(k0123)
: "cc", "memory", "x3", "v0", "v1", "v2", "v3", "v4", "v21", "v22", "v23", "v24");
}
}
}
The second code snippet:
void run_origin(const float* src, float* dst,
size_t IH, size_t IW, size_t OH, size_t OW,
size_t N) {
rep(n, N) {
const float* src_ptr = src + IW * IH * n;
float* outptr = dst;
const float* r0 = src_ptr;
const float* r1 = src_ptr + IW;
float32x4_t k0123 = vdupq_n_f32(3.f);
rep(h, OH) {
size_t width = OW >> 2;
asm volatile(
"dup v21.4s, %4.s[0] \n"
"dup v22.4s, %4.s[1] \n"
"dup v23.4s, %4.s[2] \n"
"dup v24.4s, %4.s[3] \n"
"mov x3, xzr \n"
"mov x4, xzr \n"
"0: \n"
"add x19, %2, x4 \n"
"ldr q0, [%1] \n" // load dst 0, 1, 2, 3
"ld1 {v1.4s, v2.4s}, [x19]\n" // 1, 2, 4, 6
"add x3, x3, #0x1 \n"
"cmp %0, x3 \n"
"add x19, %3, x4 \n"
"ld1 {v3.4s, v4.4s}, [x19]\n"
"fmla v0.4s, v1.4s, v21.4s \n" // src[i] * k[i]
"fmla v0.4s, v2.4s, v22.4s \n"
"fmla v0.4s, v3.4s, v23.4s \n"
"fmla v0.4s, v4.4s, v24.4s \n"
"add x4, x4, #0x20 \n"
"str q0, [%1], #16 \n"
"bne 0b \n"
"add %2, %2, x4 \n"
"add %3, %3, x4 \n"
: "+r"(width), "+r"(outptr), "+r"(r0), "+r"(r1)
: "w"(k0123)
: "cc", "memory", "x3", "x4", "x19", "v0", "v1", "v2", "v3", "v4", "v21", "v22", "v23", "v24");
}
}
}
All the code in Test performance of arm neon assembly
I test the performance of these two codes on xiaomi5s
、xiaomi6
、redmi
, The detail of the performance is:
N: 12 IH: 224 IW: 224 OH: 112 OW: 112
- perf origin: 325.35058 mflops --- new: 4275.63483 mflops --- speedup: 13.14162 xiaomi5s
- perf origin: 3082.00078 mflops --- new: 3063.45047 mflops --- speedup: 0.99398 xiaomi6
- perf origin: 1761.05058 mflops --- new: 1814.37185 mflops --- speedup: 1.03028 redmi
The following test in xiaomi5s.
N: 12 IH:48-256 IW: 224
- N: 12 IH: 48 IW: 224 OH: 24 OW: 112 perf origin: 3721.16633 mflops --- new: 4935.31729 mflops --- speedup: 1.32628
- N: 12 IH: 80 IW: 224 OH: 40 OW: 112 perf origin: 1185.58378 mflops --- new: 3852.38266 mflops --- speedup: 3.24936
- N: 12 IH: 112 IW: 224 OH: 56 OW: 112 perf origin: 1021.83468 mflops --- new: 3503.70672 mflops --- speedup: 3.42884
- N: 12 IH: 144 IW: 224 OH: 72 OW: 112 perf origin: 797.61461 mflops --- new: 4167.12780 mflops --- speedup: 5.22449
- N: 12 IH: 176 IW: 224 OH: 88 OW: 112 perf origin: 465.55073 mflops --- new: 4084.54206 mflops --- speedup: 8.77357
- N: 12 IH: 208 IW: 224 OH: 104 OW: 112 perf origin: 373.99237 mflops --- new: 4255.78687 mflops --- speedup: 11.37934
- N: 12 IH: 240 IW: 224 OH: 120 OW: 112 perf origin: 341.57406 mflops --- new: 4290.58840 mflops --- speedup: 12.56122
N: 12 IH:224 IW: 48-256
- N: 12 IH: 224 IW: 48 OH: 112 OW: 24 perf origin: 3660.35916 mflops --- new: 4729.61877 mflops --- speedup: 1.29212
- N: 12 IH: 224 IW: 80 OH: 112 OW: 40 perf origin: 2918.48755 mflops --- new: 4748.17285 mflops --- speedup: 1.62693
- N: 12 IH: 224 IW: 112 OH: 112 OW: 56 perf origin: 951.03852 mflops --- new: 4051.84318 mflops --- speedup: 4.26044
- N: 12 IH: 224 IW: 144 OH: 112 OW: 72 perf origin: 1186.74405 mflops --- new: 4160.18572 mflops --- speedup: 3.50555
- N: 12 IH: 224 IW: 176 OH: 112 OW: 88 perf origin: 533.47286 mflops --- new: 4199.36622 mflops --- speedup: 7.87175
- N: 12 IH: 224 IW: 208 OH: 112 OW: 104 perf origin: 447.30682 mflops --- new: 4092.22256 mflops --- speedup: 9.14858
- N: 12 IH: 224 IW: 240 OH: 112 OW: 120 perf origin: 442.58206 mflops --- new: 4200.13672 mflops --- speedup: 9.49007
IC: 2-12 IH:224 IW: 224
- N: 2 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 3794.45684 mflops --- new: 5236.48508 mflops --- speedup: 1.38004
- N: 3 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 3790.20521 mflops --- new: 5150.30622 mflops --- speedup: 1.35885
- N: 4 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 2117.55521 mflops --- new: 4329.34274 mflops --- speedup: 2.04450
- N: 5 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 1290.43541 mflops --- new: 3915.65607 mflops --- speedup: 3.03437
- N: 6 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 1038.86926 mflops --- new: 3747.69392 mflops --- speedup: 3.60747
- N: 7 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 845.26878 mflops --- new: 4025.81237 mflops --- speedup: 4.76276
- N: 8 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 658.23150 mflops --- new: 3971.62335 mflops --- speedup: 6.03378
- N: 9 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 527.99489 mflops --- new: 4163.94501 mflops --- speedup: 7.88634
- N: 10 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 416.75353 mflops --- new: 4119.03296 mflops --- speedup: 9.88362
- N: 11 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 378.38875 mflops --- new: 4203.33717 mflops --- speedup: 11.10852
- N: 12 IH: 224 IW: 224 OH: 112 OW: 112 perf origin: 350.36924 mflops --- new: 4202.19842 mflops --- speedup: 11.99363
I am confused by the performance test in xiaomi5s
, Why the performance of the first code on xiaomi5s so bad.
I guess it may be caused by the pipeline of neon is broken if it wait for the normal register such as ld1 {v3.4s, v4.4s}, [x19]
wait for x19
which is calculated by add x19, %3, x4
, but I am not very sure。
Addition details:
- xiaomi5s cpu: Qualcomm Snapdragon 821
- xiaomi6 cpu: Qualcomm Snapdragon 835
- redmi cpu: MediaTek Helio X20
Compile options(clang version: 5.0.0): clang++ -std=c++11 -Ofast
.
- I change
ldr q0, [%2]
told1 v0.4s, [%2]
, but the result is the same, the performance of therun_origin
may be a little faster, about 1%-3%.
N: 12 IH: 224 IW: 224 OH: 112 OW: 112
perf origin: 342.96631 mflops --- asm: 4288.51646 mflops --- speedup: 12.50419
- I change
fmla v0.4s, v1.4s, v21.4s
tosmlsl2 v0.2d, v1.4s, v21.4s
, but the result is the same.N: 12 IH: 224 IW: 224 OH: 112 OW: 112
perf origin: 348.03699 mflops --- asm: 4245.18804 mflops --- speedup: 12.19752
- I change
fmla v0.4s, v1.4s, v21.4s
tofadd v0.4s, v1.4s, v21.4s
, the origin code gets faster.
N: 12 IH: 224 IW: 224 OH: 112 OW: 112
perf origin: 743.95433 mflops --- asm: 4756.65769 mflops --- speedup: 6.39375
fmla v0.4s, v1.4s, v21.4s
tosmlsl2 v0.2d, v1.4s, v21.4s
, but the result is the same. so it's not the fused multiply implementation problem. – Sethbrinfmla
tofadd
, the origin code gets twice faster. – Sethbrin