【发布时间】:2016-01-30 20:09:41
【问题描述】:
我正在开发一个应该在 ARMv7 处理器设备上运行的原生 android 应用程序。 由于某些原因,我需要对向量(短和/或浮点)进行一些繁重的计算。 我使用 NEON 命令实现了一些汇编函数来增强计算。我获得了 1.5 倍的速度系数,这还不错。我想知道我是否可以改进这些功能以更快地运行。
所以问题是:我可以做些什么改变来改进这些功能?
//add to float vectors.
//the result could be put in scr1 instead of dst
void add_float_vector_with_neon3(float* dst, float* src1, float* src2, int count)
{
asm volatile (
"1: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vadd.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 1b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//multiply a float vector by a scalar.
//the result could be put in scr1 instead of dst
void mul_float_vector_by_scalar_with_neon3(float* dst, float* src1, float scalar, int count)
{
asm volatile (
"vdup.32 q1, %[scalar] \n"
"2: \n"
"vld1.32 {q0}, [%[src1]]! \n"
"vmul.f32 q0, q0, q1 \n"
"subs %[count], %[count], #4 \n"
"vst1.32 {q0}, [%[dst]]! \n"
"bgt 2b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [scalar] "r" (scalar), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//add to short vector -> no problem of coding limits
//the result should be put in in a dest different from src1 and scr2
void add_short_vector_with_neon3(short* dst, short* src1, short* src2, int count)
{
asm volatile (
"3: \n"
"vld1.16 {q0}, [%[src1]]! \n"
"vld1.16 {q1}, [%[src2]]! \n"
"vadd.i16 q0, q0, q1 \n"
"subs %[count], %[count], #8 \n"
"vst1.16 {q0}, [%[dst]]! \n"
"bgt 3b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "q0", "q1"
);
}
//multiply a short vector by a float vector and put the result bach into a short vector
//the result should be put in in a dest different from src1
void mul_short_vector_by_float_vector_with_neon3(short* dst, short* src1, float* src2, int count)
{
asm volatile (
"4: \n"
"vld1.16 {d0}, [%[src1]]! \n"
"vld1.32 {q1}, [%[src2]]! \n"
"vmovl.s16 q0, d0 \n"
"vcvt.f32.s32 q0, q0 \n"
"vmul.f32 q0, q0, q1 \n"
"vcvt.s32.f32 q0, q0 \n"
"vmovn.s32 d0, q0 \n"
"subs %[count], %[count], #4 \n"
"vst1.16 {d0}, [%[dst]]! \n"
"bgt 4b \n"
: [dst] "+r" (dst)
: [src1] "r" (src1), [src2] "r" (src2), [count] "r" (count)
: "memory", "d0", "q0", "q1"
);
}
提前致谢!
【问题讨论】:
-
嗯,这是程序集而不是内在函数
-
谢谢,我改了帖子
-
我在software.intel.com/en-us/blogs/2012/12/12/… 找到了很多有用的提示
-
第一条经验法则是不要尝试在加载后立即使用加载的结果,因为加载需要时间并且可能会停止下一条指令。所以你总是想交错指令,或者
software-pipeline指令。 -
我不会说 ARM。完全没有。但由于docs:
Warning: Do not modify the contents of input-only operands (except for inputs tied to outputs)中的这一行,我确实有点担心这段代码。为了澄清这个限制,包含“count”的寄存器在退出 asm 时的值是否与进入时的值完全相同?如果答案是否定的,你就违反了规则。如果 gcc 尝试重新使用它“知道”包含特定值的寄存器,却发现你误导了它,则可能会导致坏事。
标签: android c native inline-assembly neon