乘加运算的 Haskell 数学性能答案

【问题标题】：Haskell math performance on multiply-add operation乘加运算的 Haskell 数学性能
【发布时间】：2011-03-08 03:07:42
【问题描述】：

我正在用 Haskell 编写一个游戏，而我目前在 UI 上的操作涉及大量几何图形的程序生成。我目前专注于识别一个特定操作的性能（C-ish 伪代码）：

Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
    vecList[i] = vecList[i] * multiplier + addend;

也就是说，四个浮点数的沼泽标准乘加，适合 SIMD 优化的那种东西。

结果将发送到 OpenGL 顶点缓冲区，因此最终必须转储到平面 C 数组中。出于同样的原因，计算可能应该在 C 'float' 类型上完成。

我已经在 Haskell 中寻找库或本地惯用解决方案来快速完成此类事情，但我提出的每个解决方案似乎都徘徊在 2% 左右的性能（即慢 50 倍） ) 与来自 GCC 的具有正确标志的 C 相比。诚然，我几周前开始使用 Haskell，所以我的经验有限——这就是我来找你们的原因。你们中的任何人都可以提供更快的 Haskell 实现的建议，或有关如何编写高性能 Haskell 代码的文档的指针吗？

首先，最新的 Haskell 解决方案（大约 12 秒）。我尝试了来自this SO post 的爆炸模式的东西，但它并没有对 AFAICT 产生影响。将 'multAdd' 替换为 '(\iv -> v * 4)' 将执行时间缩短至 1.9 秒，因此按位计算（以及随之而来的对自动优化的挑战）似乎并没有太大问题。

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -O2 -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign.C.Types
import Data.Bits

repCount = 10000
arraySize = 20000

a = fromList $ [0.2::CFloat,  0.1, 0.6, 1.0]
m = fromList $ [0.99::CFloat, 0.7, 0.8, 0.6]

multAdd :: Int -> CFloat -> CFloat
multAdd !i !v = v * (m ! (i .&. 3)) + (a ! (i .&. 3))

multList :: Int -> Vector CFloat -> Vector CFloat
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.imap multAdd src

main = do
    print $ Data.Vector.Storable.sum $ multList repCount $ 
        Data.Vector.Storable.replicate (arraySize*4) (0::CFloat)

这是我在 C 中的内容。这里的代码有一些 #ifdefs 可以防止它被直接编译；向下滚动查看测试驱动程序。

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

typedef float v4fs __attribute__ ((vector_size (16)));
typedef struct { float x, y, z, w; } Vector4;

void setv4(v4fs *v, float x, float y, float z, float w) {
    float *a = (float*) v;
    a[0] = x;
    a[1] = y;
    a[2] = z;
    a[3] = w;
}

float sumv4(v4fs *v) {
    float *a = (float*) v;
    return a[0] + a[1] + a[2] + a[3];
}

void vecmult(v4fs *MAYBE_RESTRICT s, v4fs *MAYBE_RESTRICT d, v4fs a, v4fs m) {
    for (int j = 0; j < N; j++) {
        d[j] = s[j] * m + a;
    }
}

void scamult(float *MAYBE_RESTRICT s, float *MAYBE_RESTRICT d,
             Vector4 a, Vector4 m) {
    for (int j = 0; j < (N*4); j+=4) {
        d[j+0] = s[j+0] * m.x + a.x;
        d[j+1] = s[j+1] * m.y + a.y;
        d[j+2] = s[j+2] * m.z + a.z;
        d[j+3] = s[j+3] * m.w + a.w;
    }
}

int main () {
    v4fs a, m;
    v4fs *s, *d;

    setv4(&a, 0.2, 0.1, 0.6, 1.0);
    setv4(&m, 0.99, 0.7, 0.8, 0.6);

    s = calloc(N, sizeof(v4fs));
    d = s;

    double start = clock();
    for (int i = 0; i < M; i++) {

#ifdef COPY
        d = malloc(N * sizeof(v4fs));
#endif

#ifdef VECTOR
        vecmult(s, d, a, m);
#else
        Vector4 aa = *(Vector4*)(&a);
        Vector4 mm = *(Vector4*)(&m);
        scamult((float*)s, (float*)d, aa, mm);
#endif

#ifdef COPY
        free(s);
        s = d;
#endif
    }
    double end = clock();

    float sum = 0;
    for (int j = 0; j < N; j++) {
        sum += sumv4(s+j);
    }
    printf("%-50s %2.5f %f\n\n", NAME,
            (end - start) / (double) CLOCKS_PER_SEC, sum);
}

该脚本将编译并运行带有多个 gcc 标志组合的测试。 cmath-64-native-O3-restrict-vector-nocopy 在我的系统上的性能最好，耗时 0.22 秒。

import System.Process
import GHC.IOBase

cBase = ("cmath", "gcc mult.c -ggdb --std=c99 -DM=10000 -DN=20000")
cOptions = [
            [("32", "-m32"), ("64", "-m64")],
            [("generic", ""), ("native", "-march=native -msse4")],
            [("O1", "-O1"), ("O2", "-O2"), ("O3", "-O3")],
            [("restrict", "-DMAYBE_RESTRICT=__restrict__"),
                ("norestrict", "-DMAYBE_RESTRICT=")],
            [("vector", "-DVECTOR"), ("scalar", "")],
            [("copy", "-DCOPY"), ("nocopy", "")]
           ]

-- Fold over the Cartesian product of the double list. Probably a Prelude function
-- or two that does this, but hey. The 'perm' referred to permutations until I realized
-- that this wasn't actually doing permutations. '
permfold :: (a -> a -> a) -> a -> [[a]] -> [a]
permfold f z [] = [z]
permfold f z (x:xs) = concat $ map (\a -> (permfold f (f z a) xs)) x

prepCmd :: (String, String) -> (String, String) -> (String, String)
prepCmd (name, cmd) (namea, cmda) =
    (name ++ "-" ++ namea, cmd ++ " " ++ cmda)

runCCmd name compileCmd = do
    res <- system (compileCmd ++ " -DNAME=\\\"" ++ name ++ "\\\" -o " ++ name)
    if res == ExitSuccess
        then do system ("./" ++ name)
                return ()
        else    putStrLn $ name ++ " did not compile"

main = do
    mapM_ (uncurry runCCmd) $ permfold prepCmd cBase cOptions

【问题讨论】：

重写以使用更多惯用类型大致将运行时间减半，hpaste.org/fastcgi/hpaste.fcgi/view?id=26551#a26551 但我将此转发给 Roman 以考虑。

标签： performance math haskell simd

【解决方案1】：

Roman Leschinkskiy 回应：

实际上，核心看起来大部分都可以我。使用 unsafeIndex 代替 (!) 使程序超过两倍快（see my answer above）。这不过，下面的程序要快得多（和更清洁，IMO）。我怀疑这和之间的剩余差异 C程序是由于GHC的一般漂浮时的吸吮观点。 HEAD 产生 NCG 和-msse2 的最佳结果

首先，定义一个新的 Vec4 数据类型：

{-# LANGUAGE BangPatterns #-}

import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types

-- Define a 4 element vector type
data Vec4 = Vec4 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat
                 {-# UNPACK #-} !CFloat

确保我们可以将它存储在一个数组中

instance Storable Vec4 where
  sizeOf _ = sizeOf (undefined :: CFloat) * 4
  alignment _ = alignment (undefined :: CFloat)

  {-# INLINE peek #-}
  peek p = do
             a <- peekElemOff q 0
             b <- peekElemOff q 1
             c <- peekElemOff q 2
             d <- peekElemOff q 3
             return (Vec4 a b c d)
    where
      q = castPtr p
  {-# INLINE poke #-}
  poke p (Vec4 a b c d) = do
             pokeElemOff q 0 a
             pokeElemOff q 1 b
             pokeElemOff q 2 c
             pokeElemOff q 3 d
    where
      q = castPtr p

此类型的值和方法：

a = Vec4 0.2 0.1 0.6 1.0
m = Vec4 0.99 0.7 0.8 0.6

add :: Vec4 -> Vec4 -> Vec4
{-# INLINE add #-}
add (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a+a') (b+b') (c+c') (d+d')

mult :: Vec4 -> Vec4 -> Vec4
{-# INLINE mult #-}
mult (Vec4 a b c d) (Vec4 a' b' c' d') = Vec4 (a*a') (b*b') (c*c') (d*d')

vsum :: Vec4 -> CFloat
{-# INLINE vsum #-}
vsum (Vec4 a b c d) = a+b+c+d

multList :: Int -> Vector Vec4 -> Vector Vec4
multList !count !src
    | count <= 0    = src
    | otherwise     = multList (count-1) $ V.map (\v -> add (mult v m) a) src

main = do
    print $ Data.Vector.Storable.sum
          $ Data.Vector.Storable.map vsum
          $ multList repCount
          $ Data.Vector.Storable.replicate arraySize (Vec4 0 0 0 0)

repCount, arraySize :: Int
repCount = 10000
arraySize = 20000

使用 ghc 6.12.1，-O2 -fasm：

1.752

使用 ghc HEAD（6 月 26 日），-O2 -fasm -msse2

1.708

这看起来是编写 Vec4 数组的最惯用方式，并且可以获得最佳性能（比原来的速度快 11 倍）。（这可能会成为 GHC 的 LLVM 后端的基准）

【讨论】：

我用 LLVM 后端查看了这个。 -fasm 和 -fvia-C 都具有最佳设置，在我的笔记本电脑上运行时间约为 1.5 秒。 -fllvm 的运行时间约为 1.2 秒。标量 C 代码运行时间约为 0.7 秒，向量运行时间约为 0.27 秒。

【解决方案2】：

嗯，这样更好。 3.5 秒而不是 14 秒。

{-# LANGUAGE BangPatterns #-}
{-

-- multiply-add of four floats,
Vec4f multiplier, addend;
Vec4f vecList[];
for (int i = 0; i < count; i++)
    vecList[i] = vecList[i] * multiplier + addend;

-}

import qualified Data.Vector.Storable as V
import Data.Vector.Storable (Vector)
import Data.Bits

repCount, arraySize :: Int
repCount = 10000
arraySize = 20000

a, m :: Vector Float
a = V.fromList [0.2,  0.1, 0.6, 1.0]
m = V.fromList [0.99, 0.7, 0.8, 0.6]

multAdd :: Int -> Float -> Float
multAdd i v = v * (m `V.unsafeIndex` (i .&. 3)) + (a `V.unsafeIndex` (i .&. 3))

go :: Int -> Vector Float -> Vector Float
go n s
    | n <= 0    = s
    | otherwise = go (n-1) (f s)
  where
    f = V.imap multAdd

main = print . V.sum $ go repCount v
  where
    v :: Vector Float
    v = V.replicate (arraySize * 4) 0
            -- ^ a flattened Vec4f []

比以前更好：

$ ghc -O2 --make A.hs
[1 of 1] Compiling Main             ( A.hs, A.o )
Linking A ...

$ time ./A
516748.13
./A  3.58s user 0.01s system 99% cpu 3.593 total

multAdd 编译得很好：

        case readFloatOffAddr#
               rb_aVn
               (word2Int#
                  (and# (int2Word# sc1_s1Yx) __word 3))
               realWorld#
        of _ { (# s25_X1Tb, x4_X1Te #) ->
        case readFloatOffAddr#
               rb11_X118
               (word2Int#
                  (and# (int2Word# sc1_s1Yx) __word 3))
               realWorld#
        of _ { (# s26_X1WO, x5_X20B #) ->
        case writeFloatOffAddr#
               @ RealWorld
               a17_s1Oe
               sc3_s1Yz
               (plusFloat#
                  (timesFloat# x3_X1Qz x4_X1Te) x5_X20B)

但是，您在 C 代码中一次执行 4 个元素的乘法运算，所以我们需要直接这样做，而不是通过循环来伪造它掩蔽。 GCC 可能也在展开循环。

所以为了获得相同的性能，我们需要向量乘法（有点困难，可能通过 LLVM 后端）并展开循环（可能融合它）。我会在这里听罗马的，看看是否还有其他明显的事情。

一个想法可能是实际使用 Vector Vec4，而不是展平它。

【讨论】：

用multAdd i v = v 尝试相同的代码很有用。在我的系统上，它运行大约 75% 的时间，它告诉您与 multAdd 操作本身相比，遍历需要多长时间。
Haskell 版本的性能仍然比我想要的这个特定应用程序低很多，但这就是为什么会有 FFI。感谢您的帮助。
我正在继续查看这个，并且怀疑 imap 可能没有做正确的工作。如果我们能弄清楚发生了什么，会通知您。