启用 pthread 时 C FFI 回调的运行时性能下降答案

【问题标题】：Runtime performance degradation for C FFI Callback when pthreads are enabled启用 pthread 时 C FFI 回调的运行时性能下降
【发布时间】：2012-02-12 17:25:05
【问题描述】：

我对带有threaded 选项的 GHC 运行时的行为感到好奇，以防 C FFI 回调 Haskell 函数。我编写了代码来测量基本函数回调的开销（如下）。虽然函数回调开销之前已经是discussed，但我对在 C 代码中启用多线程时观察到的总时间急剧增加感到好奇（即使对 Haskell 的函数调用总数保持不变）。在我的测试中，我使用两种场景（GHC 7.0.4、RHEL、12-core box、代码后面的运行时选项）调用了 Haskell 函数f 5M 次：

C中的单线程create_threads函数：调用f 5M次-总时间1.32s
C create_threads函数中的5个线程：每个线程调用f 1M次 - 所以，总共还是5M - 总时间7.79s

下面的代码 - 下面的 Haskell 代码用于单线程 C 回调 - cmets 解释如何更新它以进行 5 线程测试：

t.hs：

{-# LANGUAGE BangPatterns #-}
import qualified Data.Vector.Storable as SV
import Control.Monad (mapM, mapM_)
import Foreign.Ptr (Ptr, FunPtr, freeHaskellFunPtr)
import Foreign.C.Types (CInt)

f :: CInt -> ()
f x = ()

-- "wrapper" import is a converter for converting a Haskell function to a foreign function pointer
foreign import ccall "wrapper"
  wrap :: (CInt -> ()) -> IO (FunPtr (CInt -> ()))

foreign import ccall safe "mt.h create_threads"
  createThreads :: Ptr (FunPtr (CInt -> ())) -> Ptr CInt -> CInt -> IO()

main = do
  -- set threads=[1..5], l=1000000 for multi-threaded FFI callback testing
  let threads = [1..1]
      l = 5000000
      vl = SV.replicate (length threads) (fromIntegral l) -- make a vector of l
  lf <- mapM (\x -> wrap f ) threads -- wrap f into a funPtr and create a list
  let vf = SV.fromList lf -- create vector of FunPtr to f
  -- pass vector of function pointer to f, and vector of l to create_threads
  -- create_threads will spawn threads (equal to length of threads list)
  -- each pthread will call back f l times - then we can check the overhead
  SV.unsafeWith vf $ \x ->
    SV.unsafeWith vl $ \y -> createThreads x y (fromIntegral $ SV.length vl)
  SV.mapM_ freeHaskellFunPtr vf

mt.h:

#include <pthread.h>
#include <stdio.h>

typedef void(*FunctionPtr)(int);

/** Struct for passing argument to thread
**
**/
typedef struct threadArgs{
   int  threadId;
   FunctionPtr fn;
   int length;
} threadArgs;


/* This is our thread function.  It is like main(), but for a thread*/
void *threadFunc(void *arg);
void create_threads(FunctionPtr*,int*,int);

mt.c:

#include "mt.h"


/* This is our thread function.  It is like main(), but for a thread*/
void *threadFunc(void *arg)
{
  FunctionPtr fn;
  threadArgs args = *(threadArgs*) arg;
  int id = args.threadId;
  int length = args.length;
  fn = args.fn;
  int i;
  for (i=0; i < length;){
    fn(i++); //call haskell function
  }
}

void create_threads(FunctionPtr* fp, int* length, int numThreads )
{
  pthread_t pth[numThreads];  // this is our thread identifier
  threadArgs args[numThreads];
  int t;
  for (t=0; t < numThreads;){
    args[t].threadId = t;
    args[t].fn = *(fp + t);
    args[t].length = *(length + t);
    pthread_create(&pth[t],NULL,threadFunc,&args[t]);
    t++;
  }

  for (t=0; t < numThreads;t++){
    pthread_join(pth[t],NULL);
  }
  printf("All threads terminated\n");
}

编译（GHC 7.0.4，gcc 4.4.3，以防被 ghc 使用）：

 $ ghc -O2 t.hs mt.c -lpthread -threaded -rtsopts -optc-O2

在create_threads 中使用 1 个线程运行（上面的代码会这样做）- 我关闭了并行 gc 进行测试：

$ ./t +RTS -s -N5 -g1
INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.04s  (  1.05s elapsed)
  GC    time    0.28s  (  0.28s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    1.32s  (  1.34s elapsed)

  %GC time      21.1%  (21.2% elapsed)

以 5 个线程运行（请参阅上面 t.hs 的 main 函数中的第一条评论，了解如何为 5 个线程编辑它）：

$ ./t +RTS -s -N5 -g1
INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    7.42s  (  2.27s elapsed)
  GC    time    0.36s  (  0.37s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    7.79s  (  2.63s elapsed)

  %GC time       4.7%  (13.9% elapsed)

我会很高兴深入了解为什么 create_threads 中的多个 pthread 会导致性能下降。我首先怀疑并行 GC，但我将其关闭以进行上述测试。给定相同的运行时选项，多个 pthread 的 MUT 时间也会急剧增加。所以，不只是 GC。

另外，对于这种场景，GHC 7.4.1 有什么改进吗？

我不打算经常从 FFI 回调 Haskell，但在设计 Haskell/C 多线程库交互时，它有助于理解上述问题。

【问题讨论】：

我在 7.2.2 上的减速要小得多，单线程的总时间为 1.42 秒（经过 1.42 秒），而四个线程的总时间为 2.58 秒（经过 1.86 秒）（因为我只有 2 个物理具有 4 个线程的内核，我认为要求 5 个线程毫无意义）。所以在 7.4.1 中可能会更好。
@DanielFischer，感谢您提供有关 7.2.2 性能的指针。也许我应该在 RHEL 上下载并编译 7.4.1RC 以查看它的性能。不过，这是一项相当耗时的工作。
我相信他们也为候选发布者预先构建了二进制文件。我想这不会那么耗时。还是原版二进制文件不能在 RHEL 上运行？
@DanielFischer，由于 glibc 版本比编译二进制文件时使用的旧版本，普通二进制文件无法在 RHEL5 上运行。

标签： haskell concurrency ffi

【解决方案1】：

我相信这里的关键问题是，GHC 运行时如何将 C 回调调度到 Haskell 中？虽然我不确定，但我怀疑所有 C 回调都由最初进行外部调用的 Haskell 线程处理，至少到 ghc-7.2.1（我正在使用）。

这可以解释您（和我）在从 1 个线程移动到 5 个线程时看到的大幅减速。如果五个线程都回调到同一个 Haskell 线程，那么该 Haskell 线程将有很大的竞争来完成所有回调。

为了测试这一点，我修改了您的代码，以便 Haskell 在调用 create_threads 之前分叉一个新线程，而 create_threads 每次调用只产生一个线程。如果我是正确的，每个操作系统线程都会有一个专用的 Haskell 线程来执行工作，所以应该会有更少的争用。尽管这仍然需要几乎是单线程版本的两倍，但它比原始的多线程版本要快得多，这为这一理论提供了一些证据。如果我用+RTS -qm 关闭线程迁移，差异就会小得多。

由于 Daniel Fischer 报告了 ghc-7.2.2 的不同结果，我预计该版本会改变 Haskell 安排回调的方式。也许ghc-users 列表中的某个人可以提供更多信息；我在 7.2.2 或 7.4.1 的发行说明中看不到任何可能的内容。

【讨论】：

感谢您的反馈。你的理论看起来很有道理。似乎发生了某种争论。我也怀疑回调是单线程的。你所描述的符合观察。我昨天还通过电子邮件发送了 ghc-users 列表。
在我的测试中验证了您的观察结果。如果我将每个 pthread 映射到一个 Haskell 线程以进行回调（在 7.0.4 中），则运行时可以很好地扩展。将您的解决方案标记为答案。