分析 Haskell 程序答案

【问题标题】：Profiling a Haskell program分析 Haskell 程序
【发布时间】：2012-07-20 14:57:41
【问题描述】：

我有一段代码使用sequence 从概率分布中重复采样。从道德上讲，它会这样做：

sampleMean :: MonadRandom m => Int -> m Float -> m Float
sampleMean n dist = do
  xs <- sequence (replicate n dist)
  return (sum xs)

除了它有点复杂。我感兴趣的实际代码是函数likelihoodWeightingthis Github repo。

我注意到运行时间与n 呈非线性关系。特别是，一旦n 超过某个值，它就会达到内存限制，并且运行时间会爆炸。我不确定，但我认为这是因为 sequence 正在建立一长串 thunk，直到调用 sum 才得到评估。

一旦我超过了大约 100,000 个样本，程序就会慢下来。我想对此进行优化（我的感觉是 1000 万个样本应该不成问题）所以我决定对其进行分析 - 但我在理解分析器的输出时遇到了一点麻烦。

分析

我在文件main.hs 中创建了一个简短的可执行文件，它使用 100,000 个样本运行我的函数。这是做的输出

$ ghc -O2 -rtsopts main.hs
$ ./main +RTS -s

我注意到的第一件事 - 它分配了近 1.5 GB 的堆，并将 60% 的时间用于垃圾收集。这通常表示过于懒惰吗？

 1,377,538,232 bytes allocated in the heap
 1,195,050,032 bytes copied during GC
   169,411,368 bytes maximum residency (12 sample(s))
     7,360,232 bytes maximum slop
           423 MB total memory in use (0 MB lost due to fragmentation)

Generation 0:  2574 collections,     0 parallel,  2.40s,  2.43s elapsed
Generation 1:    12 collections,     0 parallel,  1.07s,  1.28s elapsed

INIT  time    0.00s  (  0.00s elapsed)
MUT   time    1.92s  (  1.94s elapsed)
GC    time    3.47s  (  3.70s elapsed)
RP    time    0.00s  (  0.00s elapsed)
PROF  time    0.23s  (  0.23s elapsed)
EXIT  time    0.00s  (  0.00s elapsed)
Total time    5.63s  (  5.87s elapsed)

%GC time      61.8%  (63.1% elapsed)

Alloc rate    716,368,278 bytes per MUT second

Productivity  34.2% of total user, 32.7% of total elapsed

以下是结果

$ ./main +RTS -p

第一次运行时，发现有一个函数被重复调用，结果我可以记忆它，这使事情加速了 2 倍。它没有解决空间泄漏，但是。

COST CENTRE           MODULE                no. entries  %time %alloc   %time %alloc

MAIN                  MAIN                    1        0   0.0    0.0   100.0  100.0
 main                 Main                  434        4   0.0    0.0   100.0  100.0
  likelihoodWeighting AI.Probability.Bayes  445        1   0.0    0.3   100.0  100.0
   distributionLW     AI.Probability.Bayes  448        1   0.0    2.6     0.0    2.6
   getSampleLW        AI.Probability.Bayes  446   100000  20.0   50.4   100.0   97.1
    bnProb            AI.Probability.Bayes  458   400000   0.0    0.0     0.0    0.0
    bnCond            AI.Probability.Bayes  457   400000   6.7    0.8     6.7    0.8
    bnVals            AI.Probability.Bayes  455   400000  20.0    6.3    26.7    7.1
     bnParents        AI.Probability.Bayes  456   400000   6.7    0.8     6.7    0.8
    bnSubRef          AI.Probability.Bayes  454   800000  13.3   13.5    13.3   13.5
    weightedSample    AI.Probability.Bayes  447   100000  26.7   23.9    33.3   25.3
     bnProb           AI.Probability.Bayes  453   100000   0.0    0.0     0.0    0.0
     bnCond           AI.Probability.Bayes  452   100000   0.0    0.2     0.0    0.2
     bnVals           AI.Probability.Bayes  450   100000   0.0    0.3     6.7    0.5
      bnParents       AI.Probability.Bayes  451   100000   6.7    0.2     6.7    0.2
     bnSubRef         AI.Probability.Bayes  449   200000   0.0    0.7     0.0    0.7

这是一个堆配置文件。我不知道为什么它声称运行时间是 1.8 秒 - 这次运行大约需要 6 秒。

谁能帮我解释分析器的输出 - 即确定瓶颈在哪里，并就如何加快速度提供建议？

【问题讨论】：

尝试使用replicateM n 而不是sequence . replicate n
运行时间没有变化 - 可能并不奇怪，因为 replicateM n 是 defined as sequence . replicate n。
length xs 总是n，不是吗？因此，将(length xs) 替换为n，然后xs 不必随时完全驻留在内存中。
@dave4420 也可能触发列表融合优化，因为 sum 是列表使用者，replicateM 是列表生成器。这也将消除对额外严格性的需求。
@PeterHall 您将同一个生成器传递给对randomR 的每次调用，因此它每次都会生成相同的“随机”数字。您可以尝试类似（我没有测试过）main = do { a <- sampleMean 10000 (randomRIO (0.0,1.0)); print a }

标签： performance haskell strictness

【解决方案1】：

通过将 JohnL's suggestion 的使用 foldM 合并到 likelihoodWeighting 中已经实现了巨大的改进。这将内存使用量减少了大约十倍，并将 GC 时间显着降低到几乎或实际上可以忽略不计。

使用当前源运行分析会产生

probabilityIO              AI.Util.Util          26.1   42.4    413 290400000
weightedSample.go          AI.Probability.Bayes  16.1   19.1    255 131200080
bnParents                  AI.Probability.Bayes  10.8    1.2    171   8000384
bnVals                     AI.Probability.Bayes  10.4    7.8    164  53603072
bnCond                     AI.Probability.Bayes   7.9    1.2    125   8000384
ndSubRef                   AI.Util.Array          4.8    9.2     76  63204112
bnSubRef                   AI.Probability.Bayes   4.7    8.1     75  55203072
likelihoodWeighting.func   AI.Probability.Bayes   3.3    2.8     53  19195128
%!                         AI.Util.Util           3.3    0.5     53   3200000
bnProb                     AI.Probability.Bayes   2.5    0.0     40        16
bnProb.p                   AI.Probability.Bayes   2.5    3.5     40  24001152
likelihoodWeighting        AI.Probability.Bayes   2.5    2.9     39  20000264
likelihoodWeighting.func.x AI.Probability.Bayes   2.3    0.2     37   1600000

-s 报告的内存使用量为 13MB，最大驻留量约为 5MB。这已经不算太糟糕了。

不过，我们还有一些地方可以改进。首先，一个相对较小的事情，在宏伟的计划中，AI.UTIl.Array.ndSubRef：

ndSubRef :: [Int] -> Int
ndSubRef ns = sum $ zipWith (*) (reverse ns) (map (2^) [0..])

反转列表，将(2^) 映射到另一个列表是低效的，更好的是

ndSubRef = L.foldl' (\a d -> 2*a + d) 0

它不需要将整个列表保留在内存中（可能不是什么大问题，因为列表会很短），因为它可以反转它，并且不需要分配第二个列表。分配的减少是显着的，大约 10%，而且这部分的运行速度明显加快，

ndSubRef                   AI.Util.Array          1.7    1.3     24   8000384

在修改后的profile中运行，但由于只占用整体时间的一小部分，所以整体影响很小。 weightedSample 和 likelihoodWeighting 可能有更大的鱼要炸。

让我们在 weightedSample 中添加一些严格性，看看这会如何改变事情：

weightedSample :: Ord e => BayesNet e -> [(e,Bool)] -> IO (Map e Bool, Prob)
weightedSample bn fixed =
    go 1.0 (M.fromList fixed) (bnVars bn)
    where
        go w assignment []     = return (assignment, w)
        go w assignment (v:vs) = if v `elem` vars
            then
                let w' = w * bnProb bn assignment (v, fixed %! v)
                in go w' assignment vs
            else do
                let p = bnProb bn assignment (v,True)
                x <- probabilityIO p
                go w (M.insert v x assignment) vs

        vars = map fst fixed

go 的权重参数从不强制，赋值参数也不强制，因此它们可以建立 thunk。让我们启用{-# LANGUAGE BangPatterns #-} 并强制更新立即生效，同时评估p，然后将其传递给probabilityIO：

go w assignment (v:vs) = if v `elem` vars
    then
        let !w' = w * bnProb bn assignment (v, fixed %! v)
        in go w' assignment vs
    else do
        let !p = bnProb bn assignment (v,True)
        x <- probabilityIO p
        let !assignment' = M.insert v x assignment
        go w assignment' vs

这带来了分配的进一步减少 (~9%) 和小幅加速 (~%13%)，但总内存使用量和最大驻留没有太大变化。

我没有看到其他明显的变化，所以让我们看看likelihoodWeighting：

func m _ = do
    (a, w) <- weightedSample bn fixed
    let x = a ! e
    return $! x `seq` w `seq` M.adjust (+w) x m

在最后一行，首先，w 现在已经在weightedSample 中求值了，所以我们这里不需要seq 它，需要x 键来求值更新后的地图，所以@ 987654346@ing 这也没有必要。那条线上的坏事是M.adjust。 adjust 无法强制更新函数的结果，因此会在地图的值中构建 thunk。您可以通过查找修改后的值并强制执行对 thunk 的评估，但 Data.Map 在这里提供了一种更方便的方法，因为保证存在更新地图的键 insertWith'：

func !m _ = do
    (a, w) <- weightedSample bn fixed
    let x = a ! e
    return (M.insertWith' (+) x w m)

（注意：GHC 在m 上使用爆炸模式比在return $! ... 上优化更好）。这略微减少了总分配并且不会显着改变运行时间，但对使用的总内存和最大驻留时间有很大影响：

 934,566,488 bytes allocated in the heap
   1,441,744 bytes copied during GC
      68,112 bytes maximum residency (1 sample(s))
      23,272 bytes maximum slop
           1 MB total memory in use (0 MB lost due to fragmentation)

最大的运行时间改进是避免randomIO，使用的StdGen非常慢。

我很惊讶bn* 函数花费了多少时间，但没有发现任何明显的低效率。

【讨论】：

谢谢丹尼尔，这真的很有用。我已经做出了您建议的更改，我将研究您摆脱 randomIO 的想法。我觉得我现在离编写一个“真正的”Haskell 程序更近了……

【解决方案2】：

我无法消化这些个人资料，但我之前被踢过，因为 Hackage 上的 MonadRandom 是严格。创建 MonadRandom 的惰性版本让我的记忆问题消失了。

我的同事尚未获得发布代码的许可，但我已将Control.Monad.LazyRandom 在线发送至pastebin。或者，如果您想查看一些解释完全惰性随机搜索的摘录，包括无限的随机计算列表，请查看 Experience Report: Haskell in Computational Biology。

【讨论】：

+1 用于提出不那么严格的实现，而不是常规建议使其更严格
谢谢，这是一个有趣的想法。 Daniel Fischer 建议我摆脱对randomIO 的调用，因此我认为让我的采样函数返回Rand a，然后使用evalRandIO 调用它们是可行的方法。
@Chris：我已经把我们的论文放到网上了；它有几个使用Rand a 计算的示例。我们只在顶层调用一次evalRand，或者在 fork 并行计算列表时调用一次。

【解决方案3】：

我整理了一个非常基本的示例，发布在此处：http://hpaste.org/71919。我不确定它是否像您的示例一样......只是一个似乎有效的非常小的东西。

使用 -prof 和 -fprof-auto 编译并运行 100000 次迭代会产生以下分析输出的头部（请原谅我的行号）：

  8 COST CENTRE                   MODULE               %time %alloc
  9 
 10 sample                        AI.Util.ProbDist      31.5   36.6
 11 bnParents                     AI.Probability.Bayes  23.2    0.0
 12 bnRank                        AI.Probability.Bayes  10.7   23.7
 13 weightedSample.go             AI.Probability.Bayes   9.6   13.4
 14 bnVars                        AI.Probability.Bayes   8.6   16.2
 15 likelihoodWeighting           AI.Probability.Bayes   3.8    4.2
 16 likelihoodWeighting.getSample AI.Probability.Bayes   2.1    0.7
 17 sample.cumulative             AI.Util.ProbDist       1.7    2.1
 18 bnCond                        AI.Probability.Bayes   1.6    0.0
 19 bnRank.ps                     AI.Probability.Bayes   1.1    0.0

以下是汇总统计数据：

1,433,944,752 bytes allocated in the heap
 1,016,435,800 bytes copied during GC
   176,719,648 bytes maximum residency (11 sample(s))
     1,900,232 bytes maximum slop
           400 MB total memory in use (0 MB lost due to fragmentation)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    1.40s  (  1.41s elapsed)
GC      time    1.08s  (  1.24s elapsed)
Total   time    2.47s  (  2.65s elapsed)

%GC     time      43.6%  (46.8% elapsed)

Alloc rate    1,026,674,336 bytes per MUT second

Productivity  56.4% of total user, 52.6% of total elapsed

请注意，分析器将其指向sample。我通过使用$! 在该函数中强制使用return，以下是之后的一些汇总统计数据：

1,776,908,816 bytes allocated in the heap
  165,232,656 bytes copied during GC
   34,963,136 bytes maximum residency (7 sample(s))
      483,192 bytes maximum slop
           68 MB total memory in use (0 MB lost due to fragmentation)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time    2.42s  (  2.44s elapsed)
GC      time    0.21s  (  0.23s elapsed)
Total   time    2.63s  (  2.68s elapsed)

%GC     time       7.9%  (8.8% elapsed)

Alloc rate    733,248,745 bytes per MUT second

Productivity  92.1% of total user, 90.4% of total elapsed

在 GC 方面效率更高，但当时并没有太大变化。您也许可以继续以这种配置文件/调整方式进行迭代，以针对您的瓶颈并获得更好的性能。

【讨论】：

非常感谢！坦率地说，我印象深刻的是，您阅读了足够多的糟糕代码来弄清楚如何组合示例......
快速问题 - 我正在使用 -prof -auto-all 编译我的测试脚本，但如果我添加明确的 {-# SCC #-} 注释，我只会得到成本中心的细分。你是如何自动获得它们的？我尝试使用-prof -fprof-auto 进行编译，但 GCH 无法识别第二个标志。
您必须使用较旧的 GHC - 我使用的是 7.4.1，它支持 these profiling options。 7.0.2支持these；看起来-auto-all 忽略了嵌套函数。
我一定是在做一些愚蠢的事情——我正在用cabal install -p 安装库，然后用ghc -rtsopts -prof -fprof-auto --fforce-recomp -O2 --make main.hs 编译我的测试文件，但我仍然没有得到详细的配置文件信息。有什么明显的我做错了吗？
嗯。看起来不错，而且这种方法确实在我身上吐出了the desired output。不知道那里发生了什么？

【解决方案4】：

我认为您的初步诊断是正确的，而且我从未见过在记忆效应开始后有用的分析报告。

问题是您要遍历列表两次，一次用于sequence，另一次用于sum。在 Haskell 中，大型列表的多个列表遍历对性能非常非常不利。解决方法一般是使用某种类型的折叠，比如foldM。你的sampleMean 函数可以写成

{-# LANGUAGE BangPatterns #-}

sampleMean2 :: MonadRandom m => Int -> m Float -> m Float
sampleMean2 n dist = foldM (\(!a) mb -> liftM (+a) mb) 0 $ replicate n dist

例如，只遍历列表一次。

你也可以用likelihoodWeighting 做同样的事情。为了防止重击，确保折叠函数中的累加器具有适当的严格性很重要。

【讨论】：

您的示例没有进行类型检查。 step 函数应该看起来更像liftM . (+)。它也可以更严格一些；使用randomRIO 进行测试往往会溢出较大数字的堆栈，可能类似于\a mb -> mb >>= \b -> return $! a + b?
谢谢，我会重写 likelihoodWeighting 并折叠。
@Vitus - 谢谢，已修复。我通常更喜欢使用 bang 模式来增加严格性，但你的建议可能是相同的。