Haskell：涉及大型列表的计算中的意外时间复杂度答案

【问题标题】：Haskell: unexpected time-complexity in the computation involving large listsHaskell：涉及大型列表的计算中的意外时间复杂度
【发布时间】：2012-11-08 16:44:36
【问题描述】：

我正在处理作为中间结果列表 A=[B] 的计算，它是长度为 L 的 K 个列表的列表。计算 B 的元素的时间复杂度由参数控制M 并且理论上在 M 中是线性的。理论上我希望计算 A 的时间复杂度为 O(K*L*M)。但是，事实并非如此，我不明白为什么？

这是一个简单的完整草图程序，展示了我已经解释过的问题

import System.Random (randoms, mkStdGen)
import Control.Parallel.Strategies (parMap, rdeepseq)
import Control.DeepSeq (NFData)
import Data.List (transpose)

type Point = (Double, Double)

fmod :: Double -> Double -> Double
fmod a b | a < 0     = b - fmod (abs a) b 
         | otherwise = if a < b then a 
                       else let q = a / b in b * (q - fromIntegral (floor q))

standardMap :: Double -> Point -> Point
standardMap k (q, p) = (fmod (q + p) (2 * pi), fmod (p + k * sin(q)) (2 * pi))

trajectory :: (Point -> Point) -> Point -> [Point] 
trajectory map initial = initial : (trajectory map $ map initial)

justEvery :: Int -> [a] -> [a]
justEvery n (x:xs) = x : (justEvery n $ drop (n-1) xs)
justEvery _ []     = []

subTrace :: Int -> Int -> [a] -> [a]
subTrace n m = take (n + 1) . justEvery m

ensemble :: Int -> [Point]
ensemble n = let qs = randoms (mkStdGen 42)
                 ps = randoms (mkStdGen 21)
             in take n $ zip qs ps 

ensembleTrace :: NFData a => (Point -> [Point]) -> (Point -> a) -> 
                              Int -> Int -> [Point] -> [[a]]
ensembleTrace orbitGen observable n m = 
    parMap rdeepseq ((map observable . subTrace n m) . orbitGen)

main = let k = 100
           l = 100
           m = 100
           orbitGen = trajectory (standardMap 7)
           observable (p, q) = p^2 - q^2
           initials = ensemble k
           mean xs = (sum xs) / (fromIntegral $ length xs)
           result =   (map mean) 
                    $ transpose 
                    $ ensembleTrace orbitGen observable l m 
                    $ initials
       in mapM_ print result

我正在编译

$ ghc -O2 stdmap.hs -threaded

并与

一起运行

$ ./stdmap +RTS -N4 > /dev/null

在intel Q6600、Linux 3.6.3-1-ARCH、GHC 7.6.1 上得到以下结果对于不同的参数集K、L、M（程序代码中的k、l、m）

(K=200,L=200,N=200)   -> real    0m0.774s
                         user    0m2.856s
                         sys     0m0.147s

(K=2000,L=200,M=200)  -> real    0m7.409s
                         user    0m28.102s
                         sys     0m1.080s

(K=200,L=2000,M=200)  -> real    0m7.326s
                         user    0m27.932s
                         sys     0m1.020s

(K=200,L=200,M=2000)  -> real    0m10.581s
                         user    0m38.564s
                         sys     0m3.376s

(K=20000,L=200,M=200) -> real    4m22.156s
                         user    7m30.007s
                         sys     0m40.321s

(K=200,L=20000,M=200) -> real    1m16.222s
                         user    4m45.891s
                         sys     0m15.812s

(K=200,L=200,M=20000) -> real    8m15.060s
                         user    23m10.909s
                         sys     9m24.450s

我不太明白这种纯粹的缩放问题可能出在哪里。如果我理解正确，列表是惰性的，不应该构造，因为它们是在头尾方向上消耗的？从测量中可以看出，过多的实时消耗和过多的系统时间消耗之间存在相关性，因为多余的将在系统帐户上。但是如果有一些内存管理浪费时间，这仍然应该在 K、L、M 中线性缩放。

救命！

编辑

我根据 Daniel Fisher 给出的建议对代码进行了更改，确实解决了 M 的不良缩放问题。正如所指出的，通过在轨迹中强制进行严格的评估，我们避免了大型 thunk 的构建。我理解这背后的性能改进，但我仍然不理解原始代码的不良缩放，因为（如果我理解正确的话）thunk 构造的时空复杂度应该在 M 中是线性的？

此外，我仍然无法理解关于 K（集合的大小）的不良缩放。我使用改进的代码对 K=8000 和 K=16000 进行了两次额外的测量，保持 L=200，M=200。放大到 K=8000 符合预期，但对于 K=16000，它已经不正常了。问题似乎出在overflowedSPARKS 的数量上，K=8000 为 0，K=16000 为 7802。这可能反映在糟糕的并发性上，我将其量化为商Q = (MUT cpu time) / (MUT real time)，理想情况下它等于 CPU 的数量。但是，对于 K = 8000，Q ~ 4，对于 K = 16000，Q ~ 2。请帮助我了解此问题的根源和可能的解决方案。

K = 8000:

$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null

56,905,405,184 bytes allocated in the heap
 503,501,680 bytes copied during GC
  53,781,168 bytes maximum residency (15 sample(s))
   6,289,112 bytes maximum slop
         151 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause
Gen  0     27893 colls, 27893 par    7.85s    1.99s     0.0001s    0.0089s
Gen  1        15 colls,    14 par    1.20s    0.30s     0.0202s    0.0558s

Parallel GC work balance: 23.49% (serial 0%, perfect 100%)

TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

SPARKS: 8000 (8000 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time   95.90s  ( 24.28s elapsed)
GC      time    9.04s  (  2.29s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time  104.95s  ( 26.58s elapsed)

Alloc rate    593,366,811 bytes per MUT second

Productivity  91.4% of total user, 360.9% of total elapsed

gc_alloc_block_sync: 315819

和

K = 16000:

$ ghc -O2 stmap.hs -threaded -XBangPatterns
$ ./stmap +RTS -s -N4 > /dev/null

113,809,786,848 bytes allocated in the heap
 1,156,991,152 bytes copied during GC
  114,778,896 bytes maximum residency (18 sample(s))
    11,124,592 bytes maximum slop
      300 MB total memory in use (0 MB lost due to fragmentation)

                                Tot time (elapsed)  Avg pause  Max pause
Gen  0     135521 colls, 135521 par   22.83s    6.59s     0.0000s    0.0190s
Gen  1        18 colls,    17 par    2.72s    0.73s     0.0405s    0.1692s

Parallel GC work balance: 18.05% (serial 0%, perfect 100%)

TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)

SPARKS: 16000 (8198 converted, 7802 overflowed, 0 dud, 0 GC'd, 0 fizzled)

INIT    time    0.00s  (  0.00s elapsed)
MUT     time  221.77s  (139.78s elapsed)
GC      time   25.56s  (  7.32s elapsed)
EXIT    time    0.00s  (  0.00s elapsed)
Total   time  247.34s  (147.10s elapsed)

Alloc rate    513,176,874 bytes per MUT second

Productivity  89.7% of total user, 150.8% of total elapsed

gc_alloc_block_sync: 814824

【问题讨论】：

你还没有在ensembleTrace orbitGen observable n m中定义n。
a) trajectory = iterate, b) m 中的不良缩放是因为当您在评估的任何两个之间跳过许多元素时，您会在 trajectory 中构建巨大的 thunk。

标签： performance list haskell ghc time-complexity

【解决方案1】：

M. A. D.关于fmod的观点很好，但没必要喊C，我们可以更好地留在Haskell土地上（ticket链接线程是同时固定的）。麻烦在

fmod :: Double -> Double -> Double
fmod a b | a < 0     = b - fmod (abs a) b 
         | otherwise = if a < b then a 
                       else let q = a / b in b * (q - fromIntegral (floor q))

是类型默认导致floor :: Double -> Integer（因此fromIntegral :: Integer -> Double）被调用。现在Integer是比较复杂的类型，运算比较慢，Integer到Double的转换也比较复杂。原始代码（带有参数k = l = 200 和m = 5000）产生了统计数据

./nstdmap +RTS -s -N2 > /dev/null
  60,601,075,392 bytes allocated in the heap
  36,832,004,184 bytes copied during GC
       2,435,272 bytes maximum residency (13741 sample(s))
         887,768 bytes maximum slop
               9 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     46734 colls, 46734 par   41.66s   20.87s     0.0004s    0.0058s
  Gen  1     13741 colls, 13740 par   23.18s   11.62s     0.0008s    0.0041s

  Parallel GC work balance: 60.58% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   34.99s  ( 17.60s elapsed)
  GC      time   64.85s  ( 32.49s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   99.84s  ( 50.08s elapsed)

  Alloc rate    1,732,048,869 bytes per MUT second

  Productivity  35.0% of total user, 69.9% of total elapsed

在我的机器上（-N2 因为我只有两个物理内核）。只需更改代码以使用类型签名floor q :: Int 即可将其降低到

./nstdmap +RTS -s -N2 > /dev/null
  52,105,495,488 bytes allocated in the heap
  29,957,007,208 bytes copied during GC
       2,440,568 bytes maximum residency (10481 sample(s))
         893,224 bytes maximum slop
               8 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0     36979 colls, 36979 par   32.96s   16.51s     0.0004s    0.0066s
  Gen  1     10481 colls, 10480 par   16.65s    8.34s     0.0008s    0.0018s

  Parallel GC work balance: 68.64% (serial 0%, perfect 100%)

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)

  SPARKS: 200 (200 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.01s  (  0.01s elapsed)
  MUT     time   29.78s  ( 14.94s elapsed)
  GC      time   49.61s  ( 24.85s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   79.40s  ( 39.80s elapsed)

  Alloc rate    1,749,864,775 bytes per MUT second

  Productivity  37.5% of total user, 74.8% of total elapsed

经过时间减少约 20%，MUT 时间减少 13%。不错。如果我们查看通过优化获得的floor 的代码，我们就会明白原因：

floorDoubleInt :: Double -> Int
floorDoubleInt (D# x) =
    case double2Int# x of
      n | x <## int2Double# n   -> I# (n -# 1#)
        | otherwise             -> I# n

floorDoubleInteger :: Double -> Integer
floorDoubleInteger (D# x) =
    case decodeDoubleInteger x of
      (# m, e #)
        | e <# 0#   ->
          case negateInt# e of
            s | s ># 52#    -> if m < 0 then (-1) else 0
              | otherwise   ->
                case TO64 m of
                  n -> FROM64 (n `uncheckedIShiftRA64#` s)
        | otherwise -> shiftLInteger m e

floor :: Double -> Int 只是使用机器转换，而floor :: Double -> Integer 需要一个昂贵的decodeDoubleInteger 和更多的分支。但是在这里调用floor的地方，我们知道所有涉及的Doubles都是非负的，因此floor和truncate是一样的，直接映射到机器转换double2Int#，所以让我们试试这个而不是floor:

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   29.29s  ( 14.70s elapsed)
  GC      time   49.17s  ( 24.62s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   78.45s  ( 39.32s elapsed)

一个非常小的减少（可以预料，fmod 并不是真正的瓶颈）。为了比较，呼唤C：

  INIT    time    0.01s  (  0.01s elapsed)
  MUT     time   31.46s  ( 15.78s elapsed)
  GC      time   54.05s  ( 27.06s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   85.52s  ( 42.85s elapsed)

有点慢（不出所料，您可以在调用 C 的时间内执行许多 primops）。

但这不是大鱼游泳的地方。不好的是，只选择轨迹中的每个m-th 元素会导致大量的重击，从而导致大量分配，并且在时机成熟时需要很长时间来评估。因此，让我们消除泄漏并使轨迹严格：

{-# LANGUAGE BangPatterns #-}

trajectory :: (Point -> Point) -> Point -> [Point] 
trajectory map !initial@(!a,!b) = initial : (trajectory map $ map initial)

这大大减少了分配和 GC 时间，因此也减少了 MUT 时间：

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   21.83s  ( 10.95s elapsed)
  GC      time    0.72s  (  0.36s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   22.55s  ( 11.31s elapsed)

与原fmod，

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   18.26s  (  9.18s elapsed)
  GC      time    0.58s  (  0.29s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   18.84s  (  9.47s elapsed)

floor q :: Int，truncate q :: Int 的测量精度相同（truncate 的分配数字略低）。

问题似乎在于溢出的 SPARKS 的数量，对于 K=8000 为 0，对于 K=16000 为 7802。这可能反映在糟糕的并发性上

是的（尽管据我所知，这里更正确的术语是并行性而不是并发性），有一个火花池，当它已满时，不会安排任何进一步的火花在接下来有时间的任何线程中进行评估当轮到它时，计算是在没有并行性的情况下从父线程完成的。在这种情况下，这意味着在初始并行阶段之后，计算会退回到顺序。

火花池的大小显然约为 8K (2^13)。

如果您通过顶部观察 CPU 负载，您会看到它在一段时间后从 (close to 100%)*(number of cores) 下降到一个低得多的值（对我来说，-N2 约为 100%，@987654360 约为 130% @)。

解决方法是避免火花过多，让每个火花做更多的工作。快速而肮脏的修改

ensembleTrace orbitGen observable n m =
    withStrategy (parListChunk 25 rdeepseq) . map ((map observable . subTrace n m) . orbitGen)

我在 -N2 的情况下恢复到 200%，几乎整个运行过程和良好的生产力，

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   57.42s  ( 29.02s elapsed)
  GC      time    5.34s  (  2.69s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time   62.76s  ( 31.71s elapsed)

  Alloc rate    1,982,155,167 bytes per MUT second

  Productivity  91.5% of total user, 181.1% of total elapsed

使用-N4 也很好（甚至在挂钟上稍微快一点 - 不多，因为所有线程的工作基本相同，而且我只有 2 个物理内核），

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time   99.17s  ( 26.31s elapsed)
  GC      time   16.18s  (  4.80s elapsed)
  EXIT    time    0.00s  (  0.00s elapsed)
  Total   time  115.36s  ( 31.12s elapsed)

  Alloc rate    1,147,619,609 bytes per MUT second

  Productivity  86.0% of total user, 318.7% of total elapsed

因为现在火花池没有溢出。

正确的解决方法是使块的大小成为根据轨迹和可用核心数计算的参数，以便火花数不超过池大小。

【讨论】：

我想知道如果像 floor、ceil 等函数在同一类型中给出结果不是更好。 floor :: RealFrac a => a -> a
感谢您提出的改进建议。我不知道 BangPatterns 把戏，我正在努力寻找最非侵入性的方法来使trajectory 更严格。但是我仍然有未解决的问题。请查看问题更新。
@BenjaminBatistic 详细阐述了火花池并添加了治疗方法。
@DanielFischer 谢谢，parListChunk 解决了这个问题。在我的情况下，如果parListChunk n _ 中的n 对应于块的数量而不是块中的元素数量会更好，因为块是并行执行的。一般情况下不是更好吗？我仍然不明白原始代码中关于 M 的块创建的时空复杂性？
@BenjaminBatistic 我认为parListChunk 将块的长度作为参数而不是块的数量，因为通常这意味着首先必须遍历列表以找出每个块的长度成为。这可能需要评估列表元素，挫败并行化的尝试，当然对于无限列表根本不起作用。在您的情况下，您知道列表的长度，因此您可以从中计算块的长度 - 编码不太方便，但可以工作™。

【解决方案2】：

在做了一些快速分析后，我发现这些是连环犯罪者：

ghc --make -O2 MainNonOpt.hs -threaded -prof -auto-all -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null

>>>
COST CENTRE MODULE  %time %alloc
fmod        Main     46.3   33.3
standardMap Main     28.5    0.0
trajectory  Main     23.8   66.6

考虑到 fmod 主要是一个数值函数，它的大量分配令人惊讶。所以下一步就是对 fmod 进行注解，看看问题出在哪里：

fmod :: Double -> Double -> Double
fmod a b | a < 0     = {-# SCC "negbranch" #-} b - fmod (abs a) b 
         | otherwise = {-# SCC "posbranch" #-} if a < b then a 
                       else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - {-# SCC "floor" #-} fromIntegral (floor q))

这给了我们：

ghc --make -O2 MainNonOpt.hs -threaded -prof -caf-all -fforce-recomp
./MainNonOpt +RTS -N4 -p > /dev/null

COST CENTRE MODULE  %time %alloc

MAIN        MAIN     61.5   70.0
posbranch   Main     16.6    0.0
floor       Main     14.9   30.0
expression  Main      4.5    0.0
negbranch   Main      1.9    0.0

所以floor 是导致问题的原因。环顾四周后发现，Prelude 没有尽其所能实现一些 Double RealFrac 功能（请参阅here），可能会导致一些装箱/拆箱。

因此，按照链接中的建议，我使用了 floor 的修改版本，这也使得对 fromIntegral 的调用变得不必要：

floor' :: Double -> Double
floor' x = c_floor x
{-# INLINE floor' #-} 

foreign import ccall unsafe "math.h floor" 
   c_floor :: Double -> Double 


fmod :: Double -> Double -> Double
fmod a b | a < 0     = {-# SCC "negbranch" #-} b - fmod (abs a) b
         | otherwise = {-# SCC "posbranch" #-} if a < b then a
                       else let q = {-# SCC "division" #-} a / b in {-# SCC "expression" #-} b * (q - ({-# SCC "floor" #-} floor' q))

编辑：正如 Daniel Fisher 指出的那样，无需内联 C 代码来提高性能。类似的 Haskell 函数已经存在。无论如何我都会留下答案，以供进一步参考。

这确实有所作为。在我的机器上，对于 k=l=200，M=5000 这里是未优化和优化版本的数字：

未优化：

real    0m20.635s
user    1m17.321s
sys     0m4.980s

优化：

real    0m14.858s
user    0m55.271s
sys     0m3.815s

trajectory 函数可能存在类似问题，您可以像上面使用的那样使用分析来查明问题。

可以在 Real World Haskell 的 this chapter 中找到在 Haskell 中进行分析的一个很好的起点。

【讨论】：