Haskell 中的堆溢出答案

【问题标题】：Heap overflow in HaskellHaskell 中的堆溢出
【发布时间】：2015-12-07 20:17:22
【问题描述】：

在足够大的数据集上运行以下简短的 Haskell 程序时，我收到了 Heap exhausted 消息。例如，程序在大约 900k 行的 20 Mb 输入文件上失败（堆溢出）。堆大小（通过-with-rtsopts）设置为1 Gb。如果longestCommonSubstrB 被定义为更简单的东西，它运行正常，例如commonPrefix。我需要处理大约 100 Mb 的文件。

我使用以下命令行（GHC 7.8.3）编译了程序：

ghc -Wall -O2 -prof -fprof-auto "-with-rtsopts=-M512M -p -s -h -i0.1" SampleB.hs

如果能帮助我使这个东西在合理的空间中运行（按输入文件大小的顺序），我将不胜感激，但我尤其希望找到瓶颈在哪里以及在哪里以及如何强制执行的思考过程严格。

我的猜测是，以某种方式强制 longestCommonSubstrB 函数严格评估会解决问题，但我不知道该怎么做。

{-# LANGUAGE BangPatterns #-}
module Main where 
import System.Environment (getArgs)
import qualified Data.ByteString.Lazy.Char8 as B
import Data.List (maximumBy, sort)
import Data.Function (on)
import Data.Char (isSpace)

-- | Returns a list of lexicon items, i.e. [[w1,w2,w3]]
readLexicon :: FilePath -> IO [[B.ByteString]]
readLexicon filename = do 
    text <- B.readFile filename
    return $ map (B.split '\t' . stripR) . B.lines $ text
    where
        stripR = B.reverse . B.dropWhile isSpace . B.reverse

transformOne :: [B.ByteString] -> B.ByteString
transformOne (w1:w2:w3:[]) = 
    B.intercalate (B.pack "|") [w1, longestCommonSubstrB w2 w1, w3]
transformOne a = error $ "transformOne: unexpected tuple " ++ show a

longestCommonSubstrB :: B.ByteString -> B.ByteString -> B.ByteString
longestCommonSubstrB xs ys = maximumBy (compare `on` B.length) . concat $ 
    [f xs' ys | xs' <- B.tails xs] ++
    [f xs ys' | ys' <- tail $ B.tails ys]
  where f xs' ys' = scanl g B.empty $ B.zip xs' ys'
        g z (x, y) = if x == y 
                then z `B.snoc` x
                else B.empty

main :: IO ()
main = do
    (input:output:_) <- getArgs
    lexicon <- readLexicon input
    let flattened = B.unlines . sort . map transformOne $ lexicon
    B.writeFile output flattened

这是测试数据集的配置文件输出（100k 行，堆大小设置为 1 GB，即generateSample.exe 100000，生成的文件大小为 2.38 MB）：

随时间变化的堆配置文件：

执行统计：

   3,505,737,588 bytes allocated in the heap
     785,283,180 bytes copied during GC
      62,390,372 bytes maximum residency (44 sample(s))
         216,592 bytes maximum slop
              96 MB total memory in use (0 MB lost due to fragmentation)

                                    Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6697 colls,     0 par    1.05s    1.03s     0.0002s    0.0013s
  Gen  1        44 colls,     0 par    4.14s    3.99s     0.0906s    0.1935s

  INIT    time    0.00s  (  0.00s elapsed)
  MUT     time    7.80s  (  9.17s elapsed)
  GC      time    3.75s  (  3.67s elapsed)
  RP      time    0.00s  (  0.00s elapsed)
  PROF    time    1.44s  (  1.35s elapsed)
  EXIT    time    0.02s  (  0.00s elapsed)
  Total   time   13.02s  ( 12.85s elapsed)

  %GC     time      28.8%  (28.6% elapsed)

  Alloc rate    449,633,678 bytes per MUT second

  Productivity  60.1% of total user, 60.9% of total elapsed

时间和分配分析报告：

       SampleB.exe +RTS -M1G -p -s -h -i0.1 -RTS sample.txt sample_out.txt

    total time  =        3.97 secs   (3967 ticks @ 1000 us, 1 processor)
    total alloc = 2,321,595,564 bytes  (excludes profiling overheads)

COST CENTRE            MODULE  %time %alloc

longestCommonSubstrB   Main     43.3   33.1
longestCommonSubstrB.f Main     21.5   43.6
main.flattened         Main     17.5    5.1
main                   Main      6.6    5.8
longestCommonSubstrB.g Main      5.0    5.8
readLexicon            Main      2.5    2.8
transformOne           Main      1.8    1.7
readLexicon.stripR     Main      1.8    1.9


                                                                            individual     inherited
COST CENTRE                  MODULE                       no.     entries  %time %alloc   %time %alloc

MAIN                         MAIN                          45           0    0.1    0.0   100.0  100.0
 main                        Main                          91           0    6.6    5.8    99.9  100.0
  main.flattened             Main                          93           1   17.5    5.1    89.1   89.4
   transformOne              Main                          95      100000    1.8    1.7    71.6   84.3
    longestCommonSubstrB     Main                         100      100000   43.3   33.1    69.8   82.5
     longestCommonSubstrB.f  Main                         101     1400000   21.5   43.6    26.5   49.5
      longestCommonSubstrB.g Main                         104     4200000    5.0    5.8     5.0    5.8
  readLexicon                Main                          92           1    2.5    2.8     4.2    4.8
   readLexicon.stripR        Main                          98           0    1.8    1.9     1.8    1.9
 CAF                         GHC.IO.Encoding.CodePage      80           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.Encoding               74           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.FD                     70           0    0.0    0.0     0.0    0.0
 CAF                         GHC.IO.Handle.FD              66           0    0.0    0.0     0.0    0.0
 CAF                         System.Environment            65           0    0.0    0.0     0.0    0.0
 CAF                         Data.ByteString.Lazy.Char8    54           0    0.0    0.0     0.0    0.0
 CAF                         Main                          52           0    0.0    0.0     0.0    0.0
  transformOne               Main                          99           0    0.0    0.0     0.0    0.0
  readLexicon                Main                          96           0    0.0    0.0     0.0    0.0
   readLexicon.stripR        Main                          97           1    0.0    0.0     0.0    0.0
  main                       Main                          90           1    0.0    0.0     0.0    0.0

更新：以下程序可用于生成示例数据。它需要一个参数，即生成的数据集中的行数。生成的数据将保存到sample.txt 文件中。当我用它生成 900k 行数据集时（通过运行generateSample.exe 900000），生成的数据集使上述程序因堆溢出而失败（堆大小设置为 1 GB）。生成的数据集约为 20 MB。

module Main where 
import System.Environment (getArgs)
import Data.List (intercalate, permutations)

generate :: Int -> [(String,String,String)]
generate n = take n $ zip3 (f "banana") (f "ruanaba") (f "kikiriki")
    where
        f = cycle . permutations

main :: IO ()
main = do
    (n:_) <- getArgs
    let flattened = unlines . map f $ generate (read n :: Int)
    writeFile "sample.txt" flattened 
    where
        f (w1,w2,w3) = intercalate "\t" [w1, w2, w3]

【问题讨论】：

好吧，sort 不能在恒定空间中运行：它需要在产生任何输出之前消耗（并保留）其全部输入。
虽然这次我认为 GHC 与问题无关，但您应该始终在问题文本中包含 GHC 版本以及分析器报告。
@dfeuer GHC 版本 7.8.3
@ReidBarton 我知道，我的意思是说，在我看来，像longestCommonSubstrB 这样的函数的空间要求应该是空间有限的，即不依赖于输入文件的大小，而只依赖于关于其参数的大小。当然，懒惰是有影响的，但也应该有办法控制它。
为什么您认为这里不是这种情况？您显示的分析输出是关于程序中每个函数分配的内存总量。它没有说明这些分配的生命周期或任何时候堆上存在多少数据（“空间要求”）。

标签： haskell heap-memory overflow lazy-evaluation strict

【解决方案1】：

在我看来，您实现了一个幼稚的最长公共子字符串，具有可怕的空间复杂度（至少 O(n^2)）。严格与它无关。

您需要实现一个动态编程算法。您可以在 string-similarity 包中找到灵感，或者在 Diff 包中的 lcs 函数中找到灵感。

【讨论】：