Haskell：优化图处理算法答案

【问题标题】：Haskell: Optimising Graph processing algorithmHaskell：优化图处理算法
【发布时间】：2014-06-24 18:12:49
【问题描述】：

这是 this post 的后续版本，代码现在基于 Structuring Depth-First Search Algorithms in Haskell to do depth first search，由 King 和 Launchbury 在 1990 年代编写。那篇论文提出了一种生成和修剪策略，但使用了一个带有 State Monad 的可变数组（我怀疑有些语法已经被弃用了）。作者暗示可以使用一个集合来记住访问过的节点，作为额外 O(log n) 的成本。我尝试使用集合来实现（我们现在拥有比 1990 年代更好的机器！），使用现代 State Monad 语法，并使用向量而不是数组（我读到这通常更好）。

和以前一样，我的代码在小型数据集上运行，但无法返回我需要分析的 5m 边图，我正在寻找仅提示关于大规模运行的弱点.我所知道的是代码在内存中运行舒适，所以这不是问题，但我是否无意中滑到了 O(n2)？（相比之下，本文在 Data.Graph 库中的官方实现（我最近也从中借了一些代码）使用可变数组，但在大数据集上失败了......堆栈溢出！！！）

所以现在我有一个向量数据存储，其中 IntSet 状态未完成，而一个带有 ST Monad 数组“官方”的数组崩溃了！ Haskell 应该能做得比这更好？

import Data.Vector (Vector)
import qualified Data.IntSet as IS
import qualified Data.Vector as V
import qualified Data.ByteString.Char8 as BS
import Control.Monad.State

type Vertex   = Int
type Table a  = Vector a
type Graph    = Table [Vertex]
type Edge     = (Vertex, Vertex)
data Tree a   = Node a (Forest a) deriving (Show,Eq)
type Forest a = [Tree a]
-- ghc -O2 -threaded --make
-- +RTS -Nx
generate :: Graph -> Vertex -> Tree Vertex
generate g v = Node v $ map (generate g) (g V.! v)

chop :: Forest Vertex -> State IS.IntSet (Forest Vertex)
chop [] = return []
chop (Node x ts:us) = do
    visited <- contains x
    if visited then
        chop us
    else do
        include x
        x1 <- chop ts
        x2 <- chop us
        return (Node x x1:x2)

prune :: Forest Vertex -> State IS.IntSet (Forest Vertex)
prune vs = chop vs

main = do
    --edges <- V.fromList `fmap` getEdges "testdata.txt"
    edges <- V.fromList `fmap` getEdges "SCC.txt"
    let 
        -- calculate size of five largest SCC
        maxIndex = fst $ V.last edges
        gr = buildG maxIndex edges
        sccRes = scc gr
        big5 = take 5 sccRes
        big5' = map (\l -> length $ postorder l) big5
    putStrLn $ show $ big5'

contains :: Vertex -> State IS.IntSet Bool
contains v = state $ \visited -> (v `IS.member` visited, visited)

include :: Vertex -> State IS.IntSet ()
include v = state $ \visited -> ((), IS.insert v visited)


getEdges :: String -> IO [Edge]
getEdges path = do
    lines <- (map BS.words . BS.lines) `fmap` BS.readFile path
    let pairs = (map . map) (maybe (error "can't read Int") fst . BS.readInt) lines
    return [(a, b) | [a, b] <- pairs] 

vertices :: Graph -> [Vertex]
vertices gr = [1.. (V.length gr - 1)]

edges :: Graph -> [Edge]
edges g = [(u,v) | u <- vertices g, v <- g V.! u]

-- accumulate :: (a -> b -> a)  -> Vector a-> Vector (Int, b)--> Vector a
-- accumulating function f
-- initial vector (of length m)
-- vector of index/value pairs (of length n)
buildG :: Int -> Table Edge -> Graph
buildG maxIndex edges = graph' where
    graph    = V.replicate (maxIndex + 1) []
    --graph'   = V.accumulate (\existing new -> new:existing) graph edges
    -- flip f takes its (first) two arguments in the reverse order of f
    graph'   = V.accumulate (flip (:)) graph edges

mapT :: Ord a => (Vertex -> a -> b) -> Table a -> Table b
mapT = V.imap

outDegree :: Graph -> Table Int
outDegree g = mapT numEdges g
    where numEdges v es = length es

indegree :: Graph -> Table Int
indegree g = outDegree $ transposeG g

transposeG :: Graph -> Graph
transposeG g = buildG (V.length g - 1) (reverseE g)

reverseE :: Graph -> Table Edge
reverseE g = V.fromList [(w, v) | (v,w) <- edges g]

-- --------------------------------------------------------------

postorder :: Tree a -> [a]
postorder (Node a ts) = postorderF ts ++ [a]

postorderF :: Forest a -> [a]
postorderF ts = concat (map postorder ts)

postOrd :: Graph -> [Vertex]
postOrd g = postorderF (dff g)

dfs :: Graph -> [Vertex] -> Forest Vertex
dfs g vs = map (generate g) vs

dfs' :: Graph -> [Vertex] -> Forest Vertex
dfs' g vs = fst $ runState (prune d) $ IS.fromList []
    where d = dfs g vs

dff :: Graph -> Forest Vertex
dff g = dfs' g $ reverse (vertices g)

scc :: Graph -> Forest Vertex
scc g = dfs' g $ reverse $ postOrd (transposeG g)

【问题讨论】：

我认为您的问题可以使用更能反映内容的标题。
堆分析可能会很有帮助。 prune 看起来可能有 thunk 泄漏。你的Table Edge 可能是一个未装箱的向量，这应该会给你一些额外的提升。
@jberryman 问题似乎不是内存——它似乎稳定在可用容量的 75% 左右。但是所有 4 个核心都在 100% 上，10 分钟后它还没有完成
您应该包含一些示例数据。我不知道你期望它运行多快，或者它可以运行多快，但如果我知道它现在有多快，我可以尝试改进你的代码。
@user2407038 这段代码至少在 10 分钟内没有完成，尽管它不像官方（基于数组的）实现那样抱怨堆栈溢出。 Andras 在gist.github.com/AndrasKovacs/582808b6b5cc67bc36a2 上发布了一个要点，他声称在 GHC 7.8.2 上完成但在 7.6.3 上溢出，其中包括我尚未学习的语言元素

标签： algorithm haskell optimization

【解决方案1】：

一些小的可能改进：

改变

type Edge = (Vertex, Vertex)

到

data Edge = Edge {-# UNPACK #-} !Vertex {-# UNPACK #-} !Vertex

将每条边的内存使用量从 7 个字重用到 3 个字，并提高缓存局部性。减少内存压力几乎总是可以提高运行时间。正如@jberryman 提到的，可以为Table Edge 使用未装箱的向量（那么您不需要上述自定义数据类型）。

generate :: Graph -> Vertex -> Tree Vertex
generate g v = Node v $ map (generate g) (g V.! v)

如果您确定索引在范围内，则可以使用来自 vector 的不安全索引函数而不是 .!。

contains :: Vertex -> State IS.IntSet Bool
contains v = state $ \visited -> (v `IS.member` visited, visited)

改用get 和put $! 的组合。

include :: Vertex -> State IS.IntSet ()
include v = state $ \visited -> ((), IS.insert v visited)

请改用modify'。

您在程序中使用了很多列表。链表不是内存/缓存效率最高的数据结构。看看你是否可以转换你的代码以使用更多的向量。

【讨论】：