在 Haskell 中使用 lex 解析字符串答案

【问题标题】：Parse string with lex in Haskell在 Haskell 中使用 lex 解析字符串
【发布时间】：2014-01-16 12:07:40
【问题描述】：

我正在关注Gentle introduction to Haskell 教程，那里提供的代码似乎已损坏。我需要了解是不是这样，还是我对这个概念的理解是错误的。

我正在为自定义类型实现解析器：

data Tree a = Leaf a | Branch (Tree a) (Tree a)

打印功能方便

showsTree              :: Show a => Tree a -> String -> String
showsTree (Leaf x)     = shows x
showsTree (Branch l r) = ('<':) . showsTree l . ('|':) . showsTree r . ('>':)

instance Show a => Show (Tree a) where 
    showsPrec _ x = showsTree x

这个解析器很好，但是当有个空格时会中断

readsTree         :: (Read a) => String -> [(Tree a, String)]
readsTree ('<':s) =  [(Branch l r, u) | (l, '|':t) <- readsTree s,
                                        (r, '>':u) <- readsTree t ]
readsTree s       =  [(Leaf x, t)     | (x,t)      <- reads s]

这个据说是更好的解决方案，但它不起作用没有空格

readsTree_lex    :: (Read a) => String -> [(Tree a, String)]
readsTree_lex s  = [(Branch l r, x) | ("<", t) <- lex s,
                                   (l, u)   <- readsTree_lex t,
                                   ("|", v) <- lex u,
                                   (r, w)   <- readsTree_lex v,
                                   (">", x) <- lex w ]
                ++
                [(Leaf x, t)     | (x, t)   <- reads s ]

接下来我选择一个解析器与read一起使用

instance Read a => Read (Tree a) where
    readsPrec _ s = readsTree s

然后我使用 Leksah 调试模式在 ghci 中加载它（我猜这无关紧要），并尝试解析两个字符串：

    read "<1|<2|3>>"   :: Tree Int -- succeeds with readsTree
    read "<1| <2|3> >" :: Tree Int -- succeeds with readsTree_lex

当lex 遇到前一个字符串的|<2... 部分时，它会拆分为("|<", _)。这与解析器的("|", v) <- lex u 部分不匹配，无法完成解析。

出现了两个问题：

如何定义真正忽略空格而不是需要空格的解析器？
如何定义使用 lex 拆分遇到的文字的规则

说到第二个问题——更多的好奇是因为定义我自己的词法分析器似乎比定义现有的规则更正确。

【问题讨论】：

标签： parsing haskell

【解决方案1】：

lex 拆分为 Haskell 词位，跳过空格。

这意味着由于 Haskell 允许 |< 作为词位，lex 不会将其拆分为两个词位，因为它在 Haskell 中不是这样解析的。

如果您使用与 Haskell 相同（或相似）的句法规则，则只能在解析器中使用 lex。

如果您想忽略所有空格（而不是将任何空格等同于一个空格），首先运行filter (not.isSpace) 会更简单、更高效。

【讨论】：

那是我不确定的。我仍然不明白为什么这是书中的一个例子。也许 lex 实施在过去十年中发生了变化。
它确实说它遵循 Haskell 词法规则，也许他们假设你不会给它 "<3|<4,5>>"，因为 Haskell 需要在 | 和 < 之间留一个空格来区分潜在的|< 运营商。 Thius 应该在文本中非常明确，不过，我同意。
我在本书的代码库中（似乎）找到了答案。该代码库未包含在我的（翻译的）版本中，这导致我之前没有注意到这一点。感谢您的努力。
使用 filter 似乎是最好的解决方案，因为我的任何一个变体都无法从输入中删除垃圾。

【解决方案2】：

这个问题的答案似乎是Gentle introduction to Haskell 和它的code samples 之间的一个小差距，加上示例代码中的一个错误。

还应该多一个词法分析器，但是代码库中没有工作示例（满足我的需要），所以我写了一个。请指出其中的任何缺陷：

lexAll :: ReadS String
lexAll s = case lex s of
            [("",_)] -> []                                  -- nothing to parse.
            [(c, r)] -> if length c == 1 then [(c, r)]      -- we will try to match
                           else [(c, r), ([head s], tail s)]-- not only as it was 
            any_else -> any_else                            -- parsed but also splitted

作者sais：

最后，完整的阅读器。这对空白不敏感，因为是以前的版本。当您为数据派生 Show 类时阅读器自动生成的type与this风格类似。

但是应该使用lexAll而不是lex（这似乎是说错误）：

readsTree' :: (Read a) => ReadS (Tree a)
readsTree' s = [(Branch l r, x) | ("<", t) <- lexAll s,
                  (l, u)   <- readsTree' t,
                                  ("|", v) <- lexAll u,
                                  (r, w)   <- readsTree' v,
                  (">", x) <- lexAll w ]
                ++
                [(Leaf x, t)    | (x, t) <- reads s]

【讨论】：