非转义 HTML 实体（包括命名实体）答案

【问题标题】：Unescaping HTML entities (including named ones)非转义 HTML 实体（包括命名实体）
【发布时间】：2011-07-27 13:47:02
【问题描述】：

这个问题类似于之前在 Stack Overflow 上提出的 Remove html character entities in a string 问题。 然而，接受的答案并没有解决命名 HTML 实体的问题，例如&auml; 用于字符 ä； 因此它不能对所有 HTML 进行转义。

我有一些旧版 HTML，它使用命名的 HTML 实体来表示非 ASCII 字符。即&ouml; 代替ö，&auml; 代替ä 等等。 A full list of all named HTML entities 可在维基百科上找到。

我想以一种快速有效的方式将这些 HTML 实体转义为对应的字符。

我有代码在 Python 3 中使用正则表达式来执行此操作：

import re
import html.entities

s = re.sub(r'&(\w+?);', lambda m: chr(html.entities.name2codepoint[m.group(1)]), s)

然而，正则表达式在 Haskell 中似乎不太流行、快速或易于使用。

Text.HTML.TagSoup.Entity (tagsoup) 有一个有用的表和函数，用于映射命名实体 tpo 代码点。使用这个和 regex-tdfa 包，我在 Haskell 中制作了一个 非常慢 等效的：

{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Lazy.Char8 as L
import Data.ByteString.Lazy.UTF8 as UTF8
import Text.HTML.TagSoup.Entity (lookupEntity)
import Text.Regex.TDFA ((=~~))

unescapeEntites :: L.ByteString -> L.ByteString
unescapeEntites = regexReplaceBy "&#?[[:alnum:]]+;" $ lookupMatch
 where
  lookupMatch m =
    case lookupEntity (L.unpack . L.tail . L.init $ m) of
      Nothing -> m
      Just x -> UTF8.fromString [x]

-- regex replace taken from http://mutelight.org/articles/generating-a-permalink-slug-in-haskell
regexReplaceBy :: L.ByteString -> (L.ByteString -> L.ByteString) -> L.ByteString -> L.ByteString
regexReplaceBy regex f text = go text []
 where
  go str res =
    if L.null str
      then L.concat . reverse $ res
      else
        case (str =~~ regex) :: Maybe (L.ByteString, L.ByteString, L.ByteString) of
          Nothing -> L.concat . reverse $ (str : res)
          Just (bef, match , aft) -> go aft (f match : bef : res)

unescapeEntities 函数的运行速度比上面的 Python 版本慢几个数量级。 Python 代码可以在 7 秒内转换大约 130 MB，而我的 Haskell 版本运行几分钟。

我正在寻找更好的解决方案，主要是在速度方面。但如果可能的话，我也想避免使用正则表达式（无论如何，速度和避免正则表达式在 Haskell 中似乎是齐头并进的）。

【问题讨论】：

不清楚您的实际问题是什么。您在寻找更好的解决方案吗？需要帮助改进当前版本吗？
对不起，如果问题不清楚。是的，我想要一个更好的解决方案，因为我的解决方案是 1. 太慢 2. 使用看起来不太像 Haskell 惯用的正则表达式（考虑到关于它们的信息很少）。我的解决方案主要是作为“这是我到目前为止所拥有的”的起点。我对激进的想法持开放态度。
你是如何阅读文件的？如果我制作main = Data.ByteString.interact unescapeEntites 并执行time cat big.txt | ./regex >>/dev/null，我将获得 30 秒的 143M big.txt（这是 TagSoup 中列出的所有实体，其中穿插了很多“a”）。所有这些间接性仍然很笨拙，但不是几分钟。

标签： html string haskell

【解决方案1】：

这是我的版本。它使用字符串（而不是字节字符串）。

import Text.HTML.TagSoup.Entity (lookupEntity)

unescapeEntities :: String -> String
unescapeEntities [] = []
unescapeEntities ('&':xs) = 
  let (b, a) = break (== ';') xs in
  case (lookupEntity b, a) of
    (Just c, ';':as) ->  c  : unescapeEntities as    
    _                -> '&' : unescapeEntities xs
unescapeEntities (x:xs) = x : unescapeEntities xs

我猜它会更快，因为它不使用昂贵的正则表达式操作。我没有测试过。如果您需要更快，可以将其调整为 ByteString 或 Data.Text。

【讨论】：

【解决方案2】：

您可以安装 web-encodings 包，获取 decodeHtml 函数的源代码并添加您需要的字符（适用于我）。这就是你所需要的：

import Data.Maybe
import qualified Web.Encodings.StringLike as SL
import Web.Encodings.StringLike (StringLike)
import Data.Char (ord)

-- | Decode HTML-encoded content into plain content.
--
-- Note: this does not support all HTML entities available. It also swallows
-- all failures.
decodeHtml :: StringLike s => s -> s
decodeHtml s = case SL.uncons s of
    Nothing -> SL.empty
    Just ('&', xs) -> fromMaybe ('&' `SL.cons` decodeHtml xs) $ do
        (before, after) <- SL.breakCharMaybe ';' xs
        c <- case SL.unpack before of -- this are small enough that unpack is ok
            "lt" -> return '<'
            "gt" -> return '>'
            "amp" -> return '&'
            "quot" -> return '"'
            '#' : 'x' : hex -> readHexChar hex
            '#' : 'X' : hex -> readHexChar hex
            '#' : dec -> readDecChar dec
            _ -> Nothing -- just to shut up a warning
        return $ c `SL.cons` decodeHtml after
    Just (x, xs) -> x `SL.cons` decodeHtml xs

readHexChar :: String -> Maybe Char
readHexChar s = helper 0 s where
    helper i "" = return $ toEnum i
    helper i (c:cs) = do
        c' <- hexVal c
        helper (i * 16 + c') cs

hexVal :: Char -> Maybe Int
hexVal c
    | '0' <= c && c <= '9' = Just $ ord c - ord '0'
    | 'A' <= c && c <= 'F' = Just $ ord c - ord 'A' + 10
    | 'a' <= c && c <= 'f' = Just $ ord c - ord 'a' + 10
    | otherwise = Nothing

readDecChar :: String -> Maybe Char
readDecChar s = do
    case reads s of
        (i, _):_ -> Just $ toEnum (i :: Int)
        _ -> Nothing

虽然我没有测试性能。但是，如果您也可以在不使用正则表达式的情况下完成此操作，这可能是一个不错的示例。

【讨论】：