如何在保留html标签/结构的同时查找/替换html中的文本答案

【问题标题】：How to find/replace text in html while preserving html tags/structure如何在保留html标签/结构的同时查找/替换html中的文本
【发布时间】：2010-12-23 18:47:59
【问题描述】：

我使用正则表达式来转换我想要的文本，但我想保留 HTML 标记。例如如果我想用“堆栈下溢”替换“堆栈溢出”，这应该作为预期：如果输入是stack <sometag>overflow</sometag>，我必须获得stack <sometag>underflow</sometag>（即字符串替换完成，但是标签还在...

【问题讨论】：

阅读本文并在为时已晚之前悔改：stackoverflow.com/questions/1732348/…

标签： python html html-parsing

【解决方案1】：

在处理 HTML 时使用 DOM 库，而不是正则表达式：

lxml：解析器、文档和 HTML 序列化程序。也可以使用 BeautifulSoup 和 html5lib 进行解析。
BeautifulSoup：解析器、文档和 HTML 序列化器。
html5lib：一个解析器。它有一个序列化程序。
ElementTree：一个文档对象和 XML 序列化器
cElementTree：实现为 C 扩展的文档对象。
HTMLParser：一个解析器。
Genshi：包括解析器、文档和 HTML 序列化程序。
xml.dom.minidom：标准库中内置的文档模型，html5lib 可以解析到。

从http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/窃取。

其中我推荐 lxml、html5lib 和 BeautifulSoup。

【讨论】：

【解决方案2】：

Beautiful Soup 或 HTMLParser 是您的答案。

【讨论】：

【解决方案3】：

请注意，不能明确地进行任意替换。考虑以下示例：

1)

HTML：

A<tag>B</tag>

模式 -> 替换：

AB -> AXB

可能的结果：

AX<tag>B</tag>
A<tag>XB</tag>

2)

HTML：

A<tag>A</tag>A

模式 -> 替换：

A+ -> WXYZ

可能的结果：

W<tag />XYZ
W<tag>X</tag>YZ
W<tag>XY</tag>Z
W<tag>XYZ</tag>
WX<tag />YZ
WX<tag>Y</tag>Z
WX<tag>YZ</tag>
WXY<tag />Z
WXY<tag>Z</tag>
WXYZ

哪种算法适合您的案例在很大程度上取决于可能的搜索模式的性质以及处理歧义的所需规则。

【讨论】：

【解决方案4】：

使用lxml或BeautifulSoup提供的html解析器。另一种选择是使用 XSLT 转换 (XSLT in Jython)。

【讨论】：

【解决方案5】：

我不认为到目前为止发布的 DOM / HTML 解析器库建议解决了给定示例中的具体问题：overflow 应该替换为 underflow 只有在呈现的文档中以 stack 开头时，是否或者它们之间是否有标签。不过，这样的库是解决方案的必要部分。

假设标签永远不会出现在单词中间，一种解决方案是

处理 DOM，标记所有文本节点并插入唯一标识符在每个标记的开头（例如单词）
将文档呈现为纯文本
搜索并用正则表达式替换纯文本，这些正则表达式使用组来匹配、保留和在每个令牌的开头标记唯一标识符
从纯文本中提取所有带有唯一标识符的标记
通过删除唯一标识符并替换匹配的令牌来处理 DOM 用相应的更改标记标记唯一标识符
将处理后的 DOM 渲染回 HTML

示例：

在 1. HTML DOM 中，

stack <sometag>overflow</sometag>

成为 DOM

#1;stack <sometag>#2;overflow</sometag>

2. 生成纯文本：

#1;stack #2;overflow

3. 中需要的正则表达式是 #(\d+);stack\s+#(\d+);overflow\b 和替换 #\1;stack %\2;underflow。请注意，通过将唯一标识符中的 # 更改为 % 仅标记第二个单词，因为第一个单词没有改变。

在 4. 中，具有编号为 2 的唯一标识符的单词 underflow 是从生成的纯文本中提取的，因为它是通过将 # 更改为 % 来标记的。

在 5. 中，所有 #(\d+); 标识符都从 DOM 的文本节点中删除，同时在提取的单词中查找它们的编号。找不到号码1，所以#1;stack 被简单地替换为stack。数字2 与更改后的单词underflow 一起找到，因此#2;overflow 替换为underflow。

终于在6.将DOM渲染回HTML文档`stack underflow。

【讨论】：

【解决方案6】：

尝试有趣的东西。它有点工作。当我将此脚本附加到文本区域并让他们“翻译”内容时，我的朋友们喜欢它。我想你真的可以用它来做任何事情。嗯。如果您要使用它，请检查几次代码，它可以工作，但我对这一切都很陌生。我想从我开始学习 php 到现在已经有两三个星期了。


<?php

$html = ('<div style="border: groove 2px;"><p>Dear so and so, after reviewing your application I. . .</p><p>More of the same...</p><p>sincerely,</p><p>Important Dude</p></div>');

$oldWords = array('important', 'sincerely');

$newWords = array('arrogant', 'ya sure');

// function for oldWords
function regex_oldWords_word_list(&$item1, $key)
{

    $item1 = "/>([^<>]+)?\b$item1(tionally|istic|tion|ance|ence|less|ally|able|ness|ing|ity|ful|ant|est|ist|ic|al|ed|er|et|ly|y|s|d|'s|'d|'ve|'ll)?\b([^<>]+)?/";

}

// function for newWords
function format_newWords_results(&$item1, $key)
{

    $item1 = ">$1<span style=\"color: red;\"><em> $item1$2</em></span>$3";

}

// apply regex to oldWords
array_walk($oldWords, 'regex_oldWords_word_list');

// apply formatting to newWords
array_walk($newWords, 'format_newWords_results');

//HTML is not always as perfect as we want it
$poo = array('/  /', '/>([a-zA-Z\']+)/', '/’/', '/;([a-zA-Z\']+)/', '/"([a-zA-Z\']+)/', '/([a-zA-Z\']+)</', '/\.\.+/', '/\. \.+/');

$unpoo = array(' ', '> $1', '\'', ';  $1', '"  $1', '$1  <', '. crap taco.', '. crap taco with cheese.');

//and maybe things will go back to normal sort of
$repoo = array('/>  /', '/;  /', '/"  /', '/  </');

$muck = array('> ', ';', '"',' <');

//before
echo ($html);

//I don't know what was happening on the free host but I had to keep stripping slashes
//This is where the work is done anyway.
$html = stripslashes(preg_replace($repoo , $muck , (ucwords(preg_replace($oldWords , $newWords , (preg_replace($poo , $unpoo , (stripslashes(strtolower(stripslashes($html)))))))))));

//after
echo ('<hr/> ' . $html);

//now if only there were a way to keep it out of the area between
//<style>here</style> and <script>here</script> and tell it that english isn't math.

?>

【讨论】：