匹配Go中html标签之外的html文本的最佳方法是什么？答案

【问题标题】：What is the best way to match html text that is outside of an html tag in Go?匹配Go中html标签之外的html文本的最佳方法是什么？
【发布时间】：2020-04-07 21:47:47
【问题描述】：

我有一堆要解析的 html，如果它们包含某些文本，我需要删除某些 <a> 标签。通常，我会使用 Goquery，但我正在搜索的文本通常不在 html 标记本身的范围内。例如，这个 html：

<html><body>
This is the start.            
<a href="http://example.com/path">We don't want to match this text.</a>
<a href="http://www.example.com/another/path" style="font-family:Arial, Helvetica, 'sans-serif'; color:#838383;font-size:12px; line-height:14px"></a> match this text.<a href="blah">We also don't want to match this text</a>
</body></html>

我正在使用这个正则表达式，但它失败并匹配我不想匹配的文本：

(?is)<a[^>]+href=["'](?P<link>.*?)["']*.?> match this text\.

https://regex101.com/r/iEXpqc/1

【问题讨论】：

. 匹配任何字符。实际上，您仍然应该考虑一些 HTML 解析器。如果您想使用正则表达式，您应该考虑一些使用否定字符类的解决方法，请参阅an example。
是的，我也这么想，但用 Goquery 想不通。发布的示例匹配错误的文本，顺便说一句。
是的，反正比赛的标准是什么也不是很清楚。
您考虑过 XPath 包吗？ XPath 可能有点可怕，但它确实支持查看文本节点。
见：stackoverflow.com/questions/1732348/…

标签： html regex go

【解决方案1】：

这样，使用xpath（不是go，但逻辑可以重新实现）：

xmlstarlet ed -d '//a[contains(text(), "want to match")]' file.html

输出

<?xml version="1.0"?>
<html>
  <body>
This is the start.  

<a href="http://www.example.com/another/path" style="font-family:Arial, Helvetica, 'sans-serif'; color:#838383;font-size:12px; line-height:14px"/> match this text.
</body>
</html>

注意

添加-L 开关如果你想在运行中替换

【讨论】：