为什么xpath在html标签之外返回文本？答案

【问题标题】：why xpath is returning text outside html tags?为什么xpath在html标签之外返回文本？
【发布时间】：2017-01-04 08:33:37
【问题描述】：

我正在处理一个在<html> 标签之外有一些text 的文档。当我在正文中读取数据时，它还会返回甚至不在 html 标记中的文本。

page_text = Nokogiri::HTML(open(file_path)).xpath("//body").text
p page_text

输出：

"WARC/1.0\nWARC-Type: response\nWARC-Date: 2012-02-11T04:48:01Z\nWARC-TREC-ID: clueweb12-0000tw-13-04988\nWARC-IP-Address: 184.85.26.15\nWARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR\nWARC-Target-URI: http://www.allchocolate.com/health/basics/\nWARC-Record-ID: \nContent-Type: application/http; msgtype=response\nContent-Length: 14577\n\n\n\n\n sample document\n\n\n hello world\n\n"

文档：

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <urn:uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>sample document</title>
</head>
<body>
    hello world
</body>
</html>

【问题讨论】：

请阅读“minimal reproducible example”。不要使用图像向我们展示您的预期输出。链接腐烂然后中断，当你的问题发生时，你的问题就没有意义了。而是将信息复制/粘贴到您的问题中并正确格式化以提高可读性。在将其传递给 Nokogiri 之前，您需要从源中去除非 HTML。它不知道标题信息是什么，所以你只是混淆它。
@theTinMan 谢谢你的建议，我已经编辑了我的问题。

标签： html ruby parsing xpath nokogiri

【解决方案1】：

明显的前导文本是一个问题，但不是尾随文本。 XML 是一种高度结构化的语言，将 XML 解析器应用于 HTML 至少意味着您必须拥有有效的 HTML。如果您没有有效的 HTML，那么您会得到 Nokogiri 吐出的任何内容。

在我看来，Nokogiri 将整个内容包装在默认根节点中，然后返回其中的所有文本节点，基本上忽略了 //body xpath。有趣的是，如果您将文本包装在 div 中并搜索 xpath //div，则没有问题，因此可能会提出解决方案。

Nokogiri 似乎认为//body 等于根节点。啊!也许 Nokogiri 使用<body> 作为根节点。不：xpath /body//body 不起作用。

对评论的回应：

您可以使用正则表达式搜索<body> 标签，然后插入一个div 标签。但是使用简单的正则表达式搜索 html 将是一个脆弱的解决方案，而且它并非在所有情况下都有效。

顺便说一句，您可以通过解析仅包含文本的文档：hello world，然后打印出 Nokogiri 找到的所有节点，来了解 Nokogiri 如何处理标签之外的文本：

require 'nokogiri'

nodes = Nokogiri::HTML(open('html.html')).xpath('//*')

nodes.each do |node|
  puts node.name
end

--output:--
html
body
p

所以 Nokogiri 将文本包装在三个标签中。

或者，更好的是，您可以解析文档并将其打印为 html：

require 'nokogiri'

doc = Nokogiri::HTML(open('./html.html'))
puts doc.to_html

--output:--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><body><p>WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577




    <title>sample document</title>


    hello world


</uuid:ff32c863-5066-4f51-802a-f31d4af074d5></p></body></html>

这意味着您可以像这样获得hello world：

require 'nokogiri'

doc = Nokogiri::HTML(open('./html.html'))
title = doc.at_xpath('//title')
puts title.next.text.strip

--output:--
hello world

另一种方法是在使用 Nokogiri 解析之前去掉非 html 内容：

require 'nokogiri'

infile = File.open('html.html')
non_html = infile.gets(sep="\n\n")
html = infile.gets(nil)  #Slurp the rest of the file

doc = Nokogiri::HTML(html)
puts doc.at_xpath('//body').text.strip

--output:--
hello world

假设总是有一个空行将非 html 内容与 html 内容分开。

【讨论】：

那么解决办法是什么？
查看我的答案的补充。

【解决方案2】：

Nokogiri 正在尝试将文件内容解析为 HTML 文档，但它不是有效文档。它是一个文本文档，恰好包含一个 HTML 文档。当然 Nokogiri 不知道这一点，它本身也无法识别出 HTML 部分，因此它会尝试解析整个内容。由于它不是有效的 HTML，因此会产生错误。

在解析时，Nokogiri 会尽力修复这些错误，但这在这种情况下不起作用，并导致您在此处看到奇怪的输出。

特别是，当 Nokogiri 看到 HTML 之前的文本时，它假定它应该是 HTML 文档正文的一部分。因此它会创建 html 和 body 元素并将其注入到文档中，然后将文本添加为此 body 的子元素。

稍后它会看到实际的 <body> 标记，但由于它知道它已经有一个 body 元素，并且只能有一个这样的元素，它会忽略它。

您需要确保只提供有效的 HTML（或尽可能接近有效 - 纠错可以解决小问题）。您可能需要以某种方式对文件进行预处理，以删除开头的多余文本。

【讨论】：

【解决方案3】：

首先@7stud 的答案是你可以在\n\n 上破解你的文件但是在我的文档集合中，在实际的 html 代码之前并不总是 \n\n。

因此，使用相同的想法，我提出了另一种解决方法，即使用正则表达式删除html 开始标记之前的所有文本，然后将其传递给Nokogiri 进行解析。

file = File.read(file_path).to_s
file = file.sub(/.*?(?=<html)/im,"")
page = Nokogiri::HTML(file)

现在一切正常。

【讨论】：

【解决方案4】：

在将内容传递给 Nokogiri 之前对其进行预处理很简单：

require 'nokogiri'

text = '
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <urn:uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>sample document</title>
</head>
<body>
    hello world
</body>
</html>
'

doc = Nokogiri::HTML(text[/<!DOCTYPE.+/m])
doc.to_html # => "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" lang=\"en\">\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n    <title>sample document</title>\n</head>\n<body>\n    hello world\n</body>\n</html>\n"

诀窍是：

text[/<!DOCTYPE.+/m]

它告诉 Ruby 浏览文本并返回从 <!DOCTYPE 到字符串末尾的所有文本，这是有效的 HTML。

【讨论】：