如何使用 Nokogiri 仅选择叶节点？答案

【问题标题】：How to select only leaf nodes with Nokogiri?如何使用 Nokogiri 仅选择叶节点？
【发布时间】：2013-07-26 20:00:12
【问题描述】：

我正在寻找一些关于如何完成的建议。我正在尝试仅使用 xpath 的解决方案：

一个html示例：

<div>
  <div>
    <div>text div (leaf)</div>
    <p>text paragraph (leaf)</p>
  </div>
</div>
<p>text paragraph 2 (leaf)</p>

代码：

doc = Nokogiri::HTML.fragment("- the html above -")
result = doc.xpath("*[not(child::*)]")


[#<Nokogiri::XML::Element:0x3febf50f9328 name="p" children=[#<Nokogiri::XML::Text:0x3febf519b718 "text paragraph 2 (leaf)">]>]

但是这个 xpath 只给了我最后一个“p”。我想要的是一个扁平化的行为，只返回叶子节点。

以下是stackoverflow中的一些参考答案：

How to select all leaf nodes using XPath expression?

XPath - Get node with no child of specific type

谢谢

【问题讨论】：

你想要的价值观是什么？
文本上带有（叶子）的所有节点
@Luccas：你只想要文本，还是想要包含元素？即你想要text paragraph (leaf) 还是<p>text paragraph (leaf)</p>？如果你只想要文本，你想要所有的文本节点分开，还是你只是想要所有的文本连接成一个字符串？
你的尝试失败的原因是因为你使用了xpath('*…')而不是xpath('.//*…')；见this bug report 和this one。

标签： ruby xpath nokogiri

【解决方案1】：

您可以使用以下方法找到所有没有子元素的元素节点：

//*[not(*)]

例子：

require 'nokogiri'

doc = Nokogiri::HTML.parse <<-end
<div>
  <div>
    <div>text div (leaf)</div>
    <p>text paragraph (leaf)</p>
  </div>
</div>
<p>text paragraph 2 (leaf)</p>
end

puts doc.xpath('//*[not(*)]').length
#=> 3

doc.xpath('//*[not(*)]').each do |e|
    puts e.text
end
#=> "text div (leaf)"
#=> "text paragraph (leaf)"
#=> "text paragraph 2 (leaf)"

【讨论】：

【解决方案2】：

你的代码的问题是声明：

doc = Nokogiri::HTML.fragment("- the html above -")

看这里：

require 'nokogiri'

html = <<END_OF_HTML
<div>
  <div>
    <div>text div (leaf)</div>
    <p>text paragraph (leaf)</p>
  </div>
</div>
<p>text paragraph 2 (leaf)</p>
END_OF_HTML


doc = Nokogiri::HTML(html)
#doc = Nokogiri::HTML.fragment(html)
results = doc.xpath("//*[not(child::*)]")
results.each {|result| puts result}

--output:--
<div>text div (leaf)</div>
<p>text paragraph (leaf)</p>
<p>text paragraph 2 (leaf)</p>

如果我运行这个：

doc = Nokogiri::HTML.fragment(html)
results = doc.xpath("//*[not(child::*)]")
results.each {|result| puts result}

我没有输出。

【讨论】：

见github.com/sparklemotion/nokogiri/issues/213和github.com/sparklemotion/nokogiri/issues/572

【解决方案3】：

在 XPath 中，文本本身就是一个节点 - 因此，鉴于您的评论，您只想选择标签内容，而不是包含该内容的标签 - 但您会捕获 <br/>（如果有的话）。

我猜你正在寻找所有不包含其他元素的元素（标签）（这不是完全你一直要求的） - 那么你'对@Justin Ko 的回答很好，并使用 XPath 表达式

//*[not(*)]

如果真的要查找所有叶子节点，则不能使用*选择器，而需要使用node()：

//node()[not(node())]

节点可以是元素，也可以是文本节点、cmets、处理指令、属性甚至是 XML 文档（但这些不能出现在其他元素中）。

如果您真的只想要 文本节点，请选择 @Priti 建议的 //text()，这确实在某种程度上选择了 您所要求的节点 （通过突出显示它们，而不是通过定义叶节点）。

【讨论】：