创建“Nokogiri::XML”或“Nokogiri::HTML”对象时如何避免创建无意义的空白文本节点答案

【问题标题】：How to avoid creating non-significant white space text nodes when creating a `Nokogiri::XML` or `Nokogiri::HTML` object创建“Nokogiri::XML”或“Nokogiri::HTML”对象时如何避免创建无意义的空白文本节点
【发布时间】：2014-02-02 14:16:06
【问题描述】：

在解析缩进的 XML 时，非重要的空白文本节点是从结束标记和开始标记之间的空白处创建的。例如，来自以下 XML：

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>

其字符串表示如下，

 "<note>\n  <to>Tove</to>\n  <from>Jani</from>\n  <heading>Reminder</heading>\n  <body>Don't forget me this weekend!</body>\n</note>\n"

创建了以下Document：

#(Document:0x3fc07e4540d8 {
  name = "document",
  children = [
    #(Element:0x3fc07ec8629c {
      name = "note",
      children = [
        #(Text "\n  "),
        #(Element:0x3fc07ec8089c {
          name = "to",
          children = [ #(Text "Tove")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8d8064 {
          name = "from",
          children = [ #(Text "Jani")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8d588c {
          name = "heading",
          children = [ #(Text "Reminder")]
          }),
        #(Text "\n  "),
        #(Element:0x3fc07e8cf590 {
          name = "body",
          children = [ #(Text "Don't forget me this weekend!")]
          }),
        #(Text "\n")]
      })]
  })

这里有很多Nokogiri::XML::Text类型的空白节点。

我想计算 Nokogiri XML Document 中每个节点的 children，并访问第一个或最后一个子节点，不包括非重要的空格。我不希望解析它们，或区分这些和重要的文本节点，例如元素<to> 中的那些，比如"Tove"。这是我正在寻找的 rspec：

require 'nokogiri'
require_relative 'spec_helper'

xml_text = <<XML
<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
</note>
XML

xml = Nokogiri::XML(xml_text)

def significant_nodes(node)
  return 0
end

describe "Stackoverflow Question" do
  it "should return the number of significant nodes in nokogiri." do
    expect(significant_nodes(xml.css('note'))).to eq 4
  end
end

我想知道如何创建significant_nodes 函数。

如果我将 XML 更改为：

<note>
  <to>Tove</to>
  <from>Jani</from>
  <heading>Reminder</heading>
  <body>Don't forget me this weekend!</body>
  <footer></footer>
</note>

然后当我创建Document 时，我仍然希望显示页脚；不能使用config.noblanks。

【问题讨论】：

Tove 被放置在标签to 内，所以你在shell 中找到标签，然后得到文本：doc.css( 'to' ).text
amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb 我还发现 ox 在读取大型 xml 时比 nokogiri 快 5 倍。另外，我编写了一个包装器，它只允许您使用 ox 搜索大型 xml，允许您使用指定的元素进行迭代。 gist.github.com/amolpujari/5966431

标签： xml-parsing html-parsing nokogiri

【解决方案1】：

您可以使用NOBLANKS option 来解析 XML 字符串，请考虑以下示例：

require 'nokogiri'

string = "<foo>\n  <bar>bar</bar>\n</foo>"
puts string
# <foo>
#   <bar>bar</bar>
# </foo>

document_with_blanks = Nokogiri::XML.parse(s)

document_without_blanks = Nokogiri::XML.parse(s) do |config|
  config.noblanks
end

document_with_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Text:0x3ffa4e153dac "\n  ">
#<Nokogiri::XML::Element:0x3fdce3f78488 name="bar" children=[#<Nokogiri::XML::Text:0x3fdce3f781f4 "bar">]>
#<Nokogiri::XML::Text:0x3ffa4e15335c "\n">

document_without_blanks.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3f81bef42034 name="bar" children=[#<Nokogiri::XML::Text:0x3f81bef43ee8 "bar">]>

NOBLANKS 不应删除空节点：

doc = Nokogiri.XML('<foo><bar></bar></foo>') do |config|
  config.noblanks
end

doc.root.children.each { |child| p child }
#<Nokogiri::XML::Element:0x3fad0fafbfa8 name="bar">

正如 OP 在 Nokogiri 网站（以及 libxml website）上指出的那样，关于解析器选项的文档非常神秘，遵循 NOBLANKS 选项的行为规范：

require 'rspec/autorun'
require 'nokogiri'

def parse_xml(xml_string)
  Nokogiri.XML(xml_string) { |config| config.noblanks }
end

describe "Nokogiri NOBLANKS parser option" do

  it "removes whitespace nodes if they have siblings" do
    doc = parse_xml("<root>\n <child></child></root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

  it "doesn't remove whitespaces nodes if they have no siblings" do
    doc = parse_xml("<root>\n </root>")
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Text)
  end

  it "doesn't remove empty nodes" do
    doc = parse_xml('<root><child></child></root>')
    expect(doc.root.children.size).to eq(1)
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end

end

【讨论】：

太棒了！非常感谢。编辑：不是正确的答案，原因如下。
抱歉，不是正确答案。这样做的原因是，如果我随后添加一个空标签，例如<empty></empty>，它将不会被解析和表示。不幸的是，需要包含空节点。
@CodingMo 实际上NOBLANKS 选项应该保留空节点，您可以发布在您的示例中剥离footer 节点的代码吗？
你是对的。它似乎确实包含空节点。虽然，如果你按照我给出的这个链接，它确实说它会删除空节点。 nokogiri.org/Nokogiri/XML/ParseOptions.html
@CodingMo，我刚刚发现了 Nokogiri 的 noblanks 配置选项，但现在我发现它不会忽略每个仅包含空格的 Text 节点——正如我所愿。所以找到你的帖子是及时的。但是，还有另一个转折——请参阅我的帖子（请随时将其添加到您的帖子或对其进行一些修改）。此外，“严格”配置选项是默认选项，因此大多数人可能会想要config.strict.noblanks。

【解决方案2】：

您可以创建只返回元素节点并忽略文本节点的查询。在 XPath 中，* 只返回元素，因此查询可能看起来像（查询整个文档）：

doc.xpath('//note/*')

或者如果你想使用 CSS：

doc.css('note > *')

如果你想实现你的significant_nodes 方法，你需要相对于传入的节点进行查询：

def significant_nodes(node)
  node.xpath('./*').size
end

我不知道如何使用 CSS 进行相对查询，您可能需要坚持使用 XPath。

【讨论】：

.xpath('./*') 的问题在于，如果您在一个带有包含重要文本的文本节点的元素上执行此操作，这些文本节点将不会被表示。因此，如果我们使用 `#(Element:0x3fc07e8d8064 { name = "from", children = [ #(Text "Jani")]})` 并对其执行.xpath('./*')，它将不会返回具有 "贾尼”在里面。
@CodingMo 那么不要在这样的节点上使用它:-)
这是一个公平的观点，这是一个很好的答案，我会发现它在未来很有用！
@CodingMo 您可以使用像'//note/node()[self::* or self::text()[normalize-space()]]' 这样的XPath 查询来获取元素和非空白文本节点，尽管在这个特定示例中这与使用noblanks 选项几乎相同。跨度>

【解决方案3】：

Nokogiri 的 noblanks 配置选项不会在它们有兄弟姐妹时删除所有空白文本节点：

describe "Nokogiri NOBLANKS parser option" do

  it "doesn't remove whitespace Text nodes if they're surrounded by non-whitespace Text node siblings" do
    doc = parse_xml("<root>1 <two></two> \n <three></three> \n <four></four> 5</root>")
    children = doc.root.children

    expect(children.size).to_not eq(5)
    expect(children.size).to eq(7)  #Because the two newline Text nodes are not ignored
    expect(doc.root.children.first).to be_kind_of(Nokogiri::XML::Node)
  end
end

我不确定为什么 Nokogiri 被编程为以这种方式工作。我认为最好忽略所有空白文本节点，不要忽略任何文本节点。

【讨论】：