如何使用 Nokogiri 漂亮地打印 HTML？答案

【问题标题】：How do I pretty-print HTML with Nokogiri?如何使用 Nokogiri 漂亮地打印 HTML？
【发布时间】：2010-12-26 07:27:16
【问题描述】：

我用 Ruby 编写了一个网络爬虫，我正在使用 Nokogiri::HTML 来解析页面。我需要将页面打印出来，在 IRB 中乱搞时，我注意到了一个 pretty_print 方法。但是它需要一个参数，我无法弄清楚它想要什么。

我的爬虫正在缓存网页的 HTML 并将其写入本地计算机上的文件。我想“漂亮地打印”HTML，这样当我这样做时它看起来不错并且格式正确。

【问题讨论】：

您要打印什么？ html 内容（标签和所有）或选择项目？每种方法都有不同的方法，澄清对答案很有帮助

标签： html ruby nokogiri pretty-print

【解决方案1】：

更简单，效果很好

puts Nokogiri::HTML(File.read('terms.fr.html')).to_xhtml

【讨论】：

【解决方案2】：

我知道我回答这个问题已经很晚了，但我仍然会留下答案。我尝试了上述所有步骤，并且在一定程度上确实有效。

Nokogiri 确实格式化了HTML，但不关心结束标签或开始标签，因此漂亮的格式不在图片中。

我发现了一个名为htmlbeautifier 的宝石，它的作用就像一个魅力。我希望仍在寻找答案的其他人会发现这很有价值。

【讨论】：

【解决方案3】：

通过 HTML 页面的“漂亮打印”，我认为您的意思是您想用适当的缩进重新格式化 HTML 结构。 Nokogiri 不支持这个； pretty_print 方法适用于“pp”库，输出仅对调试有用。

有几个项目对 HTML 的理解足够好，能够在不破坏实际上很重要的空白的情况下重新格式化它（著名的是HTML Tidy），但通过谷歌搜索，我发现这篇文章的标题为"Pretty printing XHTML with Nokogiri and XSLT"。

归结为：

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

当然，它要求您将链接的 XSL 文件下载到您的文件系统。我已经在我的机器上很快尝试过了，它就像一个魅力。

【讨论】：

FWIW 链接的样式表可能会导致向呈现的 HTML 添加空格（例如，<p><span>pre</span>fix</p> 变为“预修复”）。

【解决方案4】：

@mislav 的回答有些错误。 Nokogiri 确实支持漂亮的打印如果你：

将文档解析为 XML
指示 Nokogiri 在解析期间忽略仅空白节点（“空白”）
使用to_xhtml 或to_xml 指定pretty-printing parameters

在行动：

html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'

require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=>   <h1>Main Section 1</h1>
#=>   <p>Intro</p>
#=>   <section>
#=>     <h2>Subhead 1.1</h2>
#=>     <p>Meat</p>
#=>     <p>MOAR MEAT</p>
#=>   </section>
#=>   <section>
#=>     <h2>Subhead 1.2</h2>
#=>     <p>Meat</p>
#=>   </section>
#=> </section>

puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>

【讨论】：

似乎它并没有将标签链分成几行，而是一个接一个地写入。这样的问题出现在 stackoverflow.com/questions/2696537 之后的原本每行一个标签的文档中 - 在代码标签以某种方式加入一个链之后，这使得 to_xhtml 有点没用（
@Nakilon 您是否使用 &:noblanks 选项解析了 XML？
是的，pastebin.com/raw.php?i=tKSSVjaG – 删除 if false 以查看 change_language 网址如何加入。（我的浏览器或 SO 有问题，不能用 @ 写你的用户名，它只是消失了，哈哈）
能否请您指出您在官方文档中找到&:noblanks 的链接/来源？
@ArupRakshit 这是Nokogiri.XML(…){|config| config.noblanks } 的 Ruby 快捷方式。 Nokogiri.XML() 方法被记录为Nokogiri::XML::Document.parse 的快捷方式。传递给方法的块是传递parse options的简写。

【解决方案5】：

我的解决方案是将print 方法添加到实际的Nokogiri 对象上。在下面的 sn-p 中运行代码后，您应该能够编写 node.print，它会漂亮地打印内容。不需要 xslt :-)

Nokogiri::XML::Node.class_eval do
  # Print every Node by default (will be overridden by CharacterData)
  define_method :should_print? do
    true
  end

  # Duplicate this node, replace the contents of the duplicated node with a
  # newline. With this content substitution, the #to_s method conveniently
  # returns a string with the opening tag (e.g. `<a href="foo">`) on the first
  # line and the closing tag on the second (e.g. `</a>`, provided that the
  # current node is not a self-closing tag).
  #
  # Now, print the open tag preceded by the correct amount of indentation, then
  # recursively print this node's children (with extra indentation), and then
  # print the close tag (if there is a closing tag)
  define_method :print do |indent=0|
    duplicate = self.dup
    duplicate.content = "\n"
    open_tag, close_tag = duplicate.to_s.split("\n")

    puts (" " * indent) + open_tag
    self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
    puts (" " * indent) + close_tag if close_tag
  end
end

Nokogiri::XML::CharacterData.class_eval do
  # Only print CharacterData if there's non-whitespace content
  define_method :should_print? do
    content =~ /\S+/
  end

  # Replace all consecutive whitespace characters by a single space; precede the
  # outut by a certain amount of indentation; print this text.
  define_method :print do |indent=0|
    puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
  end
end

【讨论】：

你有使用它的例子吗？我试试这个并得到“TypeError：没有将 nil 隐式转换为 String”，所以也许我在错误的对象上调用它。
再做一些实验，我得到了这个工作：doc = Nokogiri::HTML(html_source); doc.elements.each {|elem| elem.print }。谢谢。

【解决方案6】：

这对我有用：

 pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)

我尝试了上面的 REXML 版本，但它损坏了我的一些文档。而且我讨厌将 xslt 带入一个新项目。两人都觉得过时了。 :)

【讨论】：

这很好，但如果缺少 <body> 和 <html> 标签，则添加这些标签。就我而言，我根本不需要它们。

【解决方案7】：

你可以试试 REXML：

require "rexml/document"

doc = REXML::Document.new(xml)
doc.write($stdout, 2)

【讨论】：

【解决方案8】：

为什么不试试pp 方法？

require 'pp'
pp some_var

【讨论】：

虽然 Nokogiri 实现了帮助“漂亮打印”的方法，但输出仅供开发人员使用。在我看来，Jarsen 想要显示打印精美的 HTML 源代码。