在 Nokogiri 的所有标签之间抓取文本？答案

【问题标题】：grabbing text between all tags in Nokogiri?在 Nokogiri 的所有标签之间抓取文本？
【发布时间】：2013-01-07 05:02:10
【问题描述】：

在 html 标签之间获取所有文本的最有效方法是什么？

<div>
<a> hi </a>
....

一堆被 html 标签包围的文本。

【问题讨论】：

也请查看github.com/rgrove/sanitize

标签： ruby nokogiri

【解决方案1】：

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").to_s

【讨论】：

【解决方案2】：

使用 Sax 解析器。比 XPath 选项快得多。

require "nokogiri"

some_html = <<-HTML
<html>
  <head>
    <title>Title!</title>
  </head>
  <body>
    This is the body!
  </body>
</html>
HTML

class TextHandler < Nokogiri::XML::SAX::Document
  def initialize
    @chunks = []
  end

  attr_reader :chunks

  def cdata_block(string)
    characters(string)
  end

  def characters(string)
    @chunks << string.strip if string.strip != ""
  end
end
th = TextHandler.new
parser = Nokogiri::HTML::SAX::Parser.new(th)
parser.parse(some_html)
puts th.chunks.inspect

【讨论】：

如何将其更改为仅获取正文标签之间的文本？
设置一个标志，只有在看到body标签开始捕获字符后才开始捕获字符，并在body标签关闭后停止捕获。

【解决方案3】：

只要做：

doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").text

【讨论】：

【解决方案4】：

以下是获取此页面问题 div 中所有文本的方法：

require 'rubygems'
require 'nokogiri'
require 'open-uri'

doc = Nokogiri::HTML(open("http://stackoverflow.com/questions/1512850/grabbing-text-between-all-tags-in-nokogiri"))
puts doc.css("#question").to_s

【讨论】：