【问题标题】:Nokogiri parsing table with no html elementNokogiri parsing table with no html element
【发布时间】:2016-09-25 03:03:53
【问题描述】:

我有这段代码试图转到一个 URL 并将“li”元素解析为一个数组。但是,在尝试解析不在“b”标签中的任何内容时,我遇到了问题。

代码:

url = '(some URL)'
page = Nokogiri::HTML(open(url))
csv = CSV.open("/tmp/output.csv", 'w')

page.search('//li[not(@id) and not(@class)]').each do |row|
  arr = []
  row.search('b').each do |cell|
    arr << cell.text
  end
  csv << arr
  pp arr
end

HTML:

<li><b>The Company Name</b><br>
The Street<br>
The City, 
The State 
The Zipcode<br><br>
</li>

我想解析所有元素,以便输出如下所示:

["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"],
["The Company Name", "The Street", "The City", "The State", "The Zip Code"]

【问题讨论】:

    标签: html ruby parsing csv nokogiri


    【解决方案1】:
    require 'nokogiri'
    
    def main
      output = []
      page = File.open("parse.html") {|f| Nokogiri::HTML(f)}
      page.search("//li[not(@id) and not (@class)]").each do |row|
        arr = []
        result = row.text
        result.each_line { |l|
          if l.strip.length > 0
            arr << l.strip
          end
        }
        output << arr
      end
      print output
    end
    
    if __FILE__ == $PROGRAM_NAME
      main()
    end
    

    【讨论】:

    • 返回一个看起来像["Street Name", "City", "State", "Zip", "Other Street Name", "Other City", "Other State", "Other Zip" ]的巨型数组
    【解决方案2】:

    我最终找到了我自己问题的解决方案,所以如果有人感兴趣,我只是改变了

    row.search('b').each do |cell|
    

    进入:

    row.search('text()'.each do |cell|
    

    我也变了

    arr << cell.text
    

    进入:

    arr << cell.text.gsub("\n", '').gsub("\r", '') 
    

    为了删除输出中存在的所有 \n 和 \r。

    【讨论】:

      【解决方案3】:

      根据您的 HTML,我会这样做:

      require 'nokogiri'
      
      doc = Nokogiri::HTML(<<EOT)
      <ol>
      <li><b>The Company Name</b><br>
      The Street<br>
      The City, 
      The State 
      The Zipcode<br><br>
      </li>
      <li><b>The Company Name</b><br>
      The Street<br>
      The City, 
      The State 
      The Zipcode<br><br>
      </li>
      </ol>
      EOT
      
      doc.search('li').map{ |li|
        text = li.text.split("\n").map(&:strip)
      }
      # => [["The Company Name",
      #      "The Street",
      #      "The City,",
      #      "The State",
      #      "The Zipcode"],
      #     ["The Company Name",
      #      "The Street",
      #      "The City,",
      #      "The State",
      #      "The Zipcode"]]
      

      【讨论】:

        猜你喜欢
        • 2010-12-17
        • 2014-09-27
        • 2012-07-16
        • 2011-03-27
        • 2011-01-29
        • 1970-01-01
        • 2011-03-17
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多