【问题标题】:How to scrape data from list of URLs and save data to CSV with nokogiri如何使用 nokogiri 从 URL 列表中抓取数据并将数据保存到 CSV
【发布时间】:2023-03-19 16:53:01
【问题描述】:

我有一个名为 bontyurls.csv 的文件,如下所示:

http://bontrager.com/model/11383
http://bontrager.com/model/01740
http://bontrager.com/model/09595

我希望我的脚本读取该文件,然后输出如下文件:bonty_test_urls_results.csv

url,model_names
http://bontrager.com/model/11383,"Road TLR Conversion Kit"
http://bontrager.com/model/01740,"404 File Not Found"
http://bontrager.com/model/09595,"RXL Road"

这是我目前所得到的:

# based on code from here: http://www.andrewsturges.com/2011/09/how-to-harvest-web-data-using-ruby-and.html

require 'nokogiri'
require 'open-uri'
require 'csv'

@urls = Array.new
@model_names = Array.new

urls = CSV.read("bontyurls.csv")
(0..urls.length - 1).each do |index|
  puts urls[index][0]
  doc = Nokogiri::HTML(open(urls[index][0]))
  doc.xpath('//h1').each do |model_name|
    @model_name << model_name.content
  end
end

# write results to file  
CSV.open("bonty_test_urls_results.csv", "wb") do |row|
  row << ["url", "model_names"]
  (0..@urls.length - 1).each do |index|
    row << [
      @urls[index], 
      @model_names[index]]
  end
end

该代码不起作用。我收到此错误:

$ ruby bonty_test_urls.rb 
http://bontrager.com/model/00310
bonty_test_urls.rb:15:in `block (2 levels) in <main>': undefined method `<<' for nil:NilClass (NoMethodError)
    from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:239:in `block in each'
    from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `upto'
    from /home/simon/.rvm/gems/ruby-1.9.3-p194/gems/nokogiri-1.5.5/lib/nokogiri/xml/node_set.rb:238:in `each'
    from bonty_test_urls.rb:14:in `block in <main>'
    from bonty_test_urls.rb:11:in `each'
    from bonty_test_urls.rb:11:in `<main>'

这是一些至少返回模型名称的代码。我只是无法让它在更大的脚本中工作:

require 'open-uri'
require 'nokogiri'

doc = Nokogiri::HTML(open("http://bontrager.com/model/09124"))
doc.xpath('//h1').each do |node|
  puts node.text
end

另外,我还没有弄清楚如何处理返回 404 的 URL。

【问题讨论】:

    标签: ruby csv nokogiri


    【解决方案1】:

    这就是我的做法:

    require 'csv'
    require 'nokogiri'
    require 'open-uri'
    
    CSV_OPTIONS = {
      :write_headers => true,
      :headers => %w[url model_names]
    }
    
    CSV.open('bonty_test_urls_results.csv', 'wb', CSV_OPTIONS) do |csv|
      csv_doc = File.foreach('bontyurls.csv') do |url|
        url.chomp!
        begin
          doc = Nokogiri.HTML(open(url))
          h1 = doc.at('h1').text.strip
          h1 = doc.at('title').text.strip.sub(/^Bontrager: /i, '') if (h1.empty?)
          csv << [url, h1]
        rescue OpenURI::HTTPError => e
          csv << [url, e.message]
        end
      end
    end
    

    这会生成一个 CSV 文件,例如:

    url,model_names
    http://bontrager.com/model/11383,Road TLR Conversion Kit (Model #11383)
    http://bontrager.com/model/01740,404 File Not Found
    http://bontrager.com/model/09595,RXL Road (Model #09595)
    

    【讨论】:

    • @pguardiario,在 OP 的问题中是 wb。有时 CSV 数据包含 8 位字符,因此我保留了该设置。
    【解决方案2】:

    您声明了@model_names,但尝试将@model_name 推入,这就是它为nil 的原因。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-18
      • 2020-10-11
      • 1970-01-01
      • 1970-01-01
      • 2022-09-28
      • 2016-11-28
      相关资源
      最近更新 更多