【问题标题】:Iconv::IllegalSequence when using www::mechanize使用 www::mechanize 时的 Iconv::IllegalSequence
【发布时间】:2009-02-25 14:22:24
【问题描述】:

我正在尝试进行一些网络抓取,但 WWW:Mechanize gem 似乎不喜欢编码和崩溃。
发布请求导致 302 重定向(随后是机械化,到目前为止一切都很好),结果页面似乎崩溃了。 我用谷歌搜索了很多,但到目前为止还没有出现如何解决这个问题。大家有什么想法吗?

代码:

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

agent.user_agent_alias = 'Mac Safari'
answer = agent.post('https://www.budget.de/de/reservierung/privatkunden/step1/schnellbuchung',
{"Country" => "Deutschland",
"Abholstation" => "Aalen",
"Abgabestation" => "Aalen",
"Abholdatum" => "26.02.2009",
"Abholzeit_stunde" => "13",
"Abholzeit_minute" => "30",
"Abgabedatum" => "28.02.2009",
"Abgabezeit_stunde" => "13",
"Abgabezeit_minute" => "30",
"CountryID" => "DE",
"AbholstationID"=>"AA1",
"AbgabestationID"=>"AA1"
}
)
puts answer.body

错误:

D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `iconv': "\204nderungen vorbe"... (Iconv::IllegalSequence)
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/util.rb:29:in `to_native_charset'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_header_handler.rb:29:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/response_body_parser.rb:35:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:30:in `pass'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/handler.rb:6:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain/pre_connect_hook.rb:14:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize/chain.rb:25:in `handle'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:494:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:545:in `fetch_page'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:403:in `post_form'
from D:/Ruby/lib/ruby/gems/1.8/gems/mechanize-0.9.1/lib/www/mechanize.rb:322:in `post'
from test.rb:7

【问题讨论】:

    标签: ruby screen-scraping iconv mechanize-ruby


    【解决方案1】:

    该页面肯定是 UTF-8,但是 Mechanize 使用 NKF(一个核心 Ruby 库)来猜测编码,并且由于某种原因它出现了 Shift JIS。解决该问题的最快方法是覆盖 Mechanize 的编码映射,这样当它尝试使用 Iconv 将正文转换为 UTF-8 时,它也会将源编码作为 UTF-8 传递。你可以这样做:

    WWW::Mechanize::Util::CODE_DIC[:SJIS] = "UTF-8"
    

    将它放在require Mechanize 库所在行的后面。您可能希望在找到问题的根本原因并在必要时提交补丁后立即重新设置该值,甚至更好。

    注意:我解决这个问题的方法是使用回溯调试 Mechanize 库。 to_native_charset 方法调用 detect_charset,这就是问题所在。

    【讨论】:

      【解决方案2】:

      在我的情况下,get 方法返回了一个Mechanize::File,它根本不使用编码。
      我可以通过使用Iconv 手动转换来修复它,但这只有在你已经知道编码的情况下才有效。

      result = @agent.get uri
      # Mechanize::File instead of Mechanize::Page is returned 
      # so we have to convert manually
      result = Iconv.conv("utf-8", "iso-8859-1", result.body)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2011-10-11
        • 2014-11-11
        • 2016-04-12
        • 2012-08-10
        • 1970-01-01
        • 1970-01-01
        • 2020-08-22
        相关资源
        最近更新 更多