【问题标题】:UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' - -when using urlib.request python3
【发布时间】:2014-05-09 04:36:14
【问题描述】:

我正在编写一个脚本,它会转到链接列表并解析信息。

它适用于大多数网站,但有些网站令人窒息 "UnicodeEncodeError: 'ascii' 编解码器无法对位置 13 中的字符 '\xe9' 进行编码:序数不在范围内 (128)"

它在 client.py 上停止,它是 python3 上 urllib 的一部分

确切的链接是: http://finance.yahoo.com/news/cafés-growth-faster-than-fast-food-peers-144512056.html

这里有很多类似的帖子,但似乎没有一个答案对我有用。

我的代码是:

from urllib import request

def __request(link,debug=0):      

try:
    html = request.urlopen(link, timeout=35).read() #made this long as I was getting lots of timeouts
    unicode_html = html.decode('utf-8','ignore')

# NOTE the except HTTPError must come first, otherwise except URLError will also catch an HTTPError.
except HTTPError as e:
    if debug:
        print('The server couldn\'t fulfill the request for ' + link)
        print('Error code: ', e.code)
    return ''
except URLError as e:
    if isinstance(e.reason, socket.timeout):
        print('timeout')
        return ''    
else:
    return unicode_html

这会调用请求函数

link = 'http://finance.yahoo.com/news/cafés-growth-faster-than-fast-food-peers-144512056.html' page = __request(链接)

回溯是:

Traceback (most recent call last):
  File "<string>", line 250, in run_nodebug
  File "C:\reader\get_news.py", line 276, in <module>
    main()
  File "C:\reader\get_news.py", line 255, in main
    body = get_article_body(item['link'],debug=0)
  File "C:\reader\get_news.py", line 155, in get_article_body
    page = __request('na',url)
  File "C:\reader\get_news.py", line 50, in __request
    html = request.urlopen(link, timeout=35).read()
  File "C:\Python33\Lib\urllib\request.py", line 156, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Python33\Lib\urllib\request.py", line 469, in open
    response = self._open(req, data)
  File "C:\Python33\Lib\urllib\request.py", line 487, in _open
    '_open', req)
  File "C:\Python33\Lib\urllib\request.py", line 447, in _call_chain
    result = func(*args)
  File "C:\Python33\Lib\urllib\request.py", line 1268, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "C:\Python33\Lib\urllib\request.py", line 1248, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "C:\Python33\Lib\http\client.py", line 1061, in request
    self._send_request(method, url, body, headers)
  File "C:\Python33\Lib\http\client.py", line 1089, in _send_request
    self.putrequest(method, url, **skips)
  File "C:\Python33\Lib\http\client.py", line 953, in putrequest
    self._output(request.encode('ascii'))
UnicodeEncodeError: 'ascii' codec can't encode character '\xe9' in position 13: ordinal not in range(128)

感谢任何帮助这让我发疯,我想我已经尝试了 x.decode 和类似的所有组合

(如果可能的话,我可以忽略违规字符。)

【问题讨论】:

  • 用户 Kenneth Reitz 的请求库。我不能高度推荐它。它将使所有这些代码变得更加简单,并且几乎肯定会解决这个问题。
  • @JackGibbs:requests 确实会通过明确重新引用 URL 来处理其中包含非 ASCII 字符的 URL。

标签: python exception-handling web-scraping beautifulsoup utf8-decode


【解决方案1】:

使用percent-encoded URL

link = 'http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html'

我通过将浏览器指向上述百分比编码的 URL

http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html

进入页面,然后复制粘贴 浏览器提供的编码后的 url 返回文本编辑器。但是,您可以使用以下方式以编程方式生成百分比编码的 URL:

from urllib import parse

link = 'http://finance.yahoo.com/news/cafés-growing-faster-than-fast-food-peers-144512056.html'

scheme, netloc, path, query, fragment = parse.urlsplit(link)
path = parse.quote(path)
link = parse.urlunsplit((scheme, netloc, path, query, fragment))

产生

http://finance.yahoo.com/news/caf%C3%A9s-growing-faster-than-fast-food-peers-144512056.html

【讨论】:

  • 如果它是 url 的一部分,则可以使用 parse.quote() 代替 parse.quote_plus()(用于 x-www-form-urlencoded
  • 感谢工作我不确定这是否可能是 URL 的其他部分的问题,所以我拆分它然后重建它 url_tuple =parse.urlsplit(link) parse.quote_plus(url_tuple[ 2]) + url_tuple[3] + parse.quote_plus(url_tuple[4])) encoded_link ="%s://%s%s?%s%s"%(url_tuple[0] , url_tuple[1] , 解析.quote(url_tuple[2]) , url_tuple[3] , parse.quote(url_tuple[4]))
  • 很高兴您的解决方案能够正常工作。但是使用parse.urlunsplit 来构建url。这就是它的用途。
【解决方案2】:

您的 URL 包含不能表示为 ASCII 字符的字符。

您必须确保所有字符都经过正确的 URL 编码;例如使用urllib.parse.quote_plus;它将使用 UTF-8 URL 编码的转义来表示任何非 ASCII 字符。

【讨论】:

    猜你喜欢
    • 2015-12-03
    • 2018-08-14
    • 1970-01-01
    • 2022-01-20
    • 1970-01-01
    • 2017-09-08
    • 2022-01-27
    • 2018-04-18
    • 2014-05-31
    相关资源
    最近更新 更多