【问题标题】:Python scraping: Error 54 'Connection reset by peer'Python 抓取:错误 54“对等方重置连接”
【发布时间】:2020-08-05 11:39:35
【问题描述】:

我编写了简单的脚本来从多个网站获取 html。虽然直到昨天我对脚本没有任何问题。它突然开始抛出异常。

Traceback (most recent call last):
  File "crowling.py", line 45, in <module>
    result = requests.get(url)
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/sessions.py", line 685, in send
    r.content
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/models.py", line 829, in content
    self._content = b''.join(self.iter_content(CONTENT_CHUNK_SIZE)) or b''
  File "/Users/gen/.pyenv/versions/3.7.1/lib/python3.7/site-packages/requests/models.py", line 754, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))

脚本的主要部分是这个。

c = 0
#urls is the list of urls as strings
for url in urls:
    result = requests.get(url)
    c += 1
    with open('htmls/p{}.html'.format(c),'w',encoding='UTF-8') as f:
        f.write(result.text)

列表 url 是由我的其他代码生成的,我检查了这些 url 是否正确。此外,异常发生的时间也不是恒定的。有时它会在抓取第 20 个 html 时停止,有时它会持续到第 80 个然后停止。由于在没有更改代码的情况下突然出现异常,我猜测该异常是由于 Internet 连接造成的。但是,我想确保脚本稳定运行。是否有任何可能导致此错误的原因?

【问题讨论】:

  • 可能是查看异常堆栈跟踪这些 url 中包含 unicode 字符
  • 您可以发布一些您正在调用的示例网址吗?

标签: python web-scraping python-requests urllib3


【解决方案1】:

如果您确定 URL 正确并且是间歇性连接问题,您可以在失败后重试连接:

c = 0
#urls is the list of urls as strings
for url in urls:
    trycnt = 3  # max try cnt
    while trycnt > 0:
        try:
           result = requests.get(url)
           c += 1
           with open('htmls/p{}.html'.format(c),'w',encoding='UTF-8') as f:
               f.write(result.text)
           trycnt = 0 # success
        except ChunkedEncodingError as ex:
           if trycnt <= 0: print("Failed to retrieve: " + url + "\n" + str(ex))  # done retrying
           else: trycnt -= 1  # retry
           time.sleep(0.5)  # wait 1/2 second then retry
     # go to next URL

【讨论】:

  • 不知何故,我的脚本现在开始在其他 linux 服务器上正常工作,但使用 except 子句重试的想法令人难以置信。非常感谢您的想法。
  • 请接受答案,以便从“无答案”列表中删除此帖子。谢谢。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2019-12-07
  • 2020-07-10
  • 1970-01-01
  • 2016-10-07
  • 2017-04-12
  • 2017-06-09
  • 2016-12-15
相关资源
最近更新 更多