【问题标题】:Max retries exceeded with url caused by NewConnectionError由 NewConnectionError 引起的 url 超出最大重试次数
【发布时间】:2016-11-17 00:48:01
【问题描述】:

我正在尝试抓取一个产品网站,其中包含一个类别中超过 2000 种产品的详细信息,例如名称等超过页数的信息。但是它会随着时间的推移而中断,在不同的链接上随机出现上述错误。这是回溯:

Traceback (most recent call last):
File "crawler1.py", line 103, in <module>
crawler(25)
File "crawler1.py", line 35, in crawler get_single_data(href)
File "crawler1.py", line 57, in get_single_data source_code = requests.get(item_url, timeout=335)
File "/Library/Python/2.7/site-packages/requests/api.py", line 71, in get return request('get', url, params=params, **kwargs)
File "/Library/Python/2.7/site-packages/requests/api.py", line 57, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 475, in request resp = self.send(prep, **send_kwargs)
File "/Library/Python/2.7/site-packages/requests/sessions.py", line 585, in send r = adapter.send(request, **kwargs)
File "/Library/Python/2.7/site-packages/requests/adapters.py", line 467, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.therealreal.com', port=443): Max retries exceeded with url: /products/women/handbags/handle-bags/chanel-lax-handle-bag-4 (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x10d8de190>: Failed to establish a new connection: [Errno 60] Operation timed out',))

通过捕获所有错误,我在所有我能想到使用睡眠的地方都添加了延迟。有没有办法避免这种情况,我可以一次性提取所有 2000 个产品数据吗?或者任何人都可以建议解决方法。请帮忙。

代码如下:

try:
    source_code = requests.get(item_url, timeout=335)
    sleep(.3)
except requests.exceptions.ReadTimeout:
    print("1")
    sleep(30) 
    source_code = requests.get(item_url, timeout=335)
except requests.exceptions.Timeout:
    print("2")
    sleep(30)
    source_code = requests.get(item_url, timeout=335)
except ConnectionError:
    print("3")
    sleep(30)
    source_code = requests.get(item_url, timeout=335)
except socket.error:
    sleep(30)
    source_code = requests.get(item_url, timeout=335)
plain_text  = source_code.text
temp = BeautifulSoup(plain_text)

p.s 你可以忽略我什至没有尝试过的超时以及许多值。没有帮助。出了什么问题?

【问题讨论】:

  • 如果增加请求的超时时间会怎样? ...it breaks over time... 可能是由服务器必须处理的请求数量引起的;服务器变得更忙 -> 响应需要更长的时间 -> 请求超时
  • @dm295 从技术上讲,保持 timeout none 应该会有所帮助,但即便如此,我确实将它增加到 600,但它仍然运行了大约 700 个产品和同样的错误。当我要求它等待某个时间时,它怎么会变得很忙?
  • @dcorelibran 你找到解决问题的方法了吗

标签: python python-2.7 web-scraping beautifulsoup python-requests


【解决方案1】:

尝试捕获 requests.exceptions.ConnectionError

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-05-14
    • 2016-10-12
    • 1970-01-01
    • 2014-01-03
    • 1970-01-01
    • 1970-01-01
    • 2013-08-30
    相关资源
    最近更新 更多