【问题标题】:requests.post script for webscraping not working用于网页抓取的 requests.post 脚本不起作用
【发布时间】:2018-03-02 18:37:28
【问题描述】:

我正在尝试从https://ocfs.ny.gov/main/childcare/ccfs_template.asp 中抓取一些数据,但对每页的记录数没有计数限制。开发人员工具显示一个 Post 方法,在单击 Search 时访问“https://apps.netforge.ny.gov/dcfs/Search/SearchHTP/1.1”(需要在执行 Search 之前在 Name 字段中插入一个空格)。

我想将所有数据下载到一个文件中。我的代码使用requests.post 模块,但我不确定我是否正确使用它。我得到的错误显示在我的代码下方。感谢一些关于我应该如何修改它的指导。对python相当陌生。

代码如下:

import requests, csv

dataArg={'Criteria.ModalityCode':'', 'Criteria.CountyID':'', 'Criteria.SchoolDistrict':'', 'Criteria.ZipCode':'', 'Criteria.FacilityName':'+', 'Criteria.RegistrationID':'', 'Criteria.MedicationOnly':'false', 'Criteria.NonTraditionalHoursOnly':'false', 'Criteria.ShowOpenOnly':'true', 'Criteria.ShowOpenOnly':'false', 'Paging.PageSize':''}
dataCsv = requests.post('https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1',data=dataArg)

openFile = open('nydata', 'wb')
for chunk in dataCsv.iter_content(1000000):
    openFile.write(chunk)

open_csv = open('nydata')
csv_reader = csv.reader(open_csv)
list_data = list(csv_reader)

错误:

File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection.py", line 326, in connect
    ssl_context=context)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\ssl_.py", line 329, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 407, in wrap_socket
    _context=self, _session=session)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 814, in __init__
    self.do_handshake()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 1068, in do_handshake
    self._sslobj.do_handshake()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 689, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send
    timeout=timeout
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='apps.netforge.ny.gov', port=443): Max retries exceeded with url: /dcfs/Search/Search%20HTTP/1.1?Criteria.ModalityCode=&Criteria.CountyID=&Criteria.SchoolDistrict=&Criteria.ZipCode=&Criteria.FacilityName=+&Criteria.RegistrationID=&Criteria.MedicationOnly=false&Criteria.NonTraditionalHoursOnly=false&Criteria.ShowOpenOnly=true&Criteria.ShowOpenOnly=false&Paging.PageSize= (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)'),))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\NY.py", line 3, in <module>
    dataCsv = requests.get('https://apps.netforge.ny.gov/dcfs/Search/Search HTTP/1.1?Criteria.ModalityCode=&Criteria.CountyID=&Criteria.SchoolDistrict=&Criteria.ZipCode=&Criteria.FacilityName=+&Criteria.RegistrationID=&Criteria.MedicationOnly=false&Criteria.NonTraditionalHoursOnly=false&Criteria.ShowOpenOnly=true&Criteria.ShowOpenOnly=false&Paging.PageSize=')
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 506, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='apps.netforge.ny.gov', port=443): Max retries exceeded with url: /dcfs/Search/Search%20HTTP/1.1?Criteria.ModalityCode=&Criteria.CountyID=&Criteria.SchoolDistrict=&Criteria.ZipCode=&Criteria.FacilityName=+&Criteria.RegistrationID=&Criteria.MedicationOnly=false&Criteria.NonTraditionalHoursOnly=false&Criteria.ShowOpenOnly=true&Criteria.ShowOpenOnly=false&Paging.PageSize= (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)'),))
>>> 
== RESTART: C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\NY.py ==
Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
    chunked=chunked)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 346, in _make_request
    self._validate_conn(conn)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 850, in _validate_conn
    conn.connect()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connection.py", line 326, in connect
    ssl_context=context)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\ssl_.py", line 329, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 407, in wrap_socket
    _context=self, _session=session)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 814, in __init__
    self.do_handshake()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 1068, in do_handshake
    self._sslobj.do_handshake()
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\ssl.py", line 689, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 440, in send
    timeout=timeout
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
    _stacktrace=sys.exc_info()[2])
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\urllib3\util\retry.py", line 388, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='apps.netforge.ny.gov', port=443): Max retries exceeded with url: /dcfs/Search/Search%20HTP/1.1 (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)'),))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\NY.py", line 19, in <module>
    dataCsv = requests.post('https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1',data=dataArg)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 112, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 618, in send
    r = adapter.send(request, **kwargs)
  File "C:\Users\Karun\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\adapters.py", line 506, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='apps.netforge.ny.gov', port=443): Max retries exceeded with url: /dcfs/Search/Search%20HTP/1.1 (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:777)'),))

【问题讨论】:

  • 这个问题的发布方式有问题吗?很好奇为什么没有人回应它,考虑到这个网站上有很多 python 和网络抓取专家。感谢对上述问题的一些指导。
  • 没有人回答您的帖子,因为您的代码不可重现。链接https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1 返回Runtime Error。该链接不正确或该站点需要登录访问权限。除了打开并读取文件之外,您的代码没有任何暗示与 web scraping 相关的任何内容。
  • verify=False 添加到.post() 方法似乎有效。但是,我不是这方面的专家,所以无法解释原因。
  • @KeyurPotdar:现在它已将错误减少为警告,但生成的文件没有数据。这是修改后的命令:“dataCsv = requests.post('apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1',data=dataArg, verify=False)”。 dataArg 在顶部问题中提供的代码中定义。我还需要做什么?

标签: python python-3.x web-scraping python-requests


【解决方案1】:

首先,POST 方法是访问https://apps.netforge.ny.gov/dcfs/Search/Search 而不是https://apps.netforge.ny.gov/dcfs/Search/Search HTP/1.1

关于SSL Cert Verification,文档说:

Requests 为 HTTPS 请求验证 SSL 证书,就像 Web 浏览器一样。默认情况下启用 SSL 验证,如果无法验证证书,Requests 会抛出 SSLError。

所以,您可以设置verify=False 来克服这个问题。 但是,请注意,您不应该在生产代码中使用它

最后,使用此代码将为您提供页面:

data = {
    'Criteria.ModalityCode': '',
    'Criteria.CountyID': '',
    'Criteria.SchoolDistrict': '',
    'Criteria.ZipCode': '',
    'Criteria.FacilityName': '+',
    'Criteria.RegistrationID': '',
    'Criteria.MedicationOnly': 'false',
    'Criteria.NonTraditionalHoursOnly': 'false',
    'Criteria.ShowOpenOnly': 'false',
    'Paging.PageSize': ''
}

dataCsv = requests.post('https://apps.netforge.ny.gov/dcfs/Search/Search', data=data, verify=False)

【讨论】:

  • 感谢您的帮助。现在工作。我必须为 Paging.PageSize 指定一个值 - 否则不会返回数据。两个后续: 1. 为什么不应该在生产代码中使用“verify=false”?这样做的后果是什么? 2. 为什么在开发者工具中 Post 方法在“搜索/搜索”之后显示“HTTP/1.1”作为 URL(apps.netforge.ny.gov/dcfs/Search/Search) 的一部分? “HTTP/1.1”的意义是什么?有人怎么会知道忽略或包含它?
  • 实际上,它并没有向我显示。它显示了我在代码中使用的 URL。您在查看名为 Search 的文件吗?
  • 没关系。当纯 URL 似乎对我不起作用时,我一定是从源代码中的某个地方选择了“HTTP/1.1”。最后,为什么我们不应该在生产代码中使用“verify=False”(正如您在上面警告的那样)?我的意思是,这样做的后果是什么?
猜你喜欢
  • 1970-01-01
  • 2019-10-06
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-03-06
相关资源
最近更新 更多