【发布时间】:2018-11-12 20:54:55
【问题描述】:
我正在尝试从this 网页上抓取数据。这是一项艰巨的工作——大约有 600 个下一页链接,并且刮板在 300 页(8 小时)后由于“对等连接重置”错误而崩溃。
我希望刮板从第 300 页开始,而不是在每次崩溃后重新开始,并希望它不会出错地到达最后一页,并且我可以附加两个输出的数据文件。我的代码(如下)在我连续浏览分页(从第 1 页开始)时有效,但如果打开第 1 页并尝试发布到第 300 页,则它不起作用。我收到关于 page_no 变量的错误“AttributeError:'NoneType'对象没有“查找”属性”,这意味着它从未到达第 300 页。任何想法是什么问题以及如何解决?
#Open Search Page
url = 'http://forestsclearance.nic.in/'
r = requests.get(url + 'Online_Status.aspx')
VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)
cookies = {
'ASP.NET_SessionId': 'kaqs1jzegnfn4zxpwio4jthl',
'countrytabs': '0',
'countrytabs1': '0',
'acopendivids': 'Omfc,Email,Campa,support,livestat,commitee,Links',
'acgroupswithpersist': 'nada',
}
headers = {
'Connection': 'keep-alive',
'Cache-Control': 'max-age=0',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8',
}
#Click Search box
r = requests.post(
url + 'Online_Status.aspx',
headers=headers,
cookies=cookies,
data = {
'ctl00$ScriptManager1': 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$Button1',
'__EVENTARGUMENT': '',
'__EVENTTARGET': '',
'__VIEWSTATE': VIEWSTATE,
'__VIEWSTATEGENERATOR': GENERATOR,
'__VIEWSTATEENCRYPTED': '',
'__EVENTVALIDATION': VALIDATION,
'ctl00$ContentPlaceHolder1$ddlyear': '-All Years-',
'ctl00$ContentPlaceHolder1$ddl1': 'Select',
'ctl00$ContentPlaceHolder1$ddl3': 'Select',
'ctl00$ContentPlaceHolder1$ddlcategory': '-Select All-',
'ctl00$ContentPlaceHolder1$DropDownList1': '-Select All-',
'ctl00$ContentPlaceHolder1$txtsearch': '',
'ctl00$ContentPlaceHolder1$HiddenField1': '',
'ctl00$ContentPlaceHolder1$HiddenField2': '',
'__ASYNCPOST': 'false',
'ctl00$ContentPlaceHolder1$Button1': 'SEARCH',
}
)
VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)
#Post to Page 300
lastPage = 563
for page in range(300, lastPage + 1):
r = requests.post(
url + 'Online_Status.aspx',
cookies=cookies,
data = {
'ctl00$ScriptManager1': 'ctl00$ContentPlaceHolder1$UpdatePanel1|ctl00$ContentPlaceHolder1$Button1',
'ctl00$ContentPlaceHolder1$RadioButtonList1': 'New',
'__EVENTARGUMENT': 'Page${}'.format(page),
'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$grdevents',
'__VIEWSTATE': VIEWSTATE,
'__VIEWSTATEGENERATOR': GENERATOR,
'__VIEWSTATEENCRYPTED': '',
'__EVENTVALIDATION': VALIDATION,
'ctl00$ContentPlaceHolder1$ddlyear': '-All Years-',
'ctl00$ContentPlaceHolder1$ddl1': 'Select',
'ctl00$ContentPlaceHolder1$ddl3': 'Select',
'ctl00$ContentPlaceHolder1$ddlcategory': '-Select All-',
'ctl00$ContentPlaceHolder1$DropDownList1': '-Select All-',
'ctl00$ContentPlaceHolder1$txtsearch': '',
'ctl00$ContentPlaceHolder1$HiddenField1': '',
'ctl00$ContentPlaceHolder1$HiddenField2': '',
'__ASYNCPOST': 'false',
}
)
#scrape data
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', {'id' : 'ctl00_ContentPlaceHolder1_grdevents'})
page_no = int(table.find('tr', {'class': 'pagi'}).span.text)
rows = table.findAll('tr')
for row in rows[1:len(rows)-2]:
#My scraping code goes here...
#Get form data for next page post request
VIEWSTATE, GENERATOR, VALIDATION = getFormData(r.content)
【问题讨论】:
标签: python asp.net web-scraping python-requests