【发布时间】:2019-09-20 17:48:47
【问题描述】:
我有以下代码,但它给出的第一页(默认下拉状态)响应为 200 OK。请注意,下拉列表是动态和渐进的,直到最终搜索按钮出现,有人可以纠正我的代码有什么问题吗?
def process(ghatno):
home_url = 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik'
post_url = 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik'
print "Please wait...getting details of :" + ghatno
with requests.Session() as session:
r = session.get(url=post_url)
cookies = r.cookies
pprint.pprint(r.headers)
gethead = r.headers
soup = BeautifulSoup(r.text, 'html.parser')
viewstate = soup.select('input[name="__VIEWSTATE"]')[0]['value']
csrftoken = soup.select('input[name="__CSRFTOKEN"]')[0]['value']
eventvalidation = soup.select('input[name="__EVENTVALIDATION"]')[0]['value']
viewgen = soup.select('input[name="__VIEWSTATEGENERATOR"]')[0]['value']
data = {
'__CSRFTOKEN':csrftoken,
'__EVENTARGUMENT':'',
'__EVENTTARGET':'',
'__LASTFOCUS':'',
'__SCROLLPOSITION':'0',
'__SCROLLPOSITIONY':'0',
'__EVENTVALIDATION': eventvalidation,
'__VIEWSTATE':viewstate,
'__VIEWSTATEGENERATOR': viewgen,
'ctl00$ContentPlaceHolder5$ddlLanguage' : 'en-US',
'ctl00$ContentPlaceHolder5$btnSearchCommonSr':'Search',
'ctl00$ContentPlaceHolder5$ddlTaluka': '2',
'ctl00$ContentPlaceHolder5$ddlVillage': '25',
'ctl00$ContentPlaceHolder5$ddlYear': '20192020',
'ctl00$ContentPlaceHolder5$grpSurveyLocation': 'rdbSurveyNo',
'ctl00$ContentPlaceHolder5$txtCommonSurvey': 363
}
headers = {
'Host': 'igrmaharashtra.gov.in',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0',
'Referer': 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik',
'Host': 'igrmaharashtra.gov.in',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
}
r = requests.post(url=post_url, data=json.dumps(data), cookies=cookies, headers = headers)
soup = BeautifulSoup(r.text, 'html.parser')
table = SoupStrainer('tr')
soup = BeautifulSoup(soup.get_text(), 'html.parser', parse_only=table)
print(soup.get_text())
pprint.pprint(r.headers)
print r.text
getpost = r.headers
getpostrequest = r.request.headers
getresponsebody = r.request.body
f = open('/var/www/html/nashik/hiren.txt', 'w')
f.write(str(gethead))
f.write(str(getpostrequest))
f.write(str(getresponsebody))
f.write(str(getpost))
我的回复如下:
响应标头 -(GET 请求)
{'Content-Length': '5994', 'X-AspNet-Version': '4.0.30319', 'Set-Cookie': 'ASP.NET_SessionId=24wwh11lwvzy5gf0xlzi1we4; path=/; HttpOnly, __CSRFCOOKIE=d7b10286-fc9f-4ed2-863d-304737df8758; path=/; HttpOnly', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'X-Powered-By': 'ASP.NET', 'Server': 'Microsoft-IIS/8.0', 'Cache-Control': 'private', 'Date': 'Thu, 02 May 2019 08:21:48 GMT', 'Content-Type': 'text/html; charset=utf-8'}
响应标头 -(GET 请求)
{'Content-Length': '3726', 'Accept-Language': 'en-US,en;q=0.5', 'Accept-Encoding': 'gzip, deflate', 'Host': 'igrmaharashtra.gov.in', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:60.0) Gecko/20100101 Firefox/60.0', 'Connection': 'keep-alive', 'Referer': 'http://igrmaharashtra.gov.in/eASR/eASRCommon.aspx?hDistName=Nashik', 'Cookie': '__CSRFCOOKIE=d7b10286-fc9f-4ed2-863d-304737df8758; ASP.NET_SessionId=24wwh11lwvzy5gf0xlzi1we4', 'Content-Type': 'application/x-www-form-urlencoded'}
响应标头 -(POST 请求)
{'Content-Length': '7834', 'X-AspNet-Version': '4.0.30319', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'X -Powered-By':'ASP.NET','Server':'Microsoft-IIS/8.0','Cache-Control':'private','Date':'Fri, 03 May 2019 10:21:45 GMT ', '内容类型': 'text/html;字符集=utf-8'}
**返回默认页面选择下拉**
नाशिक 和 - - 选择 Taluka - - INSTEAD 的选项值“2”,即 इगतपुरी 一旦选择了选项“2”,我希望在下一个下拉列表中的值“25”,然后我将最终调查“363”作为结果。
请注意,我也尝试过 Mechanize 浏览器,但没有成功!
【问题讨论】:
-
你的代码应该做什么?您能否提供带有预期输出的测试 ghatno 值?
-
@QHarr ,数字是 363,我只是相信每个 POST 请求都会动态加载下拉菜单,即一旦您选择第一个下拉菜单,就会有一个 POST 请求,然后您选择第二个下拉菜单下来等等..
-
那么发布请求负责填充下一个下拉列表?
-
是的,所以为了得到结果我需要发布 3 次表格,在代码中我传递了最后一个帖子的值,现在我得到了响应,但只有 1 个帖子状态第一个下拉选择状态休息下拉在响应中不可见
-
@DaVinci007 你能把正确答案贴在下面的答案部分吗?
标签: python web-scraping beautifulsoup python-requests urllib