【发布时间】:2018-07-18 00:19:16
【问题描述】:
在过去一个月左右的时间里,我一直在尝试从一个 aspx 网站上阅读几页。我在网站上找到所有必需的项目没有问题,但我尝试的解决方案仍然无法正常工作。我在某处读到必须存在所有标题详细信息,所以我添加了它们。我还在某处读到 __EVENTTARGET 必须设置为告诉 aspx 按下了哪个按钮,所以我尝试了一些不同的东西(见下文)。我还读到应该建立一个会话来处理 cookie - 所以我也实现了它。到目前为止,我的代码 sn-p 生成的信息与我使用 Web 开发工具分析发布请求时得到的信息完全相同(打印行已被注释掉)——但这段代码总是给我第一页。有谁知道此代码中缺少什么才能使其正常工作。我还应该指出,硒或机械化并不是这个项目的真正选择。
import requests
from bs4 import BeautifulSoup
import time
import collections
import json
def SPAIN_STK_LIST(numpage):
payload = collections.OrderedDict()
header = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate',
'Accept-language' : 'en-US,en;q=0.9',
'Cache-Control' : 'max-age=0',
'Connection' : 'keep-alive',
'Content-Type': 'text/html; charset=utf-8',
'Host' : 'www.bolsamadrid.es',
'Origin' : 'null',
'Upgrade-Insecure-Requests' : '1',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
for i in range(0, numpage):
ses = requests.session()
if(i == 0):
req = ses.get("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header)
else:
req = ses.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header, data = payload)
# print(req.request.body)
# print(req.request.headers)
# print(req.request.url)
page = req.text
soup = BeautifulSoup(page, "lxml")
# find __VIEWSTATE and __EVENTVALIDATION for the next page
viewstate = soup.select("#__VIEWSTATE")[0]['value']
# print("VIEWSTATE: ", viewstate)
eventval = soup.select("#__EVENTVALIDATION")[0]['value']
# print("EVENTVALIDATION: ", eventval)
header = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding' : 'gzip, deflate',
'Accept-language' : 'en-US,en;q=0.9',
'Cache-Control' : 'max-age=0',
'Connection' : 'keep-alive',
'Content-Type': 'application/x-www-form-urlencoded',
'Host' : 'www.bolsamadrid.es',
'Origin' : 'null',
'Upgrade-Insecure-Requests' : '1',
'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
target = "ct100$Contenido$GoPag{:=>2}"
payload = collections.OrderedDict()
payload['__EVENTTARGET'] = ""
#payload['__EVENTTARGET'] = "GoPag"
#payload['__EVENTTARGET'] = "ct100$Contenido$GoPag"
#payload['__EVENTTARGET'] = target.format(i + 1)
payload['__EVENTARGUMENT'] = ""
payload['__VIEWSTATE'] = viewstate
payload['__VIEWSTATEGENERATOR'] = "65A1DED9"
payload['__EVENTVALIDATION'] = eventval
payload['ct100$Contenido$GoPag'] = i + 1
table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
for row in table.findAll("tr")[1:]:
cells = row.findAll("td")
print(cells[0].find("a").get_text().replace(",","").replace("S.A.", ""))
time.sleep(1)
SPAIN_STK_LIST(6)
请注意,第一个标头内容类型设置为“text/html”,因为这是第一个请求,但任何后续请求都使用“application/x-www-form-urlencoded”的类型内容完成。任何关于我接下来应该尝试什么的指示都将不胜感激。 E.
【问题讨论】:
标签: python asp.net python-3.x web-scraping python-requests