【发布时间】:2018-01-25 02:22:40
【问题描述】:
from bs4 import BeautifulSoup
from pprint import pprint
import requests
url = 'http://estadistico.ut.com.sv/OperacionDiaria.aspx'
s = requests.Session()
pagereq = s.get(url)
soup = BeautifulSoup(pagereq.content, 'lxml')
viewstategenerator = soup.find("input", attrs = {'id': '__VIEWSTATEGENERATOR'})['value']
viewstate = soup.find("input", attrs = {'id': '__VIEWSTATE'})['value']
eventvalidation = soup.find("input", attrs = {'id': '__EVENTVALIDATION'})['value']
eventtarget = 'ASPxDashboardViewer1'
DXCss = '1_33,1_4,1_9,1_5,15_2,15_4'
DXScript = '1_232,1_134,1_225,1_169,1_187,15_1,1_183,1_182,1_140,1_147,1_148,1_142,1_141,1_143,1_144,1_145,1_146,15_0,15_6,15_7'
eventargument = {"Task":"Export","ExportInfo":{"Mode":"SingleItem","GroupName":"pivotDashboardItem1","FileName":"Generación+por+tipo+de+tecnología+(MWh)","ClientState":{"clientSize":{"width":509,"height":385},"titleHeight":48,"itemsState":[{"name":"pivotDashboardItem1","headerHeight":34,"position":{"left":11,"top":146},"width":227,"height":108,"virtualSize":'null',"scroll":{"horizontal":'true',"vertical":'true'}}]},"Format":"Excel","DocumentOptions":{"paperKind":"Letter","pageLayout":"Portrait","scaleMode":"AutoFitWithinOnePage","scaleFactor":1,"autoFitPageCount":1,"showTitle":'true',"title":"Operación+Diaria","imageFormatOptions":{"format":"Png","resolution":96},"excelFormatOptions":{"format":"Csv","csvValueSeparator":","},"commonOptions":{"filterStatePresentation":"None","includeCaption":'true',"caption":"Generación+por+tipo+de+tecnología+(MWh)"},"pivotOptions":{"printHeadersOnEveryPage":'true'},"gridOptions":{"fitToPageWidth":'true',"printHeadersOnEveryPage":'true'},"chartOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"pieOptions":{"autoArrangeContent":'true'},"gaugeOptions":{"autoArrangeContent":'true'},"cardOptions":{"autoArrangeContent":'true'},"mapOptions":{"automaticPageLayout":'true',"sizeMode":"Zoom"},"rangeFilterOptions":{"automaticPageLayout":'true',"sizeMode":"Stretch"},"imageOptions":{},"fileName":"Generación+por+tipo+de+tecnología+(MWh)"},"ItemType":"PIVOT"},"Context":"BwAHAAIkY2NkNWRiYzItYzIwNS00MDIyLTkzZjUtYWQ0NzVhYTM5Y2E3Ag9PcGVyYWNpb25EaWFyaWECAAIAAAAAAMByQA==","RequestMarker":1,"ClientState":{}}
postdata = {'__EVENTTARGET': eventtarget,
'__EVENTARGUMENT': eventargument,
'__VIEWSTATE': viewstate,
'__VIEWSTATEGENERATOR': viewstategenerator,
'__EVENTVALIDATION': eventvalidation,
'DXScript': DXScript,
'DXCss': DXCss
}
datareq = s.post(url, data = postdata)
print datareq.text
我正在尝试从this .aspx 网页中抓取数据。该页面通过 javascript 动态加载数据,因此无法直接使用 requests/BeautifulSoup 进行抓取。
通过查看网络流量,我可以看到,当您单击元素的导出 (Exportar a) 按钮时,选择一种导出类型(excel、csv),然后确认向页面发出 POST 请求。它返回我需要的数据的 base64 编码字符串。据我所知,没有办法直接对文件发出 GET 请求,因为它仅在请求时生成。
我想要做的是复制触发 csv 响应的 POST 请求。因此,我首先搜索 __VIEWSTATE、__VIEWSTATEGENERATOR 和 __EVENTVALIDATION。 __EVENTTARGET、DXCSS 和 DXScript 看起来已修复。 __EVENTARGUMENT 直接从 POST 请求中复制。
我的代码返回服务器应用程序错误。我认为问题要么是a)错误的__EVENTARGUMENT(可能是部分动态而不是固定的?),b)没有真正理解.aspx页面的工作原理,或者c)这些工具无法实现我想要做的事情。
我确实考虑过使用 selenium 来触发数据导出,但我看不到捕获服务器响应的方法。
【问题讨论】:
标签: python asp.net web-scraping