【问题标题】:Scraping through pages of aspx website -only gets page 1抓取 aspx 网站的页面 - 仅获取第 1 页
【发布时间】:2018-07-18 00:19:16
【问题描述】:

在过去一个月左右的时间里,我一直在尝试从一个 aspx 网站上阅读几页。我在网站上找到所有必需的项目没有问题,但我尝试的解决方案仍然无法正常工作。我在某处读到必须存在所有标题详细信息,所以我添加了它们。我还在某处读到 __EVENTTARGET 必须设置为告诉 aspx 按下了哪个按钮,所以我尝试了一些不同的东西(见下文)。我还读到应该建立一个会话来处理 cookie - 所以我也实现了它。到目前为止,我的代码 sn-p 生成的信息与我使用 Web 开发工具分析发布请求时得到的信息完全相同(打印行已被注释掉)——但这段代码总是给我第一页。有谁知道此代码中缺少什么才能使其正常工作。我还应该指出,硒或机械化并不是这个项目的真正选择。

import requests
from bs4 import BeautifulSoup
import time
import collections
import json

def SPAIN_STK_LIST(numpage):
    payload = collections.OrderedDict()
    header = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
          'Accept-Encoding' : 'gzip, deflate',
          'Accept-language' : 'en-US,en;q=0.9',
          'Cache-Control' : 'max-age=0',
          'Connection' : 'keep-alive',
          'Content-Type': 'text/html; charset=utf-8',
          'Host' : 'www.bolsamadrid.es',
          'Origin' : 'null',
          'Upgrade-Insecure-Requests' : '1',
          'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
          }
for i in range(0, numpage):
    ses = requests.session()
    if(i == 0):
        req = ses.get("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header)
    else:
        req = ses.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header, data = payload)
#        print(req.request.body)
#        print(req.request.headers)
#        print(req.request.url)
    page = req.text
    soup = BeautifulSoup(page, "lxml")
    # find __VIEWSTATE and __EVENTVALIDATION for the next page
    viewstate = soup.select("#__VIEWSTATE")[0]['value']
#        print("VIEWSTATE: ", viewstate)
    eventval = soup.select("#__EVENTVALIDATION")[0]['value']
#        print("EVENTVALIDATION: ", eventval)
    header = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
          'Accept-Encoding' : 'gzip, deflate',
          'Accept-language' : 'en-US,en;q=0.9',
          'Cache-Control' : 'max-age=0',
          'Connection' : 'keep-alive',
          'Content-Type': 'application/x-www-form-urlencoded',
          'Host' : 'www.bolsamadrid.es',
          'Origin' : 'null',
          'Upgrade-Insecure-Requests' : '1',
          'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
          }
    target = "ct100$Contenido$GoPag{:=>2}"
    payload = collections.OrderedDict()
    payload['__EVENTTARGET'] = ""
    #payload['__EVENTTARGET'] = "GoPag"
    #payload['__EVENTTARGET'] = "ct100$Contenido$GoPag"
    #payload['__EVENTTARGET'] = target.format(i + 1)
    payload['__EVENTARGUMENT'] = ""
    payload['__VIEWSTATE'] = viewstate
    payload['__VIEWSTATEGENERATOR'] = "65A1DED9"
    payload['__EVENTVALIDATION'] = eventval
    payload['ct100$Contenido$GoPag'] = i + 1
    table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
    for row in table.findAll("tr")[1:]:
        cells = row.findAll("td")
        print(cells[0].find("a").get_text().replace(",","").replace("S.A.", ""))
    time.sleep(1)


SPAIN_STK_LIST(6)

请注意,第一个标头内容类型设置为“text/html”,因为这是第一个请求,但任何后续请求都使用“application/x-www-form-urlencoded”的类型内容完成。任何关于我接下来应该尝试什么的指示都将不胜感激。 E.

【问题讨论】:

    标签: python asp.net python-3.x web-scraping python-requests


    【解决方案1】:

    最简单的方法如下所示。为什么要对__EVENTTARGET__VIEWSTATE 等进行硬编码?让脚本处理这些:

    import requests
    from bs4 import BeautifulSoup
    
    url = "http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx"
    
    res = requests.get(url,headers = {"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    
    for page in range(7):
        formdata = {}
        for item in soup.select("#aspnetForm input"):
            if "ctl00$Contenido$GoPag" in item.get("name"):
                formdata[item.get("name")] = page
            else:
                formdata[item.get("name")] = item.get("value")
    
        req = requests.post(url,data=formdata)
        soup = BeautifulSoup(req.text,"lxml")
        for items in soup.select("#ctl00_Contenido_tblEmisoras tr")[1:]:
            data = [item.get_text(strip=True) for item in items.select("td")]
            print(data)
    

    假设您需要将表格数据分布在多个页面中。

    【讨论】:

    • 天啊!我的天啊!我的天啊!谢谢!谢谢!谢谢!我可能用 python 阅读了 aspx 的每一篇文章,我在阅读中看到了类似的内容,但我不认为它是解决我问题的正确方法。我将不得不研究它是如何工作的,因为这很简单(与我的方法相比),但它工作得很好而且速度非常快。再次非常感谢!不用说,我选择了这个作为答案。
    【解决方案2】:

    您需要在请求之前设置您的payload

    import requests
    from bs4 import BeautifulSoup
    import time
    import collections
    import json
    
    def SPAIN_STK_LIST(numpage):
        payload = collections.OrderedDict()
        header = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
              'Accept-Encoding' : 'gzip, deflate',
              'Accept-language' : 'en-US,en;q=0.9',
              'Cache-Control' : 'max-age=0',
              'Connection' : 'keep-alive',
              'Content-Type': 'text/html; charset=utf-8',
              'Host' : 'www.bolsamadrid.es',
              'Origin' : 'null',
              'Upgrade-Insecure-Requests' : '1',
              'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
              }
    
        ses = requests.session()
    
        viewstate = ""
        eventval = ""
    
        for i in range(0, numpage):
    
            if(i == 0):
                req = ses.get("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header)
    
                page = req.text
                soup = BeautifulSoup(page, "lxml")
                # find __VIEWSTATE and __EVENTVALIDATION for the next page
                viewstate = soup.select("#__VIEWSTATE")[0]['value']
            #        print("VIEWSTATE: ", viewstate)
                eventval = soup.select("#__EVENTVALIDATION")[0]['value']
    
            else:
    
                header = {'Accept' : 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
                      'Accept-Encoding' : 'gzip, deflate',
                      'Accept-language' : 'en-US,en;q=0.9',
                      'Cache-Control' : 'max-age=0',
                      'Connection' : 'keep-alive',
                      'Content-Type': 'application/x-www-form-urlencoded',
                      'Host' : 'www.bolsamadrid.es',
                      'Origin' : 'null',
                      'Upgrade-Insecure-Requests' : '1',
                      'User-Agent' : 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebkit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' 
                      }
                target = "ct100$Contenido$GoPag{:=>2}"
                payload = collections.OrderedDict()
                payload['__EVENTTARGET'] = "ctl00$Contenido$SiguientesArr"
                #payload['__EVENTTARGET'] = "GoPag"
                #payload['__EVENTTARGET'] = "ct100$Contenido$GoPag"
                #payload['__EVENTTARGET'] = target.format(i + 1)
                payload['__EVENTARGUMENT'] = ""
                payload['__VIEWSTATE'] = viewstate
                payload['__VIEWSTATEGENERATOR'] = "65A1DED9"
                payload['__EVENTVALIDATION'] = eventval
                # payload['ct100$Contenido$GoPag'] = i + 1
                payload['ct100$Contenido$GoPag'] = ""
    
                req = ses.post("http://www.bolsamadrid.es/ing/aspx/Empresas/Empresas.aspx", headers = header, data = payload)
    
                page = req.text
                soup = BeautifulSoup(page, "lxml")
                # find __VIEWSTATE and __EVENTVALIDATION for the next page
                viewstate = soup.select("#__VIEWSTATE")[0]['value']
            #        print("VIEWSTATE: ", viewstate)
                eventval = soup.select("#__EVENTVALIDATION")[0]['value']
    
        #        print(req.request.body)
        #        print(req.request.headers)
        #        print(req.request.url)
    
        #        print("EVENTVALIDATION: ", eventval)
    
                table = soup.find("table", {"id" : "ctl00_Contenido_tblEmisoras"})
                for row in table.findAll("tr")[1:]:
                    cells = row.findAll("td")
                    print( cells[0].find("a").get_text().replace(",","").replace("S.A.", "").encode('utf-8') )
                time.sleep(1)
    
    
    SPAIN_STK_LIST(6)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-09-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-06-09
      • 1970-01-01
      相关资源
      最近更新 更多