【问题标题】:Scraping .aspx site after click点击后抓取.aspx网站
【发布时间】:2019-08-21 00:36:55
【问题描述】:

我正在尝试从以下位置为我的中队抓取调度数据: https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9

我已经弄清楚如何使用 BeautifulSoup 提取数据:

import urllib2
from urllib2 import urlopen
import bs4 as bs

url = 'https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9'
html = urllib2.urlopen(url).read()
soup = bs.BeautifulSoup(html, 'lxml')
table = soup.find('table')
print(table.text)

但是,表格隐藏在选定日期(如果不是当天)和按下“查看计划”按钮的下方。

如何修改我的代码以“按下”“查看计划”按钮,以便随后抓取数据?如果代码还可以选择日期,则可以加分!

我尝试使用:

import urllib2
from urllib2 import urlopen
import bs4 as bs
from selenium import webdriver

driver = webdriver.Chrome("/users/base/Downloads/chromedriver")
driver.get("https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9")
button = driver.find_element_by_id('btnViewSched')
button.click()

成功打开 Chrome 并“点击”按钮,但由于地址未更改,我无法从中抓取。

【问题讨论】:

    标签: python asp.net selenium beautifulsoup screen-scraping


    【解决方案1】:

    你可以使用纯selenium获取日程:

    from selenium import webdriver
    
    driver = webdriver.Chrome('chromedriver.exe')
    driver.get("https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9")
    button = driver.find_element_by_id('btnViewSched')
    button.click()
    print(driver.find_element_by_id('dgEvents').text)
    

    输出:

    TYPE VT Brief EDT RTB Instructor Student Event Hrs Remarks Location
    Flight VT-9 07:45 09:45 11:15 JARVIS, GRANT M [LT] LENNOX, KEVIN I [ENS] BI4101 1.5 2 HR BRIEF MASS BRIEF  
    Flight VT-9 07:45 09:45 11:15 MOYNAHAN, WILLIAM P [CDR] FINNERAN, MATTHEW P [1stLt] BI4101 1.5 2 HR BRIEF MASS BRIEF  
    Flight VT-9 07:45 12:15 13:45 JARVIS, GRANT M [LT] TAYLOR, ADAM R [1stLt] BI4101 1.5 2 HR BRIEF MASS BRIEF @ 0745 W/ JARVIS MEI OPS  
    Flight VT-9 07:45 12:15 13:45 MOYNAHAN, WILLIAM P [CDR] LOW, TRENTON G [ENS] BI4101 1.5 2 HR BRIEF MASS BRIEF @ 0745 W/ MOYNAHAN MEI OPS  
    Watch VT-9   00:00 14:00 ANDERSON, LAURA [LT]   ODO (ON CALL) 14.0    
    Watch VT-9   00:00 14:00 ANDERSON, LAURA [LT]   ODO (ON CALL) 14.0    
    Watch VT-9   00:00 23:59 ANDERSON, LAURA [LT]   ODO (ON CALL) 24.0    
    Watch VT-9   00:00 23:59 ANDERSON, LAURA [LT]   ODO (ON CALL) 24.0    
    Watch VT-9   07:00 19:00   STUY, JOHN [LTJG] DAY IWO 12.0    
    Watch VT-9   19:00 07:00   STRACHAN, ALLYSON [LTJG] IWO 12.0    
    

    【讨论】:

    • @exos 什么是“人造铬”?
    【解决方案2】:

    当我阅读您的问题时,您需要使用 selenium 来抓取需要输入的 .aspx 页面。

    阅读这篇文章会帮助你scrape data for .aspx page with selenium

    【讨论】:

      【解决方案3】:

      点击“查看时间表”,请求使用相同的 url 但数据btnViewSched=View Schedule 并发送令牌。这里的代码,以地图列表格式收集表格数据:

      import requests
      from bs4 import BeautifulSoup
      
      headers = {
          'Connection': 'keep-alive',
          'Cache-Control': 'max-age=0',
          'Upgrade-Insecure-Requests': '1',
          'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) '
                        'Chrome/73.0.3683.86 Safari/537.36',
          'DNT': '1',
          'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,'
                    'application/signed-exchange;v=b3',
          'Accept-Encoding': 'gzip, deflate, br',
          'Accept-Language': 'ru,en-US;q=0.9,en;q=0.8,tr;q=0.7',
      }
      response = requests.get('https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9', headers=headers)
      assert response.ok
      
      page = BeautifulSoup(response.text, "lxml")
      # get __VIEWSTATE, __EVENTVALIDATION and __VIEWSTATEGENERATOR for further requests
      __VIEWSTATE = page.find("input", attrs={"id": "__VIEWSTATE"}).attrs["value"]
      __EVENTVALIDATION = page.find("input", attrs={"id": "__EVENTVALIDATION"}).attrs["value"]
      __VIEWSTATEGENERATOR = page.find("input", attrs={"id": "__VIEWSTATEGENERATOR"}).attrs["value"]
      
      # View Schedule click set here
      data = {
        '__EVENTTARGET': '',
        '__EVENTARGUMENT': '',
        '__VIEWSTATE': __VIEWSTATE,
        '__VIEWSTATEGENERATOR': __VIEWSTATEGENERATOR,
        '__EVENTVALIDATION': __EVENTVALIDATION,
        'btnViewSched': 'View Schedule',
        'txtNameSearch': ''
      }
      # request with params
      response = requests.post('https://www.cnatra.navy.mil/scheds/schedule_data.aspx?sq=vt-9', headers=headers, data=data)
      assert response.ok
      
      page = BeautifulSoup(response.text, "lxml")
      # get table headers to map as a keys in result
      table_headers = [td.text.strip() for td in page.select("#dgEvents tr:first-child td")]
      # get all rows, without table headers
      table_rows = page.select("#dgEvents tr:not(:first-child)")
      
      result = []
      for row in table_rows:
          table_columns = row.find_all("td")
      
          # use map with results for row and add all columns as map (key:value)
          row_result = {}
          for i in range(0, len(table_headers)):
              row_result[table_headers[i]] = table_columns[i].text.strip()
      
          # add row_result to result list
          result.append(row_result)
      
      for r in result:
          print(r)
      
      print("the end")
      

      示例输出:

      {'TYPE': 'Flight', 'VT': 'VT-9', 'Brief': '07:45', 'EDT': '09:45', 'RTB': '11:15 ','讲师':'JARVIS,GRANT M [LT]','学生':'LENNOX,KEVIN I [ENS]','事件':'BI4101','小时':'1.5','备注': '2 HR Brief MASS Brief', '位置': ''}

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2016-11-20
        • 1970-01-01
        • 1970-01-01
        • 2017-11-21
        • 2017-08-16
        • 1970-01-01
        • 2015-01-05
        相关资源
        最近更新 更多