使用 Selenium 从 .jsp 页面抓取表数据答案

【问题标题】：Scrape table data from .jsp page using Selenium使用 Selenium 从 .jsp 页面抓取表数据
【发布时间】：2020-01-23 02:54:02
【问题描述】：

我正在尝试从 .jsp 页面中抓取一个表格（详情如下）。表格仅在输入数据后加载（火车号和旅程站）

对于您的试验，列车号可以是56913，旅程站可以是SBC（输入数据后会自动更改为“KSR Bengaluru”。 p>

使用下面的脚本，我可以生成表格，但是我无法提取它（在空列表中打印结果）。我需要得到完整的桌子。谁能帮忙让知道如何提取表格？

我对网络抓取非常陌生。因此，如果犯了一些基本错误，请轻推我到正确的方向。

import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.firefox.options import Options
from selenium.webdriver import Firefox
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

from bs4 import BeautifulSoup
import soupsieve as sv
import requests
# Activate the following line if you do not want to see the Firefox window.
# Better deactivate it for debugging.
# os.environ['MOZ_HEADLESS'] = '1'

url = 'https://enquiry.indianrail.gov.in/ntes/trainOnMapBh.jsp'

opts = Options()
driver = Firefox(firefox_binary=r"C:\Program Files (x86)\Mozilla Firefox\firefox.exe", options=opts)
driver.get(url)
WebDriverWait(driver, 20)

train_field = driver.find_element_by_id("trnSrchTxt")
train_field.send_keys("56913")
time.sleep(2)
actions = ActionChains(driver)
actions.send_keys('SBC',Keys.ENTER)
actions.perform()

WebDriverWait(driver, 1)
result_table = driver.find_elements_by_id("mapTrnSch")
print(result_table)

更新除了@MadRay 的回答，下面的代码也获取了数据（不确定它有多健壮）。

import os
import time
from bs4 import BeautifulSoup
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver import Firefox
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
import re

os.environ['MOZ_HEADLESS'] = '1'
opts = Options()
driver = Firefox(firefox_binary=r"C:\Program Files (x86)\Mozilla Firefox\firefox.exe", options=opts)
driver.get('https://enquiry.indianrail.gov.in/ntes/trainOnMapBh.jsp')
WebDriverWait(driver, 20)

train_field = driver.find_element_by_id("trnSrchTxt")
train_field.send_keys("11302")
time.sleep(2)
actions = ActionChains(driver)
actions.send_keys('SBC',Keys.ENTER)
actions.perform()
time.sleep(2)
res = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()

soup = BeautifulSoup(res, 'lxml')
table_rows =soup.find_all('table')[3].find_all('tr')
rows=[]
for tr in table_rows:
    td = tr.find_all('td')
    rows.append([i.text for i in td])
delaydata = rows[3:]
import pandas as pd
df = pd.DataFrame(delaydata, columns = ['StopNo','Station',1,'SchArr','SchDep','ETA_ATA','Arr_Delay','ETD_ATD','DepDelay','Distance','PF'])
df

【问题讨论】：

标签： python selenium selenium-webdriver web-scraping

【解决方案1】：

您必须按 class_name 搜索结果，而不是 id：

results = driver.find_elements_by_class_name("mapTrnSch")

所有其他代码都运行良好。

重要通知。你会有两个结果。第一个是表头，第二个是表内容。

这是我在没有 WebDriverWait 和 ActionChains 的情况下编写的示例：

import time

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = 'https://enquiry.indianrail.gov.in/ntes/trainOnMapBh.jsp'

driver = Firefox(firefox_binary=r"C:\Program Files (x86)\Mozilla Firefox\firefox.exe", options=opts)
driver.get(url)
time.sleep(5)

# Send search data
driver.find_element_by_id("trnSrchTxt").send_keys("56913")  # Train
time.sleep(5)
driver.find_element_by_id("jrnyStn").send_keys('SBC')  # Journey
time.sleep(5)
driver.find_element_by_id("searchTrainInMapBtn").click()  # Submit button (seems like we do not need to click on it, but let's click for sure)
time.sleep(5)

# Gain results
results = driver.find_elements_by_class_name("mapTrnSch")
print(results[0].text)  # 1st result for table headers
print(results[1].text)  # 2st result for table content

【讨论】：

非常感谢。我得到了您的回答所期望的结果。我已经想出了另一种获取数据并将其转换为数据框的方法。如果你有时间，你能看一下那个代码吗？让我知道它是否有任何陷阱（如果我想在循环中获取更多列车的数据或类似的东西，我们会更改列车编号）。我已经更新了问题中的代码。