我无法使用常用的网络抓取工具从网站上抓取表格答案

【问题标题】：I cannot scrape a table from a website with usual web scraping tools我无法使用常用的网络抓取工具从网站上抓取表格
【发布时间】：2021-12-20 05:49:00
【问题描述】：

我正在尝试使用 Python 从网站上抓取表格，但由于某种原因，我所有已知的方法都失败了。有一张桌子在 https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/ 45 页。我尝试使用：requests、requests-html（渲染它）、BeautifulSoup 和 selenium 来抓取它。这是我的代码之一，我不会在这里复制我尝试过的所有代码，方法相似，只是使用不同的 Python 库：

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
page = session.get('https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/')
page.html.render(timeout=120)
soup = BeautifulSoup(page.content, 'lxml') #also tried with page.text and 'html.parser' and all permutations
table = soup.find_all(id='table')

我的表变量在这里是一个空列表，它不应该是。我尝试使用 selenium 在表格中查找任何其他 web 元素，我也尝试按类查找 xpath，但所有这些都未能找到表格或其任何部分。我用这些方法刮了很多类似的网站，在这之前我从来没有遇到过问题。请问有什么想法吗？

【问题讨论】：

表格在 iframe 中。必须切换到 iframe，然后才能查询“表格”内容。
是的，我现在可以看到了，谢谢，伙计。这是我通过网络抓取学习html并且我缺乏一些基础知识的时候......

标签： python selenium web-scraping beautifulsoup python-requests-html

【解决方案1】：

您会看到结果表位于 iframe 中。您可以直接从 iframe 的来源中提取信息：

https://flo.uri.sh/visualisation/3894531/embed?auto=1

这里的代码应该将结果保存到 .csv 文件中：

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

def get_rows(driver):
    """
    returns rows from a page
    
    Returns:
    Dict
    """
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@class='tr body-row']")))
    rows = driver.find_elements(By.XPATH, "//div[@class='tr body-row']")
    table_info= {
        'Rank': [],
        'County':[],
        'School/District':[],
        'Type':[],
        'Total cases':[],
        'Student cases':[],
        'Staff cases':[]
    }
    
    for row in rows:
        cols = row.find_elements(By.CLASS_NAME, 'td')
        for col, index in enumerate(table_info):
            table_info[index].append(cols[col].text)

    return table_info

# path to chrome driver
driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe")

driver.get("https://flo.uri.sh/visualisation/3894531/embed?auto=1")


df = pd.DataFrame.from_dict(get_rows(driver))

for _ in range(44):
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//button[@class="pagination-btn next"]'))).click()
    df = pd.concat([df, pd.DataFrame.from_dict(get_rows(driver))])

print(df)
df.to_csv('COVID-19_cases_reported_in_Ohio_schools.csv', index=False)

【讨论】：

【解决方案2】：

表格内容在iframe中，需要切换到iframe页面。见API docs。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

url = 'https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/'
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
try:
    driver.get(url)
    driver.implicitly_wait(5)
    driver.switch_to.frame(driver.find_element(By.XPATH,
            '//div[@class="ns-block-custom-html"]/div/iframe'))
    # table content is now in the driver context
    while True:
        table = driver.find_element(By.ID, "table")
        for elt in table.find_elements(By.CLASS_NAME, "body-row"):
            items = [td.text for td in elt.find_elements(By.CLASS_NAME, "td")]
            # add code to append each of row of data to CSV file, database, etc.
            print(items)
        next_btn = driver.find_element(By.CLASS_NAME, 'next')        
        if 'disabled' in next_btn.get_attribute('class'):
            # no more > done with pagination
            break
        next_btn.click() # click next button for next set of items
finally:
    driver.quit()

输出：

['1', 'Delaware', 'Olentangy Local', 'Public District', '38', '31', '7']
...
['446', 'Muskingum', 'West Muskingum Local', 'Public District', '1', '1', '0']

【讨论】：