【问题标题】:I cannot scrape a table from a website with usual web scraping tools我无法使用常用的网络抓取工具从网站上抓取表格
【发布时间】:2021-12-20 05:49:00
【问题描述】:

我正在尝试使用 Python 从网站上抓取表格,但由于某种原因,我所有已知的方法都失败了。有一张桌子在 https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/ 45 页。我尝试使用:requests、requests-html(渲染它)、BeautifulSoup 和 selenium 来抓取它。这是我的代码之一,我不会在这里复制我尝试过的所有代码,方法相似,只是使用不同的 Python 库:

from requests_html import HTMLSession
from bs4 import BeautifulSoup

session = HTMLSession()
page = session.get('https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/')
page.html.render(timeout=120)
soup = BeautifulSoup(page.content, 'lxml') #also tried with page.text and 'html.parser' and all permutations
table = soup.find_all(id='table')

我的表变量在这里是一个空列表,它不应该是。我尝试使用 selenium 在表格中查找任何其他 web 元素,我也尝试按类查找 xpath,但所有这些都未能找到表格或其任何部分。我用这些方法刮了很多类似的网站,在这之前我从来没有遇到过问题。 请问有什么想法吗?

【问题讨论】:

  • 表格在 iframe 中。必须切换到 iframe,然后才能查询“表格”内容。
  • 是的,我现在可以看到了,谢谢,伙计。这是我通过网络抓取学习html并且我缺乏一些基础知识的时候......

标签: python selenium web-scraping beautifulsoup python-requests-html


【解决方案1】:

您会看到结果表位于 iframe 中。您可以直接从 iframe 的来源中提取信息:

https://flo.uri.sh/visualisation/3894531/embed?auto=1

这里的代码应该将结果保存到 .csv 文件中:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd

def get_rows(driver):
    """
    returns rows from a page
    
    Returns:
    Dict
    """
    WebDriverWait(driver, 20).until(EC.presence_of_element_located((By.XPATH, "//div[@class='tr body-row']")))
    rows = driver.find_elements(By.XPATH, "//div[@class='tr body-row']")
    table_info= {
        'Rank': [],
        'County':[],
        'School/District':[],
        'Type':[],
        'Total cases':[],
        'Student cases':[],
        'Staff cases':[]
    }
    
    for row in rows:
        cols = row.find_elements(By.CLASS_NAME, 'td')
        for col, index in enumerate(table_info):
            table_info[index].append(cols[col].text)

    return table_info

# path to chrome driver
driver = webdriver.Chrome("D:\chromedriver\94\chromedriver.exe")

driver.get("https://flo.uri.sh/visualisation/3894531/embed?auto=1")


df = pd.DataFrame.from_dict(get_rows(driver))

for _ in range(44):
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//button[@class="pagination-btn next"]'))).click()
    df = pd.concat([df, pd.DataFrame.from_dict(get_rows(driver))])

print(df)
df.to_csv('COVID-19_cases_reported_in_Ohio_schools.csv', index=False)

【讨论】:

    【解决方案2】:

    表格内容在iframe中,需要切换到iframe页面。见API docs

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager
    
    url = 'https://www.nbc4i.com/news/state-news/535-new-cases-of-covid-19-reported-in-ohio-schools-in-past-week/'
    s = Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=s)
    try:
        driver.get(url)
        driver.implicitly_wait(5)
        driver.switch_to.frame(driver.find_element(By.XPATH,
                '//div[@class="ns-block-custom-html"]/div/iframe'))
        # table content is now in the driver context
        while True:
            table = driver.find_element(By.ID, "table")
            for elt in table.find_elements(By.CLASS_NAME, "body-row"):
                items = [td.text for td in elt.find_elements(By.CLASS_NAME, "td")]
                # add code to append each of row of data to CSV file, database, etc.
                print(items)
            next_btn = driver.find_element(By.CLASS_NAME, 'next')        
            if 'disabled' in next_btn.get_attribute('class'):
                # no more > done with pagination
                break
            next_btn.click() # click next button for next set of items
    finally:
        driver.quit()
    

    输出:

    ['1', 'Delaware', 'Olentangy Local', 'Public District', '38', '31', '7']
    ...
    ['446', 'Muskingum', 'West Muskingum Local', 'Public District', '1', '1', '0']
    

    【讨论】:

      猜你喜欢
      • 2017-10-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-07-10
      • 2020-10-25
      • 2023-02-07
      • 1970-01-01
      相关资源
      最近更新 更多