抓取 $ 解析多个具有相同 URL 的 html 表（beautifulsoup 和 selenium）答案

【问题标题】：Scraping $ parsing multiple html tables with the same URL (beautifulsoup & selenium)抓取 $ 解析多个具有相同 URL 的 html 表（beautifulsoup 和 selenium）
【发布时间】：2018-11-12 20:25:12
【问题描述】：

我正在尝试从该站点抓取完整的 HTML 表格： https://www.iscc-system.org/certificates/all-certificates/

我的代码如下：

from selenium import webdriver
import time
import pandas as pd

url = 'https://www.iscc-system.org/certificates/all-certificates/'
browser = webdriver.Chrome('/home/giuseppe/bin/chromedriver')
browser.get(url)

csvfile = open('Scrape_certificates', 'a')     

dfs = pd.read_html('https://www.iscc-system.org/certificates/all-certificates/', header=0)

for i in range(1,10):
    for df in dfs:
    df.to_csv(csvfile, header=False)
    link_next_page = browser.find_element_by_id('table_1_next')
    link_next_page.click()
    time.sleep(4)
    dfs = pd.read_html(browser.current_url)

csvfile.close()

以上代码仅以全表的前10页为例。问题是输出总是相同的第一个表重复 10 次，尽管通过单击“下一个表”按钮实际表得到更新（至少如果我看到网页），我无法获得真正的新数据从下表。我总是从第一个表中得到相同的数据。

【问题讨论】：

标签： python-3.x web-scraping html-table beautifulsoup selenium-chromedriver

【解决方案1】：

首先，您正在阅读带有 pandas 的 URL，而不是页面源。这将获取新页面，而不是读取 Selenium 生成的源。其次，您希望将读取限制为 id = table_1 的表。试试这个：

from selenium import webdriver
import time
import pandas as pd

url = 'https://www.iscc-system.org/certificates/all-certificates/'
browser = webdriver.Chrome('/home/giuseppe/bin/chromedriver')
browser.get(url)

csvfile = open('Scrape_certificates', 'a')

for i in range(1,10):
    dfs = pd.read_html(browser.page_source, attrs = {'id': 'table_1'})
    for df in dfs:
        df.to_csv(csvfile, header=False)
    link_next_page = browser.find_element_by_id('table_1_next')
    link_next_page.click()
    time.sleep(4)

csvfile.close()

您需要从每个结果中删除或过滤掉第 10 行，因为它是导航。

【讨论】：