在 Selenium 中抓取表格时循环访问 URL？答案

【问题标题】：Looping through URLs when webscraping tables in Selenium?在 Selenium 中抓取表格时循环访问 URL？
【发布时间】：2021-02-25 05:55:43
【问题描述】：

我正在尝试从人道社会立法基金中抓取表格。以下代码成功从其中一个页面获取数据：

import time
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

browser = webdriver.Chrome(ChromeDriverManager().install())


browser.get('https://hslf.org/scorecards/2007-senate-midterm')
time.sleep(10)


html = browser.page_source

humane_sc_tables = pd.read_html(html)
humane_sc_data = humane_sc_tables[0]

我现在需要遍历多个 URL 并将每个网页结果导出到 csv 文件中。

import time
import pandas as pd
from selenium import webdriver
from selenium.common import exceptions
from webdriver_manager.chrome import ChromeDriverManager

# browser = webdriver.Chrome(ChromeDriverManager().install())

URL_list = ["https://hslf.org/scorecards/2007-senate-midterm",
            "https://hslf.org/scorecards/2008-senate-final",
            "https://hslf.org/scorecards/2008-house-final",
            "https://hslf.org/scorecards/2009-senate-midterm",
            "https://hslf.org/scorecards/2009-house-midterm",
            "https://hslf.org/scorecards/2010-house-final",
            "https://hslf.org/scorecards/2010-senate-final",
            "https://hslf.org/scorecards/2011-house-midterm",
            "https://hslf.org/scorecards/2011-senate-midterm",
            "https://hslf.org/scorecards/2012-house-final",
            "https://hslf.org/scorecards/2012-senate-final",
            "https://hslf.org/scorecards/2013-house-midterm",
            "https://hslf.org/scorecards/2013-senate-midterm",
            "https://hslf.org/scorecards/2014-house-final",
            "https://hslf.org/scorecards/2014-senate-final",
            "https://hslf.org/scorecards/2015-house-midterm",
            "https://hslf.org/scorecards/2015-senate-midterm",
            "https://hslf.org/scorecards/2016-house-final",
            "https://hslf.org/scorecards/2016-senate-final",
            "https://hslf.org/scorecards/2017-house-midterm",
            "https://hslf.org/scorecards/2017-senate-midterm",
            "https://hslf.org/scorecards/2018-house-final",
            "https://hslf.org/scorecards/2018-senate-final"]

for url in URL_list:
    browser = webdriver.Chrome(ChromeDriverManager().install())
    time.sleep(5)

    print("Current session is {}".format(browser.session_id))
    browser.quit()
    try:
        browser.get(url)
    except exceptions.InvalidSessionIdException as e:
        print(e.message)

    html = browser.page_source
    humane_sc_tables = pd.read_html(html)
    humane_sc_data = humane_sc_tables[0]
    humane_sc_data = humane_sc_data.drop(humane_sc_data.columns[[0,5,7]], axis = 1)
    browser.close()
    humane_sc_data.to_csv(f'humane_scores{url}.csv')

但是，我收到以下错误：

MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=55494): Max url 超出重试次数： /session/7e430735b2d015147dc20049f3b78b10/url（由 NewConnectionError('：建立新连接失败：[Errno 61] 连接被拒绝'))

请指教。

【问题讨论】：

快速谷歌搜索该错误消息会返回大量结果，您是否已经完成了所有这些？
在调用.get()方法之前需要browser.quit()是什么？你确定这不是根本原因？

标签： python selenium-webdriver

【解决方案1】：

您对以下browser.quit() 的电话

print("Current session is {}".format(browser.session_id))
browser.quit()
try:
    browser.get(url)
except exceptions.InvalidSessionIdException as e:
    print(e.message)

quit() 是一个 webdriver 命令，它调用 driver.dispose 方法，该方法依次关闭所有浏览器窗口并终止 WebDriver 会话。

因此，您似乎正在关闭浏览器实例，然后再发出 .get() 请求，进而检索您想要的内容。尝试将该行添加到循环的末尾，以便在下一次迭代时创建一个新会话。

【讨论】：

【解决方案2】：

让它工作。请看下面的代码：

import time
import pprint
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

'''

Note:
The link https://hslf.org/scorecards/2007-house-midterm doesn't work 
and is therefore excluded

'''
URL_list = ["https://hslf.org/scorecards/2007-senate-midterm",
            "https://hslf.org/scorecards/2008-senate-final",
            "https://hslf.org/scorecards/2008-house-final",
            "https://hslf.org/scorecards/2009-senate-midterm",
            "https://hslf.org/scorecards/2009-house-midterm",
            "https://hslf.org/scorecards/2010-house-final",
            "https://hslf.org/scorecards/2010-senate-final",
            "https://hslf.org/scorecards/2011-house-midterm",
            "https://hslf.org/scorecards/2011-senate-midterm",
            "https://hslf.org/scorecards/2012-house-final",
            "https://hslf.org/scorecards/2012-senate-final",
            "https://hslf.org/scorecards/2013-house-midterm",
            "https://hslf.org/scorecards/2013-senate-midterm",
            "https://hslf.org/scorecards/2014-house-final",
            "https://hslf.org/scorecards/2014-senate-final",
            "https://hslf.org/scorecards/2015-house-midterm",
            "https://hslf.org/scorecards/2015-senate-midterm",
            "https://hslf.org/scorecards/2016-house-final",
            "https://hslf.org/scorecards/2016-senate-final",
            "https://hslf.org/scorecards/2017-house-midterm",
            "https://hslf.org/scorecards/2017-senate-midterm",
            "https://hslf.org/scorecards/2018-house-final",
            "https://hslf.org/scorecards/2018-senate-final"]

for url in URL_list:
    browser = webdriver.Chrome(ChromeDriverManager().install())
    browser.get(url)
    time.sleep(10)
    
    html = browser.page_source
    tables = pd.read_html(html)
    tables = pd.concat(tables)
    
    data = tables.iloc[:, [0,2]]
    
    browser.close()
    browser.quit()
    
    filename = url[28:].replace("/","_")
    data.to_csv(filename+'.csv', index=False)

【讨论】：