【问题标题】:Looping through URLs when webscraping tables in Selenium?在 Selenium 中抓取表格时循环访问 URL?
【发布时间】:2021-02-25 05:55:43
【问题描述】:

我正在尝试从人道社会立法基金中抓取表格。以下代码成功从其中一个页面获取数据:

import time
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

browser = webdriver.Chrome(ChromeDriverManager().install())


browser.get('https://hslf.org/scorecards/2007-senate-midterm')
time.sleep(10)


html = browser.page_source

humane_sc_tables = pd.read_html(html)
humane_sc_data = humane_sc_tables[0]

我现在需要遍历多个 URL 并将每个网页结果导出到 csv 文件中。

import time
import pandas as pd
from selenium import webdriver
from selenium.common import exceptions
from webdriver_manager.chrome import ChromeDriverManager

# browser = webdriver.Chrome(ChromeDriverManager().install())

URL_list = ["https://hslf.org/scorecards/2007-senate-midterm",
            "https://hslf.org/scorecards/2008-senate-final",
            "https://hslf.org/scorecards/2008-house-final",
            "https://hslf.org/scorecards/2009-senate-midterm",
            "https://hslf.org/scorecards/2009-house-midterm",
            "https://hslf.org/scorecards/2010-house-final",
            "https://hslf.org/scorecards/2010-senate-final",
            "https://hslf.org/scorecards/2011-house-midterm",
            "https://hslf.org/scorecards/2011-senate-midterm",
            "https://hslf.org/scorecards/2012-house-final",
            "https://hslf.org/scorecards/2012-senate-final",
            "https://hslf.org/scorecards/2013-house-midterm",
            "https://hslf.org/scorecards/2013-senate-midterm",
            "https://hslf.org/scorecards/2014-house-final",
            "https://hslf.org/scorecards/2014-senate-final",
            "https://hslf.org/scorecards/2015-house-midterm",
            "https://hslf.org/scorecards/2015-senate-midterm",
            "https://hslf.org/scorecards/2016-house-final",
            "https://hslf.org/scorecards/2016-senate-final",
            "https://hslf.org/scorecards/2017-house-midterm",
            "https://hslf.org/scorecards/2017-senate-midterm",
            "https://hslf.org/scorecards/2018-house-final",
            "https://hslf.org/scorecards/2018-senate-final"]

for url in URL_list:
    browser = webdriver.Chrome(ChromeDriverManager().install())
    time.sleep(5)

    print("Current session is {}".format(browser.session_id))
    browser.quit()
    try:
        browser.get(url)
    except exceptions.InvalidSessionIdException as e:
        print(e.message)

    html = browser.page_source
    humane_sc_tables = pd.read_html(html)
    humane_sc_data = humane_sc_tables[0]
    humane_sc_data = humane_sc_data.drop(humane_sc_data.columns[[0,5,7]], axis = 1)
    browser.close()
    humane_sc_data.to_csv(f'humane_scores{url}.csv')

但是,我收到以下错误:

MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=55494): Max url 超出重试次数: /session/7e430735b2d015147dc20049f3b78b10/url(由 NewConnectionError(':建立新连接失败:[Errno 61] 连接被拒绝'))

请指教。

【问题讨论】:

  • 快速谷歌搜索该错误消息会返回大量结果,您是否已经完成了所有这些?
  • 在调用.get()方法之前需要browser.quit()是什么?你确定这不是根本原因?

标签: python selenium-webdriver


【解决方案1】:

您对以下browser.quit() 的电话

print("Current session is {}".format(browser.session_id))
browser.quit()
try:
    browser.get(url)
except exceptions.InvalidSessionIdException as e:
    print(e.message)

quit() 是一个 webdriver 命令,它调用 driver.dispose 方法,该方法依次关闭所有浏览器窗口并终止 WebDriver 会话。

因此,您似乎正在关闭浏览器实例,然后再发出 .get() 请求,进而检索您想要的内容。尝试将该行添加到循环的末尾,以便在下一次迭代时创建一个新会话。

【讨论】:

    【解决方案2】:

    让它工作。请看下面的代码:

    import time
    import pprint
    import pandas as pd
    from selenium import webdriver
    from webdriver_manager.chrome import ChromeDriverManager
    
    '''
    
    Note:
    The link https://hslf.org/scorecards/2007-house-midterm doesn't work 
    and is therefore excluded
    
    '''
    URL_list = ["https://hslf.org/scorecards/2007-senate-midterm",
                "https://hslf.org/scorecards/2008-senate-final",
                "https://hslf.org/scorecards/2008-house-final",
                "https://hslf.org/scorecards/2009-senate-midterm",
                "https://hslf.org/scorecards/2009-house-midterm",
                "https://hslf.org/scorecards/2010-house-final",
                "https://hslf.org/scorecards/2010-senate-final",
                "https://hslf.org/scorecards/2011-house-midterm",
                "https://hslf.org/scorecards/2011-senate-midterm",
                "https://hslf.org/scorecards/2012-house-final",
                "https://hslf.org/scorecards/2012-senate-final",
                "https://hslf.org/scorecards/2013-house-midterm",
                "https://hslf.org/scorecards/2013-senate-midterm",
                "https://hslf.org/scorecards/2014-house-final",
                "https://hslf.org/scorecards/2014-senate-final",
                "https://hslf.org/scorecards/2015-house-midterm",
                "https://hslf.org/scorecards/2015-senate-midterm",
                "https://hslf.org/scorecards/2016-house-final",
                "https://hslf.org/scorecards/2016-senate-final",
                "https://hslf.org/scorecards/2017-house-midterm",
                "https://hslf.org/scorecards/2017-senate-midterm",
                "https://hslf.org/scorecards/2018-house-final",
                "https://hslf.org/scorecards/2018-senate-final"]
    
    for url in URL_list:
        browser = webdriver.Chrome(ChromeDriverManager().install())
        browser.get(url)
        time.sleep(10)
        
        html = browser.page_source
        tables = pd.read_html(html)
        tables = pd.concat(tables)
        
        data = tables.iloc[:, [0,2]]
        
        browser.close()
        browser.quit()
        
        filename = url[28:].replace("/","_")
        data.to_csv(filename+'.csv', index=False)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-10-24
      • 2018-02-27
      • 2021-07-26
      • 1970-01-01
      • 2021-09-03
      相关资源
      最近更新 更多