【发布时间】:2021-02-25 05:55:43
【问题描述】:
我正在尝试从人道社会立法基金中抓取表格。以下代码成功从其中一个页面获取数据:
import time
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
browser = webdriver.Chrome(ChromeDriverManager().install())
browser.get('https://hslf.org/scorecards/2007-senate-midterm')
time.sleep(10)
html = browser.page_source
humane_sc_tables = pd.read_html(html)
humane_sc_data = humane_sc_tables[0]
我现在需要遍历多个 URL 并将每个网页结果导出到 csv 文件中。
import time
import pandas as pd
from selenium import webdriver
from selenium.common import exceptions
from webdriver_manager.chrome import ChromeDriverManager
# browser = webdriver.Chrome(ChromeDriverManager().install())
URL_list = ["https://hslf.org/scorecards/2007-senate-midterm",
"https://hslf.org/scorecards/2008-senate-final",
"https://hslf.org/scorecards/2008-house-final",
"https://hslf.org/scorecards/2009-senate-midterm",
"https://hslf.org/scorecards/2009-house-midterm",
"https://hslf.org/scorecards/2010-house-final",
"https://hslf.org/scorecards/2010-senate-final",
"https://hslf.org/scorecards/2011-house-midterm",
"https://hslf.org/scorecards/2011-senate-midterm",
"https://hslf.org/scorecards/2012-house-final",
"https://hslf.org/scorecards/2012-senate-final",
"https://hslf.org/scorecards/2013-house-midterm",
"https://hslf.org/scorecards/2013-senate-midterm",
"https://hslf.org/scorecards/2014-house-final",
"https://hslf.org/scorecards/2014-senate-final",
"https://hslf.org/scorecards/2015-house-midterm",
"https://hslf.org/scorecards/2015-senate-midterm",
"https://hslf.org/scorecards/2016-house-final",
"https://hslf.org/scorecards/2016-senate-final",
"https://hslf.org/scorecards/2017-house-midterm",
"https://hslf.org/scorecards/2017-senate-midterm",
"https://hslf.org/scorecards/2018-house-final",
"https://hslf.org/scorecards/2018-senate-final"]
for url in URL_list:
browser = webdriver.Chrome(ChromeDriverManager().install())
time.sleep(5)
print("Current session is {}".format(browser.session_id))
browser.quit()
try:
browser.get(url)
except exceptions.InvalidSessionIdException as e:
print(e.message)
html = browser.page_source
humane_sc_tables = pd.read_html(html)
humane_sc_data = humane_sc_tables[0]
humane_sc_data = humane_sc_data.drop(humane_sc_data.columns[[0,5,7]], axis = 1)
browser.close()
humane_sc_data.to_csv(f'humane_scores{url}.csv')
但是,我收到以下错误:
MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=55494): Max url 超出重试次数: /session/7e430735b2d015147dc20049f3b78b10/url(由 NewConnectionError('
:建立新连接失败:[Errno 61] 连接被拒绝'))
请指教。
【问题讨论】:
-
快速谷歌搜索该错误消息会返回大量结果,您是否已经完成了所有这些?
-
在调用
.get()方法之前需要browser.quit()是什么?你确定这不是根本原因?