【发布时间】:2013-12-22 22:27:01
【问题描述】:
我知道这个问题已经出现了十亿次,我想做的不是 Selenium 的预期目的,但我不知道还有什么可以实现这个目的。我已经尽我所能阅读了这些答案以及大量文档,但我可以使用一些指针。
我正在尝试从 CDC Compressed Mortality 下载一些文件,这需要一对一 1)按“我同意”,2)浏览一堆菜单、复选框和下拉框,以及 3)按“发送”并等待文件自动开始下载。
网页有一些非常麻烦的限制,这使我寻找自动化的方法。
- 使用“发送”按钮导出结果数据集与某些设置不一致,省略数据点,即在某些情况下生成的文件不反映抑制/省略值的设置
- 页面限制数据行数
我发现通过导出各个州的数据,以上两点不再是问题,但是这是超级劳动密集型的,而且没有太多乐趣。我应该注意,我没有使用 Python(或真正的编程)的经验,但文档似乎足以让我让它在一定程度上发挥作用。这就是我想做的:
- 导航到页面,按“我接受”
- 选择一个州
- 填写一些选项
- 点击发送
- 等待文件完成下载
由于设置 Firefox 配置文件会跳过下载框,因此文件会自动开始下载。我可以通过查找最新文件并等待 .part 扩展名消失来确定文件是否已完成下载。
代码一直运行,直到它尝试选择 12 Florida,然后,一切都停止了。 Firefox 冻结,没有文件开始下载。手动重复此操作,没有问题。
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os, unittest, time, re
basedir = os.getcwd()
savedir = os.path.join(basedir, 'download')
# Check download status
def checkdownload():
os.chdir(savedir)
files = filter(os.path.isfile, os.listdir(os.getcwd()))
files = [os.path.join(os.getcwd(), f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))
if not files :
newest_file = "no"
else :
newest_file = files[-1]
os.chdir(basedir)
return newest_file
# Set user profile
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",basedir+'\\download')
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/plain")
# Before anything downloads
previousnew = checkdownload()
# Create a new instance of the Firefox driver
b = webdriver.Firefox(firefox_profile=fp)
b.get("http://wonder.cdc.gov/cmf-icd9.html")
b.implicitly_wait(1)
### Find states
b.find_element_by_xpath("/html/body/div/form/div[3]/div/center/input").click() # Press 'I agree'
# print [o.text for o in Select(b.find_element_by_id("SD16.V1")).options]
# Make a list of all the states available
options = Select(b.find_element_by_id("codes-D16.V9")).options
optionsList = []
for option in options:
optionsList.append(option.get_attribute("value"))
if option.get_attribute("value") == "*All*":
optionsList.remove(option.get_attribute("value")) # Remove the *All* option
# Loop over states individually
for optionValue in optionsList:
print "\nRunning on %s" % optionValue
b.get("http://wonder.cdc.gov/cmf-icd9.html")
b.implicitly_wait(1)
b.find_element_by_xpath("/html/body/div/form/div[3]/div/center/input").click() # Press 'I agree'
print "Add Selections"
# 1. Table layout, id = SB_1 ... SB_5
Select(b.find_element_by_id("SB_1")).select_by_visible_text("Age Group")
Select(b.find_element_by_id("SB_2")).select_by_visible_text("Race")
Select(b.find_element_by_id("SB_3")).select_by_visible_text("Gender")
Select(b.find_element_by_id("SB_4")).select_by_visible_text("County")
Select(b.find_element_by_id("SB_5")).select_by_visible_text("Year")
# 2. Location, id = codes-D16.V9
Select(b.find_element_by_id("codes-D16.V9")).deselect_by_index(0) # remove *All* option
Select(b.find_element_by_id("codes-D16.V9")).select_by_value(optionValue) # selection
# Age Group, id = SD16.V5
Select(b.find_element_by_id("SD16.V5")).deselect_by_index(0) # remove *All* option
Select(b.find_element_by_id("SD16.V5")).select_by_value('20-24')
Select(b.find_element_by_id("SD16.V5")).select_by_value('25-34')
Select(b.find_element_by_id("SD16.V5")).select_by_value('35-44')
Select(b.find_element_by_id("SD16.V5")).select_by_value('45-54')
Select(b.find_element_by_id("SD16.V5")).select_by_value('55-64')
# Gender, id = SD16.V7
# Race, id = SD16.V8
# Hisp, Does not exist in this file
# Year, id = SD16.V1
yr = 1997, 1998
Select(b.find_element_by_id("SD16.V1")).deselect_by_index(0) # remove *All* option
select = Select(b.find_element_by_id("SD16.V1"))
for o in yr:
select.select_by_value("%s" % o)
# ICD-9 Codes, id = codes-D16.V2
# Rate per, id = SO_rate_per
# Other options
b.find_element_by_id("export-option").click()
b.find_element_by_id("CO_show_totals").click()
b.find_element_by_id("CO_show_zeros").click()
b.find_element_by_id("CO_show_suppressed").click()
# Submit
print "Submit"
b.find_element_by_xpath("/html/body/div/form/table/tbody/tr/td/div[2]/div[2]/center/input[1]").click()
# Check if file has begun downloading
print "Waiting for new file"
new = checkdownload()
while previousnew == new:
print "... waiting"
new = checkdownload()
continue
print "Waiting for download to finish"
# New file found, wait until it doesn't have .part extension
new = checkdownload()
while os.path.splitext(new)[1] == ".part":
print "... downloading"
new = checkdownload()
continue
print "Downloaded"
continue
b.quit()
我无法确定为什么会发生这种情况,因为没有产生错误。关于我做错了什么的任何想法?
PS。我意识到我的代码很可怕,一个诚实的答案是“你做错了一切”。但是,我真的不知道为什么这个简单的脚本会这样。
【问题讨论】:
-
不是一个真正的答案,但考虑使用 PhantomJS 作为您的驱动程序。然后消除等式中的“控制 Firefox”部分。
标签: python firefox selenium selenium-webdriver