Selenium 导航表单并等待文件完成下载答案

【问题标题】：Selenium to navigate forms and waiting for a file to finish downloadingSelenium 导航表单并等待文件完成下载
【发布时间】：2013-12-22 22:27:01
【问题描述】：

我知道这个问题已经出现了十亿次，我想做的不是 Selenium 的预期目的，但我不知道还有什么可以实现这个目的。我已经尽我所能阅读了这些答案以及大量文档，但我可以使用一些指针。

我正在尝试从 CDC Compressed Mortality 下载一些文件，这需要一对一 1）按“我同意”，2）浏览一堆菜单、复选框和下拉框，以及 3）按“发送”并等待文件自动开始下载。

网页有一些非常麻烦的限制，这使我寻找自动化的方法。

使用“发送”按钮导出结果数据集与某些设置不一致，省略数据点，即在某些情况下生成的文件不反映抑制/省略值的设置
页面限制数据行数

我发现通过导出各个州的数据，以上两点不再是问题，但是这是超级劳动密集型的，而且没有太多乐趣。我应该注意，我没有使用 Python（或真正的编程）的经验，但文档似乎足以让我让它在一定程度上发挥作用。这就是我想做的：

导航到页面，按“我接受”
选择一个州
填写一些选项
点击发送
等待文件完成下载

由于设置 Firefox 配置文件会跳过下载框，因此文件会自动开始下载。我可以通过查找最新文件并等待 .part 扩展名消失来确定文件是否已完成下载。

代码一直运行，直到它尝试选择 12 Florida，然后，一切都停止了。 Firefox 冻结，没有文件开始下载。手动重复此操作，没有问题。

from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import Select, WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import os, unittest, time, re

basedir = os.getcwd()
savedir = os.path.join(basedir, 'download')

# Check download status
def checkdownload():
    os.chdir(savedir)
    files = filter(os.path.isfile, os.listdir(os.getcwd()))
    files = [os.path.join(os.getcwd(), f) for f in files] # add path to each file
    files.sort(key=lambda x: os.path.getmtime(x))
    if not files :
        newest_file = "no"
    else :
        newest_file = files[-1]
    os.chdir(basedir)
    return newest_file



# Set user profile
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir",basedir+'\\download')
fp.set_preference("browser.helperApps.neverAsk.saveToDisk","text/plain")

# Before anything downloads
previousnew = checkdownload()

# Create a new instance of the Firefox driver
b = webdriver.Firefox(firefox_profile=fp)
b.get("http://wonder.cdc.gov/cmf-icd9.html")
b.implicitly_wait(1)

### Find states
b.find_element_by_xpath("/html/body/div/form/div[3]/div/center/input").click() # Press 'I agree'

# print [o.text for o in Select(b.find_element_by_id("SD16.V1")).options]

# Make a list of all the states available
options = Select(b.find_element_by_id("codes-D16.V9")).options
optionsList = []

for option in options: 
    optionsList.append(option.get_attribute("value"))
    if option.get_attribute("value") == "*All*":
        optionsList.remove(option.get_attribute("value")) # Remove the *All* option


# Loop over states individually
for optionValue in optionsList:
    print "\nRunning on %s" % optionValue

    b.get("http://wonder.cdc.gov/cmf-icd9.html")
    b.implicitly_wait(1)

    b.find_element_by_xpath("/html/body/div/form/div[3]/div/center/input").click() # Press 'I agree'

    print "Add Selections"

    # 1. Table layout, id = SB_1 ... SB_5
    Select(b.find_element_by_id("SB_1")).select_by_visible_text("Age Group")
    Select(b.find_element_by_id("SB_2")).select_by_visible_text("Race")
    Select(b.find_element_by_id("SB_3")).select_by_visible_text("Gender")
    Select(b.find_element_by_id("SB_4")).select_by_visible_text("County")
    Select(b.find_element_by_id("SB_5")).select_by_visible_text("Year")

    # 2. Location, id = codes-D16.V9
    Select(b.find_element_by_id("codes-D16.V9")).deselect_by_index(0) # remove *All* option
    Select(b.find_element_by_id("codes-D16.V9")).select_by_value(optionValue) # selection

    # Age Group, id = SD16.V5
    Select(b.find_element_by_id("SD16.V5")).deselect_by_index(0) # remove *All* option
    Select(b.find_element_by_id("SD16.V5")).select_by_value('20-24')
    Select(b.find_element_by_id("SD16.V5")).select_by_value('25-34')
    Select(b.find_element_by_id("SD16.V5")).select_by_value('35-44')
    Select(b.find_element_by_id("SD16.V5")).select_by_value('45-54')
    Select(b.find_element_by_id("SD16.V5")).select_by_value('55-64')

    # Gender, id = SD16.V7
    # Race, id = SD16.V8
    # Hisp, Does not exist in this file

    # Year, id = SD16.V1
    yr = 1997, 1998
    Select(b.find_element_by_id("SD16.V1")).deselect_by_index(0) # remove *All* option
    select = Select(b.find_element_by_id("SD16.V1"))
    for o in yr:
        select.select_by_value("%s" % o)

    # ICD-9 Codes, id = codes-D16.V2
    # Rate per, id = SO_rate_per

    # Other options
    b.find_element_by_id("export-option").click()
    b.find_element_by_id("CO_show_totals").click()
    b.find_element_by_id("CO_show_zeros").click()
    b.find_element_by_id("CO_show_suppressed").click()

    # Submit
    print "Submit"
    b.find_element_by_xpath("/html/body/div/form/table/tbody/tr/td/div[2]/div[2]/center/input[1]").click()

    # Check if file has begun downloading
    print "Waiting for new file"
    new = checkdownload()
    while previousnew == new:
        print "... waiting"
        new = checkdownload()
        continue

    print "Waiting for download to finish"
    # New file found, wait until it doesn't have .part extension
    new = checkdownload()
    while os.path.splitext(new)[1] == ".part":
        print "... downloading"
        new = checkdownload()
        continue

    print "Downloaded"

    continue


b.quit()

我无法确定为什么会发生这种情况，因为没有产生错误。关于我做错了什么的任何想法？

PS。我意识到我的代码很可怕，一个诚实的答案是“你做错了一切”。但是，我真的不知道为什么这个简单的脚本会这样。

【问题讨论】：

不是一个真正的答案，但考虑使用 PhantomJS 作为您的驱动程序。然后消除等式中的“控制 Firefox”部分。

标签： python firefox selenium selenium-webdriver

【解决方案1】：

我运行了你的代码。第一次失败是因为 '\\' 作为硬编码的路径分隔符，但我假设您使用的是 Windows。

修复了第二次由于可能是您的实际问题的竞争条件而失败的问题。看看这些行：

os.chdir(savedir)
files = filter(os.path.isfile, os.listdir(os.getcwd()))
files = [os.path.join(os.getcwd(), f) for f in files] # add path to each file
files.sort(key=lambda x: os.path.getmtime(x))

您正在处理您知道会“消失”的文件（您正在检查的.part 文件）。如果这发生在listdir 和getmtime 之间，那么getmtime 会引发异常，因为该文件不存在并且脚本在没有关闭Firefox 的情况下退出（因此它“挂起”）。这可能是因为文件很小，下载速度很快。

在对文件进行操作时，如果文件被删除，可能会失败，您需要使用try/catch 块，因为每当您首先检查是否存在时，文件可能会在检查后立即消失或重命名。但是，这可能需要您使用循环而不是漂亮的列表推导和排序。

这是该功能的可能实现：

def checkdownload():
    max_mtime = 0
    newest_file = ""
    for filename in filter(os.path.isfile, os.listdir(savedir)):
        path = os.path.join(savedir, filename)
        try:
            mtime = os.path.getmtime(path)
            if mtime > max_mtime:
                newest_file = path
                max_mtime = mtime
        except OSError:
            pass  # File probably just moved/deleted
    return newest_file

chdir 既没有必要也可能不是一个好主意，只需参考您正在使用的目录即可。
由于您只获取最近更新的文件，因此无需对整个列表进行排序

【讨论】：