从文件类型字段下载文件？答案

【问题标题】：downloading files from Filetype fields?从文件类型字段下载文件？
【发布时间】：2012-12-21 03:42:26
【问题描述】：

我正在寻找一种从不同页面下载文件并将它们存储在本地计算机的特定文件夹下的方法。我正在使用 Python 2.7

请参阅下面的字段：

编辑

这里是html内容：

<input type="hidden" name="supplier.orgProfiles(1152444).location.locationPurposes().extendedAttributes(Upload_RFI_Form).value.filename" value="Screenshot.docx">

<a style="display:inline; position:relative;" href="

                                      /aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz">
                                Screenshot.docx
                             </a>

我刚刚尝试过的一种可能性： 如果添加说https://xyz.test.com 并构造如下所示的 URL，则使用 html 内容

https://xyz.test.com/aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz

然后将该 URL 放在浏览器上并点击Enter，让我有机会下载文件，如屏幕截图所述。但是现在我们能找到这样的aems/file/filegetrevision.do?fileEntityId=8120070&cs=LU31NT9us5P9Pvkb1BrtdwaCrEraskiCJcY6E2ucP5s.xyz 值吗？它存在于那里？

代码 到目前为止我尝试了什么

只是痛苦如何下载该文件。使用脚本构造 URL：

for a in soup.find_all('a', {"style": "display:inline; position:relative;"}, href=True):
    href = a['href'].strip()
    href = "https://xyz.test.com/" + href
print(href)

请帮帮我！

如果你们需要我提供更多信息，请告诉我，我很乐意与你们分享。

提前致谢！

【问题讨论】：

不同页面是什么意思？这些页面是从哪里渲染的？
@Amyth 我正在使用第三方URL。我正在使用selenium 在网络中从一个页面导航到另一个页面，搜索那里的任何可下载文件，如果找到，则将它们下载到特定文件夹中。我有 10000 个这样的文件要下载。
你能发布完整的 html 吗？
这是包含下载链接的html...因此我只给出了这么多！

标签： python selenium python-2.7 beautifulsoup

【解决方案1】：

正如@JohnZwinck 建议的那样，您可以使用urllib.urlretrieve 并使用re 模块在给定页面上创建链接列表并下载每个文件。下面是一个例子。

#!/usr/bin/python

"""
This script would scrape and download files using the anchor links.
"""


#Imports

import os, re, sys
import urllib, urllib2

#Config
base_url = "http://www.google.com/"
destination_directory = "downloads"


def _usage():
    """
    This method simply prints out the Usage information.
    """

    print "USAGE: %s <url>" %sys.argv[0]


def _create_url_list(url):
    """
    This method would create a list of downloads, using the anchor links
    found on the URL passed.
    """

    raw_data = urllib2.urlopen(url).read()
    raw_list = re.findall('<a style="display:inline; position:relative;" href="(.+?)"', raw_data)
    url_list = [base_url + x for x in raw_list]
    return url_list


def _get_file_name(url):
    """
    This method will return the filename extracted from a passed URL
    """

    parts = url.split('/')
    return parts[len(parts) - 1]


def _download_file(url, filename):
    """
    Given a URL and a filename, this method will save a file locally to the»
    destination_directory path.
    """
    if not os.path.exists(destination_directory):
        print 'Directory [%s] does not exist, Creating directory...' % destination_directory
        os.makedirs(destination_directory)
    try:
        urllib.urlretrieve(url, os.path.join(destination_directory, filename))
        print 'Downloading File [%s]' % (filename)
    except:
        print 'Error Downloading File [%s]' % (filename)


def _download_all(main_url):
    """
    Given a URL list, this method will download each file in the destination
    directory.
    """

    url_list = _create_url_list(main_url)
    for url in url_list:
        _download_file(url, _get_file_name(url))


def main(argv):
    """
    This is the script's launcher method.
    """

    if len(argv) != 1:
        _usage()
        sys.exit(1)
    _download_all(sys.argv[1])
    print 'Finished Downloading.'


if __name__ == '__main__':
    main(sys.argv[1:])

您可以根据需要更改base_url和destination_directory并将脚本另存为download.py。然后从终端像

一样使用它

python download.py http://www.example.com/?page=1

【讨论】：

非常感谢您，先生...我看到了您的 4 C...我唯一的 C 是 cgrt。 :) 为什么你同时使用urllib、urllib2？
因为根据docs.python.org/2/library/urllib.html， urllib.urlopen 已被python 3 中的 urllib2.urlopen 替换。所以只是为了确保脚本也适用于 p3。
+1 解决了我的困惑。 Delhi 帮助 Mumbai :) :)
你写代码的方式，我真的在上面fida，完美的cmets，漂亮的设计......你真的证明coding是你的pation.U deserves 4C
干杯，很高兴它有帮助。

【解决方案2】：

我们无法知道您的第一张图片来自什么服务，但我们会假设它位于某种网站上 - 可能是您公司内部的网站。

您可以尝试的最简单的方法是使用 urllib.urlretrieve 根据其 URL “获取”文件。如果您可以右键单击该页面上的链接、复制 URL 并将其粘贴到您的代码中，您或许可以执行此操作。

但是，这可能不起作用，例如，如果在访问该页面之前需要进行复杂的身份验证。您可能需要编写实际执行登录的 Python 代码（就像用户在控制它，输入密码一样）。如果你做到了这一点，你应该把它作为一个单独的问题发布。

【讨论】：

是的，我可以使用Selenium web-driver 登录该页面。唯一卡住的是如何提示这样的句柄和下载这样的文件。
你不需要“使用”提示符——这是人类使用时浏览器的一个特性。你正在自动化这些东西，我不认为不应该使用提示。也许你对 Selenium WebDriver 的使用实际上阻碍了你——试试我使用 urllib 的方法，看看是否更容易。如果没有身份验证，那当然应该是。
urllib.urlretrieve(url[, filename[, reporthook[, data]]])¶ 这里我只能给出 URL，但其余的值如何。因为它们本质上都是动态的。要从该页面中存在的字段下载的文件。该页面包含不同类型的字段以及此类文件类型字段。
哦，好吧，你可以使用 BeautifulSoup (crummy.com/software/BeautifulSoup) 来抓取页面并获取文件列表。而且我不明白为什么你需要知道文件类型——无论如何，扩展名似乎暗示了它，所以一旦你下载了文件就不需要了。
+1 对于建议，我已经在使用 BS4 来提取网页值。但是下载文件怎么可能呢？请指导。如果你愿意，我可以给你HTML页面的内容！