urllib 从 php 链接下载 excel 文件答案

【问题标题】：urllib download excel file from php linkurllib 从 php 链接下载 excel 文件
【发布时间】：2016-02-23 13:47:27
【问题描述】：

我正在尝试使用urllib.urlretrieve（python 2.7）从 url 下载 xls 文件列表。我能够获取该文件，但是文件顶部有一个 <script> 标签，使其在 excel 中无法读取。

这是我所拥有的：

import urllib

files= ['a','b', 'c', 'd', 'e', 'f']

url = 'http://www.thewebsite.com/data/dl_xls.php?bid='

for f in files:
    urllib.urlretrieve(url + f, f + '.xls')

这会下载一个顶部有以下内容的 xls 文件： <script>parent.parent.location.href = '../../../../a';</script> 这使它在 excel 中不可读。

如果我从 xls 中删除该脚本标记，则文件会在 excel 中正确打开。

编辑 - 这是我的 pypypy 解决方案：

import urllib

files= ['a','b', 'c', 'd', 'e', 'f']

url = 'http://www.thewebsite.com/data/dl_xls.php?bid='

for f in files:
    input_xls =  f + '_in.xls'
    urllib.urlretrieve(url + f, input_xls)
    output = open(f + '_out.xls', "wb")
    with open(input_xls, "rb") as i:
        output.write(re.sub('<script>.*</script>', "", i.read(), re.I))
        i.close()
        output.close()

【问题讨论】：

你确定吗？ http://www.thewebsite.com/data/dl_xls.php?bid=a 有效吗？
只有在我登录网站后，我才能使用该 URL 在浏览器中获取文件。如果我没有登录，在 xls 的顶部有一个脚本标签。删除脚本标签后，文件就可以正常工作了。
你也许可以在输出前使用beautifulsoup解析内容并说soup.find('script').extract()之类的东西，然后保存实际的excel文件
感谢我将 extract 与 bs 一起使用，只是想确保我没有错过 urllib.urlretrieve 的任何内容

标签： python excel python-2.7 urllib

【解决方案1】：

尝试构建一个正则表达式来匹配脚本标签并将其删除，即

import re
re.sub('<script>.*</script>', "", content, re.I)

这会将内容中的任何脚本标签替换为“”。

【讨论】：

我还建议你使用 requests 库，它比 urllib 简单得多
感谢这有效，不过我最终还是使用了 urllib。我会用解决方案更新我原来的问题。