Python - 从带有链接的网页下载 CSV 文件答案

【问题标题】：Python - downloading a CSV file from a web page with a linkPython - 从带有链接的网页下载 CSV 文件
【发布时间】：2014-06-21 03:58:17
【问题描述】：

我正在尝试通过 python 脚本从this page 下载 CSV 文件。

但是当我尝试通过浏览器中的链接直接访问 CSV 文件时，会显示一个协议表单。在允许我下载文件之前，我必须同意此表格。

无法检索到 csv 文件的确切 URL。这是一个发送到后端数据库的值，它获取文件 - 例如PERIOD_ID=2013-0：

https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0

我试过urllib2.open()和urllib2.read()，但它导致协议表单的html内容，而不是文件内容。

如何编写处理此重定向的 python 代码，然后获取我的 CSV 文件并让我保存在磁盘上？

【问题讨论】：

标签： python csv web download session-cookies

【解决方案1】：

您需要设置ASP.NET_SessionId cookie。您可以通过在上下文菜单中使用 Chrome 的 Inspect element 选项，或使用 Firefox 和 Firebug 扩展来找到它。

使用 Chrome：

右键单击网页（在您同意条款后）并选择检查元素
点击资源 -> Cookies
选择列表中的唯一元素
复制ASP.NET_SessionId元素的值

使用 Firebug：

右键单击网页（在您同意条款后），然后单击 *Inspect Element with Firebug
点击Cookie
复制ASP.NET_SessionId元素的值

就我而言，我得到了ihbjzynwfcfvq4nzkncbviou - 它可能对你有用，如果不是，你需要执行上述过程。

将 cookie 添加到您的请求中，并使用 requests 模块下载文件（基于 eladc 的 answer）：

import requests

cookies = {'ASP.NET_SessionId': 'ihbjzynwfcfvq4nzkncbviou'}
r = requests.get(
    url=('https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/'
         'DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'),
    cookies=cookies
)

with open('2013-0.csv', 'wb') as ofile:
    for chunk in r.iter_content(chunk_size=1024):
        ofile.write(chunk)
        ofile.flush()

【讨论】：

设置 ASP cookie 检索页面 HTML 内容，而不是 CSV 下载输出。
@user3602491 这确实有效，但您可能必须找到自己的 cookie 值。我自己试过了，下载的csv没有问题。

【解决方案2】：

这是我的建议，用于自动应用服务器 cookie 并基本上模仿标准客户端会话行为。

（无耻地受到@pope 的回答554580 的启发。）

import urllib2
import urllib
from lxml import etree

_TARGET_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'
_AGREEMENT_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/Welcome/Agreement.aspx'
_CSV_OUTPUT = 'urllib2_ProdExport2013-0.csv'


class _MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):

    def http_error_302(self, req, fp, code, msg, headers):
        print 'Follow redirect...'  # Any cookie manipulation in-between redirects should be implemented here.
        return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)

    http_error_301 = http_error_303 = http_error_307 = http_error_302

cookie_processor = urllib2.HTTPCookieProcessor()

opener = urllib2.build_opener(_MyHTTPRedirectHandler, cookie_processor)
urllib2.install_opener(opener)

response_html = urllib2.urlopen(_TARGET_URL).read()

print 'Cookies collected:', cookie_processor.cookiejar

page_node, submit_form = etree.HTML(response_html), {}  # ElementTree node + dict for storing hidden input fields.
for input_name in ['ctl00$MainContent$AgreeButton', '__EVENTVALIDATION', '__VIEWSTATE']:  # Form `input` fields used on the ``Agreement.aspx`` page.
    submit_form[input_name] = page_node.xpath('//input[@name="%s"][1]' % input_name)[0].attrib['value']
    print 'Form input \'%s\' found (value: \'%s\')' % (input_name, submit_form[input_name])

# Submits the agreement form back to ``_AGREEMENT_URL``, which redirects to the CSV download at ``_TARGET_URL``.
csv_output = opener.open(_AGREEMENT_URL, data=urllib.urlencode(submit_form)).read()
print csv_output

with file(_CSV_OUTPUT, 'wb') as f:  # Dumps the CSV output to ``_CSV_OUTPUT``.
    f.write(csv_output)
    f.close()

祝你好运！

[编辑]

关于事情的为什么，我认为@Steinar Lima 在要求会话cookie 方面是正确的。尽管除非您已经访问过Agreement.aspx 页面并通过提供商的网站提交了响应，否则您从浏览器的网络检查器复制的 cookie 只会导致另一个重定向到 Welcome to the PA DEP Oil & Gas Reporting网站 欢迎页面。这当然消除了让 Python 脚本为您完成这项工作的全部意义。

【讨论】：

不错的解决方案！当会话超时时，这确实可以工作，而无需访问该站点。
谢谢！完美运行