【问题标题】:download files with python (REST URL)使用 python 下载文件(REST URL)
【发布时间】:2013-09-21 20:47:20
【问题描述】:

我正在尝试编写一个脚本,该脚本将从具有 REST URL 的网站下载一堆文件。

这里是 GET 请求:

GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close

如果请求是好的,它会返回一个302响应,比如这个:

HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8

我需要脚本做的是检查它是否是 302 响应。如果不是,则“通过”,如果是,则需要解析出此处显示的位置参数:

location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6

一旦我有了位置参数,我将不得不发出另一个 GET 请求来下载该文件。我还必须为我的会话维护 cookie 才能下载文件。

有人可以为我指出最适合使用哪个库的正确方向吗?我无法找出如何解析 302 响应并添加一个 cookie 值,如上面我的 GET 请求中所示的值。我相信一定有一些图书馆可以做到这一切。

任何帮助将不胜感激。

【问题讨论】:

    标签: python http cookies request urllib


    【解决方案1】:
    import urllib.request as ur
    import urllib.error as ue
    
    '''
    Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to 
    the next amt bytes. This is because there is no way for urlopen() to automatically determine 
    the encoding of the byte stream it receives from the http server. 
    '''
    
    url = "http://www.example.org/images/{}.jpg"
    
    dst = ""
    arr = ["01","02","03","04","05","06","07","08","09"]
    # arr = range(10,20)
    try:
        for x in arr:
            print(str(x)+"). ".ljust(4),end="")
            hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
            fh = open(dst+str(x)+".jpg","b+w")
            fh.write(hrio.read())
            fh.close()
            print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
    except ue.URLError as e:
        print("\t[REQUEST INCOMPLETE]\t",end="")
        print("<Error ~ [{}]>".format(e))
    

    【讨论】:

    • 你可以使用这个python-3脚本从http://www.example.org/images/01.jpg下载图片到http://www.example.org/images/09.jpg
    • 该代码不起作用。您注释掉了不是 REST 样式 URL 的 url 变量。它不会抓取响应来解析位置。它也不下载文件,而是读取它。
    • 当然,我注释掉了变量!我已经用勺子喂了答案:|现在你想让我也咀嚼它:D
    • 该解决方案似乎不起作用。再次,它回到响应。我需要一种方法来解析 http 响应而不是页面源。它是 REST 样式的 URL,而不是代码中的 URL 类型。
    猜你喜欢
    • 2022-12-15
    • 2017-01-23
    • 1970-01-01
    • 2022-01-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-01-08
    相关资源
    最近更新 更多