【问题标题】:Download PDF's links listed in csv with python Request module [closed]使用python请求模块下载csv中列出的PDF链接[关闭]
【发布时间】:2017-10-11 16:30:00
【问题描述】:

使用 python 请求模块下载 csv 中列出的 1000 个 PDF 链接。

【问题讨论】:

  • 您是否可以将外部包添加到您的项目中,或者您必须使用urllib

标签: python python-2.7 python-3.x python-requests urllib2


【解决方案1】:

我建议您使用Requests,然后您可以按照以下方式进行操作:

import os
import csv
import requests

write_path = 'folder_name'  # ASSUMING THAT FOLDER EXISTS!

with open('x.csv', 'r') as csvfile:
    spamreader = csv.reader(csvfile)
    for link in spamreader:
        print('-'*72)
        pdf_file = link[0].split('/')[-1]
        with open(os.path.join(write_path, pdf_file), 'wb') as pdf:
            try:
                # Try to request PDF from URL
                print('TRYING {}...'.format(link[0]))
                a = requests.get(link[0], stream=True)
                for block in a.iter_content(512):
                    if not block:
                        break

                    pdf.write(block)
                print('OK.')
            except requests.exceptions.RequestException as e:  # This will catch ONLY Requests exceptions
                print('REQUESTS ERROR:')
                print(e)  # This should tell you more details about the error

x.csv的测试内容为:

https://www.pabanker.com/media/3228/qtr1pabanker_final-web.pdf
http://www.pdf995.com/samples/pdf.pdf
https://tcd.blackboard.com/webapps/dur-browserCheck-BBLEARN/samples/sample.pdf
http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

样本输出:

$ python test.py
------------------------------------------------------------------------
TRYING https://www.pabanker.com/media/3228/qtr1pabanker_final-web.pdf...
REQUESTS ERROR:
("Connection broken: ConnectionResetError(54, 'Connection reset by peer')", ConnectionResetError(54, 'Connection reset by peer'))
------------------------------------------------------------------------
TRYING http://www.pdf995.com/samples/pdf.pdf...
OK.
------------------------------------------------------------------------
TRYING https://tcd.blackboard.com/webapps/dur-browserCheck-BBLEARN/samples/sample.pdf...
OK.
------------------------------------------------------------------------
TRYING http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf...
OK.

【讨论】:

  • 我尝试使用 urllib 读取 csv。但无法读取 csv 文件,它显示 PdfReadWarning: Xref table not zero-indexed。对象的 ID 号将被更正。 [pdf.py:1736] 并且导入请求是黑色的。为什么?
  • 但是这是PdfRead库错误...你能发布导致这个错误的代码sn-p吗?
  • 这里是代码和错误imgur.com/a/kQaXZ的屏幕截图
  • 我刚刚更新了我的答案,看看是不是你需要的...
  • 不要联系我,在这里发布问题,有人会很快回答;)要更改文件夹,请查看我更新的答案...另外,如果它适合您的答案,请接受用例!
猜你喜欢
  • 2021-10-22
  • 1970-01-01
  • 2016-04-02
  • 2016-05-24
  • 1970-01-01
  • 1970-01-01
  • 2021-12-08
  • 1970-01-01
相关资源
最近更新 更多