从网页下载所有 PDF 文件的程序答案

【问题标题】：Program to Download all PDF files from webpage从网页下载所有 PDF 文件的程序
【发布时间】：2020-02-15 21:03:53
【问题描述】：

我正在尝试编写一个 Python 3 程序，它将从this 网站下载所有 PDF 文件。我目前有两个代码，但它们都不起作用。

import requests 
import urllib.request
import urllib.request
import time
import re
from bs4 import BeautifulSoup
url = 'https://fraser.stlouisfed.org/title/1339#556573.html'
response = requests.get(url)
if response.status_code == 200:
      print("Success")
else:
      print("Failure")

soup = BeautifulSoup(response.text, 'html.parser')
for one_a_tag in soup.findAll('a',href=re.compile(r'(.pdf)')):
    link = one_a_tag['href']
    download_url = 'https://fraser.stlouisfed.org/title/1339'+ link
    urllib.request.urlretrieve(download_url) 
    time.sleep(1)

程序运行时没有给出任何输出或停止。

第二个节目

from urllib import request
from bs4 import BeautifulSoup
import re
import os
import urllib
url="https://fraser.stlouisfed.org/title/1339#518552"
response = request.urlopen(url).read()
soup= BeautifulSoup(response, "html.parser")     
links = soup.find_all('a', href=re.compile(r'(.pdf)'))
url_list = []
for el in links:
    if(el['href'].startswith('http')):
    url_list.append(el['href'])
else:
    url_list.append("https://fraser.stlouisfed.org/title/1339/" + el['href'])

print(url_list)
for url in url_list:
     print(url)
     request.urlretrieve(url, r'C:/Downloads')

对于这两个程序，如果我将第二个参数 (filename) 添加到 urlretrieve 应该下载 PDF 文件的位置，它会给我

[Errno 13] 权限被拒绝：“C:/Downloads”错误。

（我尝试了多种方法来解决此错误，但它不起作用，我有 Windows）。如果我没有第二个参数，则第二个程序会继续运行并产生输出，但不会下载。

有人可以帮忙吗？

【问题讨论】：

在某些 PC 上，C:\ 驱动器被锁定。你能把它保存到 D:\ 驱动器或其他吗？
我只有C:\驱动器
请分享整个错误信息。
路径不应该是"C:\Downloads"吗？顺便说一下，urllib 的文档说urllib.request.urlretrieve() 被认为是一个遗留函数。

标签： python-3.x pdf web-scraping download

【解决方案1】：

在大多数情况下，C:/ 驱动器被锁定尝试更改为 X:/ 或 D:/ 的不同驱动器或将 C:/Downloads 重定向到 C:\Users\Name\Downloads。

【讨论】：

【解决方案2】：

我没有错误地使用您的代码，但 PDF 在下一页。举个使用其他框架的例子，仅供参考。

import io
from simplified_scrapy import Spider, SimplifiedDoc
class PdfSpider(Spider):
  name = 'stlouisfed'
  concurrencyPer1s = 5
  allowed_domains = ['stlouisfed.org']
  start_urls = ['https://fraser.stlouisfed.org/title/1339']
  refresh_urls = True

  def afterResponse(self, response, url, error=None, extra=None):
    try:
      # save pdf
      if(response.code==200 and url.find('.pdf')>0):
        name = 'data'+url[url.rindex('/'):]
        file = io.open(name, "wb")
        file.write(response.read())
        file.close()
        return None
      else: # If it's not a pdf, leave it to the frame
        return Spider.afterResponse(self, response, url, error)
    except Exception as err:
      print (err)

  def extract(self,url,html,models,modelNames):
    urls = SimplifiedDoc(html).listA(url=url['url']).containsOr(['/pdf/','.pdf'],attr='url')
    if(urls):
      self.saveUrl(urls)
    return True

from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(PdfSpider()) # Start

【讨论】：

PDF 在下一页是什么意思？
您的 url_list 中的链接不是 PDF 链接。其对应页面包含 PDF 链接。