自动下载多个 pdf 文件中的所有链接（PDF）答案

【问题标题】：Automate download all links (of PDFs) inside multiple pdf files自动下载多个 pdf 文件中的所有链接（PDF）
【发布时间】：2019-10-28 16:33:51
【问题描述】：

我正在尝试从网站 (http://cis-ca.org/islamscience1.php) 下载期刊问题。我跑了一些东西来获取此页面上的所有 PDF。但是，这些 PDF 中包含链接到另一个 PDF 的链接。

我想从所有 PDF 链接中获取终端文章。

从页面获取所有 PDF：http://cis-ca.org/islamscience1.php

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "http://cis-ca.org/islamscience1.php"

#If there is no such folder, the script will create one automatically
folder_location = r'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")     
for link in soup.select("a[href$='.pdf']"):
    #Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location,link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,link['href'])).content)

我想在这些 PDF 中链接文章。提前致谢

【问题讨论】：

这里可能已经有了答案：stackoverflow.com/q/27744210/10058326
Extract hyperlinks from PDF in Python的可能重复
我希望整个过程实现自动化，而不是遍历每个文件。

标签： python pdf web-scraping

【解决方案1】：

https://mamclain.com/?page=Blog_Programing_Python_Removing_PDF_Hyperlinks_With_Python

看看这个链接。它展示了如何识别超链接和清理 PDF 文档。您可以将其跟踪到标识部分，然后执行存储超链接而不是清理的操作。

或者，看看这个库：https://github.com/metachris/pdfx

【讨论】：