【发布时间】:2021-08-08 20:24:18
【问题描述】:
我想从https://www.archives.gov/research/pentagon-papers下载PDF文件
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://www.archives.gov/research/pentagon-papers"
# If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location): os.mkdir(folder_location)
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Downloading the files
for link in soup.select("a[href$='.pdf']"):
# Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location, link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url, link['href'])).content)
但是,我希望文件的名称不像文件名,而是像它们的描述一样。例如,我希望将表中的第三个文件命名为 [Part II] U.S. Involvement in the Franco-Viet Minh War, 1950-1954.pdf 而不是 Pentagon-Papers-Part-II.pdf
在for 循环的link 元素中,它存储为contents,但我不知道如何提取它。
【问题讨论】:
标签: python pdf web-scraping beautifulsoup