【问题标题】:Scrape and Download Pdf files with modified names through Beautifulsoup in python通过python中的Beautifulsoup抓取并下载修改名称的Pdf文件
【发布时间】:2021-08-08 20:24:18
【问题描述】:

我想从https://www.archives.gov/research/pentagon-papers下载PDF文件

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

url = "https://www.archives.gov/research/pentagon-papers"

# If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location): os.mkdir(folder_location)

response = requests.get(url)

soup = BeautifulSoup(response.text, "html.parser")

# Downloading the files
for link in soup.select("a[href$='.pdf']"):
    # Name the pdf files using the last portion of each link which are unique in this case
    filename = os.path.join(folder_location, link['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url, link['href'])).content)

但是,我希望文件的名称不像文件名,而是像它们的描述一样。例如,我希望将表中的第三个文件命名为 [Part II] U.S. Involvement in the Franco-Viet Minh War, 1950-1954.pdf 而不是 Pentagon-Papers-Part-II.pdf

for 循环的link 元素中,它存储为contents,但我不知道如何提取它。

【问题讨论】:

    标签: python pdf web-scraping beautifulsoup


    【解决方案1】:

    如你所愿,使用<a> 标记中的文本作为名称怎么样?

    方法如下:

    import os
    from urllib.parse import urljoin
    
    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.archives.gov/research/pentagon-papers"
    
    # If there is no such folder, the script will create one automatically
    folder_location = r'E:\webscraping'
    if not os.path.exists(folder_location):
        os.mkdir(folder_location)
    
    soup = BeautifulSoup(requests.get(url).text, "html.parser")
    
    # Downloading the files
    for link in soup.select("a[href$='.pdf']"):
        filename = os.path.join(
            folder_location,
            (
                link.getText()
                .rstrip()
                .replace(" ", "_")
                .replace(",", "")
                .replace(".", "")
            ),
        )
        with open(f"{filename}.pdf", 'wb') as f:
            f.write(requests.get(urljoin(url, link['href'])).content)
    

    这应该会生成所描述的文件:

    E:\webscraping/Index
    E:\webscraping/[Part_I]_Vietnam_and_the_US_1940-1950
    E:\webscraping/[Part_II]_US_Involvement_in_the_Franco-Viet_Minh_War_1950-1954
    E:\webscraping/[Part_III]_The_Geneva_Accords
    E:\webscraping/[Part_IV_A_1]_Evolution_of_the_War_NATO_and_SEATO:_A_Comparison
    E:\webscraping/[Part_IV_A_2]_Evolution_of_the_War_Aid_for_France_in_Indochina_1950-54
    E:\webscraping/[Part_IV_A_3]_Evolution_of_the_War_US_and_France's_Withdrawal_from_Vietnam_1954-56
    E:\webscraping/[Part_IV_A_4]_Evolution_of_the_War_US_Training_of_Vietnamese_National_Army_1954-59
    E:\webscraping/[Part_IV_A_5]_Evolution_of_the_War_Origins_of_the_Insurgency
    E:\webscraping/[Part_IV_B_1]_Evolution_of_the_War_Counterinsurgency:_The_Kennedy_Commitments_and_Programs_1961
    
    and more ...
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-12-04
      • 1970-01-01
      • 2021-09-13
      • 1970-01-01
      • 2023-03-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多