【问题标题】:Downloading Pdf into Subdirectory下载 PDF 到子目录
【发布时间】:2025-12-13 10:15:01
【问题描述】:

我正在尝试使用表单名称将 PDF 下载到子目录中,因此它应该看起来像 Form W-2/Form W-2 2020。目前,它只是下载到与主应用程序相同的文件夹中。

        pdf_link = form_number.find("a")
        i += 1
        print("Downloading file: ", i)
        response = requests.get(pdf_link.get('href'))
        pdf = open(form_number.text.strip() + "-" + form_year.text.strip() + ".pdf", 'wb')
        pdf.write(response.content)
        pdf.close()
        print("File ", i, " downloaded")


【问题讨论】:

标签: python pdf web-scraping beautifulsoup


【解决方案1】:

你可以使用os.path.join加入路径组件,os.path.exists检查目录是否存在,os.makedirs创建目录。

这个例子结合了方法:

import os
import requests
from bs4 import BeautifulSoup

form = "Form W-2"
URL = (
    "https://apps.irs.gov/app/picklist/list/priorFormPublication."
    "html?resultsPerPage=200&sortColumn=sortOrder&indexOfFirstRow=0&criteria=formNumber&value="
    "" + form + "&isDescending=false"
)
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
for table_element in soup.select(".picklist-dataTable tr:has(td)"):
    form_number = table_element.find("td", class_="LeftCellSpacer")
    u = form_number.a["href"]
    path = os.path.join(form, u.split("/")[-1])

    if not os.path.exists(form):
        os.makedirs(form)

    print(f"Saving {u=} to {path=}")
    with open(path, "wb") as f_out:
        f_out.write(requests.get(u).content)

打印:

Saving u='https://www.irs.gov/pub/irs-prior/fw2p--1990.pdf' to path='Form W-2/fw2p--1990.pdf'
Saving u='https://www.irs.gov/pub/irs-prior/fw2p--1989.pdf' to path='Form W-2/fw2p--1989.pdf'
Saving u='https://www.irs.gov/pub/irs-prior/fw2p--1988.pdf' to path='Form W-2/fw2p--1988.pdf'

...and so on.

并将文档保存到目录。


编辑:用不同的文件名保存:

import os
import requests
from bs4 import BeautifulSoup

form = "Form W-2"
URL = (
    "https://apps.irs.gov/app/picklist/list/priorFormPublication."
    "html?resultsPerPage=200&sortColumn=sortOrder&indexOfFirstRow=0&criteria=formNumber&value="
    "" + form + "&isDescending=false"
)
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")
for table_element in soup.select(".picklist-dataTable tr:has(td)"):
    form_number = table_element.find("td", class_="LeftCellSpacer")
    form_year = table_element.find("td", class_="EndCellSpacer")
    u = form_number.a["href"]
    p = "{}-{}.pdf".format(
        form_number.get_text(strip=True), form_year.get_text(strip=True)
    )

    path = os.path.join(form, p)

    if not os.path.exists(form):
        os.makedirs(form)

    print(f"Saving {u=} to {path=}")
    with open(path, "wb") as f_out:
        f_out.write(requests.get(u).content)

这会将文件保存为:

Saving u='https://www.irs.gov/pub/irs-prior/fw2p--1990.pdf' to path='Form W-2/Form W-2 P-1990.pdf'
Saving u='https://www.irs.gov/pub/irs-prior/fw2p--1989.pdf' to path='Form W-2/Form W-2 P-1989.pdf'
Saving u='https://www.irs.gov/pub/irs-prior/fw2p--1988.pdf' to path='Form W-2/Form W-2 P-1988.pdf'
Saving u='https://www.irs.gov/pub/irs-prior/fw2p--1987.pdf' to path='Form W-2/Form W-2 P-1987.pdf'
Saving u='https://www.irs.gov/pub/irs-prior/fw2p--1986.pdf' to path='Form W-2/Form W-2 P-1986.pdf'

...

【讨论】:

  • 非常感谢
  • 我唯一的问题是我正在尝试以 Form W-2/Form W-2 - 2020.pdf 格式保存 pdf。我该怎么做?
  • 好方法!但是当你多次下载这些文件时要小心,每个文件都会被覆盖,所以最好在写之前先做一个条件:if not os.path.exists(path):