Python count number of pages in multiple .pdf files (faster)答案

【问题标题】：Python count number of pages in multiple .pdf files (faster)Python count number of pages in multiple .pdf files (faster)
【发布时间】：2023-02-20 12:41:23
【问题描述】：

我有一个计算 PDF 文件页数的迷你应用程序。当我在本地机器（我的 PC）上运行它时，它非常快。现在，问题是当我输入文件所在的映射服务器路径（例如：Z:\scan_easy\myFolder，其中 Z 是映射存储 HDD，myFolder 是应用程序的实际输入路径）时，我的应用程序运行慢点。我想知道是否有办法加快这个过程。以下是实际 PDF 文件所在的文件夹结构。

myFolder
    Box1
        Box1File1
               pdf1
               pdf2
               pdf3
               ....
               pdf30
        Box1File2
               pdf1
               pdf2
               ....
               pdf19
     Box2
        Box2File1
               pdf1
               pdf2
               pdf3
               ....
               pdf25
        Box2File2
               pdf1
               pdf2
               ....
               pdf13

现在，共有13个盒子文件夹。其中散布着 31 个文件夹，在这 31 个文件夹中散布着 611 个 pdf 文件。

我的应用程序如下：

import PyPDF4 as pdy
import os
import pandas as pd
import tkinter as tk
import tkinter.messagebox as tkm
from datetime import datetime

POINT = 0.35277

def numberOfPages(folder):
    file_list = []
    my_list= []
    total_pages = 0
    no_of_files = 0
    for (dirpath, dirnames, filenames) in os.walk(folder):
        file_list += [os.path.join(dirpath, file) for file in filenames]
    if not file_list:
        tkm.showwarning(title="Verificari Formate",message="Your path is not correct or it's empty!")
    else:
        for item in file_list:
            if item.endswith(".pdf") or item.endswith(".PDF"):
                no_of_files += 1
                reader = pdy.PdfFileReader(item)
                no_of_pages = reader.getNumPages()
                total_pages += no_of_pages
                my_list.append((item, no_of_pages))
        excel = pd.DataFrame(my_list,columns=("File","No. Of Pages"))
        now = datetime.now()
        raport_name = now.strftime("%d.%m.%Y %H.%M.%S")
        excel.to_excel(excel_writer=f"{folder}\\{raport_name}.xlsx",sheet_name="Formate",index=False)
        tkm.showinfo(title="Verificari Formate",message=f"Report Generated successfully! You have {no_of_files} "
                                                        f"files and {total_pages} pages")
        entrybox.delete(0,"end")



app = tk.Tk()

app.geometry("1000x200")
app.title("Verificari Formate")

frame = tk.Frame(app)
frame.pack(side="bottom")

lbl_title = tk.Label(app, text="Paste path in the box below",
                     font=("Calibri", 28, "bold"))
lbl_title.pack()

entrybox = tk.Entry(app, font=("Calibri", 20), width= 70)
entrybox.pack(pady=20)


butt_pages = tk.Button(frame, text="No. Of Pages", font=("Calibri", 18, "bold"),
                       command=lambda: numberOfPages(entrybox.get()))
butt_pages.pack(side="right")

app.mainloop()

有没有办法加快应用程序的速度？（我认为如果我将 PDF 文件复制到一个文件夹中会加快一点速度）
除了 PyPDF4 之外，还有其他模块可以更快地完成这项工作吗？
仅供参考：我花了 12 分 53 秒才得到这 611 个文件的结果，总共有 8632 页。（给出的路径是 Z:\scan_easy\myFolder）。我已经尝试将我的应用程序放在服务器本地，但它无法在 win server 2008 上运行（我使用 auto-py-to-exe 为 Windows 构建它）。我想在需要数数的地方使用它。数千个 pdf 的页面，有时我有 80k 个 pdf 文件......

PS：我有一个由其他人用 C# 编写的类似应用程序，它在大约 7 分钟内对上面使用的相同路径执行相同的操作。 :(。

【问题讨论】：

我有一种感觉，PyPDF4 会在计算页数之前在后台下载文件。观察您的网络活动以确认。我能想到的一种替代方法是在服务器上安装 pdffinfo，触发该实用程序并将输出写入 .txt 文件。然后你的 Python 可以从那个 txt 文件中读取而不是读取原始 PDF

标签： python pandas pdf

【解决方案1】：

使用.pages方法

from PyPDF2 import PdfReader
    
reader = PdfReader("US_Declaration.pdf")
readpdf = len(reader.pages)

【讨论】：