如何使用索引从 PDF 中提取所有文本答案

【问题标题】：How do I extract all of the text from a PDF using indexing如何使用索引从 PDF 中提取所有文本
【发布时间】：2020-07-09 01:25:51
【问题描述】：

我是 Python 和一般编码的新手。我正在尝试创建一个程序，该程序将 OCR 一个 PDF 目录然后提取文本，以便我以后可以挑选出特定的东西。但是，我无法让 pdfPlumber 从所有页面中提取所有文本。您可以从头到尾进行索引，但如果结尾未知，则会因为索引超出范围而中断。

import ocrmypdf
import os
import requests
import pdfplumber
import re
import logging
import sys
import PyPDF2

## test folder C:\Users\adams\OneDrive\Desktop\PDF

user_direc = input("Enter the path of your files: ") 

#walks the path and prints out each PDF in the 
#OCRs the documents and skips any OCR'd pages.


for dir_name, subdirs, file_list in os.walk(user_direc):
    logging.info(dir_name + '\n')
    os.chdir(dir_name)
    for filename in file_list:
        file_ext = os.path.splitext(filename)[0--1]
        if file_ext == '.pdf':
            full_path = dir_name + '/' + filename
            print(full_path)
result = ocrmypdf.ocr(filename, filename, skip_text=True, deskew = True, optimize = 1) 
logging.info(result)

#the next step is to extract the text from each individual document and print

directory = os.fsencode(user_direc)
    
for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        with pdfplumber.open(file) as pdf:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)

按原样，这只会从每个 PDF 的第一页获取文本。我想从每个 PDF 中提取所有文本，但如果我的索引太大并且我不知道 PDF 的页数，pdfPlumber 会中断。我试过了

page = pdf.pages[0--1]

但这也会中断。我也找不到使用 PyPDF2 的解决方法。如果此代码草率或不可读，我深表歉意。我尝试添加 cmets 来解释我在做什么。

【问题讨论】：

标签： python pdf pypdf2

【解决方案1】：

pdfplumber git page 表示 pdfplumber.open 返回 pdfplumber.PDF 类的实例。

该实例具有pages 属性，它是pdfplumber.Page 实例的列表 - 每个Page 从您的pdf 加载一个。查看您的代码，如果您这样做：

total_pages = len(pdf.pages)

您应该获得当前加载的 pdf 的总页数。

要将所有 pdf 的文本组合成一个巨大的文本字符串，您可以尝试“for in”操作。尝试更改现有代码：

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        with pdfplumber.open(file) as pdf:
            page = pdf.pages[0]
            text = page.extract_text()
            print(text)

收件人：

for file in os.listdir(directory):
    filename = os.fsdecode(file)
    if filename.endswith('.pdf'):
        all_text = '' # new line
        with pdfplumber.open(file) as pdf:
            # page = pdf.pages[0] - comment out or remove line
            # text = page.extract_text() - comment out or remove line
            for pdf_page in pdf.pages:
               single_page_text = pdf_page.extract_text()
               print( single_page_text )
               # separate each page's text with newline
               all_text = all_text + '\n' + single_page_text
            print(all_text)
            # print(text) - comment out or remove line

不要使用页面的索引值pdf.page[0] 来访问各个页面，而是使用for pdf_page in pdf.pages。它会在到达最后一页后停止循环而不产生异常。您不必担心使用超出范围的索引值。

【讨论】：

我不确定在哪里包含 all_text 建议。 len 函数给出了 PDF 的长度，但 '''pdf.pages[0]''' 需要包含一个整数。我不能切片它，它不会接受一个元组。

【解决方案2】：

如果您在尝试上述代码时遇到此错误：

fp = open(path_or_fp, "rb") FileNotFoundError: [Errno 2] No such file or directory:

这是因为 os.listdir() 只给出文件名，你必须将它与目录连接起来。 os.listdir() 函数将返回与您列出的目录相关的名称。您需要重建打开这些文件的绝对路径。

要解决此错误，请尝试以下代码：

import os
import pdfplumber

directory = r'C:\Users\foo\folder'

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):
        fullpath = os.path.join(directory, filename)
        #print(fullpath)
        all_text = ""
        with pdfplumber.open(fullpath) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                #print(text)
                all_text += '\n' + text
        print(all_text)

参考：Extract text from pdf file using pdfplumber

【讨论】：