使用 pdfplumber 从 pdf 文件中提取文本答案

【问题标题】：Extract text from pdf file using pdfplumber使用 pdfplumber 从 pdf 文件中提取文本
【发布时间】：2021-06-22 01:58:18
【问题描述】：

我想从pdf文件中提取文本，试过了：

directory = r'C:\Users\foo\folder'

for x in os.listdir(directory):
    print(x)
    x = x.replace('.pdf','')
    filename = os.fsdecode(x)
    print(x)

    if filename.endswith('.pdf'):
        with pdfplumber.open(x) as pdf1:
            page1 = pdf1.pages[0]
            text1 = page1.extract_text()
            print(text1)

它打印出来了：

20170213091544343.pdf
20170213091544343

看到文件名是20170213091544343，我补充道：


    else:
        with pdfplumber.open(x) as pdf1:
                page1 = pdf1.pages[0]
                text1 = page1.extract_text()
                print(text1)

读取文件以防文件名没有 .pdf 并捕获错误：


20170213091544343.pdf
20170213091544343
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-34-e370b214f9ba> in <module>
     16 
     17     else:
---> 18         with pdfplumber.open(x) as pdf1:
     19                 page1 = pdf1.pages[0]
     20                 text1 = page1.extract_text()

C:\Python38\lib\site-packages\pdfplumber\pdf.py in open(cls, path_or_fp, **kwargs)
     56     def open(cls, path_or_fp, **kwargs):
     57         if isinstance(path_or_fp, (str, pathlib.Path)):
---> 58             fp = open(path_or_fp, "rb")
     59             inst = cls(fp, **kwargs)
     60             inst.close = fp.close

FileNotFoundError: [Errno 2] No such file or directory: '20170213091544343'

【问题讨论】：

os.listdir() 只返回文件名。你也需要目录。
嗨@JohnGordon，你能详细说明一下吗？
您必须使用完整路径os.path.join(directory, x)，并且您必须保留扩展名.pdf - 打开C:\Users\foo\folder\20170213091544343.pdf 而不是20170213091544343

标签： python pdf pdfplumber

【解决方案1】：

os.listdir() 只提供filename，你必须加入directory

for filename in os.listdir(directory):

    fullpath = os.path.join(directory, filename)

    #print(fullpath)

你必须保持exension .pdf

import os
import pdfplumber

directory = r'C:\Users\foo\folder'

for filename in os.listdir(directory):
    if filename.endswith('.pdf'):

        fullpath = os.path.join(directory, filename)
        #print(fullpath)

        #all_text = ""

        with pdfplumber.open(fullpath) as pdf:
            for page in pdf.pages:
                text = page.extract_text()
                print(text)
                #all_text += text

        #print(all_text)

或带页码

        with pdfplumber.open(fullpath) as pdf:
            for number, page in enumerate(pdf.pages, 1):
                print('--- page', number, '---')
                text = page.extract_text()
                print(text)

【讨论】：

它需要一些 OCR 程序 - 例如 tesseract（由 Google 创建）和模块 pytesseract - 将图像转换为文本。
顺便说一句：你只检查第一页 - 如果文本在下一页，那么你应该使用for-loop
如果要使用'the_complete_name_of_the_file_i_want.pdf，则无需勾选endswith()，也不需要os.listdir()，直接使用open('directory/the_complete_name_of_the_file_i_want.pdf')
我不记得这是否需要 ImageMagick - 首先从 Google 安装 tessract 并在没有 Python 的情况下对其进行测试 - 直接在控制台中 tesseract.exe file.pdf
它可能需要ImageMagick 将pdf 中的每一页都转换成图像——因为tesseract 可能需要图像。在 ImageMagick 的页面上你可以看到很多installers for Windows