【问题标题】:Ensure loop runs through every file even when errors are raised即使出现错误,也要确保循环遍历每个文件
【发布时间】:2021-09-02 15:38:39
【问题描述】:

我正在遍历文件夹中的一堆 pdf,解析它们的内容并将其附加到列表中。 它适用于 pdf 文件的子集。我不想手动删除一些 pdf,运行代码然后添加一些以再次运行它,直到我发现有故障的 pdf。由于某些 pdf 无法打开或内容可能已损坏,因此我执行了以下操作以确保循环运行:check_extractable(如果 pdf 不可提取,则 pdfminer 应抛出错误)是内部类 (PDFTextExtractionNotAllowed) 的方法可以阻止它尝试打开它实际上无法打开的 pdf

问题:即使存在无法打开或没有内容的 pdf,我需要做什么才能使代码继续运行(假设这就是引发错误的原因代码中的特定点)

import pdfminer
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
import os
import io
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter, PDFPageAggregator
from pdfminer.layout import LAParams, LTTextBox, LTFigure, LTImage, 
LTTextLine, LTTextContainer, LTChar, LTTextBoxHorizontal
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser, PDFSyntaxError

directory = 'C:/Users/'
data = []
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    fake_file_handle = io.StringIO()


    with open(os.path.join(directory, file), 'rb') as fh:
        resource_manager = PDFResourceManager()
        laparams = LAParams(line_margin = 0.6)
        device = PDFPageAggregator(resource_manager, laparams = laparams)
        page_interpreter = PDFPageInterpreter(resource_manager, device)

        positions = []
        raw_text = []
        for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
            page_interpreter.process_page(page)
            text = fake_file_handle.getvalue()
            layout = device.get_result()
            for lobj in layout:
                
                if isinstance(lobj, LTTextContainer) or isinstance(lobj, LTTextBox) or isinstance(lobj, pdfminer.layout.LTTextBoxHorizontal):
                    coord, word = int(lobj.bbox[1]), lobj.get_text().strip()
                    raw_text.append([coord, word])

                    for text_line in lobj:
                        for character in text_line:
                            if isinstance(character, LTChar):
                                if character.matrix[0]>0 :
                                    position = character.bbox 
                        positions.append(position)

                # if it's a container, recurse
                elif isinstance(lobj, LTFigure):
                    pass

        # extract elements below y0=781 und above y0=57
        text_pos = []
        maxFontpos = 780
        minFontpos = 58
        for coord, word in raw_text:
            if coord <= maxFontpos and coord >= minFontpos:
                text_pos.append(word)
            else:
                pass
 
        try:
            wap = text_pos[0]
        except:
            pass
        
    data.append([text_pos, wap])
    fake_file_handle.close()

具体的错误抛出在

---> 28                         for character in text_line:
     29                             if isinstance(character, LTChar):
     30                                 if character.matrix[0]>0 :

TypeError: 'LTChar' object is not iterable

【问题讨论】:

  • 如果您不关心为什么某些文件不起作用 - 只需将失败的部分包装到 try
  • 您检查 if isinstance(lobj, LTTextContainer) 但如果 lobjLTTextLine 在这种情况下它只包含字符,这也会成功。 (见github.com/pdfminer/pdfminer.six/blob/…
  • 伦纳德你的回答让我很好奇它为什么会失败。如果它只包含字符,我仍在评估其影响。毕竟这就是我解析和附加的内容。我也按照 Garrit 所说的做了,现在它返回了 1995 年的所有文件数据,除了 2 个文件

标签: python exception error-handling pdfminer pdf-parsing


【解决方案1】:

如果这只是一个快速而肮脏的脚本,我建议将整个 with 块包围在一个通用的 try/except 中。通常,您不希望只是盲目地排除/捕获异常而不指定您要查找的类型,以防发生您未预料到的不同异常/错误,但在这种情况下,我认为没关系:

from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed

directory = 'C:/Users/'
data = []
for file in os.listdir(directory):
    if not file.endswith(".pdf"):
        continue
    fake_file_handle = io.StringIO()

    try:
        with open(os.path.join(directory, file), 'rb') as fh:
            resource_manager = PDFResourceManager()
            laparams = LAParams(line_margin = 0.6)
            device = PDFPageAggregator(resource_manager, laparams = laparams)
            page_interpreter = PDFPageInterpreter(resource_manager, device)

            positions = []
            raw_text = []
            for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
                page_interpreter.process_page(page)
                text = fake_file_handle.getvalue()
                layout = device.get_result()
                for lobj in layout:
                
                    if isinstance(lobj, LTTextContainer) or isinstance(lobj, LTTextBox) or isinstance(lobj, pdfminer.layout.LTTextBoxHorizontal):
                        coord, word = int(lobj.bbox[1]), lobj.get_text().strip()
                        raw_text.append([coord, word])

                        for text_line in lobj:
                            for character in text_line:
                                if isinstance(character, LTChar):
                                    if character.matrix[0]>0 :
                                        position = character.bbox  # font-positon
                            positions.append(position)

                    # if it's a container, recurse
                    elif isinstance(lobj, LTFigure):
                        pass

            # extract elements below y0=781 und above y0=57
            text_pos = []
            maxFontpos = 780
            minFontpos = 58
            for coord, word in raw_text:
                if coord <= maxFontpos and coord >= minFontpos:
                    text_pos.append(word)
                else:
                    pass
 
            try:
                wap = text_pos[0]
            except:
                pass
    except:
        continue # Move on to next loop iteration

    data.append([text_pos, wap])
    fake_file_handle.close()

【讨论】:

    猜你喜欢
    • 2013-09-21
    • 2021-03-20
    • 1970-01-01
    • 2014-06-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-01-24
    • 2023-03-16
    相关资源
    最近更新 更多