按行而不是按列从 pdf 文件中读取表格答案

【问题标题】：Reading a table from a pdf file by row and not by column按行而不是按列从 pdf 文件中读取表格
【发布时间】：2020-10-21 01:26:06
【问题描述】：

我正在尝试从 PDF 文件中提取所有文本。我正在使用在线 PDF，它们包括表格。但是，此代码有效，当它到达 PDF 中的表格时，表格中的文本按列打印，而不是按行打印，这会弄乱我的数据。有没有办法让表格按行读取，而不必单独浏览表格？我仍然需要 PDF 中的所有文本一起打印。我正在使用 python。

def getTextFromPDF(url):
    open = urllib.request.urlopen(url).read()
    memoryFile = io.BytesIO(open)
    
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    
    
    with memoryFile as fh:
    
        for page in PDFPage.get_pages(fh,
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)
    
        text = fake_file_handle.getvalue()
    
    # close open handles
    converter.close()
    fake_file_handle.close()
    return text

【问题讨论】：

标签： python pdf datatables pdf-scraping

【解决方案1】：

此答案适用于遇到带有图像的 pdf 并需要使用 OCR 的任何人。我找不到可行的现成解决方案；没有什么能给我所需的准确性。

以下是我发现可行的步骤。

使用来自https://poppler.freedesktop.org/ 的pdfimages 将pdf 的页面转换为图像。

使用 Tesseract 检测旋转并使用 ImageMagick mogrify 修复它。

使用 OpenCV 查找和提取表格。

使用 OpenCV 从表格中查找和提取每个单元格。

使用 OpenCV 裁剪和清理每个单元格，这样就不会产生混淆 OCR 软件的噪音。

使用 Tesseract 对每个单元格进行 OCR。

将每个单元格的提取文本组合成您需要的格式。

我编写了一个 python 包，其中包含可以帮助完成这些步骤的模块。

回购：https://github.com/eihli/image-table-ocr

文档和来源：https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

有些步骤不需要代码，它们利用了 pdfimages 和 tesseract 等外部工具。我将为需要代码的几个步骤提供一些简短的示例。

查找表：在弄清楚如何查找表格时，此链接是一个很好的参考。 https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/

【讨论】：

为什么需要 OCR？