从pdf中提取矩形中的文本 - Python答案

【问题标题】：Extract text in a rectangle from pdf - Python从pdf中提取矩形中的文本 - Python
【发布时间】：2020-05-28 20:58:39
【问题描述】：

我需要从 Pdf 中提取矩形中的文本。我测试了几种方法。但没有得到具体的文字。例如，我使用 PyMuPDF、pdfplumber、tabula、camelot、pdftables 包进行了测试。在 PyMuPDF 模块中，它要求输入开头和结尾的词来提取文本。据我了解，其余包也只是提取线条、曲线信息，而不是文本。

我想在不提供任何开始和结束文本的情况下从 PDF 中的矩形获取文本。

https://drive.google.com/file/d/1wCvik7VbEvDwbT-mapgXc8fwlq7Ao3BP/view?usp=sharing

【问题讨论】：

您能否提供一份您试图从中提取文本的 PDF 副本？以及您要提取的 PDF 中的文本。没有它，我们只能猜测。
当然。给我 5 分钟，我会准备并提供。因为我使用的是机密的 PDF。
嗨，moys，我编辑了问题并添加了 PDF。你现在可以检查一下吗？
我建议使用 Pillow（或其他一些图像识别）首先获取矩形的坐标，然后使用 pymupdf 中的这些坐标来获取里面的文本。我已经完成了第二个，但不确定前者是否可能。

标签： python text-extraction pdf-extraction pymupdf

【解决方案1】：

您可以使用下面的代码

import PyPDF2
def convert_pdf_to_text (document):
    read_pdf = PyPDF2.PdfFileReader(document, strict=False)
    number_of_pages = read_pdf.getNumPages()

    alltext1=""
    for page_number in range(number_of_pages):
        page = read_pdf.getPage(page_number)
        alltext1 += page.extractText()
    return alltext1.replace("\n", "")
convert_pdf_to_text ('pdf_test.pdf')

输出

'A Simple PDF File  This is a small demonstration .pdf file - just for use in the Virtual Mechanics tutorials. More text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Boring, zzzzz. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. And more text. Even more. Continued on page 2 ...  Details  State: State_name     City: City_name    Country: Country_name     Rig No: 4455555  Source Id: k4-3k44 '

【讨论】：

好的。感谢您的回复。让我检查一下。
我认为代码是从 pdf 中提取整个文本。但是我们需要矩形中的文本。
嗨moys，你能帮忙只提取矩形框中的文本吗？
你的矩形的突出特点是什么？它会在每个页面上的相同位置吗？它会有相同的内容吗？应该有一些东西来定义这个矩形的位置。那是什么？
嗨，moys，很抱歉回复晚了，真正的要求是，矩形可以是页面中的任何位置，并且可以是多个矩形。矩形也没有固定的文本。矩形是动态的。