【问题标题】:How to extract table as text from the PDF using Python?如何使用 Python 从 PDF 中提取表格作为文本?
【发布时间】:2018-05-12 01:06:10
【问题描述】:

我有一个包含表格、文本和一些图像的 PDF。我想在 PDF 中任何有表格的地方提取表格。

现在正在手动从页面中查找表格。从那里我正在捕获该页面并保存到另一个 PDF 中。

import PyPDF2

PDFfilename = "Sammamish.pdf" #filename of your PDF/directory where your PDF is stored

pfr = PyPDF2.PdfFileReader(open(PDFfilename, "rb")) #PdfFileReader object

pg4 = pfr.getPage(126) #extract pg 127

writer = PyPDF2.PdfFileWriter() #create PdfFileWriter object
#add pages
writer.addPage(pg4)

NewPDFfilename = "allTables.pdf" #filename of your PDF/directory where you want your new PDF to be
with open(NewPDFfilename, "wb") as outputStream:
    writer.write(outputStream) #write pages to new PDF

我的目标是从整个 PDF 文档中提取表格。

【问题讨论】:

    标签: python pdf pdf-parsing


    【解决方案1】:
    • 我建议您使用制表符提取表格。
    • 将您的 pdf 作为参数传递给 tabula api,它会以数据框的形式返回您的表格。
    • pdf 中的每个表都作为一个数据帧返回。
    • 该表将在 dataframea 列表中返回,要使用 pandas 的 dataframe。

    这是我提取pdf的代码。

    import pandas as pd
    import tabula
    file = "filename.pdf"
    path = 'enter your directory path here'  + file
    df = tabula.read_pdf(path, pages = '1', multiple_tables = True)
    print(df)
    

    更多详情请参考我的repo

    【讨论】:

    • 这仅适用于基于文本的 PDF 而不是扫描的 PDF
    • 对于扫描的 pdf 使用图像处理技术,有很多工具
    【解决方案2】:

    此答案适用于遇到带有图像的 pdf 并需要使用 OCR 的任何人。我找不到可行的现成解决方案;没有什么能给我所需的准确性。

    以下是我发现可行的步骤。

    1. 使用https://poppler.freedesktop.org/ 中的pdfimages 将pdf 页面转换为图像。

    2. 使用Tesseract 检测旋转并使用ImageMagick mogrify 修复它。

    3. 使用 OpenCV 查找和提取表。

    4. 使用 OpenCV 从表格中查找和提取每个单元格。

    5. 使用 OpenCV 裁剪和清理每个单元格,这样就不会有干扰 OCR 软件的噪音。

    6. 使用 Tesseract 对每个单元格进行 OCR。

    7. 将每个单元格的提取文本组合成您需要的格式。

    我编写了一个 python 包,其中包含可以帮助完成这些步骤的模块。

    回购:https://github.com/eihli/image-table-ocr

    文档和来源:https://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

    有些步骤不需要代码,它们利用了pdfimagestesseract 等外部工具。我将为需要代码的几个步骤提供一些简短的示例。

    1. 查找表:

    在了解如何查找表格时,此链接是一个很好的参考。 https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/

    import cv2
    
    def find_tables(image):
        BLUR_KERNEL_SIZE = (17, 17)
        STD_DEV_X_DIRECTION = 0
        STD_DEV_Y_DIRECTION = 0
        blurred = cv2.GaussianBlur(image, BLUR_KERNEL_SIZE, STD_DEV_X_DIRECTION, STD_DEV_Y_DIRECTION)
        MAX_COLOR_VAL = 255
        BLOCK_SIZE = 15
        SUBTRACT_FROM_MEAN = -2
    
        img_bin = cv2.adaptiveThreshold(
            ~blurred,
            MAX_COLOR_VAL,
            cv2.ADAPTIVE_THRESH_MEAN_C,
            cv2.THRESH_BINARY,
            BLOCK_SIZE,
            SUBTRACT_FROM_MEAN,
        )
        vertical = horizontal = img_bin.copy()
        SCALE = 5
        image_width, image_height = horizontal.shape
        horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (int(image_width / SCALE), 1))
        horizontally_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, horizontal_kernel)
        vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, int(image_height / SCALE)))
        vertically_opened = cv2.morphologyEx(img_bin, cv2.MORPH_OPEN, vertical_kernel)
    
        horizontally_dilated = cv2.dilate(horizontally_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1)))
        vertically_dilated = cv2.dilate(vertically_opened, cv2.getStructuringElement(cv2.MORPH_RECT, (1, 60)))
    
        mask = horizontally_dilated + vertically_dilated
        contours, hierarchy = cv2.findContours(
            mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE,
        )
    
        MIN_TABLE_AREA = 1e5
        contours = [c for c in contours if cv2.contourArea(c) > MIN_TABLE_AREA]
        perimeter_lengths = [cv2.arcLength(c, True) for c in contours]
        epsilons = [0.1 * p for p in perimeter_lengths]
        approx_polys = [cv2.approxPolyDP(c, e, True) for c, e in zip(contours, epsilons)]
        bounding_rects = [cv2.boundingRect(a) for a in approx_polys]
    
        # The link where a lot of this code was borrowed from recommends an
        # additional step to check the number of "joints" inside this bounding rectangle.
        # A table should have a lot of intersections. We might have a rectangular image
        # here though which would only have 4 intersections, 1 at each corner.
        # Leaving that step as a future TODO if it is ever necessary.
        images = [image[y:y+h, x:x+w] for x, y, w, h in bounding_rects]
        return images
    
    1. 从表格中提取单元格。

    这与2非常相似,所以我不会包含所有代码。我将参考的部分是对单元格进行排序。

    我们要从左到右、从上到下识别单元格。

    我们会找到最左上角的矩形。然后我们将找到所有中心位于左上角矩形的 top-y 和 bottom-y 值内的矩形。然后我们将根据它们中心的 x 值对这些矩形进行排序。我们将从列表中删除这些矩形并重复。

    def cell_in_same_row(c1, c2):
        c1_center = c1[1] + c1[3] - c1[3] / 2
        c2_bottom = c2[1] + c2[3]
        c2_top = c2[1]
        return c2_top < c1_center < c2_bottom
    
    orig_cells = [c for c in cells]
    rows = []
    while cells:
        first = cells[0]
        rest = cells[1:]
        cells_in_same_row = sorted(
            [
                c for c in rest
                if cell_in_same_row(c, first)
            ],
            key=lambda c: c[0]
        )
    
        row_cells = sorted([first] + cells_in_same_row, key=lambda c: c[0])
        rows.append(row_cells)
        cells = [
            c for c in rest
            if not cell_in_same_row(c, first)
        ]
    
    # Sort rows by average height of their center.
    def avg_height_of_center(row):
        centers = [y + h - h / 2 for x, y, w, h in row]
        return sum(centers) / len(centers)
    
    rows.sort(key=avg_height_of_center)
    

    【讨论】:

      【解决方案3】:

      如果您的 pdf 是基于文本的而不是扫描的文档(即,如果您可以在 PDF 查看器中单击并拖动以选择表格中的文本),那么您可以将模块 camelot-py

      一起使用
      import camelot
      tables = camelot.read_pdf('foo.pdf')
      

      然后您可以选择保存表格的方式(如 csv、json、excel、html、sqlite),以及是否应将输出压缩到 ZIP 存档中。

      tables.export('foo.csv', f='csv', compress=False)
      

      编辑:tabula-py 的出现速度大约比 camelot-py 快 6 倍,因此应该改用它。

      import camelot
      import cProfile
      import pstats
      import tabula
      
      cmd_tabula = "tabula.read_pdf('table.pdf', pages='1', lattice=True)"
      prof_tabula = cProfile.Profile().run(cmd_tabula)
      time_tabula = pstats.Stats(prof_tabula).total_tt
      
      cmd_camelot = "camelot.read_pdf('table.pdf', pages='1', flavor='lattice')"
      prof_camelot = cProfile.Profile().run(cmd_camelot)
      time_camelot = pstats.Stats(prof_camelot).total_tt
      
      print(time_tabula, time_camelot, time_camelot/time_tabula)
      

      给了

      1.8495559890000015 11.057014036000016 5.978199147125147
      

      【讨论】:

        【解决方案4】:

        使用 Python pdfminer 将表格作为文本从 PDF 中提取

        from pprint import pprint
        from io import StringIO
        import re
        from pdfminer.high_level import extract_text_to_fp
        from pdfminer.layout import LAParams
        from lxml import html
        ID_LEFT_BORDER = 56
        ID_RIGHT_BORDER = 156
        QTY_LEFT_BORDER = 355
        QTY_RIGHT_BORDER = 455
        # Read PDF file and convert it to HTML
        output = StringIO()
        with open('example.pdf', 'rb') as pdf_file:
            extract_text_to_fp(pdf_file, output, laparams=LAParams(), output_type='html', codec=None)
        raw_html = output.getvalue()
        # Extract all DIV tags
        tree = html.fromstring(raw_html)
        divs = tree.xpath('.//div')
        # Sort and filter DIV tags
        filtered_divs = {'ID': [], 'Qty': []}
        for div in divs:
            # extract styles from a tag
            div_style = div.get('style')
            # print(div_style)
            # position:absolute; border: textbox 1px solid; writing-mode:lr-tb; left:292px; top:1157px; width:27px; height:12px;
        # get left position
            try:
                left = re.findall(r'left:([0-9]+)px', div_style)[0]
            except IndexError:
                continue
        # div contains ID if div's left position between ID_LEFT_BORDER and ID_RIGHT_BORDER
            if ID_LEFT_BORDER < int(left) < ID_RIGHT_BORDER:
                filtered_divs['ID'].append(div.text_content().strip('\n'))
        # div contains Quantity if div's left position between QTY_LEFT_BORDER and QTY_RIGHT_BORDER
            if QTY_LEFT_BORDER < int(left) < QTY_RIGHT_BORDER:
                filtered_divs['Qty'].append(div.text_content().strip('\n'))
        # Merge and clear lists with data
        data = []
        for row in zip(filtered_divs['ID'], filtered_divs['Qty']):
            if 'ID' in row[0]:
                continue
            data_row = {'ID': row[0].split(' ')[0], 'Quantity': row[1]}
            data.append(data_row)
        pprint(data)
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2023-02-20
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-11-16
          • 1970-01-01
          相关资源
          最近更新 更多