使用 python PyPDF2 从 PDF 中提取图像答案

【问题标题】：Extract images from PDF using python PyPDF2使用 python PyPDF2 从 PDF 中提取图像
【发布时间】：2013-12-18 02:59:41
【问题描述】：

有没有办法从 pdf 文档中提取图像作为流（使用PyPDF2 库）？是否可以将一些图像替换为另一个（例如使用 PIL 生成或从文件加载）？

我能够从 pdf 对象树中获取 EncodedStreamObject 并获取编码流（通过调用 getData() 方法），但看起来它只是没有任何图像标题和其他元信息的原始内容。

>>> import PyPDF2
>>> # sample.pdf contains png images
>>> reader = PyPDF2.PdfFileReader(open('sample.pdf', 'rb'))
>>> reader.resolvedObjects[0][9]
{'/BitsPerComponent': 8,
'/ColorSpace': ['/ICCBased', IndirectObject(20, 0)],
'/Filter': '/FlateDecode',
'/Height': 30,
'/Subtype': '/Image',
'/Type': '/XObject',
'/Width': 100}
>>>
>>> reader.resolvedObjects[0][9].__class__
PyPDF2.generic.EncodedStreamObject
>>>
>>> s = reader.resolvedObjects[0][9].getData()
>>> len(s), s[:10]
(9000, '\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc\xcc')

我已经看过很多 PyPDF2、ReportLab 和 PDFMiner 解决方案，但没有找到像我正在寻找的东西。

任何代码示例和链接都会很有帮助。

【问题讨论】：

所以您想打开一个大的 pdf，提取一个页面，然后将该页面添加到现有的 pdf 中？可以将合并后的 pdf 保存为新文件吗？
这个答案可能会有所帮助：stackoverflow.com/a/34116472/1513933
Extract images from PDF without resampling, in python?的可能重复

标签： python pdf image-processing reportlab pypdf

【解决方案1】：

import fitz
doc = fitz.open(filePath)
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

【讨论】：

欢迎来到 Stack Overflow！虽然这段代码 sn-p 可以解决问题，但它没有解释为什么或如何回答问题。请include an explanation for your code，因为这确实有助于提高您的帖子质量。请记住，您是在为将来的读者回答问题，而这些人可能不知道您提出代码建议的原因。
感谢@jainam shah，它对我有用。 pip install PyMuPDF 安装这个库和import fitz 在它工作之后。

【解决方案2】：

图像元数据不存储在 PDF 的编码图像中。如果完全存储元数据，它会存储在 PDF 本身中，但会从底层图像中剥离。您在示例中看到的元数据可能是您能够获得的全部。 PDF 编码器可能会将图像元数据存储在 PDF 的其他位置，但我还没有看到这一点。（注意这个元数据问题was also asked for Java。）

绝对可以提取流，但是，正如您所提到的，您使用getData 操作。

至于替换它，您需要使用 PDF 创建一个新的图像对象，将其添加到末尾，并相应地更新间接对象指针。 PyPdf2 很难做到这一点。

【讨论】：

【解决方案3】：

从 PDF 中提取图像

此代码有助于获取扫描或机器生成的任何图像 pdf或普通pdf
确定其出现示例每页有多少张图片
获取具有相同分辨率和扩展名的图像

pip install PyMuPDF
import fitz
import io
from PIL import Image
#file path you want to extract images from
file = r"File_path"
#open the file
pdf_file = fitz.open(file)   
#iterate over PDF pages
    for page_index in range(pdf_file.page_count):
        #get the page itself
        page = pdf_file[page_index]
        image_li = page.get_images()
        #printing number of images found in this page
        #page index starts from 0 hence adding 1 to its content
        if image_li:
            print(f"[+] Found a total of {len(image_li)} images in page {page_index+1}")
        else:
            print(f"[!] No images found on page {page_index+1}")
        for image_index, img in enumerate(page.get_images(), start=1):
            #get the XREF of the image
            xref = img[0]
            #extract the image bytes
            base_image = pdf_file.extract_image(xref)
            image_bytes = base_image["image"]
            #get the image extension
            image_ext = base_image["ext"]
            #load it to PIL
            image = Image.open(io.BytesIO(image_bytes))
            #save it to local disk
            image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
     
         

`

【讨论】：