在 Python 中从 PDF 中提取图像时更改配色方案答案

【问题标题】：Change color scheme when extracting an image from PDF in Python在 Python 中从 PDF 中提取图像时更改配色方案
【发布时间】：2017-09-06 16:10:00
【问题描述】：

我正在尝试从这篇文章之后的 pdf 中读取图像： Extract images from PDF without resampling, in python?

到目前为止，我设法从 pdf 中获取了图像文件，但它包含 CWYK 配色方案并且图片变得混乱。

我的代码如下：

import PyPDF2
import struct

pdf_filename = 'document.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(4)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
    print(xObject[obj])
    if xObject[obj]['/Subtype'] == '/Image':
        if xObject[obj]['/Filter'] == '/DCTDecode':                        
            data = xObject[obj]._data            
            img = open("image" + ".jpg", "wb")
            img.write(data)
            img.close()

pdf_file.close()

关键是当我保存的时候，颜色都很奇怪，我相信是因为colorScheme。我在控制台中有以下内容：

{'/Type': '/XObject', '/Subtype': '/Image', '/Width': 1122, '/Height': 502, '/Interpolate': <PyPDF2.generic.BooleanObject object at 0x1061574a8>, '/ColorSpace': '/DeviceCMYK', '/BitsPerComponent': 8, '/Filter': '/DCTDecode'}

如您所见，ColorSpace 是 CMYK，我相信这就是图像颜色奇怪的原因。

这是我的图片：

这是原始图像（在 pdf 文件中）：

谁能帮帮我？

提前致谢。以色列

【问题讨论】：

标签： python pdf

【解决方案1】：

PDF 中包含的 CMYK 模式 JPG 图像必须是反相的。

但在 PIL 中，不支持 CMYK 模式图像的反转。比我用 numpy 解决它。

完整的源代码在下面的链接中。 https://github.com/Gaia3D/pdfImageExtractor/blob/master/extrectImage.py

imgData = np.frombuffer(img.tobytes(), dtype='B')
invData = np.full(imgData.shape, 255, dtype='B')
invData -= imgData
img = Image.frombytes(img.mode, img.size, invData.tobytes())
img.save(outFileName + ".jpg")

【讨论】：