如何使用 Swift 解析 PDF 页面中的内容答案

【问题标题】：How can I parse content from a PDF page with Swift如何使用 Swift 解析 PDF 页面中的内容
【发布时间】：2016-03-20 15:13:21
【问题描述】：

文档对我来说不是很清楚。到目前为止，我认为我需要设置一个 CGPDFOperatorTable，然后为每个 PDF 页面创建一个 CGPDFContentStreamCreateWithPage 和 CGPDFScannerCreate。

文档中提到了设置回调，但我不清楚如何设置。如何实际从页面获取内容？

这是我目前的代码。

    let pdfURL = NSBundle.mainBundle().URLForResource("titleofdocument", withExtension: "pdf")

    // Create pdf document
    let pdfDoc = CGPDFDocumentCreateWithURL(pdfURL)

    // Nr of pages in this PF
    let numberOfPages = CGPDFDocumentGetNumberOfPages(pdfDoc) as Int

    if numberOfPages <= 0 {
        // The number of pages is zero
        return
    }

    let myTable = CGPDFOperatorTableCreate()

    // lets go through every page
    for pageNr in 1...numberOfPages {

        let thisPage = CGPDFDocumentGetPage(pdfDoc, pageNr)
        let myContentStream = CGPDFContentStreamCreateWithPage(thisPage)
        let myScanner = CGPDFScannerCreate(myContentStream, myTable, nil)

        CGPDFScannerScan(myScanner)

        // Search for Content here?
        // ??

        CGPDFScannerRelease(myScanner)
        CGPDFContentStreamRelease(myContentStream)

    }

    // Release Table
    CGPDFOperatorTableRelease(myTable)

这是一个与PDF Parsing with SWIFT 类似的问题，但还没有答案。

【问题讨论】：

我认为我必须编写回调，在扫描仪扫描时调用。有人可以发布回调示例吗？它是我向 CGPDFOperatorTableSetCallback 注册的自定义方法吗？举个例子就好了。
你知道 if 检查没有做任何事情吗？因为它从 if 块中返回，然后继续执行。要使其仅在有页面时循环浏览页面，请将 if 块之后的所有内容放在 else 块内。
我愿意。我真的很想了解更多关于回调的信息。我知道 if 语句，不过谢谢！
您能否接受任何答案或发布您自己的答案以帮助像我这样的未来读者？这是非常需要的。 @TomWolters
我给stackoverflow.com/questions/33136976/pdf-parsing-with-swift加了一个答案，你可以看看

标签： ios swift parsing pdf

【解决方案1】：

以下是 Swift 中实现的回调示例：

    let operatorTableRef = CGPDFOperatorTableCreate()

    CGPDFOperatorTableSetCallback(operatorTableRef, "BT") { (scanner, info) in
        print("Begin text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "ET") { (scanner, info) in
        print("End text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "Tf") { (scanner, info) in
        print("Select font")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "Tj") { (scanner, info) in
        print("Show text")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "TJ") { (scanner, info) in
        print("Show text, allowing individual glyph positioning")
    }

    let numPages = CGPDFDocumentGetNumberOfPages(pdfDocument)
    for pageNum in 1...numPages {
        let page = CGPDFDocumentGetPage(pdfDocument, pageNum)
        let stream = CGPDFContentStreamCreateWithPage(page)
        let scanner = CGPDFScannerCreate(stream, operatorTableRef, nil)
        CGPDFScannerScan(scanner)
        CGPDFScannerRelease(scanner)
        CGPDFContentStreamRelease(stream)
    }

【讨论】：

谢谢！很快就会对此进行测试，您的代码看起来很棒。
感谢解答，如何从info获取数据？

【解决方案2】：

您实际上已经明确指定了如何做，您需要做的就是将它们放在一起并尝试直到它起作用。

首先，您需要设置一个带有回调的表，就像您在问题开头声明自己一样（所有代码都在 Objective C 中，而不是 Swift）：

CGPDFOperatorTableRef operatorTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback(operatorTable, "q", &op_q);
CGPDFOperatorTableSetCallback(operatorTable, "Q", &op_Q);

此表包含您要调用的 PDF 运算符列表，并将回调与它们相关联。这些回调只是你在别处定义的函数：

static void op_q(CGPDFScannerRef s, void *info) {
    // Do whatever you have to do in here
    // info is whatever you passed to CGPDFScannerCreate
}

static void op_Q(CGPDFScannerRef s, void *info) {
    // Do whatever you have to do in here
    // info is whatever you passed to CGPDFScannerCreate
}

然后您创建扫描仪并开始运行，同时将您刚刚定义的信息传递给它。

// Passing "self" is just an example, you can pass whatever you want and it will be provided to your callback whenever it is called by the scanner.
CGPDFScannerRef contentStreamScanner = CGPDFScannerCreate(contentStream, operatorTable, self);

CGPDFScannerScan(contentStreamScanner);

如果您想查看包含有关如何查找和处理图像的源代码的完整示例，请check this website。

【讨论】：

谢谢！尽管我觉得我走在正确的轨道上，而且您的回答确实符合我的需要，但我只是无法将 Objective C 方法转换为有效的 Swift 回调。
如何从info中获取数据？

【解决方案3】：

要了解解析器为何以这种方式工作，您需要更好地阅读 PDF 规范。 PDF 文件包含接近打印说明的内容。如“移动到这个坐标，打印这个字符，移动到那里，改变颜色，从字体#23打印第23个字符”等等。

解析器为您提供每条指令的回调，并可以检索指令参数。就是这样。

因此，为了从文件中获取内容，您需要手动重建其状态。这意味着，重新计算所有字符的框架，并尝试对页面布局进行逆向工程。这显然不是一件容易的事，这就是人们创建库来这样做的原因。

您可能想看看 PDFKitten 或 PDFParser 这是一个 Swift 端口，我做了一些改进。

【讨论】：