从pdf创建索引[重复]答案

【问题标题】：create a index from pdf [duplicate]从pdf创建索引[重复]
【发布时间】：2011-10-18 03:56:01
【问题描述】：

可能重复：
How do I Index PDF files and search for keywords?

从 PDF 创建索引。

【问题讨论】：

到目前为止你得到了什么？如果使用 Python，请查看 collections 模块。
哦，看。很多很多人都问过同样的问题：stackoverflow.com/search?q=python+index+pdf。您也可以使用页面顶部的“搜索”框查看其他人提出的可能对您有帮助的问题。
“这不是我要找的”。一点帮助都没有。请仔细并完整地定义您的要求实际上有何不同。我们不知道您所做的事情有什么独特之处或不同之处。它看起来和我们一模一样。
@S.Lott - 与文档之间的索引页面相比，一个文件中的索引页面不同，因为源文档中的分页很关键

标签： perl

【解决方案1】：

我认为您可以为此使用 pyPdf Python 库（http://pybrary.net/pyPdf/）。此代码显示包含所需单词的页数：

from pyPdf import PdfFileReader

input = PdfFileReader(file("YourPDFFile.pdf", "rb"))

numberOfPages = input.getNumPages()

i = 1
while i <  numberOfPages:
    oPage = input.getPage(i)
    text = oPage.extractText()
    text.encode('utf8', 'ignore')
    if text.find('What are you looking for') != -1:
        print i
    i += 1

相同，但使用 Python 3

from pyPdf import PdfFileReader

input = PdfFileReader(open("YourPDFFile.pdf", "rb"))

numberOfPages = input.getNumPages()

i = 1
while i <  numberOfPages:
    oPage = input.getPage(i)
    text = oPage.extractText()
    text.encode('utf8', 'ignore')
    if text.find('What are you looking for') != -1:
        print(i)
    i += 1

【讨论】：

我认为主要问题是我在这个脚本中使用了 Python 2.7 并且构造 print 在不同的 Python 版本中有所不同 http://diveintopython3.org/porting-code-to-python-3-with-2to3.html
请注意，将其设为 for 循环 for i in range(1, numberOfPages): 并仅测试 if 'word' in text 会更直接
我没有使用过 PyPdf，但是查看文档，看起来你不能。我对 PDF 标准了解不多，但文档本身可能是按页面定义的。
浏览 PDF 标准的 Wikipedia 条目，看起来 PDF 文档确实是根据页面流定义的。因此，您可能必须逐页与文档进行交互。当然，您可以构建一个包含所有页面提取文本的大型 str 对象，然后使用它。
我认为this answer 对您的其他问题的总结最好。只要确保你的计数器是在你的循环之外声明的，否则它会随着每一页重置（这似乎正在发生）。