在多个 pdf 文件中搜索一个单词并根据字数索引 pdf答案

【问题标题】：searching a word in multiple pdf files and indexing pdf based on the word count在多个 pdf 文件中搜索一个单词并根据字数索引 pdf
【发布时间】：2015-03-19 17:41:00
【问题描述】：

谁能帮我在多个 pdf 文件中搜索一个单词并计算字数？

我需要在每个文档中按字数降序显示 pdf，我应该在 java 中执行此操作。

【问题讨论】：

标签： java lucene binary-tree

【解决方案1】：

获取数据：
下载 iText（PDF 工具），打开所有要扫描的 pdf，阅读其中的文本，制作一个 HashMap 来存储 word -> count(word)。

对你的 hashmap 进行排序：强>
这里的stackoverflow已经解决了这个问题：Sort a Map<Key, Value> by values (Java)

【讨论】：

【解决方案2】：

您似乎正在寻找一个起点或想法，而不是一个特定的解决方案 - 您在这里有几个选择。

首先，您需要确保 PDF 的文本内容是可搜索的。这里以one way 为例，使用 Adobe Acrobat。

其次，您需要使用某种 API 来索引 PDF 文件，以便它们可以搜索。这是 Apache Lucene 站点上的section，它可能会给您一些提示。

Apache Lucene 是一种高性能、全功能的文本搜索完全用 Java 编写的引擎库。

请记住，您的问题中没有太多上下文，因此为 PDF 或 Lucene 编制索引可能对您来说太过分了。

我建议谷歌搜索一些方法 - 尝试“文本搜索 pdf 文件”、“读取 pdf 文件 java”等。

这里也有一个another answer 来帮助你。

【讨论】：

【解决方案3】：

您可以使用PDFBox 计算 PDF 文件中的字数：

public static int countWordInFile(String word, String filename, String fileEncoding) throws Exception {
    int count=0;
    PrintStream ps = null;
    PrintStream originalSystemOut = System.out;

    try {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        ps = new PrintStream(baos);
        System.setOut(ps);

        // Extracting text from page
        ExtractText.main(new String[] {//
                //
                        "-encoding", fileEncoding, //
                        "-console", //
                        filename //
                //
                });

        String content = baos.toString(fileEncoding);

        // TODO: Find the word in content and count its occurences...

    } finally {
        IOUtils.closeQuietly(ps);
        System.setOut(originalSystemOut);
    }

    return count;
}

【讨论】：