如何从使用 tika 提取的文本中获取频繁出现的单词

【问题标题】：How to get the frequently occuring words from the text extracted using tika如何从使用 tika 提取的文本中获取频繁出现的单词
【发布时间】：2013-07-03 05:27:59
【问题描述】：

我使用以下代码（使用 tika）提取了多种文件格式（pdf、html、doc）的文本

File file1 = new File("c://sample.pdf);
InputStream input = new FileInputStream(file1); 
BodyContentHandler handler = new BodyContentHandler(10*1024*1024);
JSONObject obj = new JSONObject();
obj.put("Content",handler.toString());

现在我的要求是从提取的内容中获取频繁出现的单词，你能建议我怎么做吗？

谢谢

【问题讨论】：

是的，内容存储在json对象中

标签： java file apache-tika word-frequency

【解决方案1】：

这是一个最常用词的函数。

你需要把内容传递给函数，你就得到了频繁出现的词。

String getMostFrequentWord(String input) {
    String[] words = input.split(" ");
    // Create a dictionary using word as key, and frequency as value
    Map<String, Integer> dictionary = new HashMap<String, Integer>();
    for (String word : words) {
        if (dictionary.containsKey(word)) {
            int frequency = dictionary.get(word);
            dictionary.put(word, frequency + 1);
        } else {
            dictionary.put(word, 1);
        }
    }

    int max = 0;
    String mostFrequentWord = "";
    Set<Entry<String, Integer>> set = dictionary.entrySet();
    for (Entry<String, Integer> entry : set) {
        if (entry.getValue() > max) {
            max = entry.getValue();
            mostFrequentWord = entry.getKey();
        }
    }

    return mostFrequentWord;
}

算法是O(n)所以性能应该还可以。

【讨论】：