Java - 单词和短语频率计数[关闭]答案

【问题标题】：Java - Words and Phrase Frequency Counting [closed]Java - 单词和短语频率计数[关闭]
【发布时间】：2013-09-17 02:21:19
【问题描述】：

这是我的困境。

我需要一个函数，它可以在随机文本中找到出现次数最多的字符串模式。

所以如果输入是这样的：

my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name

按出现排序的输出应如下所示（不区分大小写）：

  Rank    Freq  Phrase
      1       6  jane doe
      2       3  my name
      3       3  name is
      4       2  doe doe
      5       2  doe doe my
      6       2  doe my
      7       2  is jane
      8       2  is jane doe
      9       2  jane doe doe
     10       2  jane doe doe my
     11       2  my name is
     12       2  name is jane
     13       2  name is jane doe
etc...

就我而言，我只需要包含 2 个或更多单词的短语。知道如何解决这个问题吗？

【问题讨论】：

请告诉我们what have you tried
我正在编写用于计算单个单词出现次数的代码，但它无法用于匹配模式/短语（其大小可能不受限制）。在我们说话的时候，我真的才刚刚开始思考这个问题。我正在考虑将整个文本拆分为单词，然后首先与下一个配对，并在我进行时扩展选择并保留计数器......类似的东西。
为了说明，以下在线短语计数器正在做我需要的事情：writewords.org.uk/phrase_count.asp
听起来您的想法是正确的，您有什么具体问题吗？
您说“我真的才刚刚开始考虑这个问题”——也许您可以自己做更多的思考，然后请 Stack Overflow 社区的其他成员为您思考。

标签： java design-patterns word frequency phrase

【解决方案1】：

原始版本 - 由于使用字符串连接运算符 +，此版本非常浪费 CPU 和内存，因为它会创建新的 char[] 对象并在每次使用 + 时将数据从一个对象复制到另一个对象。

public class CountPhrases {
    public static void main(String[] arg){
        String input = "my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name";

        String[] split = input.split(" ");
        Map<String, Integer> counts = new HashMap<String,Integer>();
        for(int i=0; i<split.length-1; i++){
            String phrase = split[i];
             for(int j=i+1; j<split.length; j++){
                phrase += " " + split[j];
                Integer count = counts.get(phrase);
                 if(count==null){
                     counts.put(phrase, 1);
                 } else {
                     counts.put(phrase, count+1);
                 }
             }
        }

        Map.Entry<String,Integer>[] entries = counts.entrySet().toArray(new Map.Entry[0]);
        Arrays.sort(entries, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o2.getValue().compareTo(o1.getValue());
            }
        });
        int rank=1;
        System.out.println("Rank Freq Phrase");
        for(Map.Entry<String,Integer> entry:entries){
            int count = entry.getValue();
            if(count>1){
                System.out.printf("%4d %4d %s\n", rank++, count,entry.getKey());
            }
        }
    }
}

输出：

Rank Freq Phrase
   1    6 jane doe
   2    3 name is
   3    3 my name
   4    2 name is jane doe
   5    2 jane doe doe
   6    2 doe my
   7    2 my name is
   8    2 is jane doe
   9    2 jane doe doe my
  10    2 name is jane
  11    2 is jane
  12    2 doe doe
  13    2 doe doe my

Process finished with exit code 0

新版本 - 使用 String.substring 可以节省 CPU 和内存，因为通过子字符串获得的所有字符串在后台共享相同的 char[]。这应该运行得更快。

public class CountPhrases {
    public static void main(String[] arg){
        String input = "my name is john jane doe jane doe doe my name is jane doe doe my jane doe name is jane doe I go by the name of john joe jane doe is my name";

        String[] split = input.split(" ");
        Map<String, Integer> counts = new HashMap<String,Integer>(split.length*(split.length-1)/2,1.0f);
        int idx0 = 0;
        for(int i=0; i<split.length-1; i++){
            int splitIpos = input.indexOf(split[i],idx0);
            int newPhraseLen = splitIpos-idx0+split[i].length();
            String phrase = input.substring(idx0, idx0+newPhraseLen);
            for(int j=i+1; j<split.length; j++){
                newPhraseLen = phrase.length()+split[j].length()+1;
                phrase=input.substring(idx0, idx0+newPhraseLen);
                Integer count = counts.get(phrase);
                if(count==null){
                     counts.put(phrase, 1);
                } else {
                     counts.put(phrase, count+1);
                }
            }
            idx0 = splitIpos+split[i].length()+1;
        }

        Map.Entry<String, Integer>[] entries = counts.entrySet().toArray(new Map.Entry[0]);
        Arrays.sort(entries, new Comparator<Map.Entry<String, Integer>>() {
            @Override
            public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
                return o2.getValue().compareTo(o1.getValue());
            }
        });
        int rank=1;
        System.out.println("Rank Freq Phrase");
        for(Map.Entry<String,Integer> entry:entries){
            int count = entry.getValue();
            if(count>1){
                System.out.printf("%4d %4d %s\n", rank++, count,entry.getKey());
            }
        }
    }
}

输出

Rank Freq Phrase
   1    6 jane doe
   2    3 name is
   3    3 my name
   4    2 name is jane doe
   5    2 jane doe doe
   6    2 doe my
   7    2 my name is
   8    2 is jane doe
   9    2 jane doe doe my
  10    2 name is jane
  11    2 is jane
  12    2 doe doe
  13    2 doe doe my

Process finished with exit code 0

【讨论】：

哇哇哇。我早早起床开始工作，瞧！这是我的代码完成并准备就绪。请告诉我你住在多伦多，我带你出去喝杯啤酒。
不，我在德克萨斯州。只要确保您了解该代码中发生的所有事情。 :)
当我在我的计算机上运行大约 300 个字的代码时，它工作得很好。但我遇到了在 Android 上运行此代码的一些主要问题。它因内存不足而失败。考虑到我有一个顶级设备（HTC One），我担心将其作为代码的一部分发布，这很可能会导致问题。太糟糕了，它正在做我需要的。
已更新以使其运行速度更快，而不受最大短语长度/字数的限制。（投票给 cmets 是另一种表达感激的好方法。:)）
另外，我刚刚更新了它以初始化哈希图，使其具有大约正确的桶数和负载因子，这样它就不必重新散列。

【解决方案2】：

使用Markov Algorithm 计算单词邻居的想法来创建单词之间的关系。最初是一个词，接下来是两个，依此类推。

【讨论】：

有适合我需要的代码示例吗？

【解决方案3】：

    String txt = "my name is songxiao name is";
    List<Map<String, Integer>> words = new ArrayList<Map<String, Integer>>();
    Map map = new HashMap<String, Integer>();
    String[] tmp = txt.split(" ");
    for (int i = 0; i < tmp.length - 1; i++) {
        String key = tmp[i];
        for (int j = 1; j < tmp.length - i; j++) {
            key += " " + tmp[i + j];
            if (map.containsKey(key)) {
                map.put(key, Integer.parseInt(map.get(key).toString()) + 1);
            } else {
                map.put(key, 1);
            }
        }
    }
    Iterator<String> it = map.keySet().iterator();
    while (it.hasNext()) {
        String key = it.next().toString();
        System.out.println(key + "     " + map.get(key));
    }

您可以将代码粘贴到您的 main 方法中，然后运行它。

【讨论】：

我认为它比这更复杂......