计算java中.txt文件中单词的频率答案

【问题标题】：Counting frequency of words from a .txt file in java计算java中.txt文件中单词的频率
【发布时间】：2015-06-14 02:51:27
【问题描述】：

我正在完成一项 Comp Sci 作业。最后，程序将确定文件是用英语还是法语编写的。现在，我正在努力使用计算 .txt 文件中出现的单词频率的方法。

我在标记为 1-20 的文件夹中分别有一组英文和法文文本文件。该方法要求提供一个目录（在这种情况下是“docs/train/eng/”或“docs/train/fre/”）以及程序应该通过多少个文件（每个文件夹中有 20 个文件） .然后它读取该文件，将所有单词分开（我不需要担心大小写或标点符号），并将每个单词连同它们在文件中的次数一起放入 HashMap 中。（键 = 词，值 = 频率）。

这是我为该方法编写的代码：

public static HashMap<String, Integer> countWords(String directory, int nFiles) {
// Declare the HashMap
HashMap<String, Integer> wordCount = new HashMap();

// this large 'for' loop will go through each file in the specified directory.
for (int k = 1; k < nFiles; k++) {
  // Puts together the string that the FileReader will refer to.
  String learn = directory + k + ".txt";

try {
  FileReader reader = new FileReader(learn);
  BufferedReader br = new BufferedReader(reader);
  // The BufferedReader reads the lines

  String line = br.readLine();


  // Split the line into a String array to loop through
  String[] words = line.split(" ");
  int freq = 0;

  // for loop goes through every word
  for (int i = 0; i < words.length; i++) {
    // Case if the HashMap already contains the key.
    // If so, just increments the value

    if (wordCount.containsKey(words[i])) {         
      wordCount.put(words[i], freq++);
    }
    // Otherwise, puts the word into the HashMap
    else {
      wordCount.put(words[i], freq++);
    }
  }
  // Catching the file not found error
  // and any other errors
}
catch (FileNotFoundException fnfe) {
  System.err.println("File not found.");
}
catch (Exception e) {
  System.err.print(e);
   }
 }
return wordCount;
}

代码编译。不幸的是，当我要求它打印 20 个文件的所有字数统计结果时，it printed this。这完全是胡言乱语（尽管这些话肯定在那里），根本不是我需要的方法。

如果有人可以帮助我调试我的代码，我将不胜感激。我已经做了很多年了，一次又一次地进行测试，我准备放弃了。

【问题讨论】：

你应该把你的代码分成不同的方法。例如一种方法可能是static HashMap<String, Integer> frequency(List<String> strings) {...}

标签： java loops hashmap try-catch

【解决方案1】：

让我在这里结合所有好的答案。

1) 拆分您的方法以分别处理一件事。一个将文件读入strings[]，一个处理strings[]，一个调用前两个。

2) 当你分裂时，仔细想想你想如何分裂。正如@m0skit0 建议的那样，您可能应该使用 \b 来解决这个问题。

3) 正如@jas 建议的那样，您应该首先检查您的地图是否已经包含该词。如果它确实增加了计数，如果没有将单词添加到地图并将其计数设置为 1。

4) 要以您可能期望的方式打印地图，请查看以下内容：

Map test = new HashMap();

for (Map.Entry entry : test.entrySet()){
  System.out.println(entry.getKey() + " " + entry.getValue());
}

【讨论】：

要以“键：值”格式打印地图，我应该把它放在哪里？在我的主要方法中？在我的 countWords 方法中？此外，这只能通过一种方法完成。我想我只会按空格分割，因为这就是所有需要的任务。
@KommanderKitten 您可以将其放在 main 中或将其放在视图函数中。如果你熟悉 MVC。如果您不熟悉 MVC，它是一种如何将代码分解为逻辑单元的模式。 stackoverflow.com/questions/2056/…
其他答案的不错的复制粘贴:)
我仍然不确定如何将它与我的代码集成。我只想打印出来：Cat : 1, the : 4, and : 2, etc...
@KommanderKitten 您的问题有 3 个有效答案。努力整合它或具体询问您不了解的内容。不要指望复制粘贴代码，为了你自己好，因为复制粘贴不会教你一件事。祝你好运。

【解决方案2】：

我会期待更多这样的事情。有意义吗？

if (wordCount.containsKey(words[i])) { 
  int n = wordCount.get(words[i]);    
  wordCount.put(words[i], ++n);
}
// Otherwise, puts the word into the HashMap
else {
  wordCount.put(words[i], 1);
}

如果单词已经在 hashmap 中，我们想要获取当前计数，将 1 添加到其中，并将单词替换为 hashmap 中的新计数。

如果单词还没有在 hashmap 中，我们只需将它放在 map 中，计数为 1 开始。下次我们看到相同的单词时，我们会将计数增加到2，等等。

【讨论】：

这是有道理的。我想我有逻辑，但我要么没有注意，要么只是在执行过程中很愚蠢。谢谢！

【解决方案3】：

如果仅按空格分隔，则单词中将包含其他符号（括号、标点符号等）。例如："This phrase, contains... funny stuff"，如果你用空格分割它，你会得到："This""phrase,""contains...""funny"和"stuff"。

您可以通过按字边界 (\b) 拆分来避免这种情况。

line.split("\\b");

顺便说一句，您的 if 和 else 部分是相同的。你总是将 freq 加一，这没有多大意义。如果单词已经在地图中，你想获取当前频率，将其加 1，然后更新地图中的频率。如果没有，则将其放入地图中，值为 1。

专业提示：始终打印/记录异常的完整堆栈跟踪。

【讨论】：

你是什么意思“完整的stracktrace”？我是编码新手，我不是“时髦的行话”。
fr33kk0mpu73r.blogspot.com.es/2013/11/…
我会使用line.split("[^a-zA-Z]+");，然后在所有单词上调用toLowerCase()。
感谢您的提示。我的教授说我们不必担心标点符号和小写/大写。我正在编写的程序旨在确定给定的 .txt 文件是用英语还是法语编写的，并且肯定有足够简单的小写单词来推断。
@SpiderPig 你几乎是对的。但是，如果有任何 Aphostropy 字符，它将不起作用。例如：不能 - 它将它分成 2 罐，t。