计算文件中出现的单词数[关闭]答案

【问题标题】：couning the number of words occurence in a File [closed]计算文件中出现的单词数[关闭]
【发布时间】：2015-06-14 11:23:14
【问题描述】：

考虑到我们有txt 文件，我们想知道txt 的每个单词出现了多少次。我使用了以下代码，但它不起作用。它给出所有值 1 。首先，我阅读txt 文件并将每个单词写在单独的行中。同时，我将它们放入Array List。然后，我读取txt 文件的第一行并获取数组列表的第一个元素并与整个txt 文件进行比较。如果出现任何情况，将显示出现次数的数组增加一。然后获取第二个 Array List 项，依此类推，直到到达 Array List 的末尾。

 private static void count(String text) throws FileNotFoundException, IOException {

        FileOutputStream thewords=new FileOutputStream(Check);

         ArrayList<String> keyArrayList=new ArrayList<String>();
         int countWord=0;

        StringTokenizer tokenizer =new StringTokenizer(text) ;


         while(tokenizer.hasMoreTokens())
         {
             String nextWord=tokenizer.nextToken();
             keyArrayList.add(nextWord);
             thewords.write(nextWord.getBytes());
             thewords.write(System.getProperty("line.separator").getBytes());


             countWord++;
         }


         int[] numbOfOccurance=new int[countWord];

         BufferedReader br=new BufferedReader(new FileReader(Check));
         String readline;
         for(int loopIndex=0;loopIndex<countWord;loopIndex++)
         {
          readline=br.readLine();
          String test=keyArrayList.get(loopIndex);
            if(test.equals(readline))
            {
                numbOfOccurance[loopIndex]++;

            }

         }

【问题讨论】：

使用 hashmap 其中 string 将是您的单词，而 Integer 将是您的计数。
@Pratik 在哪一行？
旁注：来自javadoc，StringTokenizer 是一个遗留类，出于兼容性原因保留，但不鼓励在新代码中使用它。建议任何寻求此功能的人使用 String 的 split 方法或 java.util.regex 包。
读取文件的每个单词.....检查它是否包含在哈希表中，然后使用您的单词作为键从 hashmap 中计数，将计数增加 1，然后再次使用您的单词放入作为一把钥匙。如果它不包含在哈希映射中，则使用您的单词作为键和 1 作为计数插入哈希映射。
@sp00m 所以你是说这个问题是为了StringTokenizer 而引起的吗？

标签： java arrays string file arraylist

【解决方案1】：

您的方法非常慢，您必须搜索整个ArrayList 才能确定一个单词是否出现了多次。

此外，StringTokenizer 已被弃用。

我可以建议以下方法：

import static java.util.function.Function.identity;
import static java.util.stream.Collectors.toMap;

public static void main(String[] args) throws Exception {
    final Path path = Paths.get("path", "to", "file");
    final Map<String, Integer> counts = countOccurrences(path);
}

private static Map<String, Integer> countOccurrences(Path path) throws IOException {
    final Pattern pattern = Pattern.compile("[^A-Za-z']+");
    try (final Stream<String> lines = Files.lines(path)) {
        return lines
                .flatMap(pattern::splitAsStream)
                .collect(toMap(identity(), w -> 1, Integer::sum));
    }
}

这使用 Java 8 Stream API 从文件中读取行。然后，它会拆分 [^A-Za-z']+ 上的行，即非单词、非撇号、字符 - 使用 flatMap 创建单个单词的 Stream。

然后我们使用Map 到collect 的单词，对于每个单词我们将1 放入Map。然后我们使用合并函数Integer::sum 将Map 中已有的值相加。

然后，您可以使用以下命令列出 Map 的内容，按出现次数排序：

counts.entrySet().stream()
        .sorted(Map.Entry.comparingByValue())
        .map(e -> String.format("%s -> %s", e.getKey(), e.getValue()))
        .forEach(System.out::println);

【讨论】：

哇！非常聪明的解决方案。
虽然非常复杂。
@lonesome 不是真的。这是 5 行压缩代码。这都是标准的 Java API 用法，所以我认为任何熟悉 Java 的人都会一眼就知道这是在做什么。对于您的解决方案，我不能说同样的话。我强烈建议你学习Stream API 并好好学习它......
谢谢。但最后一行有问题。它说找不到toMap
现在我导入了库并说最后一行中的引用不明确。

【解决方案2】：

正如@Pratik 首先指出的那样，这是HashMap 的经典用法。您只需浏览列表一次。

 HashMap<String, Integer> wordMap = new HashMap<String, Integer>();
 StringTokenizer tokenizer =new StringTokenizer(text) ;

 while(tokenizer.hasMoreTokens())
 {
     String nextWord=tokenizer.nextToken();
     Integer count = wordMap.get(nextWord); 
     if (count  == null){
        wordMap.put(nextWord, 1);
     }
     else{
         wordMap.put(nextWord, count + 1);
     }
 }

 //Print word count
 for (String key : wordMap.keySet()) {
    System.out.println(key + " count: " + wordMap.get(key));
 }

解决您当前的实现无法正常工作的原因：

我认为只使用数组是不可行的。使用您当前的代码，您可以创建一个大小为所有单词的 int 数组，而不是不同单词的大小。即使您要使用ArrayList<Integer> 为遇到的每个新单词动态添加一个新条目，您也需要循环遍历整个列表来处理一个单词。另外，您将如何保留 Integer 数组中哪个单词对应于哪个条目的映射？

【讨论】：

没错。但是我的代码计数不正确有什么问题？
哦。是的。那是被遗忘的重点