【问题标题】:How to read duplicate words count from a directory or a folder如何从目录或文件夹中读取重复的单词数
【发布时间】:2016-04-25 23:41:39
【问题描述】:

我从一个编码网站得到了这个下面的程序。

以下代码读取文本文件并查找重复的单词。

从每个文本文件中读取并逐行显示它的重复单词计数。 以及如何调用该文件,如果它不存储为字符串,我使用缓冲阅读器,但我没有得到我的输出。

我的问题:

  1. 如何让程序从给定文件夹中读取多个文件?

  2. 如何将结果保存为 Excel 文件格式?

欢迎提出任何建议。

import java.io.BufferedReader;
import java.io.DataInputStream;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.Map.Entry;


public class MaxDuplicateWordCount {

    public Map<String, Integer> getWordCount(String fileName){

        FileInputStream fis = null;
        DataInputStream dis = null;
        BufferedReader br = null;
        Map<String, Integer> wordMap = new HashMap<String, Integer>();

        try {
            fis = new FileInputStream(fileName);
            dis = new DataInputStream(fis);
            br = new BufferedReader(new InputStreamReader(dis));
            String line = null; 
            while((line = br.readLine()) != null){
                StringTokenizer st = new StringTokenizer(line, " ");
                while(st.hasMoreTokens()){
                    String tmp = st.nextToken().toLowerCase();
                    if(wordMap.containsKey(tmp)){
                        wordMap.put(tmp, wordMap.get(tmp)+1);
                    } else {
                        wordMap.put(tmp, 1);
                    }
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally{
            try{if(br != null) br.close();}catch(Exception ex){}
        }
        return wordMap;
    }

    public List<Entry<String, Integer>> sortByValue(Map<String, Integer> wordMap){

        Set<Entry<String, Integer>> set = wordMap.entrySet();
        List<Entry<String, Integer>> list = new ArrayList<Entry<String, Integer>>(set);
        Collections.sort( list, new Comparator<Map.Entry<String, Integer>>()
        {
            public int compare( Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2 )
            {
                return (o2.getValue()).compareTo( o1.getValue() );
            }
        } );
        return list;
    }

    public static void main(String a[]){



        MaxDuplicateWordCount mdc = new MaxDuplicateWordCount();
        Map<String, Integer> wordMap = mdc.getWordCount("E:\\Blog 39.txt");

        List<Entry<String, Integer>> list = mdc.sortByValue(wordMap);
        for(Map.Entry<String, Integer> entry:list){
            System.out.println(entry.getKey()+" ="+entry.getValue());
        }
    }
}

【问题讨论】:

    标签: java arrays list duplicates


    【解决方案1】:

    假设您有一个目录,其中包含您要读取的所有文件。

    File folder = new File("/Users/you/folder/");
    File[] listOfFiles = folder.listFiles();
    
    for (File file : listOfFiles) {
    
        if (file.isFile()) {
            /*
             * Here if your file is not a text file 
             * If I undersood you correct:
             *      "And how to call that files if it is not stored as String"
             * you can get it as byte[] and parse it to String
             */
            byte[] bytes = Files.readAllBytes(file.toPath());
            String decoded = new String(bytes, "UTF-8");
            String[] words = decoded.split("\\s+");
            for (int i = 0; i < words.length; i++) {
                /*  You may want to check for a non-word character before blindly
                 *  performing a replacement
                 *  It may also be necessary to adjust the character class
                 */
                 words[i] = words[i].replaceAll("[^\\w]", "");
                 //Here are all the words from a file. You can do whatever you want with them
             }
         }
    
    }
    

    【讨论】:

    • List list = new ArrayList(Arrays.asList("cat", "cat", "dog", "horse", "monkey", "zebra", "zebra", "dog", “狗”、“狗”、“跳蚤”));列表 list2 = new ArrayList();我可以从目录加载而不是这个字符串,我的代码适用于给定的字符串。你可以让代码适用于此
    【解决方案2】:

    简介

    和OP聊完,简单说一下OP的要求:

    1- 从特定文件夹读取文件,文件通常是 Unicode 作为文本文件。
    2-文件将在问题中的OP算法中处理,算法的结果应再次保存在Unicode文件中(后来OP要求将其保存为Excel文件(.XLS),因为Unicode与Excel兼容)

    解决方案

    这可以通过以下步骤解决:

    步骤 1 我们定义(声明)我们的工作空间
    步骤 2 如果不存在,我们在工作空间中创建输出文件夹
    第 3 步 我们读取工作空间文件夹中的所有现有文件并在算法中处理它们。
    步骤 4 每个文件的结果将保存为输出文件夹中的 Excel 文件。

    代码

    首先您需要导入 POI 包,这将允许您创建 XLS 表。我已经下载了这个 poi/poi-3.5-FINAL.jar.zip( 1,372 k) 并且应该将以下导入添加到您的代码中。

    import org.apache.poi.hssf.usermodel.HSSFSheet;
    import org.apache.poi.hssf.usermodel.HSSFWorkbook;
    import org.apache.poi.hssf.usermodel.HSSFRow;
    

    接下来您将以下代码添加到您的代码中,这是可自我解释的代码:

    final static String WORKSPACE = "C:/testfolder/";
    
    private static void createOutputFolder(String outputFolderName) {
        File outputDirectory = new File(WORKSPACE + outputFolderName);
    
        if (!outputDirectory.exists()) {
            try {
                outputDirectory.mkdir();
            } catch (Exception e) {
            }
        }
    }
    
    private static void exlCreator() {
    
        String outputFolder = "output/";
        String fileName, fileNameWPathInput;
        int serialNumber = 1;
        createOutputFolder(outputFolder);
    
        MaxDuplicateWordCount mdc = new MaxDuplicateWordCount();
        File folder = new File(WORKSPACE);
        File[] listOfFiles = folder.listFiles();
    
        for (int i = 0; i < listOfFiles.length; i++) {
            if (listOfFiles[i].isFile()) {
                fileName = listOfFiles[i].getName();
                fileNameWPathInput = WORKSPACE + fileName;
                Map<String, Integer> wordMap = mdc.getWordCount(fileNameWPathInput);
                List<Entry<String, Integer>> list = mdc.sortByValue(wordMap);
                String fileNameWPathOutput = WORKSPACE + outputFolder +
                        fileName.substring(0, fileName.length() - 4)
                        + "output.xls";
                try {
                    HSSFWorkbook workbook = new HSSFWorkbook();
                    HSSFSheet sheet = workbook.createSheet("ResultSheet");
    
                    HSSFRow rowhead = sheet.createRow((short) 0);
                    rowhead.createCell(0).setCellValue("Serial No.");
                    rowhead.createCell(1).setCellValue("Word");
                    rowhead.createCell(2).setCellValue("Count");
    
                    for (Map.Entry<String, Integer> entry : list) {
                        HSSFRow row = sheet.createRow((short) serialNumber);
                        row.createCell(0).setCellValue(serialNumber);
                        row.createCell(1).setCellValue(entry.getKey());
                        row.createCell(2).setCellValue(entry.getValue());
                        serialNumber++;
                    }
                    FileOutputStream fileOut = new FileOutputStream(fileNameWPathOutput);
                    workbook.write(fileOut);
                    fileOut.close();
                    serialNumber = 1;
                    System.out.println(fileNameWPathOutput + " is created");
    
                } catch (Exception ex) {
                    System.out.println(ex);
                }
            }
        }
    
    
    }
    
    public static void main(String [] args) throws IOException {
        exlCreator();
    }
    

    终于

    通过操作代码,可以创建一个输出文件,但在工作表中创建每个输出结果。 如下图所示,输出文件在 Excel 中打开,显示 Unicode 文本没有问题,因为这是我的第一个解决方案中的问题:

    链接

    Download POI
    POI documentation
    Unicode problem in CSV
    More about CSV

    完整代码,从 OP 请求

    import java.io.*;
    import java.nio.charset.Charset;
    import java.nio.charset.StandardCharsets;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.Comparator;
    import java.util.HashMap;
    import java.util.List;
    import java.util.Map;
    import java.util.Set;
    import java.util.StringTokenizer;
    import java.util.Map.Entry;
    //for Excel ark
    import org.apache.poi.hssf.usermodel.HSSFSheet;
    import org.apache.poi.hssf.usermodel.HSSFWorkbook;
    import org.apache.poi.hssf.usermodel.HSSFRow;
    
    public class MaxDuplicateWordCount {
    
        public Map<String, Integer> getWordCount(String fileName) {
    
            FileInputStream fis = null;
            DataInputStream dis = null;
            BufferedReader br = null;
            Map<String, Integer> wordMap = new HashMap<String, Integer>();
    
            try {
                fis = new FileInputStream(fileName);
                dis = new DataInputStream(fis);
                br = new BufferedReader(new InputStreamReader(dis));
                String line = null;
                while ((line = br.readLine()) != null) {
                    StringTokenizer st = new StringTokenizer(line, " ");
                    while (st.hasMoreTokens()) {
                        String tmp = st.nextToken().toLowerCase();
                        if (wordMap.containsKey(tmp)) {
                            wordMap.put(tmp, wordMap.get(tmp) + 1);
                        } else {
                            wordMap.put(tmp, 1);
                        }
                    }
                }
            } catch (FileNotFoundException e) {
                e.printStackTrace();
            } catch (IOException e) {
                e.printStackTrace();
            } finally {
                try {
                    if (br != null) br.close();
                } catch (Exception ex) {
                }
            }
            return wordMap;
        }
    
        public List<Entry<String, Integer>> sortByValue(Map<String, Integer> wordMap) {
    
            Set<Entry<String, Integer>> set = wordMap.entrySet();
            List<Entry<String, Integer>> list = new ArrayList<Entry<String, Integer>>(set);
            Collections.sort(list, new Comparator<Map.Entry<String, Integer>>() {
    
                public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
    
                    return (o2.getValue()).compareTo(o1.getValue());
                }
    
    
            });
            return list;
        }
    
        final static String WORKSPACE = "C:/testfolder/";
    
        private static void createOutputFolder(String outputFolderName) {
            File outputDirectory = new File(WORKSPACE + outputFolderName);
    
            if (!outputDirectory.exists()) {
                try {
                    outputDirectory.mkdir();
                } catch (Exception e) {
                }
            }
        }
    
        private static void exlCreator() {
    
            String outputFolder = "output/";
            String fileName, fileNameWPathInput;
            int serialNumber = 1;
            createOutputFolder(outputFolder);
    
            MaxDuplicateWordCount mdc = new MaxDuplicateWordCount();
            File folder = new File(WORKSPACE);
            File[] listOfFiles = folder.listFiles();
    
            for (int i = 0; i < listOfFiles.length; i++) {
                if (listOfFiles[i].isFile()) {
                    fileName = listOfFiles[i].getName();
                    fileNameWPathInput = WORKSPACE + fileName;
                    Map<String, Integer> wordMap = mdc.getWordCount(fileNameWPathInput);
                    List<Entry<String, Integer>> list = mdc.sortByValue(wordMap);
                    String fileNameWPathOutput = WORKSPACE + outputFolder +
                            fileName.substring(0, fileName.length() - 4)
                            + "output.xls";
                    try {
                        HSSFWorkbook workbook = new HSSFWorkbook();
                        HSSFSheet sheet = workbook.createSheet("ResultSheet");
    
                        HSSFRow rowhead = sheet.createRow((short) 0);
                        rowhead.createCell(0).setCellValue("Serial No.");
                        rowhead.createCell(1).setCellValue("Word");
                        rowhead.createCell(2).setCellValue("Count");
    
                        for (Map.Entry<String, Integer> entry : list) {
                            HSSFRow row = sheet.createRow((short) serialNumber);
                            row.createCell(0).setCellValue(serialNumber);
                            row.createCell(1).setCellValue(entry.getKey());
                            row.createCell(2).setCellValue(entry.getValue());
                            serialNumber++;
                        }
                        FileOutputStream fileOut = new FileOutputStream(fileNameWPathOutput);
                        workbook.write(fileOut);
                        fileOut.close();
                        serialNumber = 1;
                        System.out.println(fileNameWPathOutput + " is created");
    
                    } catch (Exception ex) {
                        System.out.println(ex);
                    }
                }
            }
    
    
        }
    
        public static void main(String[] args) throws IOException {
            exlCreator();
        }
    }
    

    【讨论】:

    • 线程“main”java.lang.Error 中的异常:未解决的编译问题:令牌“,”上的语法错误,此令牌后预期的 TypeArgument1 令牌“=”上的语法错误,>”以完成 ReferenceType2 语法错误,插入“()”以完成表达式语法错误,插入“)”以完成 MethodInvocation 语法错误,插入“;”要完成语句语法错误,请在 ramki.maxoccurrence.main(maxoccurrence.java:38) 处插入“}”以完成 MethodBody 我收到这些错误:/
    • 是的,我也将编译更改为 JAVA8
    • @RamKi 请找到我更新的答案,它生成 excel ark 而不是 csv 文件。
    • 你导入POI了吗?
    • 没错:)
    猜你喜欢
    • 2013-06-20
    • 1970-01-01
    • 2012-12-05
    • 1970-01-01
    • 2020-11-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多