如何对非常大的文件进行排序答案

【问题标题】：How do I sort very large files如何对非常大的文件进行排序
【发布时间】：2011-12-16 14:41:54
【问题描述】：

我有一些文件应该根据每行开头的 id 进行排序。这些文件大约为 2-3 GB。

我尝试将所有数据读入ArrayList 并对其进行排序。但是内存不足以保存所有这些。它不起作用。

线条看起来像

0052304 0000004000000000000000000000000000000041 John Teddy 000023
0022024 0000004000000000000000000000000000000041 George Clan 00013

如何对文件进行排序？？

【问题讨论】：

如果您使用最新版本的 Java 6，您将需要大约 4 GB 的内存。我想你没有那么多？？
如果您只将 id 读入 ArrayList 并对其进行排序怎么办？

标签： java file sorting

【解决方案1】：

这不完全是 Java 问题。您需要研究一种有效的算法来对未完全读入内存的数据进行排序。对 Merge-Sort 的一些调整可以实现这一点。

看看这个： http://en.wikipedia.org/wiki/Merge_sort

和： http://en.wikipedia.org/wiki/External_sorting

基本上这里的想法是将文件分成更小的部分，对它们进行排序（使用合并排序或其他方法），然后使用合并排序中的合并来创建新的排序文件。

【讨论】：

1) 你是说我们对每个单独的小文件进行排序并将排序后的数据写回每个文件，然后再次从这些文件中读取并写入一个新的更大的最终文件？ 2) 这是否会扩展到 8GB RAM 上的 16GB 平面文件，或者在上述解决方案的任何情况下我们都会遇到内存问题？
是的，这就是基本思想。它会起作用，因为您永远不会将所有文件加载到内存中，您只需要查看每个文件的开头（类似于合并排序）。

【解决方案2】：

由于您的记录已经是平面文件文本格式，您可以将它们通过管道传输到 UNIX sort(1) 例如sort -n -t' ' -k1,1 < input > output。它将自动分块数据并使用可用内存和/tmp 执行合并排序。如果您需要的空间多于可用内存，请将-T /tmpdir 添加到命令中。

当你可以使用一个在每个平台上都可用并且已经存在了几十年的工具时，每个人都告诉你下载巨大的 C# 或 Java 库或自己实现合并排序，这很有趣。

【讨论】：

很抱歉投反对票，但原始问题的标签是 java 并且没有关于使用 *nix 的信息
我认为这是最好的答案，即使考虑到 Java 标签。 OP 提到他需要对一些文件进行排序，而不是要求他使用 Java 进行排序。即使 OP 在 Windows 上，他仍然可以轻松获得 sort 可执行文件。

【解决方案3】：

您需要一个外部合并排序来做到这一点。 Here 是它的 Java 实现，用于对非常大的文件进行排序。

【讨论】：

它似乎是唯一具有此功能的可用 Java 库。你在生产环境中使用过吗？
我刚刚使用这个库对一个 24GB 的 csv 文件（大约 8.5 亿行文本数据）进行了排序，效果非常好。直接使用自定义比较器来指定我希望它如何排序。所以，我绝对可以推荐这个实现

【解决方案4】：

您可以只读取键和索引到行开始的位置（也可能是长度），而不是一次将所有数据加载到内存中，例如

class Line {
   int key, length;
   long start;
}

这将使用每行大约 40 个字节。

对这个数组进行排序后，您可以使用 RandomAccessFile 按照它们出现的顺序读取这些行。

注意：由于您将随机访问磁盘，而不是使用内存，这可能会非常慢。一个典型的磁盘需要 8 毫秒来随机访问数据，如果你有 1000 万行，这将需要大约一天的时间。（这绝对是最坏的情况）在内存中大约需要 10 秒。

【讨论】：

【解决方案5】：

您需要执行外部排序。这是 Hadoop/MapReduce 背后的驱动理念，只是它没有考虑分布式集群并且在单个节点上工作。

为了获得更好的性能，您应该使用 Hadoop/Spark。

根据您的系统更改此行。 fpath 是你的一个大输入文件（用 20GB 测试）。 shared 路径是存储执行日志的位置。 fdir 是存储和合并中间文件的位置。根据您的机器更改这些路径。

public static final String fdir = "/tmp/";
    public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
    public static final String fPath = "/input/data-20GB.in";
    public static final String opLog = shared+"Mysort20GB.log";

然后运行以下程序。您的最终排序文件将在 fdir 路径中创建，名称为 op401。最后一行Runtime.getRuntime().exec("valsort " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog); 检查输出是否已排序。如果您没有安装 valsort 或者输入文件不是使用 gensort(http://www.ordinal.com/gensort.html) 生成的，请删除此行。

另外不要忘记将int totalLines = 200000000; 更改为文件中的总行数。并且线程数 (int threadCount = 16) 应该始终是 2 的幂并且足够大，以便（总大小 * 2 / 线程数）数量的数据可以驻留在内存中。更改线程数将更改最终输出文件的名称。就像 16 一样，它将是 op401，对于 32 它将是 op501，对于 8 它将是 op301 等等。

享受吧。

    import java.io.*;
    import java.nio.file.Files;
    import java.nio.file.Paths;
    import java.util.ArrayList;
    import java.util.Comparator;
    import java.util.stream.Stream;


    class SplitFile extends Thread {
        String fileName;
        int startLine, endLine;

        SplitFile(String fileName, int startLine, int endLine) {
            this.fileName = fileName;
            this.startLine = startLine;
            this.endLine = endLine;
        }

        public static void writeToFile(BufferedWriter writer, String line) {
            try {
                writer.write(line + "\r\n");
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }

        public void run() {
            try {
                BufferedWriter writer = Files.newBufferedWriter(Paths.get(fileName));
                int totalLines = endLine + 1 - startLine;
                Stream<String> chunks =
                        Files.lines(Paths.get(Mysort20GB.fPath))
                                .skip(startLine - 1)
                                .limit(totalLines)
                                .sorted(Comparator.naturalOrder());

                chunks.forEach(line -> {
                    writeToFile(writer, line);
                });
                System.out.println(" Done Writing " + Thread.currentThread().getName());
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    class MergeFiles extends Thread {
        String file1, file2, file3;
        MergeFiles(String file1, String file2, String file3) {
            this.file1 = file1;
            this.file2 = file2;
            this.file3 = file3;
        }

        public void run() {
            try {
                System.out.println(file1 + " Started Merging " + file2 );
                FileReader fileReader1 = new FileReader(file1);
                FileReader fileReader2 = new FileReader(file2);
                FileWriter writer = new FileWriter(file3);
                BufferedReader bufferedReader1 = new BufferedReader(fileReader1);
                BufferedReader bufferedReader2 = new BufferedReader(fileReader2);
                String line1 = bufferedReader1.readLine();
                String line2 = bufferedReader2.readLine();
                //Merge 2 files based on which string is greater.
                while (line1 != null || line2 != null) {
                    if (line1 == null || (line2 != null && line1.compareTo(line2) > 0)) {
                        writer.write(line2 + "\r\n");
                        line2 = bufferedReader2.readLine();
                    } else {
                        writer.write(line1 + "\r\n");
                        line1 = bufferedReader1.readLine();
                    }
                }
                System.out.println(file1 + " Done Merging " + file2 );
                new File(file1).delete();
                new File(file2).delete();
                writer.close();
            } catch (Exception e) {
                System.out.println(e);
            }
        }
    }

    public class Mysort20GB {
        //public static final String fdir = "/Users/diesel/Desktop/";
        public static final String fdir = "/tmp/";
        public static final String shared = "/exports/home/schatterjee/cs553-pa2a/";
        public static final String fPath = "/input/data-20GB.in";
        public static final String opLog = shared+"Mysort20GB.log";

        public static void main(String[] args) throws Exception{
            long startTime = System.nanoTime();
            int threadCount = 16; // Number of threads
            int totalLines = 200000000;
            int linesPerFile = totalLines / threadCount;
            ArrayList<Thread> activeThreads = new ArrayList<Thread>();

            for (int i = 1; i <= threadCount; i++) {
                int startLine = i == 1 ? i : (i - 1) * linesPerFile + 1;
                int endLine = i * linesPerFile;
                SplitFile mapThreads = new SplitFile(fdir + "op" + i, startLine, endLine);
                activeThreads.add(mapThreads);
                mapThreads.start();
            }
            activeThreads.stream().forEach(t -> {
                try {
                    t.join();
                } catch (Exception e) {
                }
            });

            int treeHeight = (int) (Math.log(threadCount) / Math.log(2));

            for (int i = 0; i < treeHeight; i++) {
                ArrayList<Thread> actvThreads = new ArrayList<Thread>();

for (int j = 1, itr = 1; j <= threadCount / (i + 1); j += 2, itr++) {
                    int offset = i * 100;
                    String tempFile1 = fdir + "op" + (j + offset);
                    String tempFile2 = fdir + "op" + ((j + 1) + offset);
                    String opFile = fdir + "op" + (itr + ((i + 1) * 100));

                    MergeFiles reduceThreads =
                            new MergeFiles(tempFile1,tempFile2,opFile);
                    actvThreads.add(reduceThreads);
                    reduceThreads.start();
                }
                actvThreads.stream().forEach(t -> {
                    try {
                        t.join();
                    } catch (Exception e) {
                    }
                });
            }
            long endTime = System.nanoTime();
            double timeTaken = (endTime - startTime)/1e9;
            System.out.println(timeTaken);
            BufferedWriter logFile = new BufferedWriter(new FileWriter(opLog, true));
            logFile.write("Time Taken in seconds:" + timeTaken);
            Runtime.getRuntime().exec("valsort  " + fdir + "op" + (treeHeight*100)+1 + " > " + opLog);
            logFile.close();
        }
    }

【讨论】：

【解决方案6】：

使用 java 库big-sorter 可用于对非常大的文本或二进制文件进行排序。

以下是您的具体问题的实现方式：

// write the input to a file
String s = "0052304 0000004000000000000000000000000000000041   John Teddy   000023\n"
        + "0022024 0000004000000000000000000000000000000041   George Clan 00013";
File input = new File("target/input");
Files.write(input.toPath(),s.getBytes(StandardCharsets.UTF_8), StandardOpenOption.WRITE);

File output = new File("target/output");


//sort the input
Sorter
    .serializerLinesUtf8()
    .comparator((a,b) -> {
        String ida = a.substring(0, a.indexOf(' '));
        String idb = b.substring(0, b.indexOf(' '));
        return ida.compareTo(idb);
    }) 
    .input(input) 
    .output(output) 
    .sort();

// display the output
Files.readAllLines(output.toPath()).forEach(System.out::println);

输出：

0022024 0000004000000000000000000000000000000041   George Clan 00013
0052304 0000004000000000000000000000000000000041   John Teddy   000023

【讨论】：

【解决方案7】：

您可以使用 SQL Lite 文件 db，将数据加载到 db，然后让它为您排序并返回结果。

优点：不用担心写出最好的排序算法。

缺点：需要磁盘空间，处理速度较慢。

https://sites.google.com/site/arjunwebworld/Home/programming/sorting-large-data-files

【讨论】：

【解决方案8】：

您需要做的是通过流将文件分块并单独处理它们。然后您可以将文件合并在一起，因为它们已经被排序，这类似于合并排序的工作方式。

这个 SO 问题的答案很有价值：Stream large files

【讨论】：

【解决方案9】：

操作系统带有强大的文件分类实用程序。一个调用 bash 脚本的简单函数应该会有所帮助。

public static void runScript(final Logger log, final String scriptFile) throws IOException, InterruptedException {
    final String command = scriptFile;
    if (!new File (command).exists() || !new File(command).canRead() || !new File(command).canExecute()) {
        log.log(Level.SEVERE, "Cannot find or read " + command);
        log.log(Level.WARNING, "Make sure the file is executable and you have permissions to execute it. Hint: use \"chmod +x filename\" to make it executable");
        throw new IOException("Cannot find or read " + command);
    }
    final int returncode = Runtime.getRuntime().exec(new String[] {"bash", "-c", command}).waitFor();
    if (returncode!=0) {
        log.log(Level.SEVERE, "The script returned an Error with exit code: " + returncode);
        throw new IOException();
    }

}

【讨论】：

外部排序的主要问题是：（1）缺乏操作系统可移植性（2）很难对java对象进行排序，即使使用序列化。

【解决方案10】：

我使用自己的逻辑，对一个 BIG JSON 文件的格式进行了排序

{"name":"hoge.finance","address":"0xfAd45E47083e4607302aa43c65fB3106F1cd7607"}

完整的源代码与测试用例一起在https://github.com/sitetester/token-sorter 上提供。代码有据可查，很容易理解。

它将输入文件拆分为多个较小的 SORTED 文件（可配置），然后比较数据。

在这里粘贴一些 cmets...

// at this point, we have sorted data sets in respective files
// next, we will take first token from first file and compare it with tokens of all other files
// during comparison, if some token from other file is in sorted order, then we make it default/initial sorted token
// & jump to next file, since all remaining tokens in THAT file are already in sorted form
// at end of comparisons with all files, we remove it from specific file (so it's not compared next time) and put/append in final sorted file
// this process continues, until all entries are matched
// if some file has no entries, then we simply delete it (so it's not compared next time)

【讨论】：