在 Java 中，如何有效地从字节数组的开头和结尾修剪 0答案

【问题标题】：In Java, how do I efficiently trim 0's from the start and the end of a byte array在 Java 中，如何有效地从字节数组的开头和结尾修剪 0
【发布时间】：2020-03-21 18:37:13
【问题描述】：

由于我无法控制的原因，我需要解析一个巨大的文件，该文件在文件的开头和结尾都有大量空字节，并且只有一小部分实际上是有效的（最多 5 KB）。这是我想出的代码：

@NonNull
public static byte[] readFileToByteArray(@NonNull File file, boolean bTrimNulls) throws IOException {
    byte[] buffer = new byte[(int) file.length()];
    FileInputStream fis = null;
    try {
        fis = new FileInputStream(file);
        if (fis.read(buffer) == -1) {
            throw new IOException("EOF reached while trying to read the whole file");
        }
    } finally {
        closeSafely(fis);
    }
    if (!bTrimNulls) {
        return buffer;
    }
    int nFirstValidByteIndex = 0;
    for (int i = 0; i < buffer.length; i++) {
        if (buffer[i] != 0) {
            nFirstValidByteIndex = i;
            break;
        }
    }
    int nLastValidByteIndex = 0;
    for (int i = buffer.length - 1; i > 0; i--) {
        if (buffer[i] != 0) {
            nLastValidByteIndex = i;
            break;
        }
    }
    return copyBufferRange(buffer, nFirstValidByteIndex, nLastValidByteIndex + 1);
}

有没有更好的选择？

编辑：缓冲区中的有效字节对应于一个 XML 文件。

【问题讨论】：

文件中间会不会有空字节？我的意思是，在你打到第一个非零字节之后，在你到达最后一个非零字节之前，零字节会出现吗？
是的，可能有。
多大才算很大？是否保证不超过5kb的数据？
您可以做的一件事是不要将整个文件复制到内存中，而是过滤掉缓冲区（至少在开始时，结束时可能会很棘手，如果您知道的话，我想您也可以这样做有效载荷最多只有 5k），而您正在阅读它们。
fis.read（没有循环）是否可靠地像那样工作？随时停止复制不是免费的吗？

标签： java arrays parsing trim

【解决方案1】：

代码很好。对于非常大的文件，可以使用有限的缓冲区，FileChannel，带有 ByteBuffer 的 SeekableByteChannel。

只是代码可能会更好一些。参数Path 而不是File 会更通用和更现代。

public static byte[] readFileToByteArray(@NonNull File file, boolean trimNulls)
        throws IOException {
    Path path = file.toPath();
    byte[] content = Files.readAllBytes(path);
    if (trimNulls) {
        int start = 0;
        while (start < content.length && content[start] == 0) {
            ++start;
        }
        int end = content.length;
        while (end > start && content[end - 1] == 0) {
            --end;
        }
        content = Arrays.copyOfRange(content, start, end);
    }
    return content;
}

【讨论】：

我使用了 File，因为它将在 Android 中运行，并且 Files.readAllBytes() 仅存在于 Android >= 8.X.X 中。我将在较新的版本上使用它，因为它可能比 FileInputStream + 缓冲区做得更好。谢谢！

【解决方案2】：

您的代码的时间复杂度为 n，如您所说，对于大文件来说这可能太多了。幸运的是，我们知道非零部分的最大大小为 m，因此我们可以以 m 为步长搜索文件。如果我们错过了（在有效载荷中间击中零），我们需要重复它直到我们找到它。因此，如果有效载荷中为零的概率足够低，则复杂度约为 n/m。

import java.util.Arrays;
import java.util.Random;

class Test
{

    public static int findNonZero(byte[] sparse, int max)
    {
        // looks quadratic but isn't in practice if the probability of zero in the payload is low, i.e. 1/256 for random values
        for(int offset=0;offset<max;offset++)
        {
            for(int i=0;(i+offset)<sparse.length; i+=max)
            {
                if(sparse[i+offset]!=0)
                {
                    return i+offset;                    
                }
            }
        }
         // in production code you could handle this differently but this is just an example
        throw new RuntimeException("Nonzero value not found");
    }

    public static byte[] trim(byte[] sparse, int max)
    {
        int index = findNonZero(sparse, max);
        // go to the left and go to the right until you find (max) zeroes
        int from = ...
        int to = ...
        return Arrays.copyOfRange(sparse, from, to);        
    }

    public static void main(String[] args)
    {
        // create test data
        int size = 5000;
        byte[] test = new byte[1_000_000_000];
        byte[] payload = new byte[size];
        Random r = new Random();
        r.nextBytes(payload);
        payload[0]=(byte)(r.nextInt(Byte.MAX_VALUE-1)+1); // ensure start isnt zero
        payload[payload.length-1]=(byte)(r.nextInt(Byte.MAX_VALUE-1)+1);  // ensure end isnt zero
        System.arraycopy(payload, 0, test, r.nextInt(test.length-size), size);

        System.out.println(Arrays.equals(payload,trim(test,size)));
    }
}

我为你留下了最后一部分，你需要从左到右直到找到（最大）零并确定从和到索引。

您可以通过将后续偏移设置得更远来进一步提高实际性能，例如 offset_1 = 0、offset_2 = max/2、offset_3 = 1/4 max、offset_4 = 3/4 max 等等。

【讨论】：

@Thilo：我还没写完。我现在添加了差异。它实际上并不是真正的二分搜索，你是对的，我会修改它。
如果我们还必须读取大小为 n 的文件，时间复杂度不能真正降到 O(n) 以下。
@Thilo：没错。但是，您也可以访问许多类型的存储设备上的文件的随机部分，并重写代码以直接在文件的某些部分查找。这是否更有效取决于 n 和 m 的特定值以及设备。但是我发现从理论上讲这个问题也很有趣，你可以假设数组已经在内存中。

【解决方案3】：

我认为您的解决方案相当有效。实际上，您正在从数组的两端查找第一个 1 的索引，然后创建一个数据子数组。

为什么你觉得你需要改进你的算法？

小心：过早的优化是编程中万恶之源（或至少是大部分），quote by Donald Knuth

【讨论】：

因为代码实际上很慢，而且这是应用程序框架的代码，而不是实际应用程序，所以性能是必须的。我也不太喜欢 Donald Knuth 的那句话。很多时候，优化的最佳时机是在设计算法时，而不是以后，即使我认为我不应该牺牲代码的可读性，除非它是必要的。
你确定慢的部分是数组的修剪吗？没有将整个文件读入内存？
修剪不是慢的部分，找到第一个和最后一个索引是。我将尝试使用更大的块而不是一个一个字节进行搜索。