从 Java 中的子字符串中高效解析整数答案

【问题标题】：Efficient parsing of integers from substrings in Java从 Java 中的子字符串中高效解析整数
【发布时间】：2013-10-23 02:32:44
【问题描述】：

AFAIK 在标准 Java 库中没有有效的方法来解析子字符串中的整数，而无需实际更新包含子字符串的新字符串。

我正在从字符串中解析数百万个整数，并且我不想为每个子字符串创建新字符串。复制是我不需要的开销。

给定一个字符串 s，我想要一个类似的方法：

parseInteger(s, startOffset, endOffset)

语义如下：

Integer.parseInt(s.substring(startOffset, endOffset))

现在，我知道我可以这样写：

public static int parse(String s, int start, int end) {
    long result = 0;
    boolean foundMinus = false;

    while (start < end) {
        char ch = s.charAt(start);
        if (ch == ' ')
            /* ok */;
        else if (ch == '-') {
            if (foundMinus)
                throw new NumberFormatException();
            foundMinus = true;
        } else if (ch < '0' || ch > '9')
            throw new NumberFormatException();
        else
            break;
        ++start;
    }

    if (start == end)
        throw new NumberFormatException();

    while (start < end) {
        char ch = s.charAt(start);
        if (ch < '0' || ch > '9')
            break;
        result = result * 10 + (int) ch - (int) '0';
        ++start;
    }

    while (start < end) {
        char ch = s.charAt(start);
        if (ch != ' ')
            throw new NumberFormatException();
        ++start;
    }
    if (foundMinus)
        result *= -1;
    if (result < Integer.MIN_VALUE || result > Integer.MAX_VALUE)
        throw new NumberFormatException();
    return (int) result;
}

但这不是重点。我宁愿从经过测试、受支持的第三方库中获取它。例如，解析 long 和正确处理 Long.MIN_VALUE 有点微妙，我通过将 int 解析为 long 来作弊。如果解析的整数大于Long.MAX_VALUE，上述仍然存在溢出问题。

有没有这样的库？

我的搜索结果很少。

【问题讨论】：

我很想用 C 语言处理整个事情并使用标准输入和输出。

标签： java string parsing int

【解决方案1】：

您是否对您的应用进行了概要分析？您找到问题的根源了吗？

由于Strings 是不可变的，因此很有可能只需要很少的内存并且执行很少的操作来创建子字符串。

除非您真的遇到内存、垃圾收集等问题，否则请使用 substring 方法。 不要为你没有的问题寻求复杂的解决方案。

此外：如果您自己实施某些事情，就效率而言，您可能会失去更多。您的代码做了很多工作并且相当复杂——但是，对于默认实现，您可能很确定它相对较快。并且没有错误。

【讨论】：

如果您阅读了我的问题，您会知道我明确不想使用自己的代码，我解释了为什么它有错误，以及为什么很难正确。处理 GB 数据时提高效率的关键是最小化每个字节的操作数。 String 所做的复制（通过 Arrays.copyOfRange）目前很突出......

【解决方案2】：

如果您没有遇到实际的性能问题，请不要太担心对象。使用当前的 JVM，在性能和内存开销方面会有永久性的改进。

如果您希望子字符串共享底层字符串，可以查看 Google 协议缓冲区中的“ByteString”：

https://developers.google.com/protocol-buffers/docs/reference/java/com/google/protobuf/ByteString#substring%28int,%20int%29

【讨论】：

我对 GC 的困扰不如对复制的困扰。这些字符串几乎立即变成垃圾，并且在具有静态工作集的嵌套循环中，GC 应该几乎是免费的。
所以试试 Google protobuf 的 'ByteString'，它不会为子字符串创建新的字符串。
问题不在于子字符串，而在于需要使用子字符串的 int / long 解析器。 ByteString 会给我一个 ByteString 回来，然后我的问题是解析它......
如果您真正查看 Oracle JDK Long#parseLong 实现的源代码，它看起来相当高效，应该被证明是正确的。由于它仅适用于字符串而不适用于 Protobuf 的 ByteString，我将使用该代码并将其移植到 ByteString。（内部循环仅计算转发，因此 Iterable 应该可以工作）。既然我们在谈论数字编码应该也没有问题。
这样做很可能在许可方面存在问题，但似乎我的问题的答案是否定的。

【解决方案3】：

我忍不住要衡量你方法的改进：

package test;

public class TestIntParse {

    static final int MAX_NUMBERS = 10000000;
    static final int MAX_ITERATIONS = 100;

    public static void main(String[] args) {
        long timeAvoidNewStrings = 0;
        long timeCreateNewStrings = 0;

        for (int i = 0; i < MAX_ITERATIONS; i++) {
            timeAvoidNewStrings += test(true);
            timeCreateNewStrings += test(false);
        }

        System.out.println("Average time method 'AVOID new strings': " + (timeAvoidNewStrings / MAX_ITERATIONS) + " ms");
        System.out.println("Average time method 'CREATE new strings': " + (timeCreateNewStrings / MAX_ITERATIONS) + " ms");
    }

    static long test(boolean avoidStringCreation) {
        long start = System.currentTimeMillis();

        for (int i = 0; i < MAX_NUMBERS; i++) {
            String value = Integer.toString((int) Math.random() * 100000);
            int intValue = avoidStringCreation ? parse(value, 0, value.length()) : parse2(value, 0, value.length());
            String value2 = Integer.toString(intValue);
            if (!value2.equals(value)) {
                System.err.println("Error at iteration " + i + (avoidStringCreation ? " without" : " with") + " string creation: " + value + " != " + value2);
            }
        }

        return System.currentTimeMillis() - start;
    }

    public static int parse2(String s, int start, int end) {
        return Integer.valueOf(s.substring(start, end));
    }

    public static int parse(String s, int start, int end) {
        long result = 0;
        boolean foundMinus = false;

        while (start < end) {
            char ch = s.charAt(start);
            if (ch == ' ')
                /* ok */;
            else if (ch == '-') {
                if (foundMinus)
                    throw new NumberFormatException();
                foundMinus = true;
            } else if (ch < '0' || ch > '9')
                throw new NumberFormatException();
            else
                break;
            ++start;
        }

        if (start == end)
            throw new NumberFormatException();

        while (start < end) {
            char ch = s.charAt(start);
            if (ch < '0' || ch > '9')
                break;
            result = result * 10 + ch - '0';
            ++start;
        }

        while (start < end) {
            char ch = s.charAt(start);
            if (ch != ' ')
                throw new NumberFormatException();
            ++start;
        }
        if (foundMinus)
            result *= -1;
        if (result < Integer.MIN_VALUE || result > Integer.MAX_VALUE)
            throw new NumberFormatException();
        return (int) result;
    }

}

结果：

Average time method 'AVOID new strings': 432 ms
Average time method 'CREATE new strings': 500 ms

您的方法在时间和内存方面的效率大约提高了 14%，尽管相当复杂（并且容易出错）。从我的角度来看，您的方法不会得到回报，尽管在您的情况下可能会奏效。

【讨论】：

这个网站一定有阅读理解问题。我上面写的代码花了大约 5 分钟，并不是为性能而设计的……我明确表示我不想使用它……我只是为了阻止那些尝试编写自己的版本的业余爱好者。