如何通过多个分隔符拆分字符串 - 并知道哪个分隔符匹配答案

【问题标题】：How to split a string by multiple separators - and know which separator matched如何通过多个分隔符拆分字符串 - 并知道哪个分隔符匹配
【发布时间】：2024-01-21 07:22:01
【问题描述】：

使用String.split 可以很容易地用多个分隔符分割一个字符串。您只需要定义一个与您要使用的所有分隔符匹配的正则表达式。例如

"1.22-3".split("[.-]")

列表中包含元素"1"、"22" 和"3"。到目前为止一切顺利。

但是，现在我还需要知道在段之间找到了哪一个分隔符。有没有直接的方法来实现这一点？

我查看了String.split、它已弃用的前身StringTokenizer，以及其他据称更现代的库（例如StrTokenizer from Apatche Commons），但没有一个我可以得到匹配的分隔符。

【问题讨论】：

标签： java regex algorithm string-parsing string-split

【解决方案1】：

我想我正在寻找错误的算法来实现我想要实现的目标。与其使用按分隔符拆分的方法，不如采用以下两步法更成功：

首先，我实现了一个lexer (aka tokenizer, scanner)，它将字符串拆分为包含分隔符的标记。 IE。将1.22-3 拆分为1、.、22、-、3
然后，我实现了一个解析器来解释这个令牌流，即区分段及其分隔符。

词法分析器的可能实现：

import java.util.ArrayList;
import java.util.List;

public final class FixedStringTokenScanner {

    /**
     * Splits the given input into tokens. Each token is either one of the given constant string
     * tokens or a string consisting of the other characters between the constant tokens.
     *
     * @param input
     *            The string to split.
     * @param fixedStringTokens
     *            A list of strings to be recognized as separate tokens.
     * @return A list of strings, which when concatenated would result in the input string.
     *         Occurrences of the fixed string tokens in the input string are returned as separate
     *         list entries. These entries are reference-equal to the respective fixedStringTokens
     *         entry. Characters which did not match any of the fixed string tokens are concatenated
     *         and returned as list entries at the respective positions in the list. The list does
     *         not contain empty or <code>null</code> entries.
     */
    public static List<String> splitToFixedStringTokensAndOtherTokens(final String input, final String... fixedStringTokens) {
        return new FixedStringTokenScannerRun(input, fixedStringTokens).splitToFixedStringAndOtherTokens();
    }

    private static class FixedStringTokenScannerRun {

        private final String input;
        private final String[] fixedStringTokens;

        private int scanIx = 0;
        StringBuilder otherContent = new StringBuilder();
        List<String> result = new ArrayList<String>();

        public FixedStringTokenScannerRun(final String input, final String[] fixedStringTokens) {
            this.input = input;
            this.fixedStringTokens = fixedStringTokens;
        }

        List<String> splitToFixedStringAndOtherTokens() {
            while (scanIx < input.length()) {
                scanIx += matchFixedStringOrAppendToOther();
            }
            storeOtherTokenIfNotEmpty();
            return result;
        }

        /**
         * @return the number of matched characters.
         */
        private int matchFixedStringOrAppendToOther() {
            for (String fixedString : fixedStringTokens) {
                if (input.regionMatches(scanIx, fixedString, 0, fixedString.length())) {
                    storeOtherTokenIfNotEmpty();
                    result.add(fixedString); // add string instance so that identity comparison works
                    return fixedString.length();
                }
            }
            appendCharacterToOther();
            return 1;
        }

        private void appendCharacterToOther() {
            otherContent.append(input.substring(scanIx, scanIx + 1));
        }

        private void storeOtherTokenIfNotEmpty() {
            if (otherContent.length() > 0) {
                result.add(otherContent.toString());
                otherContent.setLength(0);
            }
        }
    }
}

【讨论】：

回答你自己的问题?????????????/that too as if its smomebody else??????
blog.*.com/2011/07/…
我在这个问题上花了很长时间，直到找到解决方案后，我才决定在这里提出这个问题可能是值得的。
你可以用第一人称模式写出来
@vks：如果它让你高兴，我可以重写，以便清楚地表明我提出了这个问题。虽然社区推荐我原来的做法：meta.stackexchange.com/a/137369/191131

【解决方案2】：

如果你追溯String.split(regex)所做的事情并记录String.split忽略的信息，那就很简单了：

String source = "1.22-3";
Matcher m=Pattern.compile("[.-]").matcher(source);
ArrayList<String> elements=new ArrayList<>();
ArrayList<String> separators=new ArrayList<>();
int pos;
for(pos=0; m.find(); pos=m.end()) {
    elements.add(source.substring(pos, m.start()));
    separators.add(m.group());
}
elements.add(source.substring(pos));

在这段代码的末尾，separators.get(x) 产生于elements.get(x) 和elements.get(x+1) 之间的分隔符。应该清楚separators 比elements 小一项。

如果你想在一个列表中包含元素和分隔符，只需更改代码让这两个列表成为同一个列表。项目已按出现顺序添加。

【讨论】：

不错！稍微说明一下：这段代码中的elements 与String.split 的结果略有不同，因为它的末尾可能有一个空字符串元素。该代码不会忽略尾随分隔符。但恕我直言，这很好。
解决这个问题很容易，但是对于大多数应用程序来说，保持x+1 元素恰好有x 分隔符的不变性更有用。如果您将两者都收集到一个列表中，那将是不同的，然后允许列表以分隔符结尾就可以了。看起来保留分隔符的要求隐含禁止忽略尾随分隔符，否则结果将不一致。