在 \b 上拆分字符串，但不在子字符串之间的 \b 上拆分字符串答案

【问题标题】：Split String on \b's but not on \b's between a substring在 \b 上拆分字符串，但不在子字符串之间的 \b 上拆分字符串
【发布时间】：2015-08-29 18:32:27
【问题描述】：

如何将字符串拆分为单词，但保留某些短语/术语？现在，我有String[] strarr = str.split("\\b");，但我想修改正则表达式参数来完成上面提到的。 解决方案不必包含正则表达式

例如，如果 str 等于 "The city of San Francisco is truly beautiful!" 并且术语是 "San Francisco"，如何拆分 str 以使生成的 String[] 数组看起来像这样：["The", "city", "of", "San Francisco", "is", "truly", "beautiful!"]？

在看到@Radiodef 的评论后，我决定我本身并不需要正则表达式。如果有人可以帮助我解决这个问题，仍然非常感谢帮助！

【问题讨论】：

你不能用正则表达式准确地做到这一点...正则表达式匹配字符模式，而不是地名。这就是图书馆的用途。
@Radiodef 我同意正则表达式不是这样做的正确方法，但我发布了正则表达式答案:)
Regex 在 Java 中存在许多性能问题，如 eyalsch.wordpress.com/2009/05/21/regex 所述。您甚至可以在我的答案中找到 3 个单词的短语或 n 个单词的短语。只是说;）

标签： java string

【解决方案1】：

嗯，这是一个非常有趣的问题。我的方法是编写一个通用方法，通过返回一个简单的字符串数组来帮助检测任意数量的单词短语。

Here is a demo

下面是方法，

 String[] find(String m[], String c[], String catchStr){

    String comp = c[0];
    ArrayList<String> list = new ArrayList<String>();
    for(int i=0;i<m.length;i++){

        boolean flag = false;

        //comparing if the substring matches or not
        if(comp.equals(m[i])){
            flag = true;
            for(int j=0;j<c.length;j++){
                //you can use equalsIgnoreCase() if you want to compare the string 
                //ignoring the case
                if(!m[i+j].equals(c[j])){
                    flag = false;
                    break;
                }
            }

        }

        if(flag){
            list.add(catchStr);
            i = i + c.length-1;
        }else{
            list.add(m[i]);
        }

    }

    //converting result into String array
    String finalArr[] = list.toArray(new String[list.size()]);

    return finalArr;

}

你可以把这个函数称为，

String mainStr = "The city of San Francisco is truly beautiful!";
String catchStr = "San Francisco";
String mainStrArr[] = mainStr.split(" ");
String catchStrArr[] = catchStr.split(" ");

String finalArr[] = find(mainStrArr, catchStrArr, catchStr);

【讨论】：

@javaislife 我更喜欢这个解决方案，因为它是一个更通用的解决方案，与 Evgeniy Dorofeev 给出的正则表达式不同，这个解决方案适用于任何包含任意数量短语的字符串

【解决方案2】：

如果旧金山是唯一的例外，那么这可行

    String[] a = str.split("(?<!San)\\s+(?!Francisco)");

我能找到的多个排除的最短解决方案是这个

    String str = "The city of San Francisco is truly beautiful!";
    String[] exclusions = { "San Francisco", "Los Angeles" };
    List<String> l = new ArrayList<>();
    Matcher m = Pattern.compile("\\w+").matcher(str);
    while (m.find()) {
        l.add(m.group());
        for (String ex : exclusions) {
            if (str.regionMatches(m.start(), ex, 0, ex.length())) {
                l.set(l.size() - 1, ex);
                m.find();
                break;
            }
        }
    }
    System.out.println(l);

【讨论】：

这是一个更好的方法。但是，当您包含洛杉矶时，您的正则表达式不起作用
@EvgeniyDorofeev 我试图解决这个问题，但如果你觉得它有用，我想出了\s(?=[a-z]+) 尽管of 中有一个错误
感谢您的回复！多重排除解决方案似乎没有像@Saumil Soni 那样正常运行，但它的巧妙之处却丝毫不减。我对正则表达式非常不满意，所以我想知道如何修改第一个解决方案，以便它适用于三个单词的短语（即“纽约市”）。
第二种解决方案仅适用于英文字母和 2 个单词排除

【解决方案3】：

找到要排除的子字符串，然后暂时删除其中的空格。一旦整个字符串已经被分割，找到之前编辑的子字符串，然后用原来的替换它来恢复它的空格。

    // let's say:
    // whole = "The city of San Francisco is truly beautiful!",
    // token = "San Francisco"

    public static String[] excludeString(String whole, String token) {

        // replaces token string "San Francisco" with "SanFrancisco"
        whole = whole.replaceAll(token, token.replaceAll("\\s+", ""));

        // splits whole string using space as delimiter, place tokens in a string array
        String[] strarr = whole.split("\\s+");

        // brings "SanFrancisco" back to "San Francisco" in strarr
        Collections.replaceAll(Arrays.asList(strarr), token.replaceAll("\\s+", ""), token);

        // returns the array of strings
        return strarr;
    }

示例用法：

    public static void main(String[] args) {

        String[] arr = excludeString("The city of San Francisco is truly beautiful!", "San Francisco");
        System.out.println(Arrays.asList(arr));

    }

假设你的字符串是："The city of San Francisco is truly beautiful!"

结果将是： [The, city, of, San Francisco, is, truly, beautiful!]

【讨论】：

【解决方案4】：

我知道发布的答案更好，但由于我对此几乎没有挣扎，我也想分享正则表达式答案。

因此，通过使用捕获组来实现此目的的一种可能的正则表达式方法是使用此正则表达式：

([A-Z][a-z]*(?:\s?[A-Z][a-z]+)*|[a-z!]+)

Working demo

比赛信息

MATCH 1
1.  [0-3]   `The`
MATCH 2
1.  [4-8]   `city`
MATCH 3
1.  [9-11]  `of`
MATCH 4
1.  [12-25] `San Francisco`
MATCH 5
1.  [26-28] `is`
MATCH 6
1.  [29-34] `truly`
MATCH 7
1.  [35-44] `beautiful!`

Java 代码

String line = "The city of San Francisco is truly beautiful!";
Pattern pattern = Pattern.compile("([A-Z][a-z]*(?:\\s?[A-Z][a-z]+)*|[a-z!]+)");
Matcher matcher = pattern.matcher(line);

while (matcher.find()) {
    System.out.println("Result: " + matcher.group(1));
}

【讨论】：

我也有同样的想法，并想出了一个非常相似的正则表达式：[A-Z]\\S+((\\s+[A-Z]\\S+)+)?|\\S+