Java正则表达式从字符串中删除重复的子字符串答案

【问题标题】：Java regex to remove duplicate substrings from stringJava正则表达式从字符串中删除重复的子字符串
【发布时间】：2016-07-31 11:42:06
【问题描述】：

我正在尝试构建一个正则表达式来“减少”Java 中字符串中重复的连续子字符串。例如，对于以下输入：

The big black dog big black dog is a friendly friendly dog who lives nearby nearby.

我想得到以下输出：

The big black dog is a friendly dog who lives nearby.

这是我目前的代码：

String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

Pattern dupPattern = Pattern.compile("((\\b\\w+\\b\\s)+)\\1+", Pattern.CASE_INSENSITIVE);
Matcher matcher = dupPattern.matcher(input);

while (matcher.find()) {
    input = input.replace(matcher.group(), matcher.group(1));
}

除了句尾之外的所有重复子字符串都可以正常工作：

The big black dog is a friendly dog who lives nearby nearby.

我知道我的正则表达式在子字符串中的每个单词后都需要一个空格，这意味着它不会捕获带有句点而不是空格的情况。我似乎找不到解决方法，我尝试使用捕获组并更改正则表达式以查找空格或句点而不仅仅是空格，但此解决方案仅在存在时才有效子字符串的每个重复部分之后的句点（“nearby.nearby.”）。

有人能指出我正确的方向吗？理想情况下，此方法的输入将是短段落，而不仅仅是单行。

【问题讨论】：

您必须使用正则表达式还是只对有效的解决方案感兴趣？
我实际上不必使用正则表达式，我只是认为正则表达式可以轻松找到重复的短语而不仅仅是重复的单词。任何其他解决方案也将受到欢迎！

标签： java regex string duplicates

【解决方案1】：

你可以使用

input.replaceAll("([ \\w]+)\\1", "$1");

见live demo:

import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

        Pattern dupPattern = Pattern.compile("([ \\w]+)\\1", Pattern.CASE_INSENSITIVE);
        Matcher matcher = dupPattern.matcher(input);

        while (matcher.find()) {
            input = input.replaceAll("([ \\w]+)\\1", "$1");
        }
        System.out.println(input);

    }
}

【讨论】：

这不适用于以下输入“大黑狗大黑狗是住在附近的友好友好的狗。”
@Matt OP 对冲突重复只字未提。即使他们这样做了，也可以使用相同的正则表达式以这种方式进行重复数据删除 - 重复替换，直到字符串不再有任何匹配项。
谢谢Thomas，但单词边界存在问题。对于以下输入：“这是我的狗”，我会得到“这是我的狗”，不是吗？
@ak_charlie 只需将正则表达式替换为\\b([ \\w]+)\\1
谢谢托马斯，正要评论我添加了边界这个词:)

【解决方案2】：

结合@Thomas Ayoub 的回答和@Matt 的评论。

public class Test2 {
    public static void main(String args[]){
        String input = "The big big black dog big black dog is a friendly friendly dog who lives nearby nearby.";
        String result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        while(!input.equals(result)){
            input = result;
            result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        }
        System.out.println(result);
    }
}

【讨论】：

为什么要介绍result？
@ThomasAyoub 嗯，也许是为了更好的可读性。你有什么看法？