替换字符串中的重复字符串答案

【问题标题】：Replace repeating strings in a string替换字符串中的重复字符串
【发布时间】：2011-10-10 13:11:52
【问题描述】：

我正在尝试在字符串中查找（并替换）重复的字符串。

我的字符串可能如下所示：

Lorem ipsum dolor sit amet sit amet sat amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat。

这应该变成：

Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat。

注意 amit sit 是如何没有被删除的，因为它没有重复。

或者字符串可以是这样的：

Lorem ipsum dolor sit amet () sat amet () sat amet () 坐 nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat。

应该变成：

Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

所以它不仅是 a-z，还可以有其他 (ascii) 字符。如果有人可以帮助我，我很高兴。

下一步是匹配（和替换）如下内容：

2个问题3个问题4个问题5个问题

会变成：

2 个问题

最终输出中的数字可以是任意数字2,3,4，没关系。最后一个例子中只有数字不同，但单词是一样的。

【问题讨论】：

为什么第一段第二个sit没有去掉？它仍然是第一个 sit 的重复。我们如何才能正确确定单词边界？
因为它不重复直接。所以在one two one 中没有重复，但它在one one two 中。这能回答你的问题吗？
这仅适用于文字吗？然后定义什么是单词，因为() 显然不是。我在上面引用了 tandu，“我们如何才能正确确定单词边界？” 您希望从这些示例中得到什么结果：foo foo.、foo foobar、foo foo-foo、@987654329 @、#¤% #¤% #¤%、#¤%#¤%#¤%.
因为喝了这么多，没想到正则表达式可能没这么简单……
我认为你的前两个例子是错误的；要减少的字符串不是“... sat amet sat amet sat amet sat ...”而是“... sit amet sat amet sat amet sat ... ”。所以重复的字符串是sec amet，而不是amet sic。（生成的recuntion看起来相同，但逻辑不同）。

标签： php regex perl string-search

【解决方案1】：

如果有帮助，\1、\2 等用于引用之前的分组。因此，例如，以下将挑选出重复的单词并让它们只重复一次：

$string =~ s/(\w+) ( \1)+/$1/g

重复的短语可以类似地放置。

【讨论】：

两个小点：\w 只匹配字母数字和下划线，你应该使用.*。并且您在两次捕获之间有一个额外的空间。

【解决方案2】：

((?:\b|^)[\x20-\x7E]+)(\1)+ 将匹配任何从单词边界开始的可打印 ASCII 字符的重复字符串。这意味着它将匹配 hello hello 但不匹配 hello 中的双 l。

如果您想调整匹配的字符，您可以更改和添加\x##-\x##\x##-\x## 形式的范围（其中## 是hex 值）并省略-\x##只想加一个字符。

我能看到的唯一问题是，这种有点简单的方法会挑选出合法重复的单词而不是重复的短语。如果你想强制它只选择由多个单词组成的重复短语，你可以使用((?:\b|^)[\x20-\x7E]+\s)(\1)+（注意额外的\s）。

((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+ 即将解决您的第二个问题，但我可能认为自己在那个问题上陷入了困境。

编辑：澄清一下，您可以在 Perl 或 PHP 中使用 $string ~= /((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+/$1/ig 来使用它。

【讨论】：

这似乎工作正常，可能足以满足我的需要。我做了一个更改： /(((?:\b|^)[\x20-\x7E]+)\s)(\1){2,}/ 所以它只会在有超过 2 个重复字符串时替换.但它仍然存在一些问题

【解决方案3】：

好老的蛮力...

太丑了，我想把它贴为eval(base64_decode(...))，但这里是：

function fixi($str) {
    $a = explode(" ", $str);
    return implode(' ', fix($a));
}

function fix($a) {
    $l = count($a);
    $len = 0;
    for($i=1; $i <= $l/2; $i++) {
        for($j=0; $j <= $l - 2*$i; $j++) {
            $n = 1;
            $found = false;
            while(1) {
                $a1 = array_slice($a, $j, $i);
                $a2 = array_slice($a, $j+$n*$i, $i);
                if ($a1 != $a2)
                    break;
                $found = true;
                $n++;
            }
            if ($found && $n*$i > $len) {
                $len = $n*$i;
                $f_j = $j;
                $f_i = $i;
            }
        }
    }
    if ($len) {
        return array_merge(
            fix(array_slice($a, 0, $f_j)),
            array_slice($a, $f_j, $f_i),
            fix(array_slice($a, $f_j+$len, $l))
        );
    }
    return $a;
}

标点是单词的一部分，所以不要期待奇迹。

【讨论】：

虽然这似乎工作得很好，但它很慢。一个长度为 3500 的字符串大约需要 18 秒...
那么您对暴力算法的期望是什么？ :)

【解决方案4】：

2个问题3个问题4个问题5个问题

成为

2 个问题

可以使用：

$string =~ s/(\d+ (.*))( \d+ (\2))+/$1/g;

它匹配一个数字后跟任何东西（贪婪地），然后是一系列以空格开头的东西，后跟一个数字，然后是与前一个任何东西匹配的东西。对于所有这些，它用第一个数字替换它。

【讨论】：

【解决方案5】：

第一个任务解决代码：

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        // Here I iterate through the number of words that will be repeated and check all the possible positions reps
        for ($i = round($words_count / 2); $i >= 1; $i--) {
            for ($j = 0; $j < ($words_count - $i); $j++) {
                $need_remove_item = !$need_remove[$j];
                for ($k = $j; $k < ($j + $i); $k++) {
                    if ($words[$k] != $words[$k + $i]) {
                        $need_remove_item = false;
                        break;
                    }
                }
                if ($need_remove_item) {
                    for ($k = $j; $k < ($j + $i); $k++) {
                        $need_remove[$k] = true;
                    }
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = 'Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.' . '<br>' . '<br>';



    $string = 'Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.';

?>

第二个任务解决代码：

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        for ($j = 0; $j < ($words_count - 1); $j++) {
            $need_remove_item = !$need_remove[$j];
            for ($k = $j + 1; $k < ($words_count - 1); $k += 2) {
                if ($words[$k] != $words[$k + 2]) {
                    $need_remove_item = false;
                    break;
                }
            }
            if ($need_remove_item) {
                for ($k = $j + 2; $k < $words_count; $k++) {
                    $need_remove[$k] = true;
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = '2 questions 3 questions 4 questions 5 questions';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo '2 questions';

?>

【讨论】：

【解决方案6】：

有趣的问题。这可以通过单个preg_replace() 语句来解决，但必须限制重复短语的长度以避免过度回溯。这是一个带有注释正则表达式的解决方案，适用于测试数据并修复了最大长度为 50 个字符的双倍、三倍（或重复 n 次）短语：

第 1 部分的解决方案：

$result = preg_replace('/
    # Match a doubled "phrase" having length up to 50 chars.
    (            # $1: Phrase having whitespace boundaries.
      (?<=\s|^)  # Assert phrase preceded by ws or BOL.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $1: Phrase
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \1         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1', $text);

请注意，使用此解决方案，“短语”可以由单个单词组成，并且在某些合法情况下，双重单词是有效的语法，不应被修复。如果上述解决方案不是所需的行为，则可以轻松修改正则表达式以将“短语”定义为两个或多个“单词”。

编辑： 修改上述正则表达式以处理任意数量的短语重复。还为下面问题的第二部分添加了解决方案。

这是一个类似的解决方案，短语以数字单词开头，重复短语也必须以数字单词开头（但重复短语的第一个数字单词不需要与原始单词匹配）：

第 2 部分的解决方案：

$result = preg_replace('/
    # Match doubled "phrases" with wildcard digits first word.
    (            # $1: 1st word of phrase (digits).
    \b           # Anchor 1st phrase word to word boundary.
    \d+          # Phrase 1st word is string of digits.
    \s+          # 1st and 2nd words separated by whitespace.
    )            # End $1:  1st word of phrase (digits).
    (            # $2: Part of phrase after 1st digits word.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $2: Part of phrase after 1st digits word.
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \d+        # Match duplicate of phrase.
      \s+        # Doubled phrase separated by whitespace.
      \2         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1$2', $text);

【讨论】：

【解决方案7】：

非常感谢大家回答这个问题。这对我帮助很大！。我尝试了 Ridgerunners 和 dtanders 正则表达式，虽然它们在一些测试字符串上工作（经过一些修改），但我在处理其他字符串时遇到了麻烦。

所以我选择了受 Nox 启发的蛮力攻击 :)。这样我可以将这两个问题结合起来，并且仍然具有良好的性能（甚至比正则表达式更好，因为它在 PHP 中很慢）。

对这里感兴趣的任何人是概念代码：

function split_repeating_num($string) {
$words = explode(' ', $string);
$all_words = $words;
$num_words = count($words);
$max_length = 100; //max length of substring to check
$max_words = 4; //maximum number of words in substring 
$found = array();
$current_pos = 0;
$unset = array();
foreach ($words as $key=>$word) {
    //see if this word exist in the next part of the string
    $len = strlen($word);
    if ($len === 0) continue;
    $current_pos += $len + 1; //+1 for the space
    $substr = substr($string, $current_pos, $max_length);
    if (($pos = strpos(substr($string, $current_pos, $max_length), $word)) !== false) {
        //found it
        //set pointer words and all_words to same value
        while (key($all_words) < $key ) next($all_words);
        while (key($all_words) > $key ) prev($all_words);
        $next_word = next($all_words);

        while (is_numeric($next_word) || $next_word === '') {
            $next_word = next($all_words);
        }
        // see if it follows the word directly
        if ($word === $next_word) {
            $unset [$key] = 1;
        } elseif ($key + 3 < $num_words) {
            for($i = $max_words; $i > 0; $i --) {
                $x = 0;
                $string_a = '';
                $string_b = '';
                while ($x < $i ) {
                    while (is_numeric($next_word) || $next_word === '' ) {
                        $next_word = each($all_words);
                    }
                    $x ++;
                    $string_a .= $next_word;
                    $string_b .= $words [key($all_words) + $i];
                }

                if ($string_a === $string_b) {
                    //we have a match
                    for($x = $key; $x < $i + $key; $x ++)
                        $unset [$x] = 1;
                }
            }
        }
    }

}
foreach ($unset as $k=>$v) {
    unset($words [$k]);
}
return implode(' ', $words);

}

还有一些小问题，我确实需要测试一下，但它似乎完成了它的工作。

【讨论】：