在与另一个模式匹配的字符串中查找最短子字符串的开始和结束索引答案

【问题标题】：find the begin and end index of shortest substring in a string that match another pattern在与另一个模式匹配的字符串中查找最短子字符串的开始和结束索引
【发布时间】：2018-02-28 11:20:02
【问题描述】：

给定两个字符串text和pattern，找到text中与pattern匹配的最短子字符串的开始和结束索引，这意味着pattern中的所有字符在两者中都以相同的顺序出现子字符串和pattern，但在这些字符之间可能还有其他字符。

如果您可以从text 中找到这样的子字符串，则打印其开始和结束索引，否则打印-1、-1。如果有多个最短匹配子串，则返回开始索引最小的子串的索引。

示例输入：

axxxbcaxbcaxxbc abc

abcd x

axxxbaxbab ab

样本输出：

有没有人有一些好的算法来解决这个问题而不使用内置支持 C++ 或 Python 中的正则表达式

【问题讨论】：

拥有string = 'xxxxxxxxxx' 和pattern = 'x' 是0 1 一个有效的答案？
可以返回第一个找到的子串的索引，所以输出可以是0, 0。
那么为什么不尝试在字符串中找到第一次出现的模式呢？然后加上pattern的长度找到结束索引
因为pattern中的字符在找到的子串中不需要相邻，请仔细阅读问题说明

标签： python c++ string algorithm

【解决方案1】：

Python

def shortest_match(text, pattern):

    stack = [] # to store matches

    for i in range(len(text) - len(pattern) + 1):
        # if we match the firts character of pattern in
        # text then we start to search for the rest of it
        if pattern[0] == text[i]:
            j = 1 # pattern[0] already match, let's check from 1 onwards
            k = i + 1 # text[i] == pattern[0], let's check from text[i+1] onwards
            # while pattern[j] could match text[i]
            while j < len(pattern) and k < len(text):
                if pattern[j] == text[k]:
                    j += 1 # pattern[j] matched. Let's move to the next character
                k += 1
            if j == len(pattern): # if the match was found add it to the stack
                stack.append((i, k-1))
            else: # otherwise break the loop (we won't find any other match)
                break
    if not stack: # no match found
        return (-1, -1)
    lengths = [y - x for x, y in stack] # list of matches lengths
    return stack[lengths.index(min(lengths))] # return the shortest

C++

#include <iostream>
#include <vector>
#include <string.h>
using namespace std;

struct match_pair
{
    int start;
    int end;
    int length;
};

void
print_match (match_pair m)
{
    cout << "(" << m.start << ", " << m.end << ")";
}

match_pair 
shortest_match (char * text, char * pattern) 
{

  vector <match_pair> stack; // to store matches

  for (int i = 0; strlen(text) - strlen(pattern) + 1; ++i)
  {
    // if we match the firts character of pattern in
    // text then we start to search for the rest of it
    if (pattern[0] == text[i])
    {
        int j = 1; // pattern[0] already match, let's check from 1 onwards
        int k = i + 1; // text[i] == pattern[0], let's check from text[i+1] onwards
        // while pattern[j] could match text[i]
        while (j < strlen(pattern) && k < strlen(text))
        {
            if (pattern[j] == text[k])
            {
                ++j; // pattern[j] matched. Let's move to the next character
            }
            ++k;
        }
        if (j == strlen(pattern)) // if the match was found add it to the stack
        {
            match_pair current_match;
            current_match.start = i;
            current_match.end = k - 1;
            current_match.length = current_match.end - current_match.start;
            stack.push_back(current_match);
        } else // otherwise break the loop (we won't find any other match)
        {
            break;
        }
    }
  }

  match_pair shortest;
  if (stack.empty()) // no match, return (-1, -1)
  {
    shortest.start = -1;
    shortest.end = -1;
    shortest.length = 0;
    return shortest;
  }
  // search for shortest match
  shortest.start = stack[0].start;
    shortest.end = stack[0].end;
    shortest.length = stack[0].length;
  for (int i = 1; i < stack.size(); ++i)
  {
    if (stack[i].length < shortest.length)
    {
        shortest.start = stack[i].start;
        shortest.end = stack[i].end;
        shortest.length = stack[i].length;
    }
  }

  return shortest;

}

// override << for printing match_pair
std::ostream& 
operator<< (std::ostream& os, const match_pair& m)
{
    return os << "(" <<  m.start << ", " << m.end << ")"; 
}

int
main () 
{
  char text[] = "axxxbcaxbcaxxbc";
  char pattern[] = "abc";

  cout << shortest_match(text, pattern);

  return 0;
}

【讨论】：

它是否为模式“abaa”和字符串“ababa”分叉？
@algrid 根据我对这样一对不应该匹配的问题的理解，所以我会说它有效。
好吧，据我了解，应该匹配:)
@algrid 我认为你是对的......我编辑了答案。不过，KMP 不再有帮助了。
"abaa" 一定要匹配 "ababa"

【解决方案2】：

遍历文本的字符并找到文本中模式的第一个字符。如果找到它，请在剩余文本中搜索模式的第二个字符，并对模式中的所有字符重复该操作，跳过文本中不需要的字符。完成后，从文本中下一次出现模式的第一个字符开始。

abc 模式可能更直观：

axxxbcaxbcaxxbc
[axxx|b|c] -> 6 chars
[ax|b|c] -> 4 chars
[axx|b|c] -> 5 chars

或者

 aababaccccccc
[aa|baba|c] -> 6 chars
[a|baba|c] -> 5 chars
[a|ba|c] -> 4 chars
[accccccc] -> -1 chars as the substring does not match the pattern

编辑：您应该尝试从文本末尾开始实施此算法，因为它是您要查找的子字符串最有可能出现的位置。

【讨论】：