【问题标题】:Most efficient way to search for unknown patterns in a string?在字符串中搜索未知模式的最有效方法?
【发布时间】:2017-10-22 15:59:04
【问题描述】:

我正在尝试寻找以下模式:

  • 不止一次发生
  • 长度超过 1 个字符
  • 不是任何其他已知模式的子字符串

不知道可能发生的任何模式。

例如:

  • 字符串“the boy fall by the bell”将返回 'ell', 'the b', 'y '
  • 字符串“男孩倒在铃旁,男孩倒在铃旁”将返回'the boy fell by the bell'

使用双 for 循环,它可能会被暴力强制非常效率低下:

ArrayList<String> patternsList = new ArrayList<>();
int length = string.length();
for (int i = 0; i < length; i++) {
    int limit = (length - i) / 2;
    for (int j = limit; j >= 1; j--) {
        int candidateEndIndex = i + j;
        String candidate = string.substring(i, candidateEndIndex);

        if(candidate.length() <= 1) {
            continue;
        }

        if (string.substring(candidateEndIndex).contains(candidate)) {
            boolean notASubpattern = true;
            for (String pattern : patternsList) {
                if (pattern.contains(candidate)) {
                    notASubpattern = false;
                    break;
                }
            }

            if (notASubpattern) {
                patternsList.add(candidate);
            }
        }
    }
}

但是,在搜索包含大量模式的大字符串时,这非常慢。

【问题讨论】:

  • 某种意义上,这是一种压缩形式。您可能会对各种压缩算法进行一些研究。
  • 为什么在您的第一个结果示例中单个空格不是元素?
  • @Björn 因为它只有一个字符长。
  • 当然/me 擦眼镜
  • 为什么“,”是一个带空格的逗号,不是您的第二个结果示例的一部分?

标签: java algorithm substring


【解决方案1】:

您可以在线性时间内为您的字符串构建后缀树: https://en.wikipedia.org/wiki/Suffix_tree

您要查找的模式是与只有叶子节点的内部节点相对应的字符串。

【讨论】:

    【解决方案2】:

    您可以使用 n-gram 来查找字符串中的模式。扫描字符串中的 n-gram 需要 O(n) 时间。当您使用 n-gram 找到子字符串时,将其放入哈希表中,并计算在字符串中找到该子字符串的次数。在字符串中搜索完 n-gram 后,在哈希表中搜索大于 1 的计数以查找字符串中的重复模式。

    例如,在字符串“男孩倒在铃旁,男孩倒在铃旁”中,使用 6 克将找到子字符串“男孩倒在铃旁”。具有该子字符串的哈希表条目的计数为 2,因为它在字符串中出现了两次。改变 n-gram 中的单词数量将帮助您发现字符串中的不同模式。

    Dictionary<string, int>dict = new Dictionary<string, int>();
    int count = 0;
    int ngramcount = 6;
    string substring = "";
    
    // Add entries to the hash table
    while (count < str.length) {
        // copy the words into the substring
        int i = 0;
        substring = "";
        while (ngramcount > 0 && count < str.length) {
            substring[i] = str[count];
            if (str[i] == ' ')
                ngramcount--;
            i++;
            count++;
        }
        ngramcount = 6;
        substring.Trim();  // get rid of the last blank in the substring
        // Update the dictionary (hash table) with the substring
        if (dict.Contains(substring)) {  // substring is already in hash table so increment the count
            int hashCount = dict[substring];
            hashCount++;
            dict[substring] = hashCount;
        }
        else
            dict[substring] = 1;
    }
    
    // Find the most commonly occurrring pattern in the string
    // by searching the hash table for the greatest count.
    int maxCount = 0;
    string mostCommonPattern = "";
    foreach (KeyValuePair<string, int> pair in dict) {
        if (pair.Value > maxCount) {
            maxCount = pair.Value;
            mostCommonPattern = pair.Key;
        }
    }
    

    【讨论】:

      【解决方案3】:

      我写这个只是为了好玩。我希望我已经正确理解了这个问题,这是有效且足够快的;如果没有,请对我放轻松:) 如果有人觉得它有用,我想我可能会再优化一点。

      private static IEnumerable<string> getPatterns(string txt)
      {
          char[] arr = txt.ToArray();
          BitArray ba = new BitArray(arr.Length);
          for (int shingle = getMaxShingleSize(arr); shingle >= 2; shingle--)
          {
              char[] arr1 = new char[shingle];
              int[] indexes = new int[shingle];
              HashSet<int> hs = new HashSet<int>();
              Dictionary<int, int[]> dic = new Dictionary<int, int[]>();
              for (int i = 0, count = arr.Length - shingle; i <= count; i++)
              {
                  for (int j = 0; j < shingle; j++)
                  {
                      int index = i + j;
                      arr1[j] = arr[index];
                      indexes[j] = index;
                  }
                  int h = getHashCode(arr1);
                  if (hs.Add(h))
                  {
                      int[] indexes1 = new int[indexes.Length];
                      Buffer.BlockCopy(indexes, 0, indexes1, 0, indexes.Length * sizeof(int));
                      dic.Add(h, indexes1);
                  }
                  else
                  {
                      bool exists = false;
                      foreach (int index in indexes)
                          if (ba.Get(index))
                          {
                              exists = true;
                              break;
                          }
                      if (!exists)
                      {
                          int[] indexes1 = dic[h];
                          if (indexes1 != null)
                              foreach (int index in indexes1)
                                  if (ba.Get(index))
                                  {
                                      exists = true;
                                      break;
                                  }
                      }
                      if (!exists)
                      {
                          foreach (int index in indexes)
                              ba.Set(index, true);
                          int[] indexes1 = dic[h];
                          if (indexes1 != null)
                              foreach (int index in indexes1)
                                  ba.Set(index, true);
                          dic[h] = null;
                          yield return new string(arr1);
                      }
                  }
              }
          }
      }
      private static int getMaxShingleSize(char[] arr)
      {            
          for (int shingle = 2; shingle <= arr.Length / 2 + 1; shingle++)
          {
              char[] arr1 = new char[shingle];
              HashSet<int> hs = new HashSet<int>();
              bool noPattern = true;
              for (int i = 0, count = arr.Length - shingle; i <= count; i++)
              {
                  for (int j = 0; j < shingle; j++)
                      arr1[j] = arr[i + j];
                  int h = getHashCode(arr1);
                  if (!hs.Add(h))
                  {
                      noPattern = false;
                      break;
                  }
              }
              if (noPattern)
                  return shingle - 1;
          }
          return -1;
      }
      private static int getHashCode(char[] arr)
      {
          unchecked
          {
              int hash = (int)2166136261;
              foreach (char c in arr)
                  hash = (hash * 16777619) ^ c.GetHashCode();
              return hash;
          }
      }
      

      编辑
      我以前的代码有严重的问题。这个更好:

      private static IEnumerable<string> getPatterns(string txt)
      {
          Dictionary<int, int> dicIndexSize = new Dictionary<int, int>();
          for (int shingle = 2, count0 = txt.Length / 2 + 1; shingle <= count0; shingle++)
          {   
              Dictionary<string, int> dic = new Dictionary<string, int>();
              bool patternExists = false;
              for (int i = 0, count = txt.Length - shingle; i <= count; i++)
              {
                  string sub = txt.Substring(i, shingle);
                  if (!dic.ContainsKey(sub))
                      dic.Add(sub, i);
                  else
                  {   
                      patternExists = true;
                      int index0 = dic[sub];
                      if (index0 >= 0)
                      {
                          dicIndexSize[index0] = shingle;
                          dic[sub] = -1;
                      }
                  }
              }
              if (!patternExists)
                  break;
          }
          List<int> lst = dicIndexSize.Keys.ToList();
          lst.Sort((a, b) => dicIndexSize[b].CompareTo(dicIndexSize[a]));
          BitArray ba = new BitArray(txt.Length);
          foreach (int i in lst)
          {
              bool ok = true;
              int len = dicIndexSize[i];
              for (int j = i, max = i + len; j < max; j++)
              {
                  if (ok) ok = !ba.Get(j);
                  ba.Set(j, true);
              }
              if (ok)
                  yield return txt.Substring(i, len);
          }
      }
      

      this book 中的文本在我的计算机中花费了 3.4 秒。

      【讨论】:

      • 嗨@AlexQuilliam。我想知道你是否找到了一个好的解决方案。如果是这样,如果您能添加代码,那就太好了。我很好奇我的代码在最佳解决方案方面的性能和有效性。
      【解决方案4】:

      后缀数组是正确的想法,但缺少一个重要的部分,即识别文献中称为“超最大重复”的内容。这是一个带有工作代码的 GitHub 存储库:https://github.com/eisenstatdavid/commonsub。后缀数组构造使用 SAIS 库,作为子模块出售。使用findsmaxr in Efficient repeat finding via suffix arrays (Becher–Deymonnaz–Heiber) 中的伪代码的更正版本找到超最大重复。

      static void FindRepeatedStrings(void) {
        // findsmaxr from https://arxiv.org/pdf/1304.0528.pdf
        printf("[");
        bool needComma = false;
        int up = -1;
        for (int i = 1; i < Len; i++) {
          if (LongCommPre[i - 1] < LongCommPre[i]) {
            up = i;
            continue;
          }
          if (LongCommPre[i - 1] == LongCommPre[i] || up < 0) continue;
          for (int k = up - 1; k < i; k++) {
            if (SufArr[k] == 0) continue;
            unsigned char c = Buf[SufArr[k] - 1];
            if (Set[c] == i) goto skip;
            Set[c] = i;
          }
          if (needComma) {
            printf("\n,");
          }
          printf("\"");
          for (int j = 0; j < LongCommPre[up]; j++) {
            unsigned char c = Buf[SufArr[up] + j];
            if (iscntrl(c)) {
              printf("\\u%.4x", c);
            } else if (c == '\"' || c == '\\') {
              printf("\\%c", c);
            } else {
              printf("%c", c);
            }
          }
          printf("\"");
          needComma = true;
        skip:
          up = -1;
        }
        printf("\n]\n");
      }
      

      这是第一段文本的示例输出:

      Davids-MBP:commonsub eisen$ ./repsub input
      ["\u000a"
      ," S"
      ," as "
      ," co"
      ," ide"
      ," in "
      ," li"
      ," n"
      ," p"
      ," the "
      ," us"
      ," ve"
      ," w"
      ,"\""
      ,"&ndash;"
      ,"("
      ,")"
      ,". "
      ,"0"
      ,"He"
      ,"Suffix array"
      ,"`"
      ,"a su"
      ,"at "
      ,"code"
      ,"com"
      ,"ct"
      ,"do"
      ,"e f"
      ,"ec"
      ,"ed "
      ,"ei"
      ,"ent"
      ,"ere's a "
      ,"find"
      ,"her"
      ,"https://"
      ,"ib"
      ,"ie"
      ,"ing "
      ,"ion "
      ,"is"
      ,"ith"
      ,"iv"
      ,"k"
      ,"mon"
      ,"na"
      ,"no"
      ,"nst"
      ,"ons"
      ,"or"
      ,"pdf"
      ,"ri"
      ,"s are "
      ,"se"
      ,"sing"
      ,"sub"
      ,"supermaximal repeats"
      ,"te"
      ,"ti"
      ,"tr"
      ,"ub "
      ,"uffix arrays"
      ,"via"
      ,"y, "
      ]
      

      【讨论】:

        【解决方案5】:

        我会使用Knuth–Morris–Pratt algorithm(线性时间复杂度O(n))来查找子字符串。我会尝试找到最大的子字符串模式,将其从输入字符串中删除并尝试找到第二大的,依此类推。我会这样做:

        string pattern = input.substring(0,lenght/2);
        string toMatchString = input.substring(pattern.length, input.lenght - 1);
        
        List<string> matches = new List<string>();
        
        while(pattern.lenght > 0)
        {
            int index = KMP(pattern, toMatchString);
            if(index > 0)
            {
                matches.Add(pattern);
        
                // remove the matched pattern occurences from the input string
                // I would do something like this:
                // 0 to pattern.lenght gets removed
                // check for all occurences of pattern in toMatchString and remove them
                // get the remaing shrinked input, reassign values for pattern & toMatchString
                // keep looking for the next largest substring
            }
            else
            {
                pattern = input.substring(0, pattern.lenght - 1);
                toMatchString = input.substring(pattern.length, input.lenght - 1);
            }
        }
        

        KMP 实现了 Knuth–Morris–Pratt 算法。您可以在 GithubPrinceton 找到它的 Java 实现,或者自己编写。

        PS:我不会用 Java 编写代码,我的第一个赏金很快就要结束了。所以如果我错过了一些琐碎的事情或犯了 +/-1 的错误,请不要给我棍子。

        【讨论】:

          猜你喜欢
          • 2012-12-18
          • 2012-03-23
          • 2010-10-25
          • 2012-01-29
          • 2011-11-02
          • 2018-10-13
          • 2014-05-24
          • 2018-08-20
          • 2012-11-15
          相关资源
          最近更新 更多