查找字典中给定字符串的子字符串列表答案

【问题标题】：Find list of substrings for a given string that are in the dictionary查找字典中给定字符串的子字符串列表
【发布时间】：2016-04-13 18:42:58
【问题描述】：

给定的输入基本上是一个字典（字符串数组）和一个 InputString。

我们想找出字典中该字符串的所有可能子字符串。

Input:
Dictionary:  ["hell", "hello", "heaven", "ample", "his", "some", "other", "words"]
String: "hello world, this is an example"

Output: ["hell", "hello", "his", "ample"] //all the substrings that are in dictionary.

我能想到的解决方案是从字典中构建一个类似 trie 的结构，然后运行以下循环

for(i= 0 to inputString.length)
   substring = inputString.substring(i,length)
   lookupInTrie(substring) 

lookupInTrie(string)
   this function returns list of complete words from trie that match the prefix of string. 
   i.e, if you pass in string "hello world" to this function and dictionary has word "hell" and "hello" then our lookup will return ["hell","hello"];

所以如果我们不计算dictionary->trie 转换。查找字典中给定字符串的所有子字符串可以在 O(n^2) 时间内完成。

我想知道我们是否可以进一步优化它并将复杂度从 n^2 降低。

【问题讨论】：

您能否提供更多详细信息，例如示例输入和预期输出？当您谈论子字符串时，这些只是前缀吗？字典排序了吗？
我不确定字典排序的重要性，因为无论如何我都将它转换为 trie，但如果你愿意，你可以假设。我将用示例更新问题。
如果字典很小（例如，d 个元素平均 dw 个字符长）并且字符串长度为 l，您可以反转问题：在输入字符串中搜索字典单词。当搜索算法良好时（如 Boyer-Moore 搜索），则结果复杂度约为 d*(l/dw)。而且您节省了构建 trie 的时间。

标签： string algorithm dictionary trie

【解决方案1】：

您所描述的内容看起来是使用 Aho-Corasick string-matching algorithm 的理想场所，它本质上是您在上面描述的算法的优化版本。它的工作原理是从模式字符串构建一个 trie，然后通过它运行原始字符串，但这样做的方式不需要大量的回溯。总时间复杂度为 O(m + n + z)，其中 m 是要搜索的字符串的长度，n 是模式字符串的总长度，z 是匹配的数量。

您也可以在此处使用suffix tree。为句子构建一个后缀树，然后搜索其中的每个模式需要时间 O(m + n + z)，其中 m、n 和 z 的定义如上，尽管从头开始编写代码会非常困难。

【讨论】：