文本分割：将输入与字典中最长的单词匹配的算法答案

【问题标题】：Text segmentation: Algorithm to match input with the longest words from the dictionary文本分割：将输入与字典中最长的单词匹配的算法
【发布时间】：2013-05-22 10:35:04
【问题描述】：

我需要将字符串拆分为单词，以便每个单词都来自字典。还要确保选择左边最长的单词。因此

thisisinsane => this is insane (correct as longest possible word from left)
thisisinsane => this is in sane(wrong) 
Assuming 'this', 'is', 'in', 'insane' are all words in the dictionary.

我设法通过从字符串的末尾遍历到匹配最长单词的开头来解决了这个问题。但是问题开始困扰我们这些问题..

shareasale => share as ale(wrong as 'ale' not in the dictionary)
shareasale => share a sale(correct)
Assuming 'share', 'a', 'sale' are all words in the dictionary unlike 'ale'.

我试图通过删除遇到错误之前找到的有效段来解决这个问题，即

shareasale => 'share' and 'as' (error = 'ale')

并从字典中删除它们一次，然后解决问题。所以

shareasale => no valid segments when share is removed
shareasale => share a sale (when 'as' is removed from the dictionary.

因此我也设法解决了这个问题。但是后来我无法解决这个问题

asignas => 'as' ( error = 'ignas')

然后我的解决方案将从字典中删除“as”并尝试解决它

asignas => 'a' 'sign' (error = 'as')

因为在新的递归调用中，'as' 已从字典中删除。我写的函数是in this link。我希望有人可以通过它并帮助我找到更好的算法来解决这个问题，否则建议修改我现有的算法。

【问题讨论】：

标签： python algorithm text-segmentation

【解决方案1】：

本质上，您的问题是一个树问题，在每个级别上，构成树前缀的所有单词都形成分支。不留下字符串任何部分的分支是正确的解决方案。

                      thisisinsane
                          |
                          |
                     (this)isinsane
                     /            \
                    /              \
          (this,i)sinsane         (this,is)insane
              /                     /           \
             /                     /             \
  (this,i,sin)ane          (this,is,in)sane    (this,is,insane)
                                /
                               /
                       (this,is,in,sane)

所以在这个例子中有两个解决方案，但是我们想使用最长的单词来选择解决方案，即我们想使用深度优先搜索策略从右边探索树。

所以我们的算法应该：

按长度降序对字典进行排序。
查找当前字符串的所有前缀。如果没有，请返回False。
将prefix 设置为最长的未探索前缀。
从字符串中删除它。如果字符串为空，我们找到了解决方案，返回所有前缀的列表。
递归到 2。
此分支失败，返回False。

此解决方案的示例实现：

def segment(string,wset):
    """Segments a string into words prefering longer words givens
    a dictionary wset."""
    # Sort wset in decreasing string order
    wset.sort(key=len, reverse=True)
    result = tokenize(string, wset, "")
    if result:
        result.pop() # Remove the empty string token
        result.reverse() # Put the list into correct order
        return result
    else:
        raise Exception("No possible segmentation!")

def tokenize(string, wset, token):
    """Returns either false if the string can't be segmented by 
    the current wset or a list of words that segment the string
    in reverse order."""
    # Are we done yet?
    if string == "":
        return [token]
    # Find all possible prefixes
    for pref in wset:
        if string.startswith(pref):
            res = tokenize(string.replace(pref, '', 1), wset, pref)
            if res:
                res.append(token)
                return res
    # Not possible
    return False

print segment("thisisinsane", ['this', 'is', 'in', 'insane']) # this is insane
print segment("shareasale", ['share', 'a', 'sale', 'as'])     # share a sale
print segment("asignas", ['as', 'sign', 'a'])                 # a sign as

【讨论】：

谢谢@Jakub！尽管 Photon 的回答解决了我的问题，但这个回答通过清晰的解释帮助我更好地理解了解决方案。

【解决方案2】：

只需进行递归扫描，每次在最后一个单词中添加一个字母（如果它在字典中），并尝试让它开始一个新单词。这意味着在每次通话中，您有 1 或 2 个假设要测试（是否有空格）。当您到达输入的末尾并且您有一组有效的单词时，如果其中的第一个单词比您迄今为止找到的最佳解决方案长，请保存此解决方案。

示例代码：

words=['share','as','a','sale','bla','other','any','sha','sh']
wdict={}
best=[]

def scan(input,i,prevarr):
    global best
    arr=list(prevarr)
    # If array is empty, we automatically add first letter
    if len(arr)<1:
        arr.append(input[0:1])
        return scan(input,i+1,arr)
    # If no more input is available, evaluate the solution
    if i>=len(input):
        # Is the last word a valid word
        if wdict.has_key(arr[-1]):
            # Is there a current best solution?
            if len(best)==0:
                best=arr      # No current solution so select this one
            elif len(arr[0])>len(best[0]):
                best=arr      # If new solution has a longer first word
        return best
    # If the last word in the sequence is a valid word, we can add a space and try
    if wdict.has_key(arr[-1]):
        arr.append(input[i:i+1])
        scan(input,i+1,arr)
        del arr[-1]
    # Add a letter to the last word and recurse
    arr[-1]=arr[-1]+input[i:i+1]
    return scan(input,i+1,arr)

def main():
    for w in words:
        wdict[w]=True
    res=scan('shareasasale',0,[])
    print res

if __name__ == '__main__':
    main()

【讨论】：

感谢@photon 努力写出完整的解决方案。它似乎非常适合我的需求。
我正在浏览代码，发现它有点难以理解。如果您能用 cmets 解释解决方案，我将不胜感激。
添加了 cmets。玩得开心。

【解决方案3】：

我将使用的算法类似于：

让您的字典按长度递减排序
构建一个函数prefix(s)，它按顺序生成以s 开头的字典单词
- 即。 prefix("informativecow") 首先产生“informative”，然后是“inform”，然后是“info”，然后是“in”
构建递归函数test(s)：
- 如果s是空字符串，返回True
- 对于每个前缀，从 s 中删除前缀并使用该字符串调用 test()。如果你找到一个是真的，返回真。如果您没有找到任何正确的，请返回 False。

【讨论】：

感谢您回答@Dave，但 Photon 的回答解决了我的问题。不过，我很好奇这是否适用于 'asignas' => 'a' 'sign' 'as'。

【解决方案4】：

有关如何进行英语分词的真实示例，请查看Python wordsegment module 的来源。它有点复杂，因为它使用单词和短语频率表，但它说明了记忆方法。

特别是score 说明了分割方法：

def score(word, prev=None):
    "Score a `word` in the context of the previous word, `prev`."

    if prev is None:
        if word in unigram_counts:

            # Probability of the given word.

            return unigram_counts[word] / 1024908267229.0
        else:
            # Penalize words not found in the unigrams according
            # to their length, a crucial heuristic.

            return 10.0 / (1024908267229.0 * 10 ** len(word))
    else:
        bigram = '{0} {1}'.format(prev, word)

        if bigram in bigram_counts and prev in unigram_counts:

            # Conditional probability of the word given the previous
            # word. The technical name is *stupid backoff* and it's
            # not a probability distribution but it works well in
            # practice.

            return bigram_counts[bigram] / 1024908267229.0 / score(prev)
        else:
            # Fall back to using the unigram probability.

            return score(word)

如果你替换了 score 函数，让它在你的字典中为更长的匹配返回更高的分数，那么它会按照你的意愿做。

【讨论】：