如何查找字符串中的乱序字符答案

【问题标题】：How to Find Out-of-Order Characters in a String如何查找字符串中的乱序字符
【发布时间】：2018-08-17 04:27:54
【问题描述】：

这个问题与我在工作中遇到的当前问题有关，但由于它相当广泛，我试图将其表述为更多的面试问题以鼓励讨论。

假设我们有以下两个字符串：

str1 = 'helloworld'
str2 = 'helloldwor'

我希望能够比较 str1 和 str2 并确定 str2 中的哪些字符是乱序的，假设 str1 是“正确的”。还可以假设 str2 中的所有字符都与 str1 相同（str2 只是 str1 的混杂版本）。

编辑：在这种情况下，我会说“ld”出现故障。我将“无序”子字符串定义为 str2 的最小子字符串，如果将其移动到与 str1 中的子字符串相同的位置，将使 str1 == str2。

这个问题一直困扰着我很长时间，因为它很容易从视觉上弄清楚，但我正在努力将它变成某种算法。

我的尝试：

def get_ooo(str1, str2):
#for potential options
local_set = Set()

#Loop from len(str1) to 1, splitting str2 by i to cover all possible substrings of str2
split_size = len(str1)
for i in range(len(str1),1,-1):
    print 'Iteration #' + str(len(str1) - split_size)

    #Try to find all substrings of str2 of length 'i' in str1
    for j in range(0,len(str1)-i):
        if str1.find(mid(str2,j,i)) < 0:
            #Failed to find substring in str1

            #Add to our local_set if it is a substring of all other failed substrings
            intersect = True
            for k in local_set:
                if k.find(mid(str2,j,i)) < 0:
                    intersect = False

            #If substring was a substring of all other failed substrings
            if intersect:
                #Add to local_set
                local_set.add(mid(str2,j,i))
                print mid(str2,j,i) + ' - FAIL, PASS'
            else:
                print mid(str2,j,i) + ' - FAIL, FAIL'
        else:
            print mid(str2,j,i) + ' - PASS'

#solution found?
best_option = ''
for option in local_set:
    if len(option) < len(best_option) or best_option == '':
        best_option = option
return best_option

本质上，我使用的逻辑是在 str1 中查找 str2 的子字符串，从可能的最大子字符串开始。当我发现一个不合适的，我将它添加到一个可能的解决方案集中。如果我发现另一个不适合 str1 的子字符串，我只会将它添加到可能的选项中，前提是它也是所有其他潜在选项的子字符串。因此，最后这个集合中最小的子字符串将包含第一个乱序字符。

所以，使用这个算法，我总是知道乱序部分从哪里开始。。但是，我不知道如何实际提取出乱序的部分。

我尝试将字符串反向传递给函数，这给了我从后面的字符串中第一个字符实例，然后在这里给了我完整的无序子字符串。但是，如果有多个部分乱序怎么办？此外，根据我的测试，该脚本仅返回 str2 中子字符串乱序的第一个实例。例如：

str1 = 'helloworld'
str2 = 'hworldello'

将返回 'hw'，告诉我 'w' 是字符串乱序的地方。但在这个例子中，如果 'ello' 乱序而不是 'world' 子字符串，它会更有意义，因为它更大。

我已经盯着这个问题看了一天多，并决定是时候向其他意见开放，特别是因为我觉得必须有更好的方法。那么大家怎么看呢？有人有什么绝妙的主意吗？

【问题讨论】：

您可以阅读有关 Levenshtein 距离en.wikipedia.org/wiki/Levenshtein_distance
我认为您应该从“乱序”的正式定义开始。
@juanbits Levenshtein distance 在这里有用吗？ OP 正在寻找乱序的字符块

标签： python string sorting

【解决方案1】：

您可以利用递归和向前和向后搜索字符串，以确保返回最小的子字符串：

def find_substring_forwards(str1, str2, storage, index):

    global main1, main2

    for i, (x, y) in enumerate(zip(str1, str2)):
        if x!=y and not storage: index = i
        expected_letter = main1[main1.index(''.join(storage))+len(storage)]
        if (x!=y or (expected_letter==x and main1.count(expected_letter)>1)) and index==i:
            storage.append(y)
            str2 = str2[:i] + str2[i+1:]
            if str1[:len(str2)]==str2: break
            return find_substring_forwards(str1, str2, storage, index)

    return ''.join(storage)

def find_substring_backwards(str1, str2, storage, index):

    global main1, main2

    for i, (x, y) in enumerate(zip(str1, str2)):
        if x!=y and not storage: index = i
        if x!=y and index==i:
            storage.append(y)
            str2 = str2[:i] + str2[i+1:]
            if str1[:len(str2)]==str2: break
            return find_substring_backwards(str1, str2, storage, index)

    return ''.join(storage)

def out_of_order(str1, str2):

    x = ''.join(find_substring_forwards(str1, str2, [], None))
    y = ''.join(find_substring_backwards(str1[::-1], str2[::-1], [], None)[::-1])
    final = x if len(x)<=len(y) else y

    return final

几个测试用例：

test_cases = [('helloworld','heworldllo','llo'),
            ('helloworld','hwoellorld','wo'),
            ('helloworld','hworldello','ello'),
            ('helloworld','helloldwor','ld'),
            ('helloworld','helloowrld','o'),
            ('helloworld','whelloorld','w')]

for test in test_cases:
    main1 = test[0]; main2 = test[1]
    x = out_of_order(main1, main2)
    print(main1, '|', main2)
    print('Expected:', test[2], '| Returned:', x, '\n')

产量：

helloworld | heworldllo
Expected: llo | Returned: llo 

helloworld | hwoellorld
Expected: wo | Returned: wo 

helloworld | hworldello
Expected: ello | Returned: ello 

helloworld | helloldwor
Expected: ld | Returned: ld 

helloworld | helloowrld
Expected: o | Returned: o 

helloworld | whelloorld
Expected: w | Returned: w

说明：

我们同时遍历两个字符串，知道str1 是所需的字符串。当我们从str1 中找到与正确位置不匹配的字母时，我们将该字母添加到存储中并记下索引。然后我们从字符串中删除该字母并重复该过程。我们继续这个递归循环，直到我们删除字母的索引发生变化。这表明我们已经到达“乱序”子字符串的末尾。为了确保我们找到最小的子字符串（就字符而言），我们必须反向执行相同的方法（向后遍历字符串）。在out_of_order 函数中，我们只取两个子字符串中较小的一个，如果它们相等，则我们通过向前迭代字符串来获取解决方案（因为两者在技术上都是正确的）。

更新

在以下测试用例的字符串中存在重复字母的问题：

('helloworld','heworldllo','llo')

算法现在检查作为从字符串中删除的候选字母的当前字母是否实际上是无序子字符串的预期字母。如果是，则添加到无序子字符串storage 容器中，而不是过早结束子字符串搜索。为了清晰和可读性，还分离了用于向前和向后迭代的函数。

【讨论】：

感谢您的回答！我认为这很好用，唯一的问题是每个字符串中的字符匹配但我们仍在阅读“无序”块的情况。例如，out_or_order('helloworld','heworldllo') 将返回 'wor'。但是，这仍然对我有用，因为我正在处理的“字符串”具有独特的字符，所以它不应该遇到这种情况。为了解决这个问题，我现在正在寻找使用非唯一字符的方法，如果我发现任何东西，我会更新。
太棒了！如果这有帮助，将不胜感激！
实际上，根据您的标准，“wor”是正确的。您是否希望它返回“llo”？
我希望 'world' 被返回，这是因为 'world' 从末尾移到了 'hello' 片段之间。
这是您在 OP 中的标准：“我将 '乱序' 子字符串定义为 str2 的最小子字符串，如果移动到与 str1 中的子字符串相同的位置，将使 str1 == STR2。”因此，'llo' 是正确的。我明白你在第一条评论中的意思，很抱歉一开始不理解。我也在努力。