计算两个字符串的字母差异答案

【问题标题】：Count letter differences of two strings计算两个字符串的字母差异
【发布时间】：2012-09-01 10:11:23
【问题描述】：

这是我想要的行为：

a: IGADKYFHARGNYDAA
c: KGADKYFHARGNYEAA
2 difference(s).

【问题讨论】：

不，我是 python 新手，所以我通过尝试一些逻辑问题来理清思路！
您在尝试中取得了多远（在此处询问之前）？我推荐一些tutorials，我最喜欢的是udacity的[CS101](www.udacity.com/course/cs101)。
其实我从早上就开始研究它并且厌倦了它所以现在才问你！
@NiklasB。我想可能是我每条评论只能做一个链接
@hayden: 其实你发的不是 URL :)

标签： python

【解决方案1】：

def diff_letters(a,b):
    return sum ( a[i] != b[i] for i in range(len(a)) )

【讨论】：

hayden 我应用了你的方法，但它没有计算出有多少不同
你也可以使用zip:sum(1 for x,y in zip(a, b) if x != y) 对布尔求和有点不直观，在我看来：P
@rocker：我认为你在这里缺乏一些基本的了解
这点之后的人： a = raw_input("输入第 1 行：") b = raw_input("输入第 2 行：") 我不知道该怎么做！！ ://
我认为sum(x!=y for x,y in zip(a,b)) 是一种更稳定的方法，因为如果a 和b 的长度不同，给定的示例将出错

【解决方案2】：

我认为这个示例将适用于您的具体情况，不会有太多麻烦，也不会遇到与您的 python 软件版本的互操作性问题（请升级到 2.7）：

a='IGADKYFHARGNYDAA'
b='KGADKYFHARGNYEAA'

u=zip(a,b)
d=dict(u)

x=[]
for i,j in d.items(): 
    if i==j:
        x.append('*') 
    else: 
        x.append(j)
        
print x

输出： ['*', 'E', '*', '*', 'K', '*', '*', '*', '*', '*']

通过一些调整，你可以得到你想要的......如果有帮助，请告诉我:-)

更新

你也可以这样用：

a='IGADKYFHARGNYDAA'
b='KGADKYFHARGNYEAA'

u=zip(a,b)
for i,j in u:
    if i==j:
        print i,'--',j
    else: 
        print i,'  ',j

输出：

I    K
G -- G
A -- A
D -- D
K -- K
Y -- Y
F -- F
H -- H
A -- A
R -- R
G -- G
N -- N
Y -- Y
D    E
A -- A
A -- A

更新 2

你可以这样修改代码：

y=[]
counter=0
for i,j in u:
    if i==j:
        print i,'--',j
    else: 
        y.append(j)
        print i,'  ',j
        
print '\n', y

print '\n Length = ',len(y)

输出：

I    K
G -- G
A -- A
D -- D
K -- K
Y -- Y
F -- F
H -- H
A -- A
R -- R
G -- G
N -- N
Y -- Y
D    E
A -- A
A    X

['K', 'E', 'X']

 Length =  3

【讨论】：

但我想计算打印输出有多少不同，那么如何在您的上述代码中做到这一点？？
请在我的回答中更新 2 .. 如果它对我的朋友有帮助，请告诉我。请记住，同一个答案有很多变化，您可以根据需要对其进行调整..玩得开心:-)
兄弟如果我只是想计算不同而不显示字母，我该怎么办？？
我的意思是我想打印有多少是不同的，应该像这样打印：2 个差异。
@securecurve：在更新 2 中，您添加了一个计数器，但您没有增加/使用它。

【解决方案3】：

理论

同时迭代两个字符串并比较字符。
通过分别向其添加空格键或| 字符将结果存储为新字符串。此外，为每个不同的字符增加一个从零开始的整数值。
输出结果。

实施

您可以使用内置的zip 函数或itertools.izip 来同时迭代两个字符串，而后者在大量输入的情况下性能更高一些。如果字符串的大小不同，则只会对较短的部分进行迭代。如果是这种情况，您可以用不匹配指示字符填充其余部分。

import itertools

def compare(string1, string2, no_match_c=' ', match_c='|'):
    if len(string2) < len(string1):
        string1, string2 = string2, string1
    result = ''
    n_diff = 0
    for c1, c2 in itertools.izip(string1, string2):
        if c1 == c2:
            result += match_c
        else:
            result += no_match_c
            n_diff += 1
    delta = len(string2) - len(string1)
    result += delta * no_match_c
    n_diff += delta
    return (result, n_diff)

示例

这是一个简单的测试，选项与上面的示例略有不同。请注意，我使用下划线表示不匹配的字符，以更好地演示生成的字符串如何扩展为更长字符串的大小。

def main():
    string1 = 'IGADKYFHARGNYDAA AWOOH'
    string2 = 'KGADKYFHARGNYEAA  W'
    result, n_diff = compare(string1, string2, no_match_c='_')

    print "%d difference(s)." % n_diff  
    print string1
    print result
    print string2

main()

输出：

niklas@saphire:~/Desktop$ python foo.py 
6 difference(s).
IGADKYFHARGNYDAA AWOOH
_||||||||||||_|||_|___
KGADKYFHARGNYEAA  W

【讨论】：

这是最详尽的答案，它还可以处理丢失的字母。应该是被选中的。

【解决方案4】：

Python 有出色的difflib，它应该提供所需的功能。

以下是文档中的示例用法：

import difflib  # Works for python >= 2.1

>>> s = difflib.SequenceMatcher(lambda x: x == " ",
...                     "private Thread currentThread;",
...                     "private volatile Thread currentThread;")
>>> for block in s.get_matching_blocks():
...     print "a[%d] and b[%d] match for %d elements" % block
a[0] and b[0] match for 8 elements
a[8] and b[17] match for 21 elements
a[29] and b[38] match for 0 elements

【讨论】：

但是 Thomas 我应该如何在 python 2.6.5 中使用它
不确定这是否能回答问题。 OP 想要一个简单的逐个字母匹配，而不是 LCS
是的，但我认为 OP 可能会对更高级的字符串比较感兴趣，因为无论如何都会在其他答案中突出显示逐字母比较。

【解决方案5】：

a = "IGADKYFHARGNYDAA" 
b = "KGADKYFHARGNYEAAXXX"
match_pattern = zip(a, b)                                 #give list of tuples (of letters at each index)
difference = sum (1 for e in zipped if e[0] != e[1])     #count tuples with non matching elements
difference = difference + abs(len(a) - len(b))            #in case the two string are of different lenght, we add the lenght difference

【讨论】：

【解决方案6】：

我还没有看到有人使用reduce 函数，所以我将包含一段我一直在使用的代码：

reduce(lambda x, y: x + 1 if y[0] != y[1] else x, zip(source, target), 0)

这将为您提供source 和target 中不同字符的数量

【讨论】：

【解决方案7】：

使用difflib.ndiff，您可以用一种仍然可以理解的单行方式来做到这一点：

>>> import difflib
>>> a = 'IGADKYFHARGNYDAA'
>>> c = 'KGADKYFHARGNYEAA'
>>> sum([i[0] != ' '  for i in difflib.ndiff(a, c)]) / 2
2

（sum 在这里工作，因为，嗯，有点像 True == 1 和 False == 0）

以下内容清楚地说明了发生了什么以及为什么需要/ 2：

>>> [i for i in difflib.ndiff(a,c)]
['- I',
 '+ K',
 '  G',
 '  A',
 '  D',
 '  K',
 '  Y',
 '  F',
 '  H',
 '  A',
 '  R',
 '  G',
 '  N',
 '  Y',
 '- D',
 '+ E',
 '  A',
 '  A']

如果字符串的长度不同，这也很有效。

【讨论】：

【解决方案8】：

当遍历一个字符串时，创建一个计数器对象来标识您在每次迭代时所在的字母。然后使用这个计数器作为索引来引用另一个序列。

a = 'IGADKYFHARGNYDAA'
b = 'KGADKYFHARGNYEAA'

counter = 0
differences = 0
for i in a:
    if i != b[counter]:
        differences += 1
    counter += 1

在这里，每次我们在序列 a 中遇到一个与序列 b 中相同位置的字母不同的字母时，我们将 'differences' 加 1。然后我们将计数器加 1，然后再转到下一个字母。

【讨论】：

【解决方案9】：

我喜欢 Niklas R 的 the answer，但它有一个问题（取决于您的期望）。将答案与以下两个测试用例一起使用：

print compare('berry','peach')
print compare('berry','cherry')

我们可以合理地预期 cherry 更类似于 berry 而不是 peach。然而，berry 和 peach 之间的差异较小，然后是 berry 和 cherry：

(' |   ', 4)  # berry, peach
('   |  ', 5) # berry, cherry

当字符串向后而不是向前时，会发生这种情况。为了从 Niklas R 的答案中扩展答案，我们可以添加一个帮助函数，它返回正常（正向）差异和反向字符串的差异之间的最小差异：

def fuzzy_compare(string1, string2):
    (fwd_result, fwd_diff) = compare(string1, string2)
    (rev_result, rev_diff) = compare(string1[::-1], string2[::-1])
    diff = min(fwd_diff, rev_diff)
    return diff

再次使用以下测试用例：

print fuzzy_compare('berry','peach')
print fuzzy_compare('berry','cherry')

...我们得到了

4 # berry, peach
2 # berry, cherry

正如我所说，这实际上只是扩展，而不是修改 Niklas R 的答案。

如果您只是在寻找一个简单的 diff 函数（考虑到前面提到的问题），则可以使用以下方法：

def diff(a, b):
    delta = do_diff(a, b)
    delta_rev = do_diff(a[::-1], b[::-1])
    return min(delta, delta_rev)

def do_diff(a,b):
    delta = 0
    i = 0
    while i < len(a) and i < len(b):
        delta += a[i] != b[i]
        i += 1
    delta += len(a[i:]) + len(b[i:])
    return delta

测试用例：

print diff('berry','peach')
print diff('berry','cherry')

在处理不同长度的单词时，最后一个考虑因素是 diff 函数本身。有两种选择：

将长度之间的差异视为差异字符。
忽略长度差异，只比较最短的单词。

例如：

apple 和 apples 在考虑所有情况时相差 1 人物。
apple 和 apples 相差 0 时只考虑最短的单词

当只考虑我们可以使用的最短单词时：

def do_diff_shortest(a,b):
    delta, i = 0, 0
    if len(a) > len(b):
        a, b = b, a
    for i in range(len(a)):
        delta += a[i] != b[i]
    return delta

...迭代次数由最短的单词决定，其他的都被忽略。或者我们可以考虑不同的长度：

def do_diff_both(a, b):
    delta, i = 0, 0
    while i < len(a) and i < len(b):
        delta += a[i] != b[i]
        i += 1
    delta += len(a[i:]) + len(b[i:])
    return delta

在这个例子中，所有剩余的字符都被计算并添加到 diff 值中。测试这两个功能

print do_diff_shortest('apple','apples')
print do_diff_both('apple','apples')

将输出：

0 # Ignore extra characters belonging to longest word.
1 # Consider extra characters.

【讨论】：

【解决方案10】：

这是我对基于此处提供的解决方案比较两个字符串的类似问题的解决方案： https://stackoverflow.com/a/12226960/3542145.

由于 itertools.izip 在 Python3 中不适合我，我找到了仅使用 zip 函数的解决方案：https://stackoverflow.com/a/32303142/3542145。

比较两个字符串的函数：

def compare(string1, string2, no_match_c=' ', match_c='|'):
    if len(string2) < len(string1):
        string1, string2 = string2, string1
    result = ''
    n_diff = 0
    for c1, c2 in zip(string1, string2):
        if c1 == c2:
            result += match_c
        else:
            result += no_match_c
            n_diff += 1
    delta = len(string2) - len(string1)
    result += delta * no_match_c
    n_diff += delta
    return (result, n_diff)

设置两个字符串进行比较并调用函数：

def main():
    string1 = 'AAUAAA'
    string2 = 'AAUCAA'
    result, n_diff = compare(string1, string2, no_match_c='_')
    print("%d difference(s)." % n_diff)
    print(string1)
    print(result)
    print(string2)

main()

返回：

1 difference(s).
AAUAAA
|||_||
AAUCAA

【讨论】：

【解决方案11】：

这是我的解决方案。这比较了 2 个字符串，不管你在 A 或 B 中放什么。

#Declare Variables
a='Here is my first string'
b='Here is my second string'
notTheSame=0
count=0

#Check which string is bigger and put the bigger string in C and smaller string in D
if len(a) >= len(b):
    c=a
    d=b
if len(b) > len(a):
    d=a
    c=b

#While the counter is less than the length of the longest string, compare each letter.
while count < len(c):
    if count == len(d):
        break
    if c[count] != d[count]:
        print(c[count] + " not equal to " + d[count])
        notTheSame = notTheSame + 1
    else:
        print(c[count] + " is equal to " + d[count])
    count=count+1

#the below output is a count of all the differences + the difference between the 2 strings
print("Number of Differences: " + str(len(c)-len(d)+notTheSame))

【讨论】：

【解决方案12】：

diff = 0
for i, j in zip(a, b): 
    if i != j: diff += 1
print(diff)

【讨论】：