在 Perl 中查找两个等长字符串之间差异的快速方法答案

【问题标题】：Fast Way to Find Difference between Two Strings of Equal Length in Perl在 Perl 中查找两个等长字符串之间差异的快速方法
【发布时间】：2011-06-10 05:17:27
【问题描述】：

给定一对这样的字符串。

    my $s1 = "ACTGGA";
    my $s2 = "AGTG-A";

   # Note the string can be longer than this.

我想在$s1 中找到与$s2 不同的位置和字符。在这种情况下，答案是：

#String Position 0-based
# First col = Base in S1
# Second col = Base in S2
# Third col = Position in S1 where they differ
C G 1
G - 4

我可以通过substr() 轻松实现这一目标。但它慢得可怕。通常我需要比较数百万个这样的对。

有没有快速实现的方法？

【问题讨论】：

您能否发布带有基准的substr 示例？然后我们可以将其用作比较我们潜在解决方案的基准。此外，这些不是 Unicode 字符串，对吧？（它们看起来像遗传信息......）输入是否总是在一个狭窄的字符子集中（即 [ACTG-]）？
TimToady 的经典回答perlmonks.org/?node_id=840593: $matches = ($first ^ $second) =~ tr/\0//;
@snoopy：计算有多少个字符是相同的，而不是这里想要的

标签： linux perl string unix

【解决方案1】：

Stringwise ^ 是你的朋友：

use strict;
use warnings;
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";

my $mask = $s1 ^ $s2;
while ($mask =~ /[^\0]/g) {
    print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
}

解释：

^（异或）运算符在用于字符串时，返回由每个字符的数值的每一位异或的结果组成的字符串。将示例分解为等效代码：

"AB" ^ "ab"
( "A" ^ "a" ) . ( "B" ^ "b" )
chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
chr( 65 ^ 97 ) . chr( 66 ^ 98 )
chr(32) . chr(32)
" " . " "
"  "

这里的有用特性是当且仅当两个字符串在给定位置具有相同字符时才会出现空字符（"\0"）。所以^可以在一次快速操作中高效地比较两个字符串的每个字符，并且可以在结果中搜索非空字符（表示不同）。可以在标量上下文中使用 /g 正则表达式标志重复搜索，并使用 $-[0] 找到每个字符差异的位置，它给出了最后一次成功匹配的开始偏移量。

【讨论】：

顺便说一句，使用@-的非常简洁的例子。
如果你能解释一下这里发生了什么，那就太好了。
感谢建议的编辑以添加解释，@carandraug；我的做法有些不同。

【解决方案2】：

对完整的字符串使用二进制位操作。

$s1 & $s2 或 $s1 ^ $s2 之类的东西运行速度非常快，并且可以处理任意长度的字符串。

【讨论】：

【解决方案3】：

我在 2012 年的感恩节假期很无聊，回答了这个问题等等。它适用于等长的字符串。如果他们不是，它将起作用。我添加了一个帮助，选择处理只是为了好玩。我想有人可能会觉得它有用。如果您是 PERL 新手，请添加不知道。请勿将 DATA 下面的脚本中的任何代码添加到程序中。玩得开心。

./diftxt -h

    usage: diftxt [-v ] string1 string2
                   -v = Verbose 
                  diftxt [-V|--version]
                  diftxt [-h|--help]  "This help!"
Examples:  diftxt test text
           diftxt "This is a test" "this is real"

    Place Holders:  space = "·" , no charater = "ζ"

猫 ./diftxt ----------- 切✂------------

#!/usr/bin/perl -w

use strict;
use warnings;
use Getopt::Std;
my %options=();
getopts("Vhv", \%options);
my $helptxt='
        usage: diftxt [-v ] string1 string2
                       -v = Verbose 
                      diftxt [-V|--version]
                      diftxt [-h|--help]  "This help!"
    Examples:  diftxt test text
               diftxt "This is a test" "this is real"

        Place Holders:  space = "·" , no charater = "ζ"';
my $Version = "inital-release 1.0 - Quincey Craig 11/21/2012";

print "$helptxt\n\n" if defined $options{h};
print "$Version\n" if defined $options{V};
if (@ARGV == 0 ) {
 if (not defined $options{h}) {usage()};
 exit;
}

my $s1 = "$ARGV[0]";
my $s2 = "$ARGV[1]";
my $mask = $s1 ^ $s2;

#  setup unicode output to STDOUT
binmode DATA, ":utf8";
my $ustring = <DATA>;
binmode STDOUT, ":utf8";

my $_DIFF = '';
my $_CHAR1 = '';
my $_CHAR2 = '';

sub usage
{
        print "\n";
        print "usage: diftxt [-v ] string1 string2\n";
        print "               -v = Verbose \n";
        print "       diftxt [-V|--version]\n";
        print "       diftxt [-h|--help]\n\n";
        exit;
}

sub main
{
 print "\nOrig\tDiff\tPos\n----\t----\t----\n" if defined $options{v};
 while ($mask =~ /[^\0]/g) {
### redirect stderr to allow for test of empty variable with error message from substr   
    open STDERR, '>/dev/null';
    if (substr($s2,$-[0],1) eq "") {$_CHAR2 = "\x{03B6}";close STDERR;} else {$_CHAR2 = substr($s2,$-[0],1)};
    if (substr($s2,$-[0],1) eq " ") {$_CHAR2 = "\x{00B7}"};
      $_CHAR1 = substr($s1,$-[0],1);
    if ($_CHAR1 eq "") {$_CHAR1 = "\x{03B6}"} else {$_CHAR1 = substr($s1,$-[0],1)};
    if ($_CHAR1 eq " ") {$_CHAR1 = "\x{00B7}"};
### Print verbose Data  
   print $_CHAR1, "\t", $_CHAR2, "\t", $+[0], "\n" if defined $options{v};
### Build difference list 
   $_DIFF = "$_DIFF$_CHAR2";
### Build mask 
   substr($s1,"$-[0]",1) = "\x{00B7}";
 } ### end loop

 print "\n" if defined $options{v};
 print "$_DIFF, ";
 print "Mask: \"$s1\"\n";
} ### end main
if ($#ARGV == 1) {main()};
__DATA__

【讨论】：

【解决方案4】：

这是你能得到的最简单的形式

my $s1 = "ACTGGA";
my $s2 = "AGTG-A";

my @s1 = split //,$s1;
my @s2 = split //,$s2;

my $i = 0;
foreach  (@s1) {
    if ($_ ne $s2[$i]) {
        print "$_, $s2[$i] $i\n";
    }
    $i++;
}

【讨论】：