在两个字符串中查找公共前缀的长度答案

【问题标题】：Finding length of common prefix in two strings在两个字符串中查找公共前缀的长度
【发布时间】：2017-08-03 14:06:53
【问题描述】：

对于文件中的所有行（大约 30000），我想找到开头的字符数当前行的与上一行相同。例如输入：

#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

我希望：

0   #to
3   #top
0   /0linyier
1   /10000001659/item/1097859586891251/
19  /10000001659/item/1191085827568626/
6   /10000121381/item/890759920974460/
7   /10000154478/item/1118425481552267/
3   /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2   /1175332/item/10150825241495757/
1   /806123/item/10210653847881125/
1   /51927642128/item/488930816844251927642128/341878905879428/

我正在尝试通过将字符串解包为字符并计数直到第一次不匹配来在perl 中工作，但我想知道使用awk 或perl 的内置函数是否有一些不太慢的方法。

更新：我已添加我的尝试作为答案。

【问题讨论】：

标签： perl awk command-line

【解决方案1】：

也许是这样？

它是用 Perl 编写的

use strict;
use warnings 'all';

my $prev = "";

while ( my $line = <DATA> ) {

    chomp $line;

    my $max = 0;
    ++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);

    printf "%-2d  %s\n", $max-1, $line;

    $prev = $line;
}

__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

输出

0   #to
3   #top
0   /0linyier
1   /10000001659/item/1097859586891251/
19  /10000001659/item/1191085827568626/
6   /10000121381/item/890759920974460/
7   /10000154478/item/1118425481552267/
3   /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2   /1175332/item/10150825241495757/
1   /806123/item/10210653847881125/
1   /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]

【讨论】：

【解决方案2】：

使用gawk

awk -v FS="" 'p{
    pl=0; 
    split(p,a,r); 
    for(i=1;i in a; i++)
          if(a[i]==$i){ pl++ }else { break }
}
{ 
   print pl+0,$0; p=$0
}' file

或

awk -v FS="" 'p{
     pl=0; 
     for(i=1;i<=NF; i++)
     if(substr(p,i,1)==$i){ pl++ }else { break }
}
{ 
   print pl+0,$0; p=$0
}' file

输入

$ cat file
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/

输出

$ awk -v FS="" 'p{pl=0; split(p,a,r); for(i=1;i in a; i++)if(a[i]==$i){ pl++ }else { break }}{ print pl+0,$0; p=$0}' file
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/

说明

awk -v FS="" '                                  # call awk set field sep=""
       p{
           pl=0;                                # reset variable pl
           split(p,a,r);                        # split variable p
           for(i=1;i in a; i++)                 # loop through array
                 if(a[i]==$i){                  # check array element with current field
                     pl++                       # if matched then increment pl
                 }else { 
                     break                      # else its over break loop
                 }
        }
        { 
            print pl+0,$0;                      # print count, and current record
            p=$0                                # store current record in variable p
        }
     ' file

请注意，标准规定如果将空字符串分配给FS，则结果未指定。 awk 的某些版本将产生上面示例中显示的输出。 OS/X 上的awk 版本发出警告并输出。

awk: field separator FS is empty

因此，将FS 设置为空字符串的特殊含义，不适用于每个awk。

【讨论】：

正确，将 FS 设置为空字符串所产生的行为是 POSIX 未定义的，因此任何 awk 都可以随心所欲地使用它，并且仍然符合 POSIX。 GNU awk（和其他一些？）根据可能有用的设置选择拆分字符。

【解决方案3】：

没有内置函数可以为您执行此操作，但您可以在一种二进制搜索中一次比较每个字符串的一半，而不是一次输入 1 个字符，例如（half-assed awk 伪代码）：

prev     = curr
lgthPrev = lgthCurr
curr     = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
    partCurr = substr(curr,1,partLgth)
    partPrev = substr(prev,1,partLgth)
    if ( partCurr == partPrev ) {
        # add on half of the rest of each string and try again
        partLgth = partLgth * 1.5
    }
    else {
        # subtract half of these strings and try again
        partLgth = partLgth * 0.5
    }
}

当您没有更多要比较的子字符串时退出上述循环，此时结果是：

在上一次迭代中匹配的 2 个子字符串使得前一个字符串长度是匹配子字符串的最大长度，或者
2 个子字符串从未匹配，因此 2 个字符串之间没有部分匹配。

与逐个字符比较相比，这可能会使用更少的迭代，但正如所写的那样，它在每次迭代时都进行字符串而不是字符比较，所以不知道最终的性能结果是什么。您可以通过在每次迭代时先进行字符而不是字符串比较来加快速度，并且仅在字符在当前位置匹配时才进行字符串比较：

prev     = curr
lgthPrev = lgthCurr
curr     = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
    if ( substr(curr,partLgth,1) == substr(prev,partLgth,1) )
        isMatch = (substr(curr,1,partLgth) == substr(prev,1,partLgth) ? 1 : 0)
    }
    else {
        isMatch = 0
    }
    if ( isMatch ) 
        # add on half of the rest of each string and try again
        partLgth = partLgth * 1.5
    }
    else {
        # subtract half of these strings and try again
        partLgth = partLgth * 0.5
    }
}

【讨论】：

你似乎在优化一些可能已经足够快的东西
他的问题中的 OP 说他有一个解决方案，该解决方案涉及一次 1 个字符，并要求采用更快的方法，因为这太慢了 (I am trying to work in perl by unpacking the strings into characters and counting till first mismatch but I wonder if there is some not too slow method)，所以不确定你在哪里来自那个声明。
啊，我明白了。我没有读到这意味着他们已经有了解决方案。可能你是对的。处理单个字符的缓慢部分是split，它必须创建一个数组和一些标量变量。

【解决方案4】：

perl 脚本：

#!/usr/bin/perl -ln
$c = [ unpack "C*" ]; #current record
$i = 0;
$i++ while $p->[$i] == $c->[$i]; # count till mismatch
print "$i $_";
$p = $c               #save current record for next time

同样的事情没有命令行标志：

#!/usr/bin/perl
while (<>) {
    chomp;
    $c = [ unpack "C*" ];
    $i = 0;
    $i++ while $p->[$i] == $c->[$i];
    print "$i $_\n";
    $p = $c
}

与单线相同：

perl -lne '$c=[unpack "C*"]; $i=0; $i++ while $p->[$i] == $c->[$i]; print "$i $_"; $p = $c'

将包含行的文件作为参数传递或将数据通过管道传递到命令中。

根据我的实际数据，它的运行速度与Borodin's solution 差不多：

$ xzcat href.xz |wc -l
33150
$ time xzcat href.xz | ./borodin.pl >borodin.out

real    0m2.437s
user    0m2.684s
sys     0m0.080s
$ time xzcat href.xz | ./pk.pl > pk.out 

real    0m2.305s
user    0m2.564s
sys     0m0.088s
$ diff pk.out borodin.out

【讨论】：

【解决方案5】：

在 awk 中：

$ awk -F '' '{n=split(p,a,"");for(i=1;i<=(NF<n?NF:n)&&a[i]==$i;i++);print --i,$0; p=$0}' file
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/

解释：

awk -F '' '{                                # each char on its own field
    n=split(p,a,"")                         # split prev record p each char in own a cell
    for(i=1;i<=(NF<n?NF:n)&&a[i]==$i;i++);  # compare while $i == a[i]
    print --i,$0                            # print comparison count (--fix)
    p=$0                                    # store record to p(revious)
}' file

【讨论】：

好的，所以我的结果与@AkshayHegde 的解决方案相似（++ 是为了极好的品味，我没有偷看 :) 结合了一些不同之处，所以无论如何我都敢把它留在这里。对他关于FS 的解决方案的评论也适用于此解决方案。

【解决方案6】：

您可以直接使用gawk 进行操作。在这里，它只是将当前行与上一行进行比较，并计算常见前导字符的数量：

BEGIN{
    prev="";
}
{
    curr=$1;
    n = length(curr);
    m = length(prev);
    s = n<m?n:m;
    cnt = 0;
    for(i = 1;i <= s;i++){
        if(substr(curr, i, 1) == substr(prev, i, 1)){
            cnt++;
        }else{
            break;
        }
    }
    print(cnt, curr);

    prev=curr;
}

【讨论】：