如何修剪文件 - 删除除前两列之外的列中具有相同值的行答案

【问题标题】：how to trim file - remove the rows which with the same value in the columns except the first two columns如何修剪文件 - 删除除前两列之外的列中具有相同值的行
【发布时间】：2011-06-16 22:15:56
【问题描述】：

在这里，我希望通过删除除前两列之外的列中具有相同值的行来帮助您修剪文件。

我拥有的文件（制表符分隔，包含数百万行和数十列）

Jack Mike Jones Dan Was
1 2 7 3 4
2 3 9 4 8
T T C T T
T M T T T
W A S I S

我想要的文件（删除单元格中除前两个之外具有相同值的行）

Jack Mike Jones Dan Was
1 2 7 3 4
2 3 9 4 8
T T C T T
W A S I S

你能给我一些关于我的问题的提示吗？非常感谢。

我在related question 中体验过几个优秀的awk、shell 和perl 脚本。非常感谢帮助者。

【问题讨论】：

请发布您目前拥有的代码/您尝试过的代码/等等。以你的最后一个问题为基础。
您真的很快就会向他们发布问题，想知道您是否有时间在发布新问题之前将收到的答案合并到您的程序中
比较前两列，如果相同则丢弃。现在检查剩余数据行中的重复项。（假设 FMc 指出存在错误）。
@aartist：我的理解是：对于每一行 { 读取行，忽略前 2 列，如果其余列都有一个相同的值：丢弃行。 }

标签： linux perl shell awk

【解决方案1】：

awk '{
    val=$3
    for (i=4; i<=NF; i++)
        if (val != $i) {
            print
            break
        }
}'

【讨论】：

+1 可读性强并且理论上与正则表达式方法一样高效
我对 perl 不是很熟悉。所以我选择收到这个答案。非常感谢，格伦和所有其他人。我从你身上学到了很多。

【解决方案2】：

我能想到的最简单的事情（半开玩笑：）

#!/usr/bin/perl
while (<>)
{
    my (undef, undef, @flds) = split;
    print if 1<scalar keys % {{ map { $_ => 1 } @flds }}
}

解释

_{它利用一个临时哈希表来查找每行的唯一列。这里是：}

while (<>)   # for each line
{
    # split the line into columns, discarding the first two
    my (undef, undef, @flds) = split; 

    my %columns   = map { $_ => 1 } @flds; # insert the value as key into a hashtable
    my @uniq_cols = keys %columns;         # get just the keys
    my $uniq_count= scalar @uniq_cols;     # count the keys

    print if 1<$uniq_count                 # if count == 1, all columns are the same
}

更明确地说，'map' 调用大致相当于通常的习惯用法：

    # my %columns   = map { $_ => 1 } @flds;
    my %columns;

    foreach $fld (@flds)
    {
         $columns{$fld}++; # actually the map version does '$columns{$fld} = 1;' every time
    }

HTH

【讨论】：

谢谢sehe，非常感谢您的帮助，一个非常好的解决方案。这让我更清楚。
@Rahul Dravid：我刚刚添加了一个带有更多临时变量的注释版本以供解释

【解决方案3】：

试试这个：perl -ne 'next if /^\w+\W+\w+\W+(\w+)(\W+\1)+\W*$/; print;'

即匹配：

^        beginning of line
\w+      first word
\W+      non-word (like spaces, tabs, etc)
\w+\W+   second word and spaces
(\w+)    third word (and remember)
(\W+\1)+ spaces followed by a copy of the third word as many times as necessary
\W*      optional trailing spaces
$        end of line

【讨论】：

urff...我仍然认为我的版本更清晰。并且一开始就被混淆了:)