perl 在值匹配时将行转换为列答案

【问题标题】：perl convert lines to columns when values matchperl 在值匹配时将行转换为列
【发布时间】：2016-08-09 09:44:27
【问题描述】：

我有一个制表符分隔的文本文件，结构如下

col1    col2    col3    col4    col5    col6
T1      a       b       c       d       x
T5      a       b       c       d       x 
T3      a       b       c       e       y
T50     e       f       g       h       y
T2      e       f       g       h       y
T60     a       b       c       d       y

*如果两行或多行的 col2、col3、col4 和 col5 中的值相同，我希望将所有匹配行的 col6 的值放在一个额外的列中，该列中的 col1 的值标题。所以在这种情况下，第 1 行和第 2 行应该被认为是相同的，因为 col2,3,4,5 的值是相同的 (abcd)。第 4 行和第 5 行（efgh）也是如此

*如果 col2、col3、col4 和 col5 中的值是唯一的，那么它只是应该放在新列中的那一行的 col6 的值。所以在这种情况下，第 3 行是唯一的，因为其他行都没有 abce 作为 col2,3,4,5 的值。

所以输出应该是这样的。

col1    col2    col3    col4    col5    col6    T1    T2    T3    T5    T50  T60
T1      a       b       c       d       x       x     -     -     x     -    y
T5      a       b       c       d       x       x     -     -     x     -    y
T3      a       b       c       e       y       -     -     y     -     -    -
T50     e       f       g       h       y       -     y     -     -     y    -
T2      e       f       g       h       y       -     y     -     -     y    -
T60     a       b       c       d       y       x     -     -     x     -    y

我想在 perl 中做到这一点。但我不知道该怎么做。我应该使用散列来存储 col2-col3-col4-col5 的值吗？下面只是打开输入和输出文件的脚本的开始

#! /usr/bin/perl
use strict;
use warnings;

open(my $table1,'<', "input.txt") or die "$! - [$input]"; #input file 
open(my $table2, '+>', "output.txt") || die ("Can't write new file: $!"); #output file

【问题讨论】：

我不明白想要的输出。你的意思是整个文件都是唯一的吗？文件有多大？
该文件包含大约 400000 行和 col1 中大约 60 个不同的值
如果两行有col2 .. col5 = a b c d 但col6 在一行是x 而在另一行是y？

标签： perl

【解决方案1】：

此解决方案对文件进行两次传递。它对__DATA__ 有一些特殊处理，如果您使用文件，则可以将其删除。

use strict;
use warnings 'all';
use feature 'say';
use List::MoreUtils 'uniq';
use Fcntl 'SEEK_SET';

# grab the header and split up the headings
chomp( my $header = <DATA> );
my @fields = split /\s+/, $header;

# we need this because __DATA__ is not a regular filehandle
my $data_start = tell DATA;

# this name is bad, because we don't know what this the column really contains
my %col6;

# first pass
while ( my $row = <DATA> ) {
    chomp $row;
    my @cols = split /\s+/, $row;

    # in a hash with col2 to col5 as key, use col1 as key and col6 as value
    $col6{ join q{::}, @cols[ 1 .. 4 ] }->{ $cols[0] } = $cols[5];
}

# reset DATA to beginning, this is not needed if you work with a file
seek DATA, $data_start, SEEK_SET;

# get the additional headings from the col1 mapping
my @new_fields = sort +uniq map { keys %{ $col6{$_} } } keys %col6;

# output them
say join "\t", @fields, @new_fields;

# second pass
while ( my $row = <DATA> ) {
    chomp $row;
    my @cols = split /\s+/, $row;

    # go through all the new headings and either print the value of the col6, or a dash
    say join "\t", @cols, map { $col6{ join q{::}, @cols[ 1 .. 4 ] }->{$_} || q{-} } @new_fields;
}

__DATA__
col1    col2    col3    col4    col5    col6
T1      a       b       c       d       x
T5      a       b       c       d       x
T3      a       b       c       e       y
T50     e       f       g       h       y
T2      e       f       g       h       y

输出：

col1    col2    col3    col4    col5    col6    T1  T2  T3  T5  T50
T1  a   b   c   d   x   x   -   -   x   -
T5  a   b   c   d   x   x   -   -   x   -
T3  a   b   c   e   y   -   -   y   -   -
T50 e   f   g   h   y   -   y   -   -   y
T2  e   f   g   h   y   -   y   -   -   y

【讨论】：

【解决方案2】：

你需要分解问题：

识别col2 .. col6 的所有唯一组合，并将相应的col1 值与其关联。这尖叫着哈希：

my ( %unique, %label);
while ( <$table1> ) {
    my @record = split;
    my $id = $record[0];
    my $label = $record[5];
    my $result = join '~', @record[1..4];
    push @{ $unique{$result} }, $id;
    $label{$id} = $label;
}

根据您的要求生成输出矩阵

【讨论】：