从文件头匹配的空格分隔文件中删除列答案

【问题标题】：Delete columns from space delimited file where file header matches从文件头匹配的空格分隔文件中删除列
【发布时间】：2012-07-19 10:21:25
【问题描述】：

我有一个空格分隔的输入文本文件。我想使用 sed 或 awk 删除列标题为大小的列。

输入文件：

id quantity colour shape size colour shape size colour shape size
1 10 blue square 10 red triangle 8 pink circle 3
2 12 yellow pentagon 3 orange rectangle 9 purple oval 6

期望的输出：

id quantity colour shape colour shape colour shape
1 10 blue square red triangle pink circle
2 12 yellow pentagon orange rectangle purple oval

【问题讨论】：

你知道size列在哪个位置吗？

标签： unix sed awk

【解决方案1】：

`awk` 命令

awk '
NR==1{
    for(i=1;i<=NF;i++)
        if($i!="size")
            cols[i]
}
{
    for(i=1;i<=NF;i++)
        if(i in cols)
            printf "%s ",$i
    printf "\n"
}' input > output

漂亮的印刷

column -t -s ' ' output

结果

id  quantity  colour  shape     colour  shape      colour  shape
1   10        blue    square    red     triangle   pink    circle
2   12        yellow  pentagon  orange  rectangle  purple  oval

【讨论】：

完美。正是我想要的

【解决方案2】：

使用awk 的通用解决方案。 BEGIN 块中有一个硬编码变量 (columns_to_delete) 来指示要删除的字段的位置。然后脚本将计算每个字段的宽度，并删除与变量位置匹配的字段。

假设infile有问题的内容和script.awk的以下内容：

BEGIN {
    ## Hard-coded positions of fields to delete. Separate them with spaces.
    columns_to_delete = "5 8 11"

    ## Save positions in an array to handle it better.
    split( columns_to_delete, arr_columns )
}


## Process header.
FNR == 1 { 

    ## Split header with a space followed by any non-space character.
    split( $0, h, /([[:space:]])([^[:space:]])/, seps )

    ## Use FIELDWIDTHS to handle fixed format of data. Set that variable with
    ## length of each field, taking into account spaces.
    for ( i = 1; i <= length( h ); i++ ) { 
        len = length( h[i] seps[i] )
        FIELDWIDTHS = FIELDWIDTHS " " (i == 1 ? --len : i == length( h ) ? ++len : len)
    }   

    ## Re-calculate fields with new FIELDWIDTHS variable.
    $0 = $0
}

## Process header too, and every line with data.
{
    ## Flag to know if 'p'rint to output a field.
    p = 1 

    ## Go throught all fields, if found in the array of columns to delete, reset
    ## the 'print' flag.
    for ( i = 1; i <= NF; i++ ) { 
        for ( j = 1; j <= length( arr_columns ); j++ ) { 
            if ( i == arr_columns[j] ) { 
                p = 0 
                break
            }   
        }   

        ## Check 'print' flag and print if set.
        if ( p ) { 
            printf "%s", $i
        }
        else {
            printf " " 
        }
        p = 1 
    }   
    printf "\n"
}

像这样运行它：

awk -f script.awk infile

输出如下：

id  quantity colour shape    colour shape      colour  shape    
1   10       blue   square   red    triangle   pink    circle   
2   12       yellow pentagon orange rectangle  purple   oval

编辑：哦哦，刚刚意识到输出不正确，因为两个字段之间存在连接。修复这将是太多的工作，因为在开始处理任何内容之前将检查每一行的最大列大小。但是有了这个脚本，我希望你能明白。现在不是时候，也许我可以稍后尝试修复它，但不确定。

编辑 2：修复了为删除的每个字段添加额外空间的问题。这比预期的要容易:-)

编辑 3：见 cmets。

我已修改 BEGIN 块以检查是否提供了额外变量作为参数。

BEGIN {
    ## Check if a variable 'delete_col' has been provided as argument.
    if ( ! delete_col ) { 
        printf "%s\n", "Usage: awk -v delete_col=\"column_name\" -f script.awk " ARGV[1]
        exit 0
    }   

}

并在FNR == 1模式中添加了计算要删除的列数的过程：

## Process header.
FNR == 1 { 

    ## Find column position to delete given the name provided as argument.
    for ( i = 1; i <= NF; i++ ) { 
        if ( $i == delete_col ) { 
            columns_to_delete = columns_to_delete " " i
        }   
    }   

    ## Save positions in an array to handle it better.
    split( columns_to_delete, arr_columns )

    ## ...
    ## No modifications from here until the end. Same code as in the original script.
    ## ...
}

现在运行它：

awk -v delete_col="size" -f script.awk infile

结果是一样的。

【讨论】：

有没有办法做到这一点，而不用硬编码列号（使用列标题名称）？
@SantoshPillai：哪个分隔符？
我的意思是输出没有任何分隔符/分隔符（列之间没有空格）
@SantoshPillai：首先，先前评论的示例文件似乎与您在问题中粘贴的不同。它有不同的格式，或者说清楚，根本没有格式，只有空格。我认为awk 可以比这个脚本更简单地完成这项工作。其次，我运行了一个测试，输出文件中的字段也用空格分隔，我不知道你的意思。给我们一个输入文件的好（和简短）示例，并向我们展示这个脚本失败的情况。我和其他用户也更容易为您提供帮助。
我很抱歉。我在我的问题中添加了额外的空格来输入数据以使其可读（我现在已经删除了它）。您的脚本成功删除了大小列，但它也删除了其他连续数据列之间的空格。我得到输出 idquantitycolourshape colourshape colourshape 110bluesquare redtriangle pinkcircle 212yellowpentagon orangerectangle Purpleoval

【解决方案3】：

使用cut:

$ cut -d' ' -f1-4,6,7,9,10 < in.txt   
id quantity colour shape colour shape colour shape
1 10 blue square red triangle pink circle
2 12 yellow pentagon orange rectangle purple oval

【讨论】：

我在寻找标题匹配。 @kev 的解决方案完成了这项工作

【解决方案4】：

给定一个固定的文件格式：

cut -f 1-4,6-7,9-10 infile

【讨论】：

文件格式不固定所以想删除匹配头的地方

【解决方案5】：

如果你有 GNU cut 可用，可以这样做：

columns=$(head -n1 INPUT_FILE \
          | tr ' ' '\n'       \
          | cat -n            \
          | grep size         \
          | tr -s ' '         \
          | cut -f1           \
          | tr -d ' '         \
          | paste -sd ",")

cut --complement -d' ' -f$columns INPUT_FILE

它根据标题生成一个逗号分隔的列表，然后从 INPUT_FILE 中删除该列表的补码。

【讨论】：

awk 命令

漂亮的印刷

结果

`awk` 命令