【问题标题】:awk: print duplicate entries after a condition is metawk:满足条件后打印重复条目
【发布时间】:2021-05-12 10:36:59
【问题描述】:

我有一个包含不同类型条目的大文件,由制表符分隔:

## HEADER 1
## HEADER 2
## HEADER 3
#Col1   Col2    Col3
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  CANON
1_506_C/T   value2  ISO
1_506_C/T   value2  CANON
1_245_A/T   value3  SINGLE
2_1156_C/G  value4  ISO
2_1156_C/G  value4  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  CANON
2_1221_A/T/C    value5  CANON
3_787_G/T   value6  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  CANON
4_12_T/C    value8   SINGLE
4_167_A/G   value9  ISO
4_167_A/G   value9  CANON
4_167_A/G   value9  CANON

我想打印所有内容,但将满足这些条件的条目中的 $3 值更改为“CANON_DUPL”:

  1. 不以#开头。
  2. $3 的值必须是“CANON”。
  3. $1 值必须重复。

所以决赛桌一定是:

## HEADER 1
## HEADER 2
## HEADER 3
#Col1   Col2    Col3
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  CANON
1_506_C/T   value2  ISO
1_506_C/T   value2  CANON
1_245_A/T   value3  SINGLE
2_1156_C/G  value4  ISO
2_1156_C/G  value4  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  CANON_DUPL
2_1221_A/T/C    value5  CANON_DUPL
3_787_G/T   value6  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  CANON
4_12_T/C    value8  SINGLE
4_167_A/G   value9  ISO
4_167_A/G   value9  CANON_DUPL
4_167_A/G   value9  CANON_DUPL

我使用 awk 进行了尝试,但我只需要满足前两个条件:

> awk 'BEGIN {FS=OFS="\t"}; !/#/$3~"CANON"{$3="CANON_DUPL"} {print $0}' file.txt
## HEADER 1
## HEADER 2
## HEADER 3
#Col1   Col2    Col3
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  ISO
1_222_A/G   value1  CANON_DUPL #should not be modified
1_506_C/T   value2  ISO
1_506_C/T   value2  CANON_DUPL #should not be modified
1_245_A/T   value3  SINGLE
2_1156_C/G  value4  ISO
2_1156_C/G  value4  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  ISO
2_1221_A/T/C    value5  CANON_DUPL
2_1221_A/T/C    value5  CANON_DUPL
3_787_G/T   value6  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  ISO
3_99089_A/C value7  CANON_DUPL #should not be modified
4_12_T/C    value8  SINGLE
4_167_A/G   value9  ISO
4_167_A/G   value9  CANON_DUPL
4_167_A/G   value9  CANON_DUPL

我不知道 awk 中的解决方案是否更易于实施。
有什么想法吗?

注意:经过编辑以更好地反映文件结构

【问题讨论】:

    标签: awk duplicates


    【解决方案1】:

    对于您显示的示例,您能否尝试以下操作。这将需要内存(对于大型数据集),因为它读取 Input_file 两次。如果您的实际 Input_file 是制表符分隔的,则在以下代码中将 awk 更改为 awk 'BEGIN{FS=OFS="\t"}

    awk '
    (FNR==1 || FNR==2 || FNR==3 ){
      if(++count<=3){ print }
      next
    }
    FNR==NR{
      arr[$1,$3]++
      next
    }
    arr[$1,$3]>1 && $0!~/^#/ && $3=="CANON"{
      $3="CANON_DUPL"
    }
    1
    '  Input_file  Input_file
    

    说明:为上述添加详细说明。

    awk '                                     ##Starting awk program from here.
    (FNR==1 || FNR==2 || FNR==3 ){            ##Checking condition if line is 1 2 or 3 here.
      if(++count<=3){ print }                 ##If count is lesser or equals to 3 then print it.
      next                                    ##next will skip all further statements from here.
    }
    FNR==NR{                                  ##Checking condition which will be TRUE when 1st time Input_file is being read.
      arr[$1,$3]++                            ##Creating arr with index of $1,$3 and keep increasing its value by 1 here.
      next                                    ##next will skip all further statements from here.
    }
    arr[$1,$3]>1 && $0!~/^#/ && $3=="CANON"{  ##Checking condition if arr with 1st,3rd field value is greater than 1 AND line not starting with # AND 3rd column is CANON then do following.
      $3="CANON_DUPL"                         ##Set 3rd field to CANON_DUPL here.
    }
    1                                         ##printing current line here.
    ' Input_file  Input_file                  ##Mentioning Input_file names here.
    

    【讨论】:

    • 它不能正常工作,因为我刚刚意识到几乎每个条目都重复了 2 到 6 次,CANON 和 ISO 值都在 3 美元(即 1_222_A/G - ISO 存在 5 次,并且和 1_222_A/G - CANON 存在一次),所以最后这个命令检测到每个 CANON 都是重复的(按照示例,我不希望将 1_222_A/G - CANON 标记为 CANON_DUPL,因为它只出现一次)。我编辑了问题以澄清这一点。
    • @ALG,好的,请尝试关注一次awk '(FNR==1 || FNR==2 || FNR==3 ){if(++count&lt;=3){ print };next} FNR==NR{arr[$1,$3]++;next};arr[$1,$3]&gt;1 &amp;&amp; $0!~/^#/ &amp;&amp; $3=="CANON"{$3="CANON_DUPL"}1' file file,如果这对您有帮助,请告诉我?
    • @ALG,我也编辑了我的答案,输出与您显示的示例相同,请检查一次,如有任何疑问,请告诉我,谢谢。
    【解决方案2】:

    这里是一次性解决方案:

    parse.awk

    NR<=4 { print; next }
    NR==5 { P1=$1; P2=$2; P3=$3; next }
    $1 == P1 && $3 == "CANON" && P3 == "CANON" { $3 = P3 = "CANON_DUPL" }
    { print P1, P2, P3; P1=$1; P2=$2; P3=$3 }
    END { print P1, P2, P3 }
    

    像这样运行它:

    awk -f parse.awk infile OFS='\t'
    

    输出:

    ## HEADER 1
    ## HEADER 2
    ## HEADER 3
    #Col1   Col2    Col3
    1_222_A/G   value1  ISO
    1_222_A/G   value1  ISO
    1_222_A/G   value1  ISO
    1_222_A/G   value1  CANON
    1_506_C/T   value2  ISO
    1_506_C/T   value2  CANON
    1_245_A/T   value3  SINGLE
    2_1156_C/G  value4  ISO
    2_1156_C/G  value4  ISO
    2_1221_A/T/C    value5  ISO
    2_1221_A/T/C    value5  ISO
    2_1221_A/T/C    value5  CANON_DUPL
    2_1221_A/T/C    value5  CANON_DUPL
    3_787_G/T   value6  ISO
    3_99089_A/C value7  ISO
    3_99089_A/C value7  ISO
    3_99089_A/C value7  CANON
    4_12_T/C    value8  SINGLE
    4_167_A/G   value9  ISO
    4_167_A/G   value9  CANON_DUPL
    4_167_A/G   value9  CANON_DUPL
    

    【讨论】: