【问题标题】:awk - find pattern in line and remove it together with an upstream partawk - 在行中查找模式并将其与上游部分一起删除
【发布时间】:2014-11-05 10:15:09
【问题描述】:

我最初过滤了我的文本文件以仅包含那些已识别模式的行(在本例中为“TCTGTACTATATG”)。现在从生成的文件中,我想从包含它的每一行中删除此模式以及上游字符。 使用 AWK 的最佳方法是什么?

这是我的输入:

@DGTKZQN1:384:C364AACXX:1:1109:19757:66886 2:N:0:GTGAAA
AACAGTTTCTGTACTATATTGACTCATAAGAGTGGTTTAATACGAAGGGAGGAGAAGTTTCCTGGAAATAATCGATTTCCTAGCTTTTAGTTGCAATAAT
+
CCCFFFFFHHHHDIIJJJJJJJJJIIJEIJHHCFGFFGHIIIIJGGIJGG@GHIGEEFDGGIGIJJIEHGIEHHHEDFFFDEEEDDEDDCCDBDDDCDDD
@DGTKZQN1:384:C364AACXX:1:1109:20360:66756 2:N:0:GTGAAA
TTTCTGTACTATATTGGGTGTGAGAAGTAATGGTGCACTCCACAGACCTCCAGTGGCTGCTTGTTCGCCAGAACAGCAAATTTCTGCAGAAGCGCAAAAG
+
@@CFFFFFHHHGHIIIJI;GCGGIIIJFHIIJGEDGGIJIICBDFIIIIJHIIGHIDHGEEHGHHIIJHGD?DDFEECEDDDDCDCCDDDCDDDDDDBC>
@DGTKZQN1:384:C364AACXX:1:1109:21207:66784 2:N:0:GTGAAA
AACAGTTTCTGTACTATATTGTACGTTGTGGATTATTAAAGGGAATAAAAGTGGTAGATTGTGCAGTTGAGGCAGGCTCTCAACTGTGAAACAGCGGTGG
+
@@CFFBDDFHBDCGG<?:CEEAFEEF@A3<?<3C>FEGHGG@DB?8BF@G>?0909??DF>HE@C=)8CEH9DHCB:AED>?C@6>C;6>C3?3=@B8B=
@DGTKZQN1:384:C364AACXX:1:1109:21026:66836 2:N:0:GTGAAA
AGAACAGTTTCTGTACTATATTGTTATACTTCTGTTGTGGGTGTAGAGTTTTCTCCGGCGTTGGCTTCAATGGAATAAGGCACGAGATGAATCCGTGGAG
+
@@@FFFFDHHHDHHIIJJEHHJGJJIGIIEIIIIEHEGHIJDF?DGEE4??DG@FGEG:FHHHHF@D@CEACEEEDDDCCCDDBDDDDDDDACDB??>BD

输出应该是这样的:

@DGTKZQN1:384:C364AACXX:1:1109:19757:66886 2:N:0:GTGAAA
ACTCATAAGAGTGGTTTAATACGAAGGGAGGAGAAGTTTCCTGGAAATAATCGATTTCCTAGCTTTTAGTTGCAATAAT
+
CCCFFFFFHHHHDIIJJJJJJJJJIIJEIJHHCFGFFGHIIIIJGGIJGG@GHIGEEFDGGIGIJJIEHGIEHHHEDFFFDEEEDDEDDCCDBDDDCDDD
@DGTKZQN1:384:C364AACXX:1:1109:20360:66756 2:N:0:GTGAAA
GGTGTGAGAAGTAATGGTGCACTCCACAGACCTCCAGTGGCTGCTTGTTCGCCAGAACAGCAAATTTCTGCAGAAGCGCAAAAG
+
@@CFFFFFHHHGHIIIJI;GCGGIIIJFHIIJGEDGGIJIICBDFIIIIJHIIGHIDHGEEHGHHIIJHGD?DDFEECEDDDDCDCCDDDCDDDDDDBC>
@DGTKZQN1:384:C364AACXX:1:1109:21207:66784 2:N:0:GTGAAA
TACGTTGTGGATTATTAAAGGGAATAAAAGTGGTAGATTGTGCAGTTGAGGCAGGCTCTCAACTGTGAAACAGCGGTGG
+
@@CFFBDDFHBDCGG<?:CEEAFEEF@A3<?<3C>FEGHGG@DB?8BF@G>?0909??DF>HE@C=)8CEH9DHCB:AED>?C@6>C;6>C3?3=@B8B=
@DGTKZQN1:384:C364AACXX:1:1109:21026:66836 2:N:0:GTGAAA
TTATACTTCTGTTGTGGGTGTAGAGTTTTCTCCGGCGTTGGCTTCAATGGAATAAGGCACGAGATGAATCCGTGGAG
+
@@@FFFFDHHHDHHIIJJEHHJGJJIGIIEIIIIEHEGHIJDF?DGEE4??DG@FGEG:FHHHHF@D@CEACEEEDDDCCCDDBDDDDDDDACDB??>BD

我已经尝试过使用 awk 和 split 函数,但我正在努力使用字符串作为字段分隔符。

【问题讨论】:

  • 你想要的结果/输出是?

标签: bash search awk


【解决方案1】:

看起来简单的sed 应该适合你:

sed -i.bak 's/^.*TCTGTACTATATTG//g' file

使用 awk:

awk '{gsub(/^.*TCTGTACTATATTG/, "")} 1' file

但使用 sed 还可以让您受益于内联编辑。

【讨论】:

    【解决方案2】:
    sed -i.bak 's/.*TCTGTACTATATTG//g' file
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-20
      • 2016-05-09
      • 2019-03-15
      • 1970-01-01
      • 2020-12-30
      • 1970-01-01
      相关资源
      最近更新 更多