【问题标题】:How to extract some text from a column to create a new column如何从列中提取一些文本以创建新列
【发布时间】:2015-10-01 16:54:18
【问题描述】:

尊敬的 stackoverflow 社区,

我有一个 2 列文件,如下所示:

Ccrux.00013.c0_g1_i1    .
Ccrux.00013.c0_g2_i1    .
Ccrux.00014.c0_g1_i1    .
Ccrux.00014.c0_g2_i1    .
Ccrux.00015.c0_g1_i1    .
Ccrux.00015.c0_g1_i1    GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0005506^molecular_function^iron ion binding`GO:0031418^molecular_function^L-ascorbic acid binding`GO:0016706^molecular_function^oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, 2-oxoglutarate as one donor, and incorporation of one atom each of oxygen into both donors`GO:0045646^biological_process^regulation of erythrocyte differentiation
Ccrux.00015.c0_g2_i1    GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0005506^molecular_function^iron ion binding`GO:0031418^molecular_function^L-ascorbic acid binding`GO:0016706^molecular_function^oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, 2-oxoglutarate as one donor, and incorporation of one atom each of oxygen into both donors`GO:0045646^biological_process^regulation of erythrocyte differentiation
Ccrux.00016.c0_g1_i1    .
Ccrux.00016.c0_g2_i1    .
Ccrux.00017.c0_g1_i1    .
Ccrux.00018.c0_g1_i1    .
Ccrux.00019.c0_g1_i1    .

我需要一个新的 2 列文件:

  • 不包含第 2 列值为 .的行。
  • 仅包含 GO:XXXXXXX 作为第 2 列值(即从第 2 列中删除所有文本并仅保留 GO 编号)

新文件应如下所示:

Ccrux.00015.c0_g1_i1    GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646
Ccrux.00015.c0_g2_i1    GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646
Ccrux.00029.c0_g1_i1    GO:0035869,GO:0005737,GO:0005615,GO:0016020,GO:0021956,GO:0060271,GO:0021904,GO:0001701,GO:0001841,GO:0008589,GO:0021523,GO:0021537

我一直在尝试使用 perl:

perl -ne '/(GO:\d+)/ && print "$1"' input.file > output.file

但是只在一列中打印出我所有的 GO 数字。我真的不知道该怎么做。任何建议都将受到欢迎。

提前谢谢大家。

【问题讨论】:

    标签: perl selection text-extraction


    【解决方案1】:

    你在那里得到的模式匹配一​​段文本,然后打印出来。

    从它听起来你正在做的事情来看:

    GO:0005789^cellular_component^endoplasmic reticulum membrane`
    

    您要删除^ 和下一个GO 之间的任何“位”?

    perl 的好处在于语法 -ne 只是在命令周围创建了一个小的 while 循环 - 所以它可以让您执行多个语句。

    所以 - 扩展示例:

    #!/usr/bin/env perl 
    use strict;
    use warnings;
    
    while (<DATA>) {
        next unless m/GO/;
        s/\^[^`]+`/,/g;
        s/\^[^`]+$/\n/g;
        print;
    }
    
    __DATA__
    Ccrux.00013.c0_g1_i1    .
    Ccrux.00013.c0_g2_i1    .
    Ccrux.00014.c0_g1_i1    .
    Ccrux.00014.c0_g2_i1    .
    Ccrux.00015.c0_g1_i1    .
    Ccrux.00015.c0_g1_i1    GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0005506^molecular_function^iron ion binding`GO:0031418^molecular_function^L-ascorbic acid binding`GO:0016706^molecular_function^oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, 2-oxoglutarate as one donor, and incorporation of one atom each of oxygen into both donors`GO:0045646^biological_process^regulation of erythrocyte differentiation
    Ccrux.00015.c0_g2_i1    GO:0005789^cellular_component^endoplasmic reticulum membrane`GO:0016021^cellular_component^integral component of membrane`GO:0005509^molecular_function^calcium ion binding`GO:0005506^molecular_function^iron ion binding`GO:0031418^molecular_function^L-ascorbic acid binding`GO:0016706^molecular_function^oxidoreductase activity, acting on paired donors, with incorporation or reduction of molecular oxygen, 2-oxoglutarate as one donor, and incorporation of one atom each of oxygen into both donors`GO:0045646^biological_process^regulation of erythrocyte differentiation
    Ccrux.00016.c0_g1_i1    .
    Ccrux.00016.c0_g2_i1    .
    Ccrux.00017.c0_g1_i1    .
    Ccrux.00018.c0_g1_i1    .
    Ccrux.00019.c0_g1_i1    .
    

    这将生成为输出:

    Ccrux.00015.c0_g1_i1    GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646
    Ccrux.00015.c0_g2_i1    GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646
    

    我们:

    • 跳过任何不包含GO 的行。
    • 替换文字 ^ 的任何实例,一个或多个不是 ^,然后用逗号替换反引号。
    • 并用\n 替换在行尾终止的相同内容。

    我们可以将其浓缩为一条线:

    perl -ne 'next unless m/GO/;s/\^[^`]+`/,/g;s/\^[^`]+$/\n/g;print' inputfile > outputfile
    

    或者更好 - 没有打印 - 请参阅 perlrun - -p 类似于 -n 但它构建在 print 中(所以工作起来有点像 sed)。

    perl -pe 'next unless m/GO/;s/\^[^`]+`/,/g;s/\^[^`]+$/\n/g;' inputfile > outputfile
    

    【讨论】:

    • 几乎@Sobrique!他们俩都给了我这个:Ccrux.00015.c0_g1_i1 GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646^biological_process^regulation of erythrocyte differentiation Ccrux.00015.c0_g2_i1 GO:0005789,GO:0016021,GO:0005509,GO:0005506,GO:0031418,GO:0016706,GO:0045646^biological_process^regulation of erythrocyte differentiation
    • 是的,行尾没有反引号。更新。试试看。
    • 现在正在工作,@Sobrique。谢谢!关于如何删除第二列值为 . 的行的任何想法?
    • next unless m/GO/;
    【解决方案2】:

    我认为您的要求对于单行解决方案来说有点太长了,但它可以非常简短。该程序将产生您描述的输出。它期望输入文件的路径作为命令行上的参数

    use strict;
    use warnings;
    
    while ( <> ) {
        next unless my @values = /GO:\d+/g;
        local $" = ',';
        s/\S\s+\K.+/@values/;
        print;
    }
    

    单行版本会有点笨拙

    perl -pe '@v=/GO:\d+/g or next; $"=","; s/\S\s+\K.+/@v/; print;' myfile > newfile
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-07-29
      • 2020-06-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多