sed 多次替换第二列中的模式答案

【问题标题】：sed to replace pattern in second column for multiple timessed 多次替换第二列中的模式
【发布时间】：2018-08-17 13:14:09
【问题描述】：

sed 新手并试图完成以下工作，但完全卡住了：我试图在第二列中用 sed 替换模式。这种模式多次出现。

我有：

Gene1 GO:0000045^biological_process^autophagosome assembly`GO:0005737^cellular_component^cytoplasm
Gene2 GO:0000030^molecular_function^mannosyltransferase activity`GO:0006493^biological_process^protein O-linked glycosylation`GO:0016020^cellular_component^membrane

我想得到：

Gene1 GO:0000045,GO:0005737
Gene2 GO:0000030,GO:0006493,GO:0016020

所以去掉所有描述性部分并使用“，”作为分隔符。我选择使用 sed 是因为我认为可以轻松识别 ^ 和 ` 之间的模式。但相反，它删除了所有第一个 GO 术语。

代码：

sed -E 's/(^)'.+'(`)/,/g'

有人可以帮我吗？

【问题讨论】：

标签： unix sed

【解决方案1】：

试试这个，分两步说明

$ # showing how to remove from ^ to ` and replace with ,
$ sed 's/\^[^`]*`/,/g' ip.txt
Gene1 GO:0000045,GO:0005737^cellular_component^cytoplasm
Gene2 GO:0000030,GO:0006493,GO:0016020^cellular_component^membrane

$ # removing remaining data from ^ to end of line as well
$ sed 's/\^[^`]*`/,/g; s/\^.*//' ip.txt
Gene1 GO:0000045,GO:0005737
Gene2 GO:0000030,GO:0006493,GO:0016020

因为^ 是一个元字符，所以使用\^ 来匹配它
[^`]* 将匹配零个或多个非 ` 字符
不要使用\^.*`，由于greedy nature of quantifiers，这将从第一个^到最后一个反引号删除

【讨论】：

【解决方案2】：

sed -e 's/\^[^`]*//g' -e 's/`/,/g' your_file

第一个命令删除（不替换）除` 后面的^（包括）之外的任何字符

第二个替换 ` 由 ,

【讨论】：

【解决方案3】：

识别各个字段然后对每个字段进行操作可能比仅使用正则表达式识别每行的部分更有用：

$ awk -F'^' -v OFS=',' '{print NR") "$0; for (i=1;i<=NF;i++) print "\t"i") "$i}' file
1) Gene1 GO:0000045^biological_process^autophagosome assembly`GO:0005737^cellular_component^cytoplasm
        1) Gene1 GO:0000045
        2) biological_process
        3) autophagosome assembly`GO:0005737
        4) cellular_component
        5) cytoplasm
2) Gene2 GO:0000030^molecular_function^mannosyltransferase activity`GO:0006493^biological_process^protein O-linked glycosylation`GO:0016020^cellular_component^membrane
        1) Gene2 GO:0000030
        2) molecular_function
        3) mannosyltransferase activity`GO:0006493
        4) biological_process
        5) protein O-linked glycosylation`GO:0016020
        6) cellular_component
        7) membrane

$ awk -F'^' -v OFS=',' '{out=$1; for (i=2;i<=NF;i++) if (sub(/.*`/,"",$i)) out=out OFS $i; print out}' file
Gene1 GO:0000045,GO:0005737
Gene2 GO:0000030,GO:0006493,GO:0016020

【讨论】：