通过匹配特定列来连接两个文件答案

【问题标题】：Join two files by matching a specific column通过匹配特定列来连接两个文件
【发布时间】：2021-11-21 15:03:38
【问题描述】：

我正在尝试加入两个已经排序的文件

文件1

70 CBLB Cbl proto-oncogene B
70 HOXC11 centrosomal protein 57
70 CHD4 chromodomain helicase
70 FANCF FA complementation
70 LUZP2 leucine zipper protein 2

文件2

0.700140820757797 ELAVL1
0.700229616476825 HOXC11
0.700328646327188 CHD4
0.700328951649384 LUZP2

输出

Gene Symbol  Gene Description         Target Score mirDB   Target Score Diana
HOXC11       centrosomal protein 57   70                   0.700229616476825
CHD4         chromodomain helicase    70                   0.700328646327188
LUZP2        leucine zipper protein 2 70                   0.700328951649384

为了执行这个任务，我尝试了这个脚本，但是它返回一个空文件

join -j 2 -o 1.1,1.2,1.3,1.4,2.4 File1 File2 | column -t | sed '1i Gene Symbol, Gene 
Description, Target Score mirDB, Target Score Diana' > Output

请求有关 awk 或 join 命令的任何帮助。

【问题讨论】：

请在代码问题中给出minimal reproducible example--剪切&粘贴&运行代码；具有期望和实际输出的示例输入（包括逐字错误消息）；标签和版本；明确的规范和解释。对于包含最少代码的错误，您可以给出的代码是您显示的代码可以通过您显示的代码扩展为不正常。（调试基础。）对于 SQL 包括 DDL 和表格初始化代码。当你得到一个你不期望的结果时，暂停你的总体目标，切到第一个具有意外结果的子表达式并说出你的期望和原因，并通过文档证明是合理的。 How to Ask Help center

标签： linux join awk sed

【解决方案1】：

你可以试试这个awk

$ awk 'BEGIN {OFS="\t"; print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"} NR==FNR{array[$2]=$1; next} $0!~array[$2]{print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]}' file2 file1

Gene Symbol     Gene Description        Target Score mirDB      Target Score Diana
HOX11           centrosomal protein 57          70              0.700229616476825
CHD4            chromodomain helicase           70              0.700328646327188
LUZP2           leucine zipper protein  2       70              0.700328951649384

BEGIN {
    OFS="\t" 
    print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
} NR==FNR {
    array[$2]=$1
    next
} $0!~array[$2] {
    print $2,OFS $3" "$4" "$5,$6, $1,OFS array[$2]
}

【讨论】：

【解决方案2】：

更新： 更新了awk 以删除 Windows 行尾 (\r)，因为这在 OP 的 cmets/other-question 中作为问题弹出

问题：

OP 的当前代码需要在调用join 之前对两个文件进行预排序
由于File1 中空格分隔列的数量可变，因此很难（不可能？）让join 生成不会被后续column 调用打乱的格式
column 无法区分用作字段分隔符的空格和用作字段一部分的空格

由于这些问题，我认为awk 解决方案结合column 进行“简单”重新格式化，更易于实施和理解，例如：

awk '
BEGIN      { OFS="|"                              # "|" will be used as the input delimiter for a follow-on "column" call
             print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
           }
           { sub(/\r/,"") }                       # remove Windows line ending "\r" for all lines in all files
FNR==NR    { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
             for (i=3;i<=NF;i++) {                # pull fields #3 to #NF into a single variable 
                 lastF=lastF pfx $i
                 pfx=" "
             }
             print $2, lastF, $1, gene[$2]
           }
' File2 File1

这会生成：

Gene Symbol|Gene Description|Target Score mirDB|Target Score Diana
HOXC11|centrosomal protein 57|70|0.700229616476825
CHD4|chromodomain helicase|70|0.700328646327188
LUZP2|leucine zipper protein 2|70|0.700328951649384

虽然可以添加更多代码以便awk 在“漂亮”列中打印输出，但我选择了一种更简单的方法让column 完成额外的工作：

awk '
BEGIN      { OFS="|" 
             print "Gene Symbol", "Gene Description", "Target Score mirDB", "Target Score Diana"
           }
           { sub(/\r/,"") }                       # remove Windows line ending "\r" for all lines in all files
FNR==NR    { gene[$2]=$1 ; next }
$2 in gene { lastF=pfx=""
             for (i=3;i<=NF;i++) {
                 lastF=lastF pfx $i
                 pfx=" "
             }
             print $2, lastF, $1, gene[$2]
           }
' File2 File1 | column -s'|' -t

这会生成：

Gene Symbol  Gene Description          Target Score mirDB  Target Score Diana
HOXC11       centrosomal protein 57    70                  0.700229616476825
CHD4         chromodomain helicase     70                  0.700328646327188
LUZP2        leucine zipper protein 2  70                  0.700328951649384

【讨论】：

这是一个非常好的解决方案。我想知道是否可以使用逗号分隔 (;) or (,) 来输出 csv 格式。由于基因描述较长，因此值混淆了。
如果我理解你的问题：1) 设置OFS=';' 或OFS=','（或任何你决定的）和2) 不要管道到column（即删除| column -s'|' -t）；如果这不能回答您的问题，我建议您创建一个包含必要详细信息的新问题...
我尝试了你的建议，但我只得到了标题Gene Symbol;Gene Description;Target Score mirDB;Target Score Diana
只是更改为OFS=";" 我仍然得到与上图相同的 4 行输出；您是否将相同的 2 个文件（File2 和 File1）提供给 awk？
非常感谢您的支持。我只是在修复格式。我在Gene Description 值末尾有唯一的问题，有一个换行符，; 写在下一行。 Gene Symbol;Gene Description;Target Score mirDB;Target Score Diana NPTX1;neuronal pentraxin 1 ;100;0.999551316662558 FGD4;FYVE, RhoGEF and PH domain containing 4 ;70;0.768034158332055

【解决方案3】：

这可能对你有用（GNU sed、join 和 column）：

( echo 'Gene Symbol@Gene Description@Target Score mirDB@Target Score Diana';
join -j2 -t@ --no -o 0,1.3,1.1,2.1 <(sed 's/ /@/;s//@/' file1) <(sed 's/ /@/' file2) ) |
column -s@ -t

制定最终的标题，连接两个输入文件并将总输出传递给将结果制成表格的列命令。

注意标题由@ 分隔，这是在标题或连接文件中找不到的任意字符。修改了输入文件，使其字段分隔符与标题的分隔符匹配，并且 column 命令使用相同的分隔符将最终结果制成表格。 --no（--nocheck-order 的缩写）阻止警告消息。

【讨论】：