使用 bash 合并两个缺失值的数据表答案

【问题标题】：Merging two data tables with missing values using bash使用 bash 合并两个缺失值的数据表
【发布时间】：2014-12-17 01:26:17
【问题描述】：

我正在寻找一个可以合并两个包含表格的文件的脚本。列是单个样本的细菌计数，而行包含细菌的名称。我不能只对它们进行排序和合并，因为有些细菌只出现在一个文件中，而另一个文件中没有。如果是这种情况，我想用零填充该行。

这是一个例子：

文件 1

Header                         S1    S2    S3    S4
Acetobacterium submarinus     1350  1000   1541 1541
Abiotrophia defectiva         100   110    112  166
Acetobacterium tundrae         2     1      0     0

文件 2

Header                         S5    S6     S7    S8
Acholeplasma cavigenitalium   100    90    88    120
Acetobacterium woodii          2     3      4     0
Acetobacterium submarinus     500   600    400   480

生成的文件应该是（按字母顺序排序）

Header                         S1    S2    S3    S4    S5    S6     S7    S8
Abiotrophia defectiva         100   110    112  166     0     0     0      0
Acetobacterium submarinus     1350  1000   1541  1541  500   600    400   480
Acetobacterium tundrae         2     1      0     0     0     0      0     0
Acetobacterium woodii          0     0      0     0     2     3      4     0
Acholeplasma cavigenitalium    0     0      0     0    100    90    88    120

有什么想法吗？

我知道粘贴功能可以按第一列合并文件，但我不确定如何处理丢失的物种。

更新这是两个示例数据集。列号和原始数据集中的一样，我只是缩短了行数。

https://www.dropbox.com/s/h46nwjwwfdyzwqr/Class_Level_Aggregate_Counts-1.csv?dl=0 https://www.dropbox.com/s/x8wtdxl45bej729/Class_Level_Aggregate_Counts-2.csv?dl=0

【问题讨论】：

标签： bash text merge

【解决方案1】：

您应该将join 与-a 1 2、-e '0' 和-o '0,1.2,1.3,1.4,1.5,2.2,2.3,2.4,2.5' 选项一起使用：

join -a 1 -a 2 -e '0' -1 1 -2 1 -o '0,1.2,1.3,1.4,1.5,2.2,2.3,2.4,2.5' -t $'\t' file1 file2 > joinedfile

由于join 需要排序输入，并且您希望标题行位于顶部，因此您必须排除第一行然后排序：

sed -n '2,$p' file1unsorted | sort >file1
sed -n '2,$p' file2unsorted | sort >file2

之后，对已排序的文件运行上述join 命令（还要注意指定列分隔符的-t - 我假设您有Tab-分隔文件）。

分别加入你的标题：

head -1 file1unsorted | join -1 1 -2 1 -o '0,1.2,1.3,1.4,1.5,2.2,2.3,2.4,2.5' -t $'\t' - <(head -1 file2unsorted) >headerfile

然后“重新组装”您的最终文件（将新标题添加到文件的其余部分）：

cat headerfile joinedfile >resulfile

更新：

关于join 对列数的依赖性（以防您的文件有更多列）：是的，在某种程度上存在依赖性。准确地说，列号用于-1 和-2 选项（两者的值都是1，这是您要加入的相应文件中的列号；显然不是只要您加入第一列，就取决于列的总数）。列号也用于指定输出格式的-o 选项（即要输出的列和顺序，格式为“file#.column#”，均从 1 开始，以及用于连接的列具有“0”的特殊语法）。我们在示例中指定的格式实际上是默认格式（首先是要加入的列，然后是第一个文件中的所有其余列，然后是第二个文件的所有其他列），但遗憾的是我们仍然不能省略这个选项因为-e 选项需要它（它可能不在你的join 版本中，所以请尝试省略-o 部分，看看会发生什么）。

【讨论】：

评论不用于扩展讨论；这个对话是moved to chat。
@bluefeet：我试过了，但是 OP 没有足够的代表在聊天中说话。

【解决方案2】：

有时，老式的蛮力方法适用于难以将数据硬塞到单个函数中的情况。下面的 Bash 脚本读取这两个数据文件，将它们递增地操作到排序的 tmp 文件（在 /tmp 中），然后将值读入数组，最后对于每个唯一名称，如果两个文件中都存在数据，则组合 S1 - S8 值, 否则用0s 填充缺失值。 trap 函数在退出时删除临时文件。该文件有很好的注释以帮助解释逻辑。注意这是一个 Bash 解决方案（主要是由于 substing 测试运算符），但可以很容易地适应其他 shell 环境。如果您有任何问题，请告诉我：

#!/bin/bash

## simple error/usage function
function usage {
    errno=${2:-0}
    if test -n "$1" ; then
        printf "\n %s\n" "$1"
    fi
cat >&2 <<TAG

  Merge two bacterial count data files zeroing non-common columns.

  Usage:  ./${0//*\//} file_1 file_2

TAG
    exit $((errno))
}

## semi-random tmp file timestamp 'mmdd????'
function tstamp {
    local rd=$(date +%N)
    printf "%s" "$(date +%m%d)${rd:4:4}"
}

## trap function - cleanup temp files
function cleanup {
    rm "$tfn1"
    rm "$tfn2"
    rm "$tfnall"
    rm "$tfnuniq"
}

## respond to help
test "$1" = "-h" -o "$1" = "--help" && usage

## validate input files
test -z "$1" && usage "error: insufficient input." 1
test -z "$2" && usage "error: insufficient input." 1
test -r "$1" || usage "error: invalid input, file not readable '$1'" 1
test -r "$2" || usage "error: invalid input, file not readable '$2'" 1

## assign temp file names
tfn1="/tmp/bmrg_$(tstamp).tmp"      # temp file_1 (sorted w/o header)
tfn2="/tmp/bmrg_$(tstamp).tmp"      # temp file_2 (sorted w/o header)
tfnall="/tmp/bmrg_$(tstamp).tmp"    # concatenated $tfn1 $tfn2
tfnuniq="/tmp/bmrg_$(tstamp).tmp"   # uniq records $tfn1 $tfn2

## create $tfn1 $tfn2 & validate
tail -n+2 "$1" | sort > "$tfn1"
tail -n+2 "$2" | sort > "$tfn2"

test -f "$tfn1" || usage "error: failed to create tmp file '${tfn1}'" 1
test -f "$tfn2" || usage "error: failed to create tmp file '${tfn2}'" 1

## set trap for cleanup on exit
trap cleanup EXIT

## read names from $tfn1
while read -r name || test -n "$name" ; do
    name1+=( "$name" )
done <<<"$(cut -c -30 "$tfn1")"
unset name

## read names from $tfn2
while read -r name || test -n "$name" ; do
    name2+=( "$name" )
done <<<"$(cut -c -30 "$tfn2")"
unset name 

## concatenate $tfn1 $tfn2
printf "%s\n" "${name1[@]}" > "$tfnall"
printf "%s\n" "${name2[@]}" >> "$tfnall"

## get unique names
sort -u "$tfnall" > "$tfnuniq"

## read $tfn1 values into separate arrays
while read -r v1 v2 v3 v4 || test -n "$v4" ; do
    s1+=( "$v1" )
    s2+=( "$v2" )
    s3+=( "$v3" )
    s4+=( "$v4" )
done <<<"$(cut -c 31- "$tfn1")"

## read $tfn2 values into separate arrays
while read -r v5 v6 v7 v8 || test -n "$v8" ; do
    s5+=( "$v5" )
    s6+=( "$v6" )
    s7+=( "$v7" )
    s8+=( "$v8" )
done <<<"$(cut -c 31- "$tfn2")"

printf "Header                         S1    S2    S3    S4    S5    S6    S7    S8\n"

## for each unique name in $tfnuniq
while read -r name || test -n "$name" ; do

    ## test if found in name1, print values by index, else print 0's
    found=0
    for ((i=0; i < ${#name1[@]}; i++)); do
        test "${name1[i]}" = "$name" && { found=1; break; }
    done
    if test "$found" -eq 1 ; then
        printf "%-30s%-6s%-6s%-6s%-6s" "$name" "${s1[i]}" "${s2[i]}" "${s3[i]}" "${s4[i]}"
    else
        printf "%-30s%-6s%-6s%-6s%-6s" "$name" "0" "0" "0" "0"
    fi

    ## test if found in name2, print values by index, else print 0's
    found=0
    for ((i=0; i < ${#name2[@]}; i++)); do
        test "${name2[i]}" = "$name" && { found=1; break; }
    done
    if test "$found" -eq 1 ; then
        printf "%-6s%-6s%-6s%-6s\n" "${s5[i]}" "${s6[i]}" "${s7[i]}" "${s8[i]}"
    else
        printf "%-6s%-6s%-6s%-6s\n" "0" "0" "0" "0"
    fi

done <"$tfnuniq"

exit 0

输出：

$ bash bactmerge.sh dat/bact1.txt dat/bact2.txt
Header                         S1    S2    S3    S4    S5    S6    S7    S8
Abiotrophia defectiva         100   110   112   166   0     0     0     0
Acetobacterium submarinus     1350  1000  1541  1541  500   600   400   480
Acetobacterium tundrae        2     1     0     0     0     0     0     0
Acetobacterium woodii         0     0     0     0     2     3     4     0
Acholeplasma cavigenitalium   0     0     0     0     100   90    88    120

注意：脚本依赖于您问题中提供的数据文件的间距。如果您的数据文件与发布的不同（例如，由于制表符/空格转换），您可以调整提供给上述cut 命令的值。

【讨论】：

感谢您的建议。我尝试了脚本；但是，在输出中，我遇到了与其他解决方案相同的问题。单个细菌被复制。 IE。每个细菌有两行，并且计数在行之间拆分，不是按输入文件，而是按其他拆分。另外，是否可以从文件中读取标题？我上面给出的只是一个简短的示例，实际文件每个包含大约 100 列，示例名称并不那么简单。
哦。这确实改变了一些事情。您的示例文件相对简单。如果您可以发布指向实际数据文件的链接（或至少 20 行完整的列），那么我可以调整脚本。关键是名称的长度。只要这是一致的，那么调整脚本以读取无限数量的值并不困难。我从你的例子中认为它只是S1-S8。您可以确认您是否针对您的示例数据运行它——它像宣传的那样工作。还包含很多 bash 微妙之处，如果您正在学习，这是一个很好的消化。
对不起，我应该提到这一点。我将添加一些示例数据文件的链接，但我需要将它们截断一点。
谢谢！我一定会仔细查看脚本并尽可能多地从中学习。