在玩了一两个小时后,我写了这篇可憎的文章:
cat <<EOF >file
ID Time A_in Time B_in Time C_in
Ax 0.1 10 0.1 15 0.1 45
By 0.2 12 0.2 35 0.2 30
Cz 0.3 20 0.3 20 0.3 15
Fr 0.4 35 0.4 15 0.4 05
Exp 0.5 10 0.5 25 0.5 10
EOF
# fix stackoverflow formatting
# input file should be separated with tabs
<file tr -s ' ' | tr ' ' '\t' > file2
mv file2 inputfile
# read headers to an array
IFS=$'\t' read -r -a hdrs < <(head -n1 inputfile)
# exp line read into an array
IFS=$'\t' read -r -a exps < <(grep -m1 $'^Exp\t' inputfile)
# column count
colcnt="${#hdrs[@]}"
if [ "$colcnt" -eq 0 ]; then
echo >&2 "ERROR - must be at least one column"
exit 1
fi
# numbers of those columns which headers have _in suffix
incolnums=$(
paste <(
printf "%s\n" "${hdrs[@]}"
) <(
# puff, the numbers will start from zero cause bash indexes arrays from zero
# but `cut` indexes fields from 1, so.. just keep in mind it's from 0
seq 0 $((colcnt - 1))
) |
grep $'_in\t' |
cut -f2
)
# read the input file
{
# preserve header line
IFS= read -r hdrline
( IFS=$'\t'; printf "%s\n" "$hdrline" )
# ok. read the file field by field
# I think we could awk here
while IFS=$'\t' read -a vals; do
# for each column number with _in suffix
while IFS= read -r incolnum; do
# update the column value
# I use bc for float calculations
vals[$incolnum]=$(bc <<-EOF
define abs(i) {
if (i < 0) return (-i)
return (i)
}
scale=2
abs(${vals[$incolnum]} - ${exps[$incolnum]})
EOF
)
done <<<"$incolnums"
# output the line
( IFS=$'\t'; printf "%s\n" "${vals[*]}" )
done
} < inputfile > MyFirstDesiredOutcomeIsThis.txt
# ok so, first part done
{
# output headers names with _in suffix
printf "%s\n" "${hdrs[@]}" |
grep '_in$' |
tr '\n' '\t' |
# omg, fix tr, so stupid
sed 's/\t$/\n/'
# puff
# output the corresponding ID of 2 smallest values of the specified column number
# @arg: $1 column number
tmpf() {
# remove header line
<MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
# extract only this column
cut -f$(($1 + 1)) |
# unique numeric sort and extract two smallest values
sort -n -u | head -n2 |
# now, well, extract the id's that match the numbers
# append numbers with tab (to match the separator)
# suffix numbers with dollar (to match end of line)
sed 's/^/\t/; s/$/$/;' |
# how good is grep at buffering(!)
grep -f /dev/stdin <(
<MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
cut -f1,$(($1 + 1))
) |
# extract numbers only
cut -f1
}
# the following is something like foldr $'\t' $(tmpf ...) for each $incolnums
# we need to buffer here, we are joining the output column-wise
output=""
while IFS= read -r incolnum; do
output=$(<<<$output paste - <(tmpf "$incolnum"))
done <<<"$incolnums"
# because with start with empty $output, paste inserts leading tabs
# remove them ... and finally output $output
<<<"$output" cut -f2-
} > MySecondDesiredOutcomeIs.txt
# fix formatting to post it on stackoverflow
# files have tabs, and column will output them with space
# which is just enough
echo '==> MyFirstDesiredOutcomeIsThis.txt <=='
column -t -s$'\t' MyFirstDesiredOutcomeIsThis.txt
echo
echo '==> MySecondDesiredOutcomeIs.txt <=='
column -t -s$'\t' MySecondDesiredOutcomeIs.txt
脚本会输出:
==> MyFirstDesiredOutcomeIsThis.txt <==
ID Time A_in Time B_in Time C_in
Ax 0.1 0 0.1 10 0.1 35
By 0.2 2 0.2 10 0.2 20
Cz 0.3 10 0.3 5 0.3 5
Fr 0.4 25 0.4 10 0.4 5
Exp 0.5 0 0.5 0 0.5 0
==> MySecondDesiredOutcomeIs.txt <==
A_in B_in C_in
Ax Cz Cz
By Exp Fr
Exp Exp
在tutorialspoint 编写和测试。
我使用 bash 和 core-/more-utils 来操作文件。首先,我确定以_in 后缀结尾的列数。然后我缓冲存储在Exp 行中的值。
然后我只是逐行、逐个字段地读取文件,并且对于每个具有标题以_in 后缀结尾的列号的字段,我用@987654327 中的字段值减去字段值@ 线。我认为这部分应该是最慢的(我使用普通的while IFS=$'\t' read -r -a vals),但是一个聪明的awk 脚本可以加快这个过程。正如您所说,这会生成您的“第一个所需输出”。
然后我只需要输出以_in 后缀结尾的标题名称。然后对于以_in 后缀结尾的每个列号,我需要确定列中的 2 个最小值。我使用普通的sort -n -u | head -n2。然后,它变得有点棘手。我需要提取在此类列中具有相应 2 个最小值之一的 ID。这是grep -f 的工作。我使用sed 在输入中准备适当的正则表达式,并让grep -f /dev/stdin 完成过滤工作。