如何使用文件中的值作为 awk 中计算的输入 - 在 bash 中？答案

【问题标题】：How to use the value in a file as input for a calculation in awk - in bash?如何使用文件中的值作为 awk 中计算的输入 - 在 bash 中？
【发布时间】：2020-02-27 15:56:26
【问题描述】：

我正在尝试计算每行的计数是否超过某个值，即总计数的 30%。

在 for 循环中，我获得了 awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value 中的百分比，这是一个数字，输出仅包含该数字。

如何针对${i}_percentage-value 对${i}_counts 的每一行进行“值大于”的计算？也就是说，如何将文件中的数字作为数值进行数学运算？

数据：

data.csv（摘录）

SampleID    ASV    Count
1000A   ASV_1216    14
1000A   ASV_12580   150
1000A   ASV_12691   260
1000A   ASV_135     434
1000A   ASV_147     79
1000A   ASV_15      287
1000A   ASV_16      361
1000A   ASV_184     8
1000A   ASV_19      42

samples-ID-short

1000A
1000B
1000C

因此，对于每个样品 ID，都有很多 ASV，数量可能相差很大，例如 1000A 为 50 ASV，1000B 为 120 等等。每个 ASV_## 都有一个计数，我的代码用于计算计数总和，然后找出每个样本的 30% 值，报告哪个 ASV_## 大于 30%。最终，它应该报告 30% 的 1。

到目前为止，这是我的代码：

    for i in $(cat samplesID-short)
    do
    grep ${i} data.csv | cut -d , -f3 - > ${i}_count_sample
    grep ${i} data.csv | cut -d , -f2 - > ${i}_ASV
    awk '{ sum += $1; } END { print sum; }' ${i}_count_sample > ${i}_counts
    awk '$1=($1/100)*30' ${i}_counts > ${i}_percentage-value

#I was thinking about replicate the numeric value for the entire column and make the comparison "greater than", but the repetition times depend on the ASV counts for each sample, and they are always different.

    wc -l ${i}_ASV > n
    for (( c=1; c<=n; c++)) ; do echo ${i}_percentage-value ; done

    paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_percentage-value > ${i}_tmp; 
    awk 'BEGIN{OFS="\t"}{if($2 >= $3) print $1}' ${i}_tmp > ${i}_is30;

#How the output should be:

    paste <(sed 's/^[[:blank:]]*//' ${i}_ASV) ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_is30 > ${i}_summary_nh
    echo -e "ASV_ID\tASV_in_sample\ttotal_ASVs_inSample\ttreshold_for_30%\tASV_over30%" | cat - ${i}_summary_nh > ${i}_summary
    rm ${i}_count_sample ${i}_counts ${i}_percentage-value ${i}_ASV ${i}_summary_nh ${i}_is30
    done &

【问题讨论】：

标签： bash for-loop awk cycle

【解决方案1】：

您可以根据值过滤列，例如

$ awk '$3>300' data.csv
SampleID    ASV    Count
1000A   ASV_135     434
1000A   ASV_16      361

您可以使用 >= 表示大于或等于。

看来您的脚本过于复杂了。

【讨论】：

我是一个初学者，所以你对复杂化的看法可能是对的。但是由于 ASV 计数，30% 的值总是不同的，每个样本的值都不相同。我计算每个样本的 30%，然后我想将该值用作您在示例中设置的 300。
你可以使用类似：awk 'NR>1{A[$1]+=$3;I[$1]++}END{for(i in A) if (A[i]) print i,A[i]*.3}' data.csv 来获得 30% 的分数，然后循环播放。

【解决方案2】：

这应该可以工作

$ awk 'NR==1 || $3>$1*3/10' file

SampleID    ASV    Count
1000A   ASV_135     434
1000A   ASV_16      361

或者，带有指示符列

$ awk 'NR==1{print $0, "Ind"} NR>1{print $0, ($3>$1*3/10)}' file | column -t

SampleID  ASV        Count  Ind
1000A     ASV_1216   14     0
1000A     ASV_12580  150    0
1000A     ASV_12691  260    0
1000A     ASV_135    434    1
1000A     ASV_147    79     0
1000A     ASV_15     287    0
1000A     ASV_16     361    1
1000A     ASV_184    8      0
1000A     ASV_19     42     0

【讨论】：

【解决方案3】：

请您尝试以下方法：

awk -v OFS="\t" '
    NR==FNR {   # this block is executed in the 1st pass only
        if (FNR > 1) sum[$1] += $3
                # accumulate the "count" for each "SampleID"
        next
    }
                # the following block is executed in the 2nd pass only
    FNR > 1 {   # skip the header line
        if ($1 != prev_id) {
                # SampleID has changed. then update the output filename and print the header line
            if (outfile) close(outfile)
                # close previous outfile
            outfile = $1 "_summary"
            print "ASV_ID", "ASV_in_sample", "total_ASVs_inSample", "treshold_for_30%", "ASV_over30%" >> outfile
            prev_id = $1
        }
        mark = ($3 > sum[$1] * 0.3) ? 1 : 0
                # set the mark to "1" if the "Count" exceeds 30% of sum
        print $2, $3, sum[$1], sum[$1] * 0.3, mark >> outfile
                # append the line to the summary file
    }
' data.csv data.csv

数据.csv：

SampleID    ASV    Count
1000A   ASV_1216    14
1000A   ASV_12580   150
1000A   ASV_12691   260
1000A   ASV_135     434
1000A   ASV_147     79
1000A   ASV_15      287
1000A   ASV_16      361
1000A   ASV_184     8
1000A   ASV_19      42
1000B   ASV_1       90
1000B   ASV_2       90
1000B   ASV_3       20
1000C   ASV_4       100
1000C   ASV_5       10
1000C   ASV_6       10

在以下输出示例中，如果计数超过总和值的 30%，则最后一个字段 ASV_over30% 表示 1。

1000A_summary：

ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
ASV_1216        14      1635    490.5   0
ASV_12580       150     1635    490.5   0
ASV_12691       260     1635    490.5   0
ASV_135 434     1635    490.5   0
ASV_147 79      1635    490.5   0
ASV_15  287     1635    490.5   0
ASV_16  361     1635    490.5   0
ASV_184 8       1635    490.5   0
ASV_19  42      1635    490.5   0

1000B_summary：

ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
ASV_1   90      200     60      1
ASV_2   90      200     60      1
ASV_3   20      200     60      0

1000C_summary：

ASV_ID  ASV_in_sample   total_ASVs_inSample     treshold_for_30%        ASV_over30%
ASV_4   100     120     36      1
ASV_5   10      120     36      0
ASV_6   10      120     36      0

[解释]

在计算输入数据的平均值时，我们需要经过直到数据的结尾。如果我们想打印出输入记录和平均值值（或其他基于平均值的信息）同时，我们需要使用技巧：

将整个输入记录存储在内存中。
两次读取输入数据。

由于awk适合读取多个文件更改过程根据文件的顺序，我选择了第二种方法。

条件NR==FNR 仅在读取第一个文件时返回TRUE。我们计算此块中count 字段的总和作为第一次通过。
块末尾的next 语句会跳过以下代码。
如果第一个文件完成，脚本将读取第二个文件，即当然，与第一个文件相同。
读取第二个文件时，条件NR==FNR 不再返回 TRUE 并跳过第一个块。
第二个块再次读取输入文件，打开一个文件打印输出，逐行读取输入数据，并添加信息比如第一遍得到的平均值。

【讨论】：

非常感谢，它有效！但是当一个样本有很多 ASV 时它会停止，例如它停止了 89 ASV 说“awk: 1004A_summary make too many open files”。由于我也有超过 500 个 ASV 的样本，我该如何克服这个限制？
可能是因为我使用的真实“data.csv”的大小为 17MB。
@camcecc10 感谢您的反馈。说实话，我没想到你有这么多样本 ID :)。我已经更新了我的代码以关闭以前的文件以避免错误。请你试试看好吗？顺便说一句，我假设data.csv 中的sampleID 被排序为一个块，而不是随机定位。否则 *_summary 文件中的标题行被放置多次。我的假设正确吗？ BR。
您的假设是正确的，并且添加到代码中效果很好！非常感谢，我希望有一天我能像你一样编码，你救了我！
对不起，如果我再次打扰你，但你能简要解释一下它是如何使用这两个输入的吗？