awk 同一行值的几何平均值答案

【问题标题】：awk geometric average on the same row valueawk 同一行值的几何平均值
【发布时间】：2014-08-17 01:09:23
【问题描述】：

我有以下输入，如果“Cpd_number”和“ID3”相同，我想做几何平均。这些文件有很多数据，所以我们可能需要数组来完成这些技巧。但是，作为 awk 初学者，我不太确定如何开始。有人可以提供一些提示吗？

输入：

“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”5”,”100”
“95”, “123”,”4”,”5”,”1”
“95”, “123”,”4”,”6”,”10”
“95”, “123”,”4”,”6”,”100”
“95”, “456”,”4”,”6”,”10”
“95”, “456”,”4”,”6”,”100”

“95”、“123”、“4”、“5”三行应该做几何平均

两行“95”、“123”、“4”、“6”应该做几何平均

两行“95”、“456”、“4”、“6”应该做几何平均

这是所需的输出：

“ID1”,“Cpd_number”, “ID2”,”ID3”,”activity”
“95”,“123”,”4”,”5”,”10”
“95”, “123”,”4”,”6”,”31.62”
“95”, “456”,”4”,”6”,”31.62”

关于几何平均数的一些信息：

http://en.wikipedia.org/wiki/Geometric_mean

此脚本计算几何平均值

 #!/usr/bin/awk -f
 {
   b  = $1;   # value of 1st column
   C += log(b);  
   D++; 
 }

 END {
   print "Geometric mean  : ",exp(C/D);
   }

【问题讨论】：

标签： arrays bash awk mean

【解决方案1】：

拥有这个文件：

$ cat infile
"ID1","Cpd_number","ID2","ID3","activity"
"95","123","4","5","10"
"95","123","4","5","100"
"95","123","4","5","1"
"95","123","4","6","10"
"95","123","4","6","100"
"95","456","4","6","10"
"95","456","4","6","100"

这件作品：

awk -F\" 'BEGIN{print}            # Print headers
      last != $4""$8 && last{     # ONLY When last key  "Cpd_number + ID3" 
          print line,exp(C/D)     # differs from actual , print line + average
          C=D=0}                  # reset acumulators
      { # This block process each line of infile
       C += log($(NF-1)+0)        # C calc
       D++                        # D counter
       $(NF-1)=""                 # Get rid of activity col ir order to print line
       line=$0                    # Line will be actual line without activity
       last=$4""$8}               # Store the key in orther to track switching 
      END{ # This block triggers after the complete file read
           # to print the last average that cannot be trigger during
           # the previous block 
          print line,exp(C/D)}' infile

会抛出：

 ID1 , Cpd_number ,  ID2 , ID3 ,   0
 95 ,  123 , 4 , 5 ,   10
 95 ,  123 , 4 , 6 ,   31.6228
 95 ,  456 , 4 , 6 ,   31.6228

还有一些工作要做格式化。

NOTE: char " is used  instead of “ and ”

EDIT：NF 是 file 中的字段数，因此 NF-1 将是倒数第二个：

$ awk -F\" 'BEGIN{getline}{print $(NF-1)}' infile                                                                                 
10
100
1
10
100
10
100

所以在：log($(NF-1)+0)我们对那个值应用对数函数（加0总和以确保数值）

D++ 只是一个计数器。

【讨论】：

另外，请问这里的“$8”是什么意思？
我不得不承认，我喜欢 awk 解决方案。
你好 klashxx，你能解释一下这个脚本吗？我很难完全理解它，可能是因为我的 awk 初学者水平。此外，每次执行脚本时，都会出现一条“错误”消息：“awk: cmd.line:4: (FILENAME=infile FNR=10) fatal: attempt to access field -1”。我可以知道这是什么意思吗？
有关解释，请参阅我编辑的答案@HenrySu，以及您遇到的错误。我怀疑与您的文件格式有关。注意：我使用了 char " 而不是 " 和 "
感谢 klashxx。您能否就“C += log($(NF-1)+0)”和“D++”提供一些想法？我知道 NF =“记录中的字段数”，但我仍然无法弄清楚这两行的含义。谢谢！

【解决方案2】：

为什么要使用 awk，只需在 bash 中使用，使用 bc 或 calc 来处理浮点数学。您可以在http://www.isthe.com/chongo/src/calc/ 下载 calc（2.12.4.13-11 是最新的）。有可用的 rpm、二进制和源代码 tarball。在我看来，它远远优于bc。例程相当简单。 您需要从数据文件中删除多余的 " 引号，首先留下一个 csv 文件。这有帮助。请参阅下面的 cmets 中使用的 sed 命令。请注意，下面的几何平均值是 (id1*cpd*id2*id3) 的 4 次根。如果您需要不同的平均值，只需调整以下代码：

#!/bin/bash

##
##  You must strip all quotes from data before processing, or write more code to do
##  it here. Just do "$ sed -d 's/\"//g' < datafile > newdatafile" Then use 
##  newdatafile as command line argument to this program
##
##  Additionally, this script uses 'calc' for floating point math. go download it
##  from: http://www.isthe.com/chongo/src/calc/ (2.12.4.13-11 is latest). You can also
##  use bc if you like, but why, calc is so much better.
##

## test to make sure file passed as argument is readable
test -r "$1" || { echo "error: invalid input, usage: ${0//*\//} filename"; exit 1; }

## function to strip extraneous whitespace from input
trimWS() {
    [[ -z $1 ]] && return 1
    strln="${#1}"
    [[ strln -lt 2 ]] && return 1
    trimSTR=$1
    trimSTR="${trimSTR#"${trimSTR%%[![:space:]]*}"}"  # remove leading whitespace characters
    trimSTR="${trimSTR%"${trimSTR##*[![:space:]]}"}"  # remove trailing whitespace characters
    echo $trimSTR
    return 0
}

let cnt=0
let oldsum=0    # holds value to compare against new Cpd_number & ID3
product=1       # initialize product to 1
pcnt=0          # initialize the number of values in product
IFS=$',\n'      # Internal Field Separator, set to break on ',' or newline

while read newid1 newcpd newid2 newid3 newact || test -n "$act"; do

    cpd=`trimWS $cpd`  # trimWS from cpd (only one that needed it)

    # if first iteration, just output first row
    test "$cnt" -eq 0 && echo " $newid1 $newcpd $newid2 $newid3 $newact"

    # after first iteration, test oldsum -ne sum, if so do geometric mean
    # and reset product and counters
    if test "$cnt" -gt 0 ; then

        sum=$((newcpd+newid3))   # calculate sum to test against oldsum
        if test "$oldsum" -ne "$sum" && test "$cnt" -gt 1; then
            # geometric mean (nth root of product)
            # mean=`calc -p "root ($product, $pcnt)"`  # using calc
            mean=`echo "scale=6; e( l($product) / $pcnt)" | bc -l` # using bc
            echo " $id1 $cpd $id2 $id3  average: $mean"
            pcnt=0
            product=1
        fi

        # update last values to new values
        oldsum=$sum
        id1="$newid1"
        cpd="$newcpd"
        id2="$newid2"
        id3="$newid3"
        act="$newact"

        ((product*=act))  # accumulate product
        ((pcnt+=1))
    fi

    ((cnt+=1))

done < "$1"

输出：

# output using calc
ID1 Cpd_number  ID2 ID3 activity
95 123 4 5  average: 10
95 123 4 6  average: 31.62277660168379331999
95 456 4 6  average: 31.62277660168379331999

# output using bc
ID1 Cpd_number  ID2 ID3 activity
95 123 4 5  average: 9.999999
95 123 4 6  average: 31.622756
95 456 4 6  average: 31.622756

更新后的脚本计算正确的平均值。由于必须保留旧/新值来测试 cpd 和 id3 的变化，因此涉及更多。这可能是 awk 更简单的方法。但如果您以后需要更大的灵活性，bash 可能就是答案。

【讨论】：

谢谢大卫！我可以知道“bc”代表什么吗？
bc 是一种任意精度的计算器语言，通常与 bash 一起安装。亨利 - 我误解了你想要的意思。上面的代码读取数据，但计算 id 和 cpd 值的平均值，而不仅仅是 cpd 和 id3 相等的活动值。我明白你现在想做什么。
@HenrySu 我会拿出答案，等到我有机会修正平均计算。如果您喜欢我留下它，请告诉我，或者在我将其拉下之前复制代码。我会等几分钟。
谢谢我已经复制了脚本！如果您愿意，请随意将其拉下 :)
Henry，我更新了脚本来计算正确的活动平均值。