【问题标题】:Counting field occurrences计算字段出现次数
【发布时间】:2013-09-07 23:57:20
【问题描述】:

如何计算来自多个列的唯一字符串并仅使用 awk 显示它们的计数

我的输入文件c.txt:

US A one
IN A two
US B one
LK C one
US B two
US A three
IN A three
US B one
LK C two
US B three
US A one
IN A one
US B three
LK C three
US B two
US A two
IN A two
US B two
LK C three
US B two
US A one
IN A two
US B one
LK C one
US B two
US A three
IN A three
US B one
LK C two
US B three
US A one
IN A one
US B three
LK C three
US B two
US A two
IN A two
US B two
LK C three
US B two
US A one
IN A two
US B one
LK C one
US B two
US A three
IN A three
US B one
LK C two
US B three
US A one
IN A one
US B three
LK C three
US B two
US A two
IN A two
US B two
LK C three
US B two
US A one
IN A two
US B one
LK C one
US B two
US A three
IN A three
US B one
LK C two
US B three
US A one
IN A one
US B three
LK C three
US B two
US A two
IN A two
US B two
LK C three
US B two
US A one
IN A two
US B one
LK C one
US B two
US A three
IN A three
US B one
LK C two
US B three
US A one
IN A one
US B three
LK C three
US B two
US A two
IN A two
US B two
LK C three
US B two

我能够做到这一点,但分别使用 3 个命令,是否可以使用单个命令获得所有输出

awk '{a[$1]++}END{for (i in a)print i,a[i]}' c.txt
awk '{a[$1" "$2]++}END{for (i in a)print i,a[i]}' c.txt
awk '{a[$1" "$2" "$3]++}END{for (i in a)print i,a[i]}' c.txt

我想要的输出应该是:

IN 20 A 20 one 5 
IN 20 A 20 three 5
IN 20 A 20 two 10
LK 20 C 20 one 5
LK 20 C 20 three 10
LK 20 C 20 two 5
US 60 A 20 one 10
US 60 A 20 three 5
US 60 A 20 two 5
US 60 B 40 one 10
US 60 B 40 three 10
US 60 B 40 two 20

第 2 列是输入文件第 1 列的总 uniq 值。

第 4 列是输入文件的第 1 列和第 2 列的总 uniq 值。

第 6 列是输入文件的第 1、2、3 列的总 uniq 值。

【问题讨论】:

  • 请以较小的数据样本为例。我们不想向下滚动。
  • 很好地提供了一个输入和输出示例,并发布了您已经尝试过的内容以及您期望发生的事情。在发布未来的问题stackoverflow.com/help/formatting 时,您会发现此格式指南很有帮助

标签: awk


【解决方案1】:

使用GNU awk,您可以使用以下脚本:

$ cat count.awk 
{
    lines[$0]=$0
    count1[$1]++
    count2[$1,$2]++
    count3[$1,$2,$3]++
}
END{
    n = asorti(lines)
    for (i=1;i<=n;i++) {
        split(lines[i],field,FS)
        total1 = count1[field[1]]
        total2 = count2[field[1],field[2]]
        total3 = count3[field[1],field[2],field[3]]

        print field[1],total1,field[2],total2,field[3],total3
    }
}

要在您的文件上运行脚本:

$ awk -f count.awk file 
IN 20 A 20 one 5
IN 20 A 20 three 5
IN 20 A 20 two 10
LK 20 C 20 one 5
LK 20 C 20 three 10
LK 20 C 20 two 5
US 60 A 20 one 10
US 60 A 20 three 5
US 60 A 20 two 5
US 60 B 40 one 10
US 60 B 40 three 10
US 60 B 40 two 20

【讨论】:

  • 如果您注意到所需的输出,最后三行有B 20 而不是B 40。可能是问题中的错误。
  • 我已经编辑了问题以消除此错误。 OP 描述了输出列应该是什么,因此可以安全地假设它是一个错误,您的答案也给出了相同的输出,这让我确信我没有犯一个简单的错误。
【解决方案2】:

试试这个 awk one liner:

$ awk '{a[$1]++;b[$1,$2]++;c[$1,$2,$3]++}END{for (i in c) {split (i, d, SUBSEP); print d[1],a[d[1]],d[2],b[d[1],d[2]],d[3],c[d[1],d[2],d[3]] } }' file | sort
IN 20 A 20 one 5
IN 20 A 20 three 5
IN 20 A 20 two 10
LK 20 C 20 one 5
LK 20 C 20 three 10
LK 20 C 20 two 5
US 60 A 20 one 10
US 60 A 20 three 5
US 60 A 20 two 5
US 60 B 40 one 10
US 60 B 40 three 10
US 60 B 40 two 20

或者以更易读的格式:

$ awk '
    {
        a[$1]++
        b[$1,$2]++
        c[$1,$2,$3]++
    }
    END{
        for (i in c) {
            split (i, d, SUBSEP); 
            print d[1], a[d[1]],
                  d[2], b[d[1], d[2]],
                  d[3], c[d[1], d[2], d[3]] 
        } 
    }' file | sort

【讨论】:

    猜你喜欢
    • 2018-12-30
    • 1970-01-01
    • 2022-01-06
    • 2013-09-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多