【问题标题】:Remove duplicates, but keeping only the last occurrence in linux file [duplicate]删除重复项,但只保留 linux 文件中的最后一次出现 [重复]
【发布时间】:2016-10-11 11:22:20
【问题描述】:

输入文件:

5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,,user,,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,C
5,,OR1,1000,Nawras,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,,user,,f660818af5625b3be61fe12489689601,50328589469,,,30002,C
5,,OR2,2000,Nawras,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,,user,,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,C
5,,OR1,1000,Nawras,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,,user,,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,C
0,,OR5,5000,Nawras,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H
0,,OR1,1000,Nawras,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C

期望的输出:

5,,OR1,1000,UY,OR,20160105T05:30:17+0400,20181231T23:59:59+0400,20160217T01:45:18+0400,,user,aaa8016058f008ddceae6329f0c5d551,50293277591,,,30001,H 
5,,OR2,2000,UY,OR,20160216T06:30:18+0400,20191231T23:59:59+0400,20160216T06:30:18+0400,,user,f660818af5625b3be61fe12489689601,50328589469,,,30002,H    
5,,OR1,1000,UY,OR,20150328T03:00:13+0400,20171230T23:59:59+0400,20150328T03:00:13+0400,,user,22bf18b024e1d4f42ac79943062cf576,50212935879,,,10001,H    
0,,OR5,5000,UY,OR,20160421T02:45:16+0400,20191231T23:59:59+0400,20160421T02:45:16+0400,,user,c7c501ac92d85a04bb26c575929e9317,50329769192,,,11001,H
0,,OR1,1000,UY,OR,20160330T02:00:14+0400,20181231T23:59:59+0400,,user,,d4ea749306717ec5201d264fc8044201,50285524333,,,11001,C*

使用代码:

for i in `cat file | awk -F, '{print $13}' | sort | uniq`
do
grep $i file | tail -1 >> TESTINGGGGGGG_SV
done

这需要很长时间,因为该文件有 3 亿条记录,并且在第 13 列有 6500 万条 uniq 记录。

所以我需要一个可以遍历第 13 列值的输出 - 最后一次出现在文件中作为输出。

【问题讨论】:

  • perl -F, -le '$seen{$F[12]} = $_; END { print $seen{$_} for sort keys %seen }'

标签: linux shell awk


【解决方案1】:

awk 来救援!

awk -F, 'p!=$13 && p0 {print p0} {p=$13; p0=$0} END{print p0}' file

期望排序的输入。

如果你能成功运行脚本,请发布时间。

如果无法排序,则另一种选择是

tac file | awk -F, '!a[$13]++' | tac

反转文件,以 13 美元的价格获取第一个条目并将结果反转回来。

【讨论】:

    【解决方案2】:

    这是一个应该可行的解决方案:

    awk -F, '{rows[$13]=$0} END {for (i in rows) print rows[i]}' file
    

    解释:

    • rows 是由字段 13 $13 索引的关联数组,每当字段 13 重复时,$13 索引的数组元素就会被覆盖;它的值是整行$0

    但这在内存方面效率低下,因为保存数组需要空间。

    对上述未使用排序的解决方案的改进是仅将行号保存在关联数组中:

    awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}' file|while read lN; do sed "${lN}q;d" file; done
    

    解释:

    • rows 和以前一样,但值是行号而不是整行
    • awk -F, '{rows[$13]=NR}END {for(i in rows) print rows[i]}' 文件输出包含搜索行的行号列表
    • sed "${lN}q;d"file 获取行号 lN

    【讨论】:

    • 您是否想过您的程序将使用多少内存? 6500 万条唯一记录。如果每条记录是 50 字节,它将变成大约 3 GB 的原始数据,这还不包括 AWK 需要什么来保持数组的结构。自己算算perl -le 'print 65_000_000 * 50 / 1024 / 1024 / 1024'
    猜你喜欢
    • 1970-01-01
    • 2022-07-19
    • 2016-09-10
    • 2022-11-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-09-18
    • 2023-03-04
    相关资源
    最近更新 更多