【问题标题】:optimize multiple sed statements优化多个 sed 语句
【发布时间】:2014-02-11 10:46:32
【问题描述】:

我想优化我对具有这种结构的文件的处理:

2014-01-21 14:26:05.900,2014-01-21 14:26:05.740,    0.000,    192.168.40.2,   192.168.40.26,6    ,  8000, 33311,  172000,    2000,.A..S.,  0
2014-01-21 14:29:23.900,2014-01-21 14:29:23.340,    0.000,   192.168.40.26,    192.168.40.2,6    , 33317,  8000, 3052000,    2000,.A....,  0
2014-01-21 14:30:25.900,2014-01-21 14:30:25.330,    0.000,   192.168.40.26,    192.168.40.2,17   , 36193,   514,  558000,    2000,......,  0
2014-01-21 14:31:04.901,2014-01-21 14:31:04.222,    0.000,  192.168.40.242,    192.168.40.2,17   , 57516,   514,  422000,    2000,......,  0
2014-01-21 14:31:13.900,2014-01-21 14:31:13.143,    0.000,   192.168.40.16,    192.168.40.2,17   , 53313,   514,  540000,    2000,......,  0

到具有这种结构的文件:

2014-01-21 14:26:05.900,900,0.000,192.168.40.2,192.168.40.26,6,8000,33311,172000,2000,.A..S.,0
2014-01-21 14:29:23.900,900,0.000,192.168.40.26,192.168.40.2,6,33317,8000,3052000,2000,.A....,0
2014-01-21 14:30:25.900,900,0.000,192.168.40.26,192.168.40.2,17,36193,514,558000,2000,......,0
2014-01-21 14:31:04.901,901,0.000,192.168.40.242,192.168.40.2,17,57516,514,422000,2000,......,0
2014-01-21 14:31:13.900,900,0.000,192.168.40.16,192.168.40.2,17,53313,514,540000,2000,......,0

要优化的命令:

sed -e 's/,\s\+/,/g' -i /tmp/to_filter
sed -e 's/\s\+,/,/g' -i /tmp/to_filter
while IFS=, read -r f1 f2 f3 f4 f5 f6 f7 f8 f9 f10; do
    echo "$f1,${f1##*.},$f3,$f4,$f5,$f6,$f7,$f8,$f9,$f10"
done < /tmp/to_filter

【问题讨论】:

  • 简单来说,您可以使用两个 -e 选项将前两个 sed 操作组合到一个命令中。您还应该简单地将sed 的输出传送到while 循环,而不是重写文件。认为“临时文件是一个肮脏的黑客”。当然有时它们是必要的,并且在必要时毫不犹豫地使用它们。但是不要在不需要的时候使用它们。除此之外,你有并发使用问题(文件名是固定的,所以两个人同时运行脚本会相互干扰),你也有清理问题。

标签: python perl bash sed awk


【解决方案1】:
awk 'BEGIN{FS=OFS=","} {t=$2=$1; sub(/.*\./,"",$2); gsub(/ /,""); $1=t} 1' file      
2014-01-21 14:26:05.900,900,0.000,192.168.40.2,192.168.40.26,6,8000,33311,172000,2000,.A..S.,0
2014-01-21 14:29:23.900,900,0.000,192.168.40.26,192.168.40.2,6,33317,8000,3052000,2000,.A....,0
2014-01-21 14:30:25.900,900,0.000,192.168.40.26,192.168.40.2,17,36193,514,558000,2000,......,0
2014-01-21 14:31:04.901,901,0.000,192.168.40.242,192.168.40.2,17,57516,514,422000,2000,......,0
2014-01-21 14:31:13.900,900,0.000,192.168.40.16,192.168.40.2,17,53313,514,540000,2000,......,0

【讨论】:

  • 这看起来很棒,除了第二个字段应该包含 .和 , 在第一个字段中。
  • 您更改了输入文件,不是吗?好的,我更新了脚本,在 sub() 之前将 $1 复制到 $2 中。
  • 这个测试胜出!针对 500,000 条记录的处理时间 real 0m29.505s user 0m28.670s sys 0m0.816s
【解决方案2】:

这可能对你有用(GNU sed):

sed -r 's/^([^,.]*\.([^,]*)),[^,]*/\1,\2/;s/\s*,\s*/,/g' file

编辑:

sed -r 's/\.([^,]*),[^,]*/.\1,\1/;s/\s*,\s*/,/g' file

【讨论】:

    【解决方案3】:

    我会使用 单线。它映射到每个字段以删除前导和尾随空格,然后从第二个字段中删除所有字符,直到最后一个 .,然后打印所有字段以逗号连接:

    perl -F, -ane '
        @F = map { s/\A\s+//; s/\s+\Z//; $_ } @F; 
        $F[1] =~ s/\A.*\.//; 
        printf qq|%s\n|, join q|,|, @F
    ' infile
    

    它产生:

    2014-01-21 14:26:05.900,900,0.000,192.168.40.2,192.168.40.26,6,8000,33311,172000,2000,.A..S.,0
    2014-01-21 14:29:23.900,900,0.000,192.168.40.26,192.168.40.2,6,33317,8000,3052000,2000,.A....,0
    2014-01-21 14:30:25.900,900,0.000,192.168.40.26,192.168.40.2,17,36193,514,558000,2000,......,0
    2014-01-21 14:31:04.901,901,0.000,192.168.40.242,192.168.40.2,17,57516,514,422000,2000,......,0
    2014-01-21 14:31:13.900,900,0.000,192.168.40.16,192.168.40.2,17,53313,514,540000,2000,......,0
    

    【讨论】:

    • 这看起来很棒,除了第二个字段应该包含 .和 , 在第一个字段中。
    【解决方案4】:

    使用 awk

    awk '{t=$1;gsub(/ /,"");split($1,a,".");$1=t;$2=a[2]}1' FS=, OFS=, file
    

    【讨论】:

      猜你喜欢
      • 2011-07-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-23
      • 2021-06-02
      • 2021-11-22
      • 1970-01-01
      相关资源
      最近更新 更多