解析日志文件的最佳方式答案

【问题标题】：Optimal way to parse a log file解析日志文件的最佳方式
【发布时间】：2012-11-29 12:32:05
【问题描述】：

我有一个看起来像这样的日志文件：

Client connected with ID 8127641241
< multiple lines of unimportant log here>
Client not responding
Total duration: 154.23583
Sent: 14
Received: 9732
Client lost

Client connected with ID 2521598735
< multiple lines of unimportant log here>
Client not responding
Total duration: 12.33792
Sent: 2874
Received: 1244
Client lost

日志包含许多以Client connected with ID 1234 开头并以Client lost 结尾的块。他们永远不会混淆（一次只有一个客户）。

我将如何解析这个文件并生成如下统计数据：

我主要问的是解析过程，不是格式。

我想我可以遍历所有行，在找到 Client connected 行时设置一个标志并将 ID 保存在变量中。然后 grep 行，保存值，直到找到 Client lost 行。这是一个好方法吗？有没有更好的？

【问题讨论】：

标签： bash shell logging sed grep

【解决方案1】：

这是使用awk的快速方法：

awk 'BEGIN { print "ID Duration Sent Received" } /^(Client connected|Total duration:|Sent:)/ { printf "%s ", $NF } /^Received:/ { print $NF }' file | column -t

结果：

ID          Duration   Sent  Received
8127641241  154.23583  14    9732
2521598735  12.33792   2874  1244

【讨论】：

【解决方案2】：

perl中的一个解决方案

#!/usr/bin/perl

use warnings;
use strict;

print "\tID\tDuration\tSent\tReceived\n";

while (<>) {
  chomp;
  if (/Client connected with ID (\d+)/) {
    print "$1\t";
  }
  if (/Total duration: ([\d\.]+)/) {
    print "$1\t";
  }
  if (/Sent: (\d+)/) {
    print "$1\t";
  }
  if (/Received: (\d+)/) {
    print "$1\n";
  }
}

样本输出：

        ID  Duration    Sent    Received
8127641241  154.23583   14  9732
2521598735  12.33792    2874    1244

【讨论】：

【解决方案3】：

如果您确定日志文件不会有错误，并且字段始终按相同的顺序排列，则可以使用以下内容：

#!/bin/bash

ids=()
declare -a duration
declare -a sent
declare -a received
while read _ _ _ _ id; do
   ids+=( "$id" )
   read _ _ duration[$id]
   read _ sent[$id]
   read _ received[$id]
done < <(grep '\(^Client connected with ID\|^Total duration:\|^Sent:\|Received:\)' logfile)

# printing the data out, for control purposes only
for id in "${ids[@]}"; do
   printf "ID=%s\n\tDuration=%s\n\tSent=%s\n\tReceived=%s\n" "$id" "${duration[$id]}" "${sent[$id]}" "${received[$id]}"
done

输出是：

$ ./parsefile
ID=8127641241
    Duration=154.23583
    Sent=14
    Received=9732
ID=2521598735
    Duration=12.33792
    Sent=2874
    Received=1244

但数据存储在相应的关联数组中。这是相当有效的。在另一种编程语言（例如 perl）中它可能会稍微高效一些，但是由于您只用 bash、sed 和 grep 标记了您的帖子，我想我完全回答了您的问题。

解释：grep 只过滤我们感兴趣的行，而 bash 只读取我们感兴趣的字段，假设它们总是以相同的顺序出现。该脚本应该易于理解并根据您的需要进行修改。

【讨论】：

很好地使用 grep 来预过滤文件。您可以省略“ids”数组并遍历“duration”数组的键：for id in "${!duration[@]}"

【解决方案4】：

awk：

awk 'BEGIN{print "ID Duration Sent Received"}/with ID/&&!f{f=1}f&&/Client lost/{print a[1],a[2],a[3],a[4];f=0}f{for(i=1;i<=NF;i++){
        if($i=="ID")a[1]=$(i+1)
        if($i=="duration:")a[2]=$(i+1)
        if($i=="Sent:")a[3]=$(i+1)
        if($i=="Received:")a[4]=$(i+1)
}}'log

如果你的数据块之间总是有一个空行，上面的 awk 脚本可以简化为：

 awk -vRS="" 'BEGIN{print "ID Duration Sent Received"}
{for(i=1;i<=NF;i++){
        if($i=="ID")a[1]=$(i+1)
        if($i=="duration:")a[2]=$(i+1)
        if($i=="Sent:")a[3]=$(i+1)
        if($i=="Received:")a[4]=$(i+1)
}print a[1],a[2],a[3],a[4];}' log

输出：

ID Duration Sent Received
8127641241 154.23583 14 9732
2521598735 12.33792 2874 1244

如果您想获得更好的格式，请将输出通过管道传输到 |column -t

你得到：

ID          Duration   Sent  Received
8127641241  154.23583  14    9732
2521598735  12.33792   2874  1244

【讨论】：

【解决方案5】：

使用段落模式浏览文件

使用 Perl 或 AWK，您可以使用一种特殊的段落模式来插入记录，该模式使用记录之间的空白行作为分隔符。在 Perl 中，使用-00 来使用段落模式；在 AWK 中，您将 RS 变量设置为空字符串（例如 ""）以执行相同的操作。然后你可以解析每条记录中的字段。

使用面向行的语句

或者，您可以使用 shell while 循环一次读取每一行，然后使用 grep 或 sed 解析每一行。您甚至可以使用 case 语句，具体取决于解析的复杂程度。

例如，假设您的记录中始终有 5 个匹配字段，您可以执行以下操作：

while read; do
    grep -Eo '[[:digit:]]+'
done < /tmp/foo | xargs -n5 | sed 's/ /\t/g'

循环会产生：

23583   14  9732    2521598735  33792
2874    1244    8127641241  23583   14
9732    2521598735  33792   2874    1244

您当然可以使用格式，添加标题行等等。关键是您必须了解您的数据。

AWK、Perl 甚至 Ruby 是解析面向记录格式的更好选择，但如果您的需求是基本的，shell 肯定是一个选择。

【讨论】：

对于以空格开头或结尾或包含反斜杠的行，您的 shell 循环将失败。始终使用while IFS= read -r line，除非您有充分的理由不这样做并且确切地知道自己在做什么。你也没有给 grep 一个目标来操作，所以脚本会挂起。使用 awk。

【解决方案6】：

Perl 的简短 sn-p：

perl -ne '
    BEGIN {print "ID Duration Sent Received\n";}
    print "$1 " if /(?:ID|duration:|Sent:|Received:) (.+)$/;
    print "\n" if /^Client lost/;
' filename | column -t

【讨论】：

【解决方案7】：

awk -v RS= -F'\n' '
BEGIN{ printf "%15s%15s%15s%15s\n","ID","Duration","Sent","Received" }
{
   for (i=1;i<=NF;i++) {
      n = split($i,f,/ /)    
      if ( $i ~ /^(Client connected|Total duration:|Sent:|Received:)/ ) {
         printf "%15s",f[n]
      }
   }
   print ""
}'

【讨论】：