【问题标题】:Parsing amazon s3 log files (PHP)解析亚马逊 s3 日志文件 (PHP)
【发布时间】:2014-04-11 22:54:20
【问题描述】:

我正在寻找以空格分隔的 amazon s3 日志文件。唯一的问题是,一些空格分隔的字段包含空格。我将如何解析这样的文件?

450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -

【问题讨论】:

    标签: php file parsing logging amazon-s3


    【解决方案1】:

    您可能可以使用正则表达式来解析日志文件以获取各个部分

    这是一个 PHP 示例

    <?php 
    $string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';
    
    $pattern = '/(?P<owner>\S+) (?P<bucket>\S+) (?P<time>\[[^]]*\]) (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (?P<request>"[^"]*") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (?P<referrer>"[^"]*") (?P<useragent>"[^"]*") (?P<version>\S)/';
    
    preg_match($pattern, $string, $matches);
    print_r($matches);
    

    【讨论】:

    • 太棒了。这完全有效。感谢您分享您的解决方案杰里米。嘘!
    【解决方案2】:

    我稍微修改了 Jeremy Quinton 的答案以使匹配更好

      <?php 
      $string ='450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -';
    
      $pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) "(?P<method>[^ ]*) (?P<path>[^"]*)" (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) "(?P<referrer>[^"]*)" "(?P<useragent>[^"]*)" (?P<version>\S)/';
    
      preg_match($pattern, $string, $matches);
      print_r($matches);
    
      ?>
    
      result : 
      Array
      (
          [0] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c renderd [10/Apr/2014:19:32:23 +0000] 75.256.56.200 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c 0231400AA3D3533C REST.GET.OBJECT Trailer.mp4 "GET /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1" 206 - 5016183 16149754 216682 39 "http://example.com" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36" -
          [owner] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
          [1] => 450227804f8fd31c931036b020ddd0003a03b421d8c669d8858c7c15d72c
          [bucket] => renderd
          [2] => renderd
          [time] => 10/Apr/2014:19:32:23 +0000
          [3] => 10/Apr/2014:19:32:23 +0000
          [ip] => 75.256.56.200
          [4] => 75.256.56.200
          [requester] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
          [5] => 450227804f8fd31c931036b020000343afa03b421d8c669d8858c7c15d72c
          [reqid] => 0231400AA3D3533C
          [6] => 0231400AA3D3533C
          [operation] => REST.GET.OBJECT
          [7] => REST.GET.OBJECT
          [key] => Trailer.mp4
          [8] => Trailer.mp4
          [method] => GET
          [9] => GET
          [path] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
          [10] => /Trailer.mp4?AWSAccessKeyId=AKIAJFV33YRQMN63AQCQ&Expires=1397159234&Signature=8ipN9ymsB5gCzxChTu9lD6ZMrdA%3D HTTP/1.1
          [status] => 206
          [11] => 206
          [error] => -
          [12] => -
          [bytes] => 5016183
          [13] => 5016183
          [size] => 16149754
          [14] => 16149754
          [totaltime] => 216682
          [15] => 216682
          [turnaround] => 39
          [16] => 39
          [referrer] => http://example.com
          [17] => http://example.com
          [useragent] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
          [18] => Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36
          [version] => -
          [19] => -
      )
    

    【讨论】:

      【解决方案3】:

      亚马逊现在在日志中附加了更多字段,因此这是一个包含新字段的新正则表达式:

      • 主机标识
      • sigversion
      • 密码套件
      • 身份验证类型
      • 主机头
      • tlsversion

      还有其他一些变化:

      • 如果没有包围 method+pathreferreruseragent 值,Yi 的最后一个正则表达式根本不匹配日志行如果值为空(记录为单个破折号),通常会出现这种情况。
      • path 现在在末尾附加了 HTTP 协议版本,因此我将其分离为一个新的 protocol 值。

      更新正则表达式

      $pattern = '/(?P<owner>\S+) (?P<bucket>\S+) \[(?P<time>[^]]*)\] (?P<ip>\S+) (?P<requester>\S+) (?P<reqid>\S+) (?P<operation>\S+) (?P<key>\S+) (-|"-"|"(?P<method>[^ ]*) (?P<path>\S+) (?P<protocol>[^"]*)") (?P<status>\S+) (?P<error>\S+) (?P<bytes>\S+) (?P<size>\S+) (?P<totaltime>\S+) (?P<turnaround>\S+) (-|"(?P<referrer>[^"]*)") (-|"(?P<useragent>[^"]*)") (?P<version>\S+) (?P<hostid>\S+) (?P<sigversion>\S+) (?P<ciphersuite>\S+) (?P<authtype>\S+) (?P<hostheader>\S+) (?P<tlsversion>\S+)/';
      preg_match($pattern, $string, $matches);
      

      您可以替换空值(破折号)并从 $matches 数组中过滤掉重复的数字索引,如下所示:

      $matches = array_map(
          function($val) { return $val === '-' ? '' : $val; },
          array_filter(
              $matches,
              function($key) { return !is_numeric($key); },
              ARRAY_FILTER_USE_KEY
          )
      );
      

      【讨论】:

        猜你喜欢
        • 2012-05-01
        • 1970-01-01
        • 2018-01-22
        • 2015-02-03
        • 2021-04-05
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2010-10-20
        相关资源
        最近更新 更多