【问题标题】:python parsing web access logspython解析web访问日志
【发布时间】:2019-04-20 16:23:33
【问题描述】:

我正在尝试使用以下正则表达式解析 Web 访问日志

pattern = re.compile(r"""(?x)^
    (?P<remote_host>\S+)            \s+         # host %h
    \S+                             \s+         # indent %l (unused)
    (?P<remote_user>\S+)            \s+         # user %u
    \[(?P<time_received>.*?)\]      \s+         # time %t
    "(?P<request>.*?)"              \s+         # request "%r"
    (?P<status>[0-9]+)              \s+         # status %>s
    (?P<response_bytes_clf>\S+)     (?:\s+      # size %b (careful, can be '-')
    "(?P<referrer>[^"?\s]*[^"]*)"   \s+         # referrer "%{Referer}i"
    "(?P<user_agent>[^"]*)"         (?:\s+      # user agent "%{User-agent}i"
    "[^"]*"                         )?)?        # optional argument (unused)
$""")

def get_structured_access_log(access_log):
    return pattern.match(access_log).groupdict()

但有些日志行包含如下恶意请求:

190.2.7.178 - - [21/Dec/2011:05:47:03 +0000] "GET /gnu3/index.php?doc=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 273 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:04 +0000] "GET /gnu/index.php?doc=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 271 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:04 +0000] "GET /phpgwapi/setup/tables_update.inc.php?appdir=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 286 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:05 +0000] "GET /forum/install.php?phpbb_root_dir=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 274 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:06 +0000] "GET /includes/calendar.php?phpc_root_path=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 275 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:06 +0000] "GET /includes/setup.php?phpc_root_path=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 273 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:07 +0000] "GET /inc/authform.inc.php?path_pre=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 275 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:07 +0000] "GET /include/authform.inc.php?path_pre=../../../../../../../proc/self/environ%00 HTTP/1.1" 404 278 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:08 +0000] "GET /index.php?nic=../../../../../../../proc/self/environ%00 HTTP/1.1" 200 4399 "-" "<?php system(\"id\"); ?>"
190.2.7.178 - - [21/Dec/2011:05:47:11 +0000] "GET /index.php?sec=../../../../../../../proc/self/environ%00 HTTP/1.1" 200 4399 "-" "<?php system(\"id\"); ?>"

这些请求用上面的正则解析失败,其他正常的web请求解析成功。

以下是一些成功解析的访问日志:

123.125.71.79 - - [28/Apr/2012:08:12:57 +0100] "GET /robots.txt HTTP/1.1" 404 268 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
157.56.95.126 - - [28/Apr/2012:10:23:02 +0100] "GET /robots.txt HTTP/1.1" 404 268 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
157.56.95.126 - - [28/Apr/2012:10:23:02 +0100] "GET / HTTP/1.1" 200 4399 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
110.75.173.193 - - [28/Apr/2012:11:57:26 +0100] "GET / HTTP/1.1" 200 4399 "-" "Yahoo! Slurp China"

异常错误信息:

'NoneType' object has no attribute 'groupdict'

如何修复正则表达式,使其也可以解析这些复杂的请求?

【问题讨论】:

  • 关于该错误,this page 可能会有所帮助。尝试将组 user_agent 替换为 "(?P&lt;user_agent&gt;.*?)" demo.* 如果它是要匹配的字符串的最后一个。如果最后一部分没有使用,为什么要保留它?
  • 最后一部分可能会用到,这些日志格式是动态的。
  • 演示对您有用吗?
  • 是的,我将组 user_agent 替换为 (?P.*?) 并且有效
  • 很好用!我已将其添加为答案。

标签: python regex


【解决方案1】:

使用re.match 将返回一个对应的匹配对象,如果字符串与模式不匹配,则返回None

在第一个示例数据中,这是包含转义双引号 "&lt;?php system(\"id\"); ?&gt;" 的最后一部分

如果您使用非双引号匹配的否定字符类并且想要断言字符串的结尾,那么[^"]* 将不会超过(\"id 中的第一个双引号

您可以通过替换否定字符类以匹配此部分中的双引号"(?P&lt;user_agent&gt;[^"]*)" 来匹配除新行之外的任何字符.*?

您的模式可能如下所示:

(?x)^
    (?P<remote_host>\S+)            \s+         # host %h
    \S+                             \s+         # indent %l (unused)
    (?P<remote_user>\S+)            \s+         # user %u
    \[(?P<time_received>.*?)\]      \s+         # time %t
    "(?P<request>.*?)"              \s+         # request "%r"
    (?P<status>[0-9]+)              \s+         # status %>s
    (?P<response_bytes_clf>\S+)     (?:\s+      # size %b (careful, can be '-')
    "(?P<referrer>[^"?\s]*[^"]*)"   \s+         # referrer "%{Referer}i"
    "(?P<user_agent>.*?  )"         (?:\s+      # user agent "%{User-agent}i"
    "[^"]*"                         )?)?        # optional argument (unused)
$

Regex demo

【讨论】:

  • 谢谢,您也可以为推荐人这样做吗?
猜你喜欢
  • 1970-01-01
  • 2017-08-11
  • 2015-02-21
  • 2014-11-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-10-08
相关资源
最近更新 更多