【发布时间】:2020-01-03 02:41:48
【问题描述】:
我有一个 Python 3 字典,其中键是元组,值是一些字符串,它们对应于掩码的正则表达式。我想删除重叠的元组。
问题
基本上,我正在尝试构建一个匹配给定字符串的正则表达式。我有一个遍历字符串的正则表达式目录,然后将“匹配的”正则表达式存储为字典,其中它们的“跨度”(元组)作为键,正则表达式掩码与字符串的跨度匹配.
在此之后,我的目标是将这些正则表达式组合在一起。但是,我面临一个阻碍我进步的问题。
示例
考虑日志行 -
Mar 29 2004 09:54:18: %PIX-6-302005: Built UDP connection for faddr 198.207.223.240/53337 gaddr 10.0.0.187/53 laddr 192.168.0.2/53
一旦我把这个字符串通过我的匹配逻辑,这就是我生成的字典 -
pos_dict:
(0, 20) CISCOTIMESTAMP
(23, 35) CISCOTAG
(37, 42) CISCO_ACTION
(68, 83) IPV4
(83, 89) URIPATH
(96, 106) IPV4
(106, 109) URIPATH
(116, 127) IPV4
(127, 130) URIPATH
在此之后,我将被值(CISCOTAG、IPV4)等掩蔽的正则表达式组合起来,在此处得到最终的正则表达式。
但是,如果我在此日志行中输入相同的代码 -
2016-11-16 06:43:19.79 kali sshd[37727]: Failed password for root from 127.0.0.1 port 22 ssh2
生成的位置字典是-
pos_dict:
(0, 4) INT
(0, 22) TIMESTAMP_ISO8601
(4, 7) INT
(7, 10) INT
(11, 13) INT
(14, 16) INT
(17, 19) INT
(20, 22) INT
(32, 39) SYSLOG5424SD
(33, 38) INT
(71, 74) INT
(71, 80) IPV4
(75, 76) INT
(77, 78) INT
(79, 80) INT
(86, 88) INT
(92, 93) INT
虽然这不是完全“错误”,但我们可以看到没有必要
(0, 4) INT
(4, 7) INT
(7, 10) INT
(11, 13) INT
(14, 16) INT
(17, 19) INT
(20, 22) INT
(33, 38) INT
(71, 74) INT
(75, 76) INT
(77, 78) INT
(79, 80) INT
因为它们已经在范围内
(0, 22) TIMESTAMP_ISO8601
(32, 39) SYSLOG5424SD
(71, 80) IPV4
尝试
这是我用来匹配正则表达式并生成完整正则表达式的代码:
def get_order(results: list, string: str) -> dict:
"""
Get the order of the regex occurence in a dictionary.
Paramters
---------
results : list
list matches
string: str
input string
Returns
-------
dict
"""
pos_dict = {}
for result in results:
# all_regex is a dictionary of regular expressions matched against their 'masked' names.
expr = all_regex.get(result)
# Iterate through the expression and store the span of the matched values as a key in the pos_dict
for iter in regex.finditer(expr, string):
pos_dict[iter.span()] = result
return pos_dict
def get_final_regex(pos_dict: dict) -> str:
"""
Combine the grok regexes into a final regex pattern.
Paramters
---------
pos_dict : dict
list matches
Returns
-------
str
"""
final_regex = ''
filler_start = '(.*?'
filler_end = ')'
for key in sorted(pos_dict):
## DEBUG START
print(key, pos_dict[key])
## DEBUG END
expr = (pos_dict.get(key))
q = all_regex.get(expr)
q = q.replace('/', r'\/')
if not (('(' in q) and (')' in q)):
q = '(' + q + ')'
final_regex = final_regex + filler_start + q + filler_end
return final_regex
期望
对于日志行2016-11-16 06:43:19.79 kali sshd[37727]: Failed password for root from 127.0.0.1 port 22 ssh2,pos_dict 的期望值应该是-
(0, 22) TIMESTAMP_ISO8601
(32, 39) SYSLOG5424SD
(71, 80) IPV4
(86, 88) INT
(92, 93) INT
以便我以后可以将正则表达式组合在一起。
理想情况下,这归结为“排序”和“忽略”重叠元组的问题。
任何帮助将不胜感激。
【问题讨论】:
标签: python python-3.x dictionary tuples