【问题标题】:Python - Searching file lines for multiple patterns efficientlyPython - 有效地搜索多个模式的文件行
【发布时间】:2015-10-17 23:55:01
【问题描述】:

我正在解析大量大文件,并希望确保我尽可能高效地进行解析。我正在解析的其中一行代码如下所示(Windows 安全事件日志 4624):

Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp  AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111 

我想知道的是,从该行中提取多个字段的最有效方法是什么?我可以反复划分线路,直到到达我感兴趣的每个领域,但我觉得反复循环线路是浪费时间/资源。

是否有一种智能的方法可以只查看一次但拉出,例如以下字段:

LogonType = 3
TargetUserName = johndoe
TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111

例如,我可以做的是重复以下过程:

part = line.partition('TargetUserName = ')[2]
username = part.partition(' ')[0]

获取我想要的每个字段(上面的示例只获取用户名),但这对我来说又是低效的。

有没有更好的处理方法?

【问题讨论】:

  • 你看过regexes吗?
  • 是的,我过去使用过正则表达式,有没有办法在同一个正则表达式操作期间匹配多个模式?或者对于我感兴趣的每个模式,我是否同样必须有不同的 re.match() 或 search()。谢谢!

标签: python file search


【解决方案1】:

每个字段名称是一组大小写字符。它们通过= 与它们的值分开。每个值都是一组非空白字符。您可以使用 re.findall 和匹配组来定位所有“字母 = 非空白”实例。这将为您提供tuples 中的list,您可以保存或迭代并传递给格式字符串:

>>> s = '''Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp  AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111 '''
>>> import re
>>> for item in re.findall(r'([A-Za-z]+) = (\S+)', s):
...     print('{} = {}'.format(*item))
...
SubjectUserSid = S-1-0-0
SubjectUserName = -
SubjectDomainName = -
SubjectLogonId = 0x0000000000000000
TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111
TargetUserName = johndoe
TargetDomainName = TestDomain
TargetLogonId = 0x0000000001111111
LogonType = 3
LogonProcessName = NtLmSsp
AuthenticationPackageName = NTLM
WorkstationName = TestWorkstation
LogonGuid = {00000000-0000-0000-0000-000000000000}
TransmittedServices = -
LmPackageName = NTLM
KeyLength = 128
ProcessId = 0x0000000000000000
ProcessName = -
IpAddress = 1.1.1.1
IpPort = 11111

你也可以把它变成字典,方便查阅:

>>> d = dict(re.findall(r'([A-Za-z]+) = (\S+)', s))
>>> d['LogonType']
'3'

【讨论】:

  • 这太棒了。一个问题 - 有没有办法我可以做同样的事情,但将 .group() 逻辑应用于它只在 '=' 之后抓住我的 'nonwhitespace' ?谢谢!
  • 如果你已经把它变成了字典(我会推荐),你可以通过d.values()获取它的值。
  • 但如果您真的不想捕获字段名称,您可以简单地删除 [A-Za-z] 周围的括号。
  • 超级有帮助。非常感谢!
【解决方案2】:
    st = 'Security/Microsoft-Windows-Security-Auditing ID [4624] :EventData/Data -> SubjectUserSid = S-1-0-0 SubjectUserName = - SubjectDomainName = - SubjectLogonId = 0x0000000000000000 TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 TargetUserName = johndoe TargetDomainName = TestDomain TargetLogonId = 0x0000000001111111 LogonType = 3 LogonProcessName = NtLmSsp  AuthenticationPackageName = NTLM WorkstationName = TestWorkstation LogonGuid = {00000000-0000-0000-0000-000000000000} TransmittedServices = - LmPackageName = NTLM V2 KeyLength = 128 ProcessId = 0x0000000000000000 ProcessName = - IpAddress = 1.1.1.1 IpPort = 11111';

using re module and re.findall you can I think get want you want

    import re
    li = re.findall(r'LogonType\s*=\s*\d+|TargetUserName\s*=\s*\w+|TargetUserSid\s*=\s*\w-.*?\s',st,re.MULTILINE| re.DOTALL)
    >>>li
    ['TargetUserSid = S-1-1-11-1111111111-1111111111-1111111111-1111 ', 'TargetUserName = johndoe', 'LogonType = 3']

【讨论】:

    猜你喜欢
    • 2021-03-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-11-23
    • 1970-01-01
    • 2016-08-23
    相关资源
    最近更新 更多