如何使用 re.compile 在 python 中生成正则表达式模式答案

【问题标题】：How to generate regex patterns in python using re.compile如何使用 re.compile 在 python 中生成正则表达式模式
【发布时间】：2019-10-31 05:03:23
【问题描述】：

我正在尝试创建一个 python 代码，该代码将能够使用正则表达式从字符串中提取信息，例如下面的字符串。

date=2019-10-26 time=17:59:00 logid="0000000020" type="traffic" subtype="forward" level="notice" vd="root" eventtime=1572127141 srcip=192.168.6.15 srcname="TR" srcport=522 srcintf="port1" srcintfrole="lan" dstip=172.217.15.194 dstport=43 dstintf="wan2" dstintfrole="wan" poluuid="feb1fa32-d08b-51e7-071f-19e3b5d2213c" sessionid =195421734 proto=6 action="accept" policyid=4 policytype="policy" service="HTTPS" dstcountry="United States" srccountry="Reserved" trandisp="snat" transip=168.168.140.247 transport=294 appid=537 app="Google.Ads" appcat="General.Interest" apprisk="elevated" applist="Seniors" appact="detected" duration=719 sentbyte=2691 rcvdbyte=2856 sentpkt=19 rcvdpkt=25 shapepolicyid=1 sentdelta=449 rcvddelta=460 devtype="Linux" devcategory="Linux" mastersrcmac="fa:cc:4e:a3:56:2d" srcmac="fa:cc:4e:a3:56:2d" srcserver=0

我在 github 上找到了某人的代码，他使用下面的行来提取信息，但是，他的代码没有提取我需要的所有字段，最值得注意的是 srcip=192.168.1.105

我不想发布这个人的全部代码，因为它不是我的。但是，如果需要，我可以。

我希望从混乱的信息中提取所有字段，以便将它们保存为 .csv 文件。

【问题讨论】：

您能否将您尝试的代码或模式添加到问题中并指定您希望匹配的内容？请看How do I ask a good question?

标签： regex python-3.x

【解决方案1】：

正则表达式\w+=([^\s"]+|"[^"]*") 匹配

字段名称（至少一个单词字符），然后
= 符号，然后
要么：
- 不带引号的字段值（至少一个字符，不包括空格和引号），或
- 带引号的字段值（"，然后是任意数量的非引号，然后是 "）。

通过在正则表达式中匹配字段名称的部分以及未引用和引用的值周围添加括号，我们可以使用findall 方法提取相关部分并将它们放入字典中：

import re

pattern = re.compile(r'(\w+)=(([^\s"]+)|"([^"]*)")')
def parse_fields(text):
    return {
        name: (value or quoted_value)
        for name,_,value,quoted_value in pattern.findall(text)
    }

【讨论】：

谢谢。我用这一行来解决问题： pattern = re.compile( '(\w+)(?:=)(?:([^\s,""]+|"(?:\\.|[^" "])*"))|(\w+)=(?:([\w\-\.:\=]+))')

【解决方案2】：

与 kaya3 相同，但我不保留引号

s = '''date=2019-10-26 time=17:59:00 logid="0000000020" type="traffic"
subtype="forward" level="notice" vd="root" eventtime=1572127141
srcip=192.168.6.15 srcname="TR" srcport=522 srcintf="port1" srcintfrole="lan"
dstip=172.217.15.194 dstport=43 dstintf="wan2" dstintfrole="wan"
poluuid="feb1fa32-d08b-51e7-071f-19e3b5d2213c" sessionid=195421734 proto=6
action="accept" policyid=4 policytype="policy" service="HTTPS"
dstcountry="United States" srccountry="Reserved" trandisp="snat"
transip=168.168.140.247 transport=294 appid=537 app="Google.Ads"
appcat="General.Interest" apprisk="elevated" applist="Seniors"
appact="detected" duration=719 sentbyte=2691 rcvdbyte=2856 sentpkt=19
rcvdpkt=25 shapingpolicyid=1 sentdelta=449 rcvddelta=460 devtype="Linux"
devcategory="Linux" mastersrcmac="fa:cc:4e:a3:56:2d" srcmac="fa:cc:4e:a3:56:2d"
srcserver=0'''

import re

matches = re.findall(r'([a-zA-Z_][a-zA-Z0-9_]*)=(?:"([^"]+)"|(\S+))', s)

d = {
    name: quoted or unquoted
    for name, quoted, unquoted in matches
}

【讨论】：