从带引号的字符串中提取键值对答案

【问题标题】：Extracting key value pairs from string with quotes从带引号的字符串中提取键值对
【发布时间】：2016-12-08 19:13:03
【问题描述】：

我无法为这个要求编写一个“优雅”的解析器。（看起来不像是一份 C 早餐）。输入是一个字符串，键值对用','分隔并加入'='。

key1=value1,key2=value2

欺骗我的部分是值可以被引用 (") ，并且引号内的 ',' 不会结束键。

key1=value1,key2="value2,still_value2"

这最后一部分让我很难使用 split 或 re.split，诉诸 for i in range for 循环 :(.

谁能演示一个干净的方法来做到这一点？

可以假设引号仅出现在值中，并且没有空格或非字母数字字符。

【问题讨论】：

你能发布预期的输出吗？
第二个例子中key2的值是否包含引号？即在您的示例中，key2 是否映射到 "value2,still_value2" 或 "\"value2,still_value2\""？

标签： python parsing

【解决方案1】：

我不确定它看起来不像 C 早餐，是否相当优雅:)

data = {}
original = 'key1=value1,key2="value2,still_value2"'
converted = ''

is_open = False
for c in original:
    if c == ',' and not is_open:
        c = '\n'
    elif c in ('"',"'"):
        is_open = not is_open
    converted += c

for item in converted.split('\n'):
    k, v = item.split('=')
    data[k] = v

【讨论】：

【解决方案2】：

使用Split a string, respect and preserve quotes 的一些正则表达式魔法，我们可以做到：

import re

string = 'key1=value1,key2="value2,still_value2"'

key_value_pairs = re.findall(r'(?:[^\s,"]|"(?:\\.|[^"])*")+', string)

for key_value_pair in key_value_pairs:
    key, value = key_value_pair.split("=")

根据 BioGeek，我的猜测是，我的意思是解释 Janne Karila 使用的正则表达式：该模式在逗号上断开字符串，但在此过程中尊重双引号部分（可能带有逗号）。它有两个单独的选项：不涉及引号的字符运行；和双引号字符的运行，其中双引号完成运行，除非它（反斜杠）转义：

(?:              # parenthesis for alternation (|), not memory
[^\s,"]          # any 1 character except white space, comma or quote
|                # or
"(?:\\.|[^"])*"  # a quoted string containing 0 or more characters
                 # other than quotes (unless escaped)
)+               # one or more of the above

【讨论】：

你能补充一些关于正则表达式如何工作的解释吗？
@BioGeek，我按照你的要求尝试了，如果我成功了，请告诉我！
我通常尽可能避免使用 re，我在大多数任务中都做到了这一点。今天我偶然发现了一个网络设备日志结构，其中包含空格作为字段分隔符，值可选引用，而一些值涉及空格（被引用）。我真的很喜欢并想在另一个答案中使用 shlex 方法，但这不起作用，而你的正则表达式确实有效。谢谢，+1。

【解决方案3】：

我想出了这个正则表达式解决方案：

import re
match = re.findall(r'([^=]+)=(("[^"]+")|([^,]+)),?', 'key1=value1,key2=value2,key3="value3,stillvalue3",key4=value4')

这使得“匹配”：

[('key1', 'value1', '', 'value1'), ('key2', 'value2', '', 'value2'), ('key3', '"value3,stillvalue3"', '"value3,stillvalue3"', ''), ('key4', 'value4', '', 'value4')]

然后你可以做一个for循环来获取键和值：

for m in match:
    key = m[0]
    value = m[1]

【讨论】：

【解决方案4】：

根据其他几个答案，我想出了以下解决方案：

import re
import itertools

data = 'key1=value1,key2="value2,still_value2"'

# Based on Alan Moore's answer on http://stackoverflow.com/questions/2785755/how-to-split-but-ignore-separators-in-quoted-strings-in-python
def split_on_non_quoted_equals(string):
    return re.split('''=(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', string)
def split_on_non_quoted_comma(string):
    return re.split(''',(?=(?:[^'"]|'[^']*'|"[^"]*")*$)''', string)

split1 = split_on_non_quoted_equals(data)
split2 = map(lambda x: split_on_non_quoted_comma(x), split1)

# 'Unpack' the sublists in to a single list. Based on Alex Martelli's answer on http://stackoverflow.com/questions/952914/making-a-flat-list-out-of-list-of-lists-in-python
flattened = [item for sublist in split2 for item in sublist]

# Convert alternating elements of a list into keys and values of a dictionary. Based on Sven Marnach's answer on http://stackoverflow.com/questions/6900955/python-convert-list-to-dictionary
d = dict(itertools.izip_longest(*[iter(flattened)] * 2, fillvalue=""))

生成的d 是以下字典：

{'key1': 'value1', 'key2': '"value2,still_value2"'}

【讨论】：

【解决方案5】：

我建议不要对这个任务使用正则表达式，因为你要解析的语言不是正则的。

你有一个包含多个键值对的字符串。解析它的最佳方法不是匹配其上的模式，而是对其进行适当的标记。

Python 标准库中有一个名为 shlex 的模块，它模仿 POSIX shell 完成的解析，并提供可以轻松根据您的需求定制的词法分析器实现。

from shlex import shlex

def parse_kv_pairs(text, item_sep=",", value_sep="="):
    """Parse key-value pairs from a shell-like text."""
    # initialize a lexer, in POSIX mode (to properly handle escaping)
    lexer = shlex(text, posix=True)
    # set ',' as whitespace for the lexer
    # (the lexer will use this character to separate words)
    lexer.whitespace = item_sep
    # include '=' as a word character 
    # (this is done so that the lexer returns a list of key-value pairs)
    # (if your option key or value contains any unquoted special character, you will need to add it here)
    lexer.wordchars += value_sep
    # then we separate option keys and values to build the resulting dictionary
    # (maxsplit is required to make sure that '=' in value will not be a problem)
    return dict(word.split(value_sep, maxsplit=1) for word in lexer)

示例运行：

parse_kv_pairs(
  'key1=value1,key2=\'value2,still_value2,not_key1="not_value1"\''
)

输出：

{'key1': 'value1', 'key2': 'value2,still_value2,not_key1="not_value1"'}

编辑：我忘了补充一点，我通常坚持使用 shlex 而不是使用正则表达式（在这种情况下更快）的原因是它给你带来的惊喜更少，特别是如果你需要稍后允许更多可能的输入。我从来没有发现如何使用正则表达式正确解析此类键值对，总会有输入（例如：A="B=\"1,2,3\""）会欺骗引擎。

如果您不关心此类输入，（或者，换句话说，如果您可以确保您的输入遵循正则语言的定义），那么正则表达式就可以了。

EDIT2： split 有一个 maxsplit 参数，使用起来比拆分/切片/连接更简洁。感谢@cdlane 的声音输入！

【讨论】：

我相信shlex 是一个可靠的生产解决方案，这是一个很好的例子，说明如何将其调整到手头的问题。然而，这个答案在它的return 语句中对我来说失去了所有优雅——split() 两次相同的数据，然后join() 在过多的split() 之后清理，这样你就可以使用字典理解？ return dict(word.split(value_sep, maxsplit=1) for word in lexer) 怎么样
是的，这好多了，我在写作时忘记了maxsplit 参数，并且在值中添加对= 的支持时确实让它变得不那么优雅了。感谢您的建议，我编辑答案。