解析非标准分号分隔的“JSON”答案

【问题标题】：Parse non-standard semicolon separated "JSON"解析非标准分号分隔的“JSON”
【发布时间】：2016-08-13 22:40:55
【问题描述】：

我有一个非标准的“JSON”文件要解析。每个项目以分号分隔，而不是逗号分隔。我不能简单地将; 替换为,，因为可能有一些值包含;，例如。 “你好世界”。如何将其解析为 JSON 通常会解析的相同结构？

{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
  ...
}

【问题讨论】：

这种可憎的东西是怎么来的？
分隔符; 总是在行尾吗？
它只是JSON中的一个对象吗？
那不是“非标准 json”，那是 不是 json。找出它是什么，并为此获取解析器。

标签： python json parsing

【解决方案1】：

使用 Python tokenize module 将文本流转换为带有逗号而不是分号的文本流。 Python 标记器也很乐意处理 JSON 输入，甚至包括分号。标记器将字符串显示为 整个标记，而“原始”分号在流中作为单个 token.OP 标记供您替换：

import tokenize
import json

corrected = []

with open('semi.json', 'r') as semi:
    for token in tokenize.generate_tokens(semi.readline):
        if token[0] == tokenize.OP and token[1] == ';':
            corrected.append(',')
        else:
            corrected.append(token[1])

data = json.loads(''.join(corrected))

这假设一旦您将分号替换为逗号，格式变为有效的 JSON；例如在结束 ] 或 } 之前不允许使用尾随逗号，但如果下一个非换行符是右大括号，您甚至可以跟踪添加的最后一个逗号并再次删除它。

演示：

>>> import tokenize
>>> import json
>>> open('semi.json', 'w').write('''\
... {
...   "client" : "someone";
...   "server" : ["s1"; "s2"];
...   "timestamp" : 1000000;
...   "content" : "hello; world"
... }
... ''')
>>> corrected = []
>>> with open('semi.json', 'r') as semi:
...     for token in tokenize.generate_tokens(semi.readline):
...         if token[0] == tokenize.OP and token[1] == ';':
...             corrected.append(',')
...         else:
...             corrected.append(token[1])
...
>>> print ''.join(corrected)
{
"client":"someone",
"server":["s1","s2"],
"timestamp":1000000,
"content":"hello; world"
}
>>> json.loads(''.join(corrected))
{u'content': u'hello; world', u'timestamp': 1000000, u'client': u'someone', u'server': [u's1', u's2']}

令牌间空白已被删除，但可以通过注意 tokenize.NL 令牌以及作为每个令牌一部分的 (lineno, start) 和 (lineno, end) 位置元组来重新设置。由于标记周围的空格对 JSON 解析器来说无关紧要，因此我没有为此烦恼。

【讨论】：

【解决方案2】：

您可以做一些奇怪的事情并（可能）做对。

因为 JSON 上的字符串不能有控制字符，例如 \t，您可以将每个 ; 替换为 \t,，这样如果您的 JSON 解析器能够加载非严格的 JSON（例如 Python 的)。

之后，您只需将数据转换回 JSON，这样您就可以将所有这些 \t, 替换回 ;，并使用普通的 JSON 解析器最终加载正确的对象。

Python 中的一些示例代码：

data = '''{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world"
}'''

import json
dec = json.JSONDecoder(strict=False).decode(data.replace(';', '\t,'))
enc = json.dumps(dec)
out = json.loads(dec.replace('\\t,' ';'))

【讨论】：

【解决方案3】：

使用简单的字符状态机，您可以将此文本转换回有效的 JSON。我们需要处理的基本事情是确定当前的“状态”（我们是否正在转义字符，在字符串、列表、字典等中），并替换 ';'处于某种状态时按“,”。

我不知道这是否是正确的编写方式，可能有一种方法可以使它更短，但我没有足够的编程技能来为此制作最佳版本。

我尽可能多地发表评论：

def filter_characters(text):
    # we use this dictionary to match opening/closing tokens
    STATES = {
        '"': '"', "'": "'",
        "{": "}", "[": "]"
    }

    # these two variables represent the current state of the parser
    escaping = False
    state = list()

    # we iterate through each character
    for c in text:
        if escaping:
            # if we are currently escaping, no special treatment
            escaping = False
        else:
            if c == "\\":
                # character is a backslash, set the escaping flag for the next character
                escaping = True
            elif state and c == state[-1]:
                # character is expected closing token, update state
                state.pop()
            elif c in STATES:
                # character is known opening token, update state
                state.append(STATES[c])
            elif c == ';' and state == ['}']:
                # this is the delimiter we want to change
                c = ','
        yield c

    assert not state, "unexpected end of file"

def filter_text(text):
    return ''.join(filter_characters(text))

测试：

{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
  ...
}

{
  "client" : "someone",
  "server" : ["s1"; "s2"],
  "timestamp" : 1000000,
  "content" : "hello; world",
  ...
}

【讨论】：

【解决方案4】：

Pyparsing 使编写字符串转换器变得容易。为要更改的字符串编写表达式，并添加解析操作（解析时回调）以将匹配的文本替换为您想要的内容。如果您需要避免某些情况（如带引号的字符串或 cmets），则将它们包含在扫描仪中，但保持不变。然后，要实际转换字符串，请调用scanner.transformString。

（从您的示例中不清楚您是否可能在一个括号列表中的最后一个元素之后有一个';'，所以我添加了一个术语来抑制这些，因为在括号列表中的尾随'，'也是无效的 JSON。）

sample = """
{
  "client" : "someone";
  "server" : ["s1"; "s2"];
  "timestamp" : 1000000;
  "content" : "hello; world";
}"""


from pyparsing import Literal, replaceWith, Suppress, FollowedBy, quotedString
import json

SEMI = Literal(";")
repl_semi = SEMI.setParseAction(replaceWith(','))
term_semi = Suppress(SEMI + FollowedBy('}'))
qs = quotedString

scanner = (qs | term_semi | repl_semi)
fixed = scanner.transformString(sample)
print(fixed)
print(json.loads(fixed))

打印：

{
  "client" : "someone",
  "server" : ["s1", "s2"],
  "timestamp" : 1000000,
  "content" : "hello; world"}
{'content': 'hello; world', 'timestamp': 1000000, 'client': 'someone', 'server': ['s1', 's2']}

【讨论】：