从文本文件中检索 JSON 对象（使用 Python）答案

【问题标题】：Retrieving JSON objects from a text file (using Python)从文本文件中检索 JSON 对象（使用 Python）
【发布时间】：2012-02-02 13:06:55
【问题描述】：

我有数千个包含多个 JSON 对象的文本文件，但不幸的是，这些对象之间没有分隔符。对象存储为字典，它们的一些字段本身就是对象。每个对象可能具有可变数量的嵌套对象。具体来说，一个对象可能如下所示：

{field1: {}, field2: "some value", field3: {}, ...}

并且数百个这样的对象在文本文件中连接在一起，没有分隔符。这意味着我既不能使用json.load()，也不能使用json.loads()。

关于如何解决这个问题的任何建议。是否有已知的解析器来执行此操作？

【问题讨论】：

它们至少是分开到不同的行上，还是只是一个长的单行{...}{...}{...}堆积？
不，这就是问题所在，它只是一个长单行。
您可以使用str.replace 添加分隔符吗？如：single_line_json.replace('}{',}\n{')
如果您需要更快的解决方案，您可以通过切换到生成器来避免大型对象列表：while end != s_len: obj, end = decoder.raw_decode(s, idx=end) yield obj。

标签： python json object

【解决方案1】：

这会从字符串中解码您的 JSON 对象“列表”：

from json import JSONDecoder

def loads_invalid_obj_list(s):
    decoder = JSONDecoder()
    s_len = len(s)

    objs = []
    end = 0
    while end != s_len:
        obj, end = decoder.raw_decode(s, idx=end)
        objs.append(obj)

    return objs

这里的好处是你可以很好地使用解析器。因此，它会一直告诉您确切在哪里发现了错误。

示例

>>> loads_invalid_obj_list('{}{}')
[{}, {}]

>>> loads_invalid_obj_list('{}{\n}{')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "decode.py", line 9, in loads_invalid_obj_list
    obj, end = decoder.raw_decode(s, idx=end)
  File     "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 2 column 2 (char 5)

清洁溶液（稍后添加）

import json
import re

#shameless copy paste from json/decoder.py
FLAGS = re.VERBOSE | re.MULTILINE | re.DOTALL
WHITESPACE = re.compile(r'[ \t\n\r]*', FLAGS)

class ConcatJSONDecoder(json.JSONDecoder):
    def decode(self, s, _w=WHITESPACE.match):
        s_len = len(s)

        objs = []
        end = 0
        while end != s_len:
            obj, end = self.raw_decode(s, idx=_w(s, end).end())
            end = _w(s, end).end()
            objs.append(obj)
        return objs

示例

>>> print json.loads('{}', cls=ConcatJSONDecoder)
[{}]

>>> print json.load(open('file'), cls=ConcatJSONDecoder)
[{}]

>>> print json.loads('{}{} {', cls=ConcatJSONDecoder)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 339, in loads
    return cls(encoding=encoding, **kw).decode(s)
  File "decode.py", line 15, in decode
    obj, end = self.raw_decode(s, idx=_w(s, end).end())
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 376, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting object: line 1 column 5 (char 5)

【讨论】：

真的很酷，我希望 json 模块有这样的东西，它有。这是完美的。谢谢！

【解决方案2】：

Sebastian Blask 的想法是正确的，但没有理由使用正则表达式进行如此简单的更改。

objs = json.loads("[%s]"%(open('your_file.name').read().replace('}{', '},{')))

或者，更清晰

raw_objs_string = open('your_file.name').read() #read in raw data
raw_objs_string = raw_objs_string.replace('}{', '},{') #insert a comma between each object
objs_string = '[%s]'%(raw_objs_string) #wrap in a list, to make valid json
objs = json.loads(objs_string) #parse json

【讨论】：

【解决方案3】：

这样的事情怎么样：

import re
import json

jsonstr = open('test.json').read()

p = re.compile( '}\s*{' )
jsonstr = p.sub( '}\n{', jsonstr )

jsonarr = jsonstr.split( '\n' )

for jsonstr in jsonarr:
   jsonobj = json.loads( jsonstr )
   print json.dumps( jsonobj )

【讨论】：

【解决方案4】：

解决方案

据我所知，}{ 没有出现在有效的 JSON 中，因此在尝试获取连接的单独对象的字符串时，以下内容应该是完全安全的（txt 是您文件的内容）。 它不需要任何导入（甚至是 re 模块）来做到这一点：

retrieved_strings = map(lambda x: '{'+x+'}', txt.strip('{}').split('}{'))

或者如果您更喜欢列表推导式（正如 David Zwicker 在 cmets 中提到的那样），您可以这样使用它：

retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{'))]

这将导致retrieved_strings 成为一个字符串列表，每个字符串都包含单独的 JSON 对象。在此处查看证明：http://ideone.com/Purpb

示例

以下字符串：

'{field1:"a",field2:"b"}{field1:"c",field2:"d"}{field1:"e",field2:"f"}'

会变成：

['{field1:"a",field2:"b"}', '{field1:"c",field2:"d"}', '{field1:"e",field2:"f"}']

正如 the example I mentioned 所证明的那样。

【讨论】：

这应该使用列表理解来完成retrieved_strings = ['{'+x+'}' for x in txt.strip('{}').split('}{')]
@DavidZwicker：为什么？您是被视为已弃用的 map() 函数的支持者之一吗？这是完全有效的。但是它可能看起来更简单，我会将其添加到我的答案中。
带有}{的有效json：'{"f1" : "}{}{", "b" : "{{}{}}{{{}{}"}'
@Tadeck：请参阅stackoverflow.com/questions/1247486/…，了解有关地图与列表理解的讨论。实际上，我自己有时会使用map，但只是在功能已经存在的情况下。将lambda 与map 结合使用对我来说没有多大意义。
@soulcheck：+1，非常好！它仍然可以解决，但现在需要检查 }{ 序列是否出现在引号内...

【解决方案5】：

为什么不将文件加载为字符串，将所有 }{ 替换为 }，{ 并用 [] 包围整个内容？比如：

re.sub('\}\s*?\{', '\}, \{', string_read_from_a_file)

如果你确定你总是有 }{ 中间没有空格，或者简单的字符串替换。

如果您希望 }{ 也出现在字符串中，您还可以拆分 }{ 并使用 json.load 评估每个片段，以防出现错误，片段不完整，您必须添加第一个的下一个，依此类推。

【讨论】：

酷！这很聪明，也很容易做到。我会尝试一下，然后返回结果。谢谢！
如果你在其他地方有'}{'字符串会发生什么，比如属性值？例如：'{"field1" : "}{123", "field2" : "123"}'

【解决方案6】：

如何阅读文件，每次找到 { 时递增计数器，遇到 } 时递减计数器。当您的计数器达到 0 时，您将知道您已经到达第一个对象的末尾，因此通过 json.load 发送它并重新开始计数。然后重复完成。

【讨论】：

【解决方案7】：

import json

file1 = open('filepath', 'r')
data = file1.readlines()

for line in data :
   values = json.loads(line)

'''Now you can access all the objects using values.get('key') '''

【讨论】：

【解决方案8】：

假设您在文件中的文本开头添加了一个 [，并使用了 json.load() 的一个版本，当它检测到找到 { 而不是预期的逗号的错误时（或到达结尾文件），吐出刚刚完成的对象？

【讨论】：

哦，我明白你的意思了。您是否建议使用 try/except 然后在列索引显示时拆分？我很快尝试了一下，我得到了异常：“期待，分隔符：第 1 行第 1332 列（字符 1332）。这是可行的。我只是希望那里有一个解析器，因为它似乎可能会发生一些事情。但是谢谢对于这个建议。

【解决方案9】：

用其中的垃圾替换文件：

$ sed -i -e 's;}{;}, {;g' foo

在 Python 中即时执行：

junkJson.replace('}{', '}, {')

【讨论】：