【问题标题】:How to parse a BIG JSON file in python如何在python中解析一个大的JSON文件
【发布时间】:2026-01-22 21:05:01
【问题描述】:

我正在处理一个非常大的数据集,但我遇到了一个无法找到任何答案的问题。 我正在尝试解析来自 JSON 的数据,这是我对整个数据集中的一段数据所做的工作:

import json

s = set()

with open("data.raw", "r") as f:

    for line in f:
        d = json.loads(line)

令人困惑的是,当我将这段代码应用于我的主要数据(大小约为 200G)时,它会显示以下错误(不会耗尽内存):

    d = json.loads(line)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

type(f) = TextIOWrapper 如果有帮助的话……但这种数据类型也适用于小型数据集……

下面是我的几行数据看格式:

{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}

它是 Json,因为我已经解析了前 2000 行并且它运行良好。但是当我尝试对大文件使用相同的过程时,它会从数据的第一行显示错误。

【问题讨论】:

  • 应该对该 json 数据进行哪些更改?
  • data.raw是json文件还是每行一个json对象的文件?如果是前者,请使用json.load
  • 您的文件不是有效的 JSON。不过,它似乎在每一行都包含有效的 JSON 文本。我的建议是,修复生成此“JSON”的任何内容(实际上它不是 JSON)。除此之外,我想你可以逐行将反序列化的对象累积到一个列表或其他东西中。
  • .raw from matlab ?
  • 你可以more data.raw | head查看你的文件格式吗?

标签: python json load


【解决方案1】:

下面是一个示例 json 数据。它包含两个人的记录。但它也可能是一百万。下面的代码是一种解决方案,它逐行读取文件并一次从一个人那里检索数据并将其作为 json 对象返回。

数据:

[
  {
    "Name" : "Joy",
    "Address" : "123 Main St",
    "Schools" : [
      "University of Chicago",
      "Purdue University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Guitar",
        "Level" : "Expert"
      },
      {
        "percussion" : "Drum",
        "Level" : "Professional"
      }
    ],
    "Status" : "Student",
    "id" : 111,
    "AltID" : "J111"
  },
  {
    "Name" : "Mary",
    "Address" : "452 Jubal St",
    "Schools" : [
      "University of Pensylvania",
      "Washington University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Violin",
        "Level" : "Expert"
      },
      {
        "percussion" : "Piano",
        "Level" : "Professional"
      }
    ],
    "Status" : "Employed",
    "id" : 112,
    "AltID" : "M112"
  }
  }
]

代码: 导入json

curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
    #Reading file line by line
    line = fp.readline()
    lnum = 0
    while line:
        for a in line:
            if a == '{':
                curly_idx.append(lnum)
                first_curly_found = True
            elif a == '}':
                curly_idx.pop()

        # when the right curly for every left curly is found,
        # it would mean that one complete data element was read
        if len(curly_idx) == 0 and first_curly_found:
            jstr = f'{jstr}{line}'
            jstr = jstr.rstrip()
            jstr = jstr.rstrip(',')
            jstr[:-1]
            print("------------")
            if len(jstr) > 10:
                print("making json")
                j = json.loads(jstr)
            print(jstr)
            jstr = ""
            line = fp.readline()
            lnum += 1
            continue

        if first_curly_found:
            jstr = f'{jstr}{line}'

        line = fp.readline()
        lnum += 1
        if lnum > 100:
            break

【讨论】:

  • 我发现这个答案非常有用。我修改了上面的代码在Linux下运行
【解决方案2】:

读取大型 json 数据集的一个很好的解决方案是在python 中使用像yield 这样的生成器,因为如果您的 json 解析器将整个文件存储在内存中,200G 对于您的 RAM 来说太大了,一步一步内存使用迭代器保存。

您可以使用带有 Pythonic 接口 http://pypi.python.org/pypi/ijson/ 的迭代 JSON 解析器。

但是这里你的文件有.raw扩展名,它不是一个json文件。

阅读该做的:

import numpy as np

content = np.fromfile("data.raw", dtype=np.int16, sep="")

但是这个解决方案对于大文件可能会崩溃。

如果事实上.raw 似乎是.csv 文件,那么您可以这样创建您的阅读器:

import csv

def read_big_file(filename):
    with open(filename, "rb") as csvfile:
         reader = csv.reader(csvfile)
         for row in reader:
             yield row

或者喜欢一个文本文件:

def read_big_file(filename):
    with open(filename, "r") as _file:
         for line in _file:
             yield line

仅当您的文件是二进制文件时才使用rb

执行:

for line in read_big_file(filename):
    <treatment>
    <free memory after a size of chunk>

如果您给出文件的第一行,我可以准确地回答。

【讨论】:

  • 解决方案应包含有关ijson 用法的更多详细信息。
【解决方案3】:

这里有一些简单的代码可以查看哪些数据不是有效的 JSON 以及它在哪里:

for i, line in enumerate(f):
    try:
        d = json.loads(line)
    except json.decoder.JSONDecodeError:
        print('Error on line', i + 1, ':\n', repr(line))

【讨论】:

  • 谢谢@alex。我使用了这段代码,结果很奇怪!根据结果​​,每条偶数行都有错误!但是我使用了我的大文件的前 2000 行,它没有显示任何错误......这太令人困惑了......
  • @Mina 你能告诉我们其中一条错误消息吗?特别是我希望看到一条失败的线路。
  • 您无法相信,但这就是重点:我包含的主要大文件在两行之间有额外的输入,这就是错误消息的原因!顺便说一句,您的建议对我找到错误的根源非常有帮助。谢谢。