如何在python中解析一个大的JSON文件答案

【问题标题】：How to parse a BIG JSON file in python如何在python中解析一个大的JSON文件
【发布时间】：2026-01-22 21:05:01
【问题描述】：

我正在处理一个非常大的数据集，但我遇到了一个无法找到任何答案的问题。我正在尝试解析来自 JSON 的数据，这是我对整个数据集中的一段数据所做的工作：

import json

s = set()

with open("data.raw", "r") as f:

    for line in f:
        d = json.loads(line)

令人困惑的是，当我将这段代码应用于我的主要数据（大小约为 200G）时，它会显示以下错误（不会耗尽内存）：

    d = json.loads(line)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "C:\Users\Sathyanarayanan\AppData\Local\Programs\Python\Python35-32\lib\json\decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 1 (char 1)

type(f) = TextIOWrapper 如果有帮助的话……但这种数据类型也适用于小型数据集……

下面是我的几行数据看格式：

{"MessageType": "SALES.CONTRACTS.SALESTATUSCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "OldStatus": {"Status": 3, "AutoRemoveInfo": null}, "NewStatus": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T13:39:57", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALESHIPPINGINFOCHANGED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-0801743-2330650"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:57.9681547", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 1193}, "Sku": {"Sku": "Con BM20"}, "Quantity": 1, "UnitPrice": {"amount": 11.92, "currency": 840}}], "FulfilledItems": []}, "OldShippingInfo": {"Carrier": "", "Class": "", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "NewShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "0001-01-01T00:00:00", "PendingItems": null, "Kits": null, "Products": null, "OldSaleDate": "0001-01-01T00:00:00", "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-002-4851828-6514632"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.1402505", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL Blanket Seahawks"}, "Quantity": 1, "UnitPrice": {"amount": 22.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-13T15:51:12", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}
{"MessageType": "SALES.CONTRACTS.SALECREATED", "Event": {"Id": {"Source": 1, "SourceId": "ZGA=-3-1-102-3824485-2270645"}, "RefInfo": {"TenantId": {"Id": "ZGA="}, "UserId": {"Id": "ZMKj"}, "SentUtc": "2013-01-14T20:17:58.3436109", "Source": 1}, "Status": {"Status": 4, "AutoRemoveInfo": null}, "Items": {"Items": [{"Id": {"Id": 9223372036854775807}, "Sku": {"Sku": "NFL CD Wallet Chargers"}, "Quantity": 1, "UnitPrice": {"amount": 12.99, "currency": 840}}], "FulfilledItems": []}, "ShippingInfo": {"Carrier": "USPS", "Class": "FIRST/RECTPARCEL", "Region": null, "Country": 0, "PostalCode": null, "Costs": null, "Charges": null}, "SaleDate": "2013-01-12T02:49:58", "Kits": null, "Products": null, "AdditionalSaleInfo": null}}

它是 Json，因为我已经解析了前 2000 行并且它运行良好。但是当我尝试对大文件使用相同的过程时，它会从数据的第一行显示错误。

【问题讨论】：

应该对该 json 数据进行哪些更改？
data.raw是json文件还是每行一个json对象的文件？如果是前者，请使用json.load
您的文件不是有效的 JSON。不过，它似乎在每一行都包含有效的 JSON 文本。我的建议是，修复生成此“JSON”的任何内容（实际上它不是 JSON）。除此之外，我想你可以逐行将反序列化的对象累积到一个列表或其他东西中。
.raw from matlab ?
你可以more data.raw | head查看你的文件格式吗？

标签： python json load

【解决方案1】：

下面是一个示例 json 数据。它包含两个人的记录。但它也可能是一百万。下面的代码是一种解决方案，它逐行读取文件并一次从一个人那里检索数据并将其作为 json 对象返回。

数据：

[
  {
    "Name" : "Joy",
    "Address" : "123 Main St",
    "Schools" : [
      "University of Chicago",
      "Purdue University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Guitar",
        "Level" : "Expert"
      },
      {
        "percussion" : "Drum",
        "Level" : "Professional"
      }
    ],
    "Status" : "Student",
    "id" : 111,
    "AltID" : "J111"
  },
  {
    "Name" : "Mary",
    "Address" : "452 Jubal St",
    "Schools" : [
      "University of Pensylvania",
      "Washington University"
    ],
    "Hobbies" : [
      {
        "Instrument" : "Violin",
        "Level" : "Expert"
      },
      {
        "percussion" : "Piano",
        "Level" : "Professional"
      }
    ],
    "Status" : "Employed",
    "id" : 112,
    "AltID" : "M112"
  }
  }
]

代码：导入json

curly_idx = []
jstr = ""
first_curly_found = False
with open("C:\\Users\\Rajeshs\\PycharmProjects\\Project1\\data\\test.json", 'r') as fp:
    #Reading file line by line
    line = fp.readline()
    lnum = 0
    while line:
        for a in line:
            if a == '{':
                curly_idx.append(lnum)
                first_curly_found = True
            elif a == '}':
                curly_idx.pop()

        # when the right curly for every left curly is found,
        # it would mean that one complete data element was read
        if len(curly_idx) == 0 and first_curly_found:
            jstr = f'{jstr}{line}'
            jstr = jstr.rstrip()
            jstr = jstr.rstrip(',')
            jstr[:-1]
            print("------------")
            if len(jstr) > 10:
                print("making json")
                j = json.loads(jstr)
            print(jstr)
            jstr = ""
            line = fp.readline()
            lnum += 1
            continue

        if first_curly_found:
            jstr = f'{jstr}{line}'

        line = fp.readline()
        lnum += 1
        if lnum > 100:
            break

【讨论】：

我发现这个答案非常有用。我修改了上面的代码在Linux下运行

【解决方案2】：

读取大型 json 数据集的一个很好的解决方案是在python 中使用像yield 这样的生成器，因为如果您的 json 解析器将整个文件存储在内存中，200G 对于您的 RAM 来说太大了，一步一步内存使用迭代器保存。

您可以使用带有 Pythonic 接口 http://pypi.python.org/pypi/ijson/ 的迭代 JSON 解析器。

但是这里你的文件有.raw扩展名，它不是一个json文件。

阅读该做的：

import numpy as np

content = np.fromfile("data.raw", dtype=np.int16, sep="")

但是这个解决方案对于大文件可能会崩溃。

如果事实上.raw 似乎是.csv 文件，那么您可以这样创建您的阅读器：

import csv

def read_big_file(filename):
    with open(filename, "rb") as csvfile:
         reader = csv.reader(csvfile)
         for row in reader:
             yield row

或者喜欢一个文本文件：

def read_big_file(filename):
    with open(filename, "r") as _file:
         for line in _file:
             yield line

仅当您的文件是二进制文件时才使用rb。

执行：

for line in read_big_file(filename):
    <treatment>
    <free memory after a size of chunk>

如果您给出文件的第一行，我可以准确地回答。

【讨论】：

解决方案应包含有关ijson 用法的更多详细信息。

【解决方案3】：

这里有一些简单的代码可以查看哪些数据不是有效的 JSON 以及它在哪里：

for i, line in enumerate(f):
    try:
        d = json.loads(line)
    except json.decoder.JSONDecodeError:
        print('Error on line', i + 1, ':\n', repr(line))

【讨论】：

谢谢@alex。我使用了这段代码，结果很奇怪！根据结果，每条偶数行都有错误！但是我使用了我的大文件的前 2000 行，它没有显示任何错误......这太令人困惑了......
@Mina 你能告诉我们其中一条错误消息吗？特别是我希望看到一条失败的线路。
您无法相信，但这就是重点：我包含的主要大文件在两行之间有额外的输入，这就是错误消息的原因！顺便说一句，您的建议对我找到错误的根源非常有帮助。谢谢。