【发布时间】:2020-08-18 10:18:21
【问题描述】:
我有一个大小为 5 GB 的 json 文件。我想加载它并对其进行一些 EDA,以便找出相关信息在哪里。
我试过了:
import json
import pprint
json_fn = 'abc.ndjson'
data = json.load(open(json_fn, 'rb'))
pprint.pprint(data, depth=2)
但这只是崩溃
Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
我也试过了:
import ijson
with open(json_fn) as f:
items = ijson.items(f, 'item', multiple_values=True) # "multiple values" needed as it crashes otherwise with a "trailing garbage parse error" (https://stackoverflow.com/questions/59346164/ijson-fails-with-trailing-garbage-parse-error)
print('Data loaded - no processing ...')
print("---items---")
print(items)
for item in items:
print("---item---")
print(item)
但这只是返回:
Data loaded, now importing
---items---
<_yajl2.items object at 0x7f436de97440>
Process finished with exit code 0
ndjson 文件包含有效的 ascii 字符(使用 vi 检查)但行很长,因此无法从文本编辑器中真正理解。
文件开头如下:
{"visitId":257057,"staticFeatures":[{"type":"CODES","value":"9910,51881,42833,486,4280,42731,2384,V5861,9847,3962,49320,3558,2720,4019,99092"},{"type":"visitID","value":"357057"},{"type":"VISITOR_ID","value":"68824"}, {"type":"ADMISSION_ID","value":"788457"},{"type":"AGE","value":"34"}, ...
我做错了什么,我该如何处理这个文件?
【问题讨论】: