【发布时间】:2021-10-28 16:42:41
【问题描述】:
我正在运行以下代码
import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json
parquet_schema = schema = pyarrow.schema(
[('id', pyarrow.string()),
('firstname', pyarrow.string()),
('lastname', pyarrow.string())])
user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
writer = pq.ParquetWriter('user.parquet', schema=parquet_schema)
df = pd.DataFrame.from_dict(json.loads(user_json))
table = pyarrow.Table.from_pandas(df)
print(table.schema)
writer.write_table(table)
writer.close()
但我收到以下错误:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-14-a427a4cdd392> in <module>()
15 writer = pq.ParquetWriter('user.parquet', schema=parquet_schema)
16
---> 17 df = pd.DataFrame.from_dict(json.loads(user_json))
18 table = pyarrow.Table.from_pandas(df)
19 print(table.schema)
4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
385
386 if not indexes and not raw_lengths:
--> 387 raise ValueError("If using all scalar values, you must pass an index")
388
389 if have_series:
ValueError: If using all scalar values, you must pass an index
遵循文档和教程,但我缺少一些东西。
【问题讨论】:
-
您的目标是创建单行 parquet 文件吗?
-
pyarrow 和 pandas 处理一批记录而不是逐个记录。如果您只有一条记录,请将其放入列表中:
pd.DataFrame.from_dict([json.loads(user_json)])。它会起作用,但效率不会很高,并且会破坏 pyarrow/pandas 的目的。
标签: python json pandas dataframe pyarrow