【问题标题】:Write JSON to parquet file using pyarrow使用 pyarrow 将 JSON 写入镶木地板文件
【发布时间】:2021-10-28 16:42:41
【问题描述】:

我正在运行以下代码

import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json

parquet_schema = schema = pyarrow.schema(
    [('id', pyarrow.string()),
     ('firstname', pyarrow.string()),
     ('lastname', pyarrow.string())])



user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'

writer = pq.ParquetWriter('user.parquet', schema=parquet_schema)

df = pd.DataFrame.from_dict(json.loads(user_json))
table = pyarrow.Table.from_pandas(df)
print(table.schema)
writer.write_table(table)
writer.close()

但我收到以下错误:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-14-a427a4cdd392> in <module>()
     15 writer = pq.ParquetWriter('user.parquet', schema=parquet_schema)
     16 
---> 17 df = pd.DataFrame.from_dict(json.loads(user_json))
     18 table = pyarrow.Table.from_pandas(df)
     19 print(table.schema)

4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/internals/construction.py in extract_index(data)
    385 
    386         if not indexes and not raw_lengths:
--> 387             raise ValueError("If using all scalar values, you must pass an index")
    388 
    389         if have_series:

ValueError: If using all scalar values, you must pass an index

遵循文档和教程,但我缺少一些东西。

【问题讨论】:

  • 您的目标是创建单行 parquet 文件吗?
  • pyarrow 和 pandas 处理一批记录而不是逐个记录。如果您只有一条记录,请将其放入列表中:pd.DataFrame.from_dict([json.loads(user_json)])。它会起作用,但效率不会很高,并且会破坏 pyarrow/pandas 的目的。

标签: python json pandas dataframe pyarrow


【解决方案1】:

您有三个选择:

  1. 停止使用标量值,将 dict 的值(来自 json 字符串)作为列表。
import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json


user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)

# Make all values in the dict a list
for key, value in user_dict.items():
    user_dict[key] = [value]
df = pd.DataFrame(user_dict)

df.to_parquet('myfile.parquet')

  1. 在加载标量值时只需传递一个索引(例如 2 而不是 [2])
import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json


user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)

# Pass an index instead
df = pd.DataFrame(user_dict, index=[0])
df.to_parquet('myfile.parquet')

  1. 利用`Dataframe.from_records
import pyarrow
import pyarrow.parquet as pq
import pandas as pd
import json


user_json = '{"id" : "id1", "firstname": "John", "lastname":"Doe"}'
user_dict = json.loads(user_json)

# Simply use `DataFrame.from_records`
df = pd.DataFrame.from_records(user_dict)
df.to_parquet('myfile.parquet')

第三个是最简单的,但我可能会养成将标量值传递给 DF 并使用第一个选项的解决方案的习惯。

Constructing pandas DataFrame from values in variables gives "ValueError: If using all scalar values, you must pass an index"阅读有关标量问题的更多信息

【讨论】:

    【解决方案2】:

    鉴于您正在尝试使用列数据,您使用的库会期望您将传递每一列的行

    我猜你不会在现实生活中编写单行的 parquet 文件,在这种情况下,你可以按列对值进行分组,这将适用于 pandas 和箭头。

    你也可以完全避免使用pandas并通过pyarrow.Tablefrom_pydict方法

    import pyarrow
    import pyarrow.parquet as pq
    
    users = {"id" : ["id1", "id2"], 
             "firstname": ["John", "Jack"], 
             "lastname": ["Doe", "Ryan"]}
    
    table = pyarrow.Table.from_pydict(users)
    print(table.schema)
    
    with pq.ParquetWriter('user.parquet', schema=table.schema) as writer:
        writer.write_table(table)
    

    https://arrow.apache.org/cookbook/py/create.html#create-table-from-plain-typeshttps://arrow.apache.org/cookbook/py/io.html#write-a-parquet-file

    【讨论】:

      猜你喜欢
      • 2021-08-27
      • 2018-03-29
      • 2021-12-06
      • 2018-05-06
      • 2018-04-17
      • 2019-02-05
      • 2014-12-13
      • 2019-06-02
      • 1970-01-01
      相关资源
      最近更新 更多