如何使用 Python 将 Json 行转换为镶木地板？答案

【问题标题】：How to transform Json lines to parquet with Python?如何使用 Python 将 Json 行转换为镶木地板？
【发布时间】：2019-11-11 17:23:36
【问题描述】：

我需要用 Python 以一种简单的方式来完成它。我正在尝试使用 Pandas，但我才刚刚开始，这对我来说非常困难。

现在我正在尝试使用 json2parquet：

try:
    input_filename= '/tmp/source_file'
    source_file = s3.get_object(Bucket="myBucket", Key="myJsonLinesFile")
    datajson = source_file['Body'].read()
    with open(input_filename, 'wb') as f:
         f.write(datajson)
    convert_json(input_filename, '/tmp/final.parquet')


except Exception as e:
    print(e)   
    raise e

但我遇到以下错误： "errorMessage": "不能混合列表和非列表、非空值", "errorType": "ArrowInvalid",

【问题讨论】：

标签： python json pandas parquet

【解决方案1】：

如果您使用的是 pandas 0.25.3 版本，您可以安装 fastparquet 或 pyarrow 库并执行以下代码

>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>>> df.to_parquet('df.parquet.gzip',
...               compression='gzip')  # doctest: +SKIP

以下是链接

fastparquet - https://pypi.org/project/fastparquet/
pyarrow - https://arrow.apache.org/docs/python/install.html#using-pip

【讨论】：

嗨 jjayadeep，我的代码必须处理不具有相同架构的文件，我现在不希望每个文件都有很多列。感谢您的回答，我会阅读您分享的链接。
你想在同一个 parquet 文件中写入不同的模式吗？
source_file = s3.get_object(Bucket=source_bucket, Key=key) datajson = source_file['Body'].read() con = pd.read_json(datajson, lines=True) con = con.astype(str) con.to_parquet(tmp_file, compression=None) 这段代码可以正常工作，但是当我在 Athena 中运行查询时，我发现镶木地板文件中的一个字段是 INT64 是否出现以下错误：HIVE_BAD_DATA：镶木地板中的字段 xxxxxx 的 INT64 类型不兼容在表架构中定义类型字符串
为什么 con.astype(str) 它不投射这个字段？还是问题出在推断数据类型的 to_parquet 中？如果是这种情况，我该如何解决？