使用 Pyarrow 将 .parquet 文件转换为 CSV答案

【问题标题】：Converting .parquet file to CSV using Pyarrow使用 Pyarrow 将 .parquet 文件转换为 CSV
【发布时间】：2017-05-05 14:16:13
【问题描述】：

我有一个 .parquet 文件，我正在使用 PyArrow。我使用以下代码将 .parquet 文件转换为表格：

import pyarrow.parquet as pq
import pandas as pd
filepath = "xxx"  # This contains the exact location of the file on the server
from pandas import Series, DataFrame
table = pq.read_table(filepath)

执行table.shape 返回(39014 rows, 19 columns)。

表的架构是：

col1: int64 not null
col2: string not null
col3: string not null
col4: int64 not null
col5: string not null
col6: string not null
col7: int64 not null
col8: int64 not null
col9: string not null
col10: string not null
col11: string not null
col12: string not null
col13: string not null
col14: string not null
col15: string not null
col16: int64 not null
col17: int64 not null
col18: int64 not null
col19: string not null

执行p = table.to_pandas() 时出现以下错误：

ImportError: 无法导入名称 RangeIndex

如何将此镶木地板文件转换为数据框，然后转换为 CSV？请帮忙。谢谢。

【问题讨论】：

您使用的是哪个版本的 pyarrow 和 pandas？它们可能不兼容。在最后几天，Pandas 发布了一个新版本，PyArrow 也将发布一个新版本。现在升级/降级您的 Pandas 安装可能会有所帮助，直到新的 pyarrow 版本下降。
尝试from pandas import RangeIndex 并使用输出更新您的问题

标签： python pandas parquet bigdata

【解决方案1】：

尝试以下操作：

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import pyodbc

def read_pyarrow(path, nthreads=1):
    return pq.read_table(path, nthreads=nthreads).to_pandas()

path = './test.parquet'
df1 = read_pyarrow(path)

df1.to_csv(
    './test.csv',
    sep='|',
    index=False,
    mode='w',
    line_terminator='\n',
    encoding='utf-8')

【讨论】：

技术上如果分隔符是“|”那么它不是CSV，但原理是一样的:-)