【发布时间】:2021-08-06 23:22:59
【问题描述】:
当我使用 Pandas 和 Dask 将同一张表保存到 Parquet 中时,Pandas 会创建一个 4k 文件,而 Dask 会创建一个 39M 文件。
创建数据框
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import dask.dataframe as dd
n = int(1e7)
df = pd.DataFrame({'col': ['a'*64]*n})
以不同的方式保存
# Pandas: 4k
df.to_parquet('example-pandas.parquet')
# PyArrow: 4k
pq.write_table(pa.Table.from_pandas(df), 'example-pyarrow.parquet')
# Dask: 39M
dd.from_pandas(df, npartitions=1).to_parquet('example-dask.parquet', compression='snappy')
起初我认为 Dask 不使用字典和行程编码,但似乎并非如此。我不确定我是否正确解释了元数据信息,但至少,它似乎完全一样:
>>> pq.read_metadata('example-pandas.parquet').row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fbee7d1a770>
file_offset: 548
file_path:
physical_type: BYTE_ARRAY
num_values: 10000000
path_in_schema: col
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7fbee7d2cc70>
has_min_max: True
min: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
max: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
null_count: 0
distinct_count: 0
num_values: 10000000
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 29
total_compressed_size: 544
total_uncompressed_size: 596
>>> pq.read_metadata('example-dask.parquet/part.0.parquet').row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7fbee7d3d180>
file_offset: 548
file_path:
physical_type: BYTE_ARRAY
num_values: 10000000
path_in_schema: col
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7fbee7d3d1d0>
has_min_max: True
min: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
max: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
null_count: 0
distinct_count: 0
num_values: 10000000
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: SNAPPY
encodings: ('PLAIN_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 29
total_compressed_size: 544
total_uncompressed_size: 596
为什么 Dask-create Parquet 这么大?或者,如何进一步检查可能存在的问题?
【问题讨论】:
-
可能向 dask 开发人员提出问题?
-
@drum 好点:github.com/dask/dask/issues/8009
标签: python pandas dask parquet pyarrow