根据下面的数据,我认为gzip 在流式传输等场景之外胜出,其中写入时间延迟很重要。
请务必记住,速度本质上就是计算成本。但是,云计算是一次性成本,而云存储是经常性成本。权衡取舍取决于数据的保留期限。
让我们在 Python 中测试大小 parquet 文件的速度和大小。
结果(大文件,117 MB):
+----------+----------+--------------------------+
| snappy | gzip | (gzip-snappy)/snappy*100 |
+-------+----------+----------+--------------------------+
| write | 1.62 ms | 7.65 ms | 372% slower |
+-------+----------+----------+--------------------------+
| size | 35484122 | 17269656 | 51% smaller |
+-------+----------+----------+--------------------------+
| read | 973 ms | 1140 ms | 17% slower |
+-------+----------+----------+--------------------------+
结果(小文件,4 KB,鸢尾花数据集):
+---------+---------+--------------------------+
| snappy | gzip | (gzip-snappy)/snappy*100 |
+-------+---------+---------+--------------------------+
| write | 1.56 ms | 2.09 ms | 33.9% slower |
+-------+---------+---------+--------------------------+
| size | 6990 | 6647 | 5.2% smaller |
+-------+---------+---------+--------------------------+
| read | 3.22 ms | 3.44 ms | 6.8% slower |
+-------+---------+---------+--------------------------+
small_file.ipynb
import os, sys
import pyarrow
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(
data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target']
)
# ========= WRITE =========
%timeit df.to_parquet(path='iris.parquet.snappy', compression='snappy', engine='pyarrow', index=True)
# 1.56 ms
%timeit df.to_parquet(path='iris.parquet.gzip', compression='snappy', engine='pyarrow', index=True)
# 2.09 ms
# ========= SIZE =========
os.stat('iris.parquet.snappy').st_size
# 6990
os.stat('iris.parquet.gzip').st_size
# 6647
# ========= READ =========
%timeit pd.read_parquet(path='iris.parquet.snappy', engine='pyarrow')
# 3.22 ms
%timeit pd.read_parquet(path='iris.parquet.gzip', engine='pyarrow')
# 3.44 ms
large_file.ipynb
import os, sys
import pyarrow
import pandas as pd
df = pd.read_csv('file.csv')
# ========= WRITE =========
%timeit df.to_parquet(path='file.parquet.snappy', compression='snappy', engine='pyarrow', index=True)
# 1.62 s
%timeit df.to_parquet(path='file.parquet.gzip', compression='gzip', engine='pyarrow', index=True)
# 7.65 s
# ========= SIZE =========
os.stat('file.parquet.snappy').st_size
# 35484122
os.stat('file.parquet.gzip').st_size
# 17269656
# ========= READ =========
%timeit pd.read_parquet(path='file.parquet.snappy', engine='pyarrow')
# 973 ms
%timeit pd.read_parquet(path='file.parquet.gzip', engine='pyarrow')
# 1.14 s