【发布时间】:2021-11-02 21:40:32
【问题描述】:
我正在使用以下代码从 s3 读取镶木地板文件。接下来,我想分块迭代它。我怎样才能实现它?
import s3fs
import fastparquet as fp
s3 = s3fs.S3FileSystem()
fs = s3fs.core.S3FileSystem()
bucket, path = 'mybucket', 'mypath'
root_dir_path = f'{bucket}/{path}'
s3_path = f"{root_dir_path}/*.parquet"
all_paths_from_s3 = fs.glob(path=s3_path)
fp_obj = fp.ParquetFile(all_paths_from_s3, open_with=s3.open, root=root_dir_path)
df = fp_obj.to_pandas()
一种方法是使用生成器:
def chunks(df, chunksize):
for i in range(0, len(df), chunksize):
yield df[i:i + chunksize]
for chunk in chunks(df, 1000):
# dummy code to transform & operate on chunk
print(len(chunk))
# dummy code ends
有什么更节省空间和时间的方法?
【问题讨论】:
标签: python amazon-web-services amazon-s3 parquet fastparquet