使用 Lambda 函数在 AWS S3 上解压缩大文件答案

【问题标题】：Unzip Large file on AWS S3 using Lamda Functions使用 Lambda 函数在 AWS S3 上解压缩大文件
【发布时间】：2021-03-12 12:24:16
【问题描述】：

我有一个大约 6GB 的大文件，当使用 Python 和 Boto3 将文件上传到 S3 存储桶时，我使用 AWS lambda 触发器解压缩文件，但是在使用 ByteIO 将文件解压缩到缓冲区中时出现内存错误。

# zip file is in output dir
if '-output' in file.key:
    # get base path other the zip file name
    save_base_path = file.key.split('//')[0]
    # starting unzip process
    zip_obj = s3_resource.Object(bucket_name=source_bucket, key=file.key)
    buffer = BytesIO(zip_obj.get()["Body"].read())
    
    z = zipfile.ZipFile(buffer)
    print(f'Unziping....')
    for filename in z.namelist():
        file_info = z.getinfo(filename)
        try:
            response = s3_resource.meta.client.upload_fileobj(
                z.open(filename),
                Bucket=target_bucket,
                Key=f'{save_base_path}/{filename}'
            )
        except Exception as e:
            print(e)
    print('unziping process completed')        
    # deleting zip file after unzip
    s3_resource.Object(source_bucket, file.key).delete()
    my_bucket.delete_object()
    print("iteration completed")
    
    
else:
    print('Zip file invalid position')
    s3_resource.Object(source_bucket, file.key).delete
    print(f'{file.key} deleted...')

问题 1

当我读取字节时它给我内存错误
我已在 AWS lambda 函数的常规配置中将内存设置为 10240(10GB)

问题 2

我想从 s3 中删除对象，它可以正常运行代码并且没有给出任何错误，但也没有删除文件

有什么办法可以解决我的解压问题

【问题讨论】：

这可能是一个愚蠢的问题，但文件 6GB 是压缩的还是解压缩的？如果压缩后是 6GB，您可能无法从 lambda 解压缩它
另外，您需要足够的 RAM 来存储 zip 文件及其解压缩版本。如果 zip 文件为 6gb，并且包含一个文件，这意味着您需要至少 12gb 的 RAM，可能更多。
它是 6GB 压缩包，我是 AWS 新手，有什么我可以研究的解决方案对我有很大帮助
您可以在 EC2 上运行它。启动、运行脚本、终止。

标签： amazon-web-services amazon-s3 aws-lambda

【解决方案1】：

可以将文件的读取都包装在一个小包装器中，这样就不需要从 S3 下载整个 zip 文件。从那里可以直接将完成的文件上传回 S3，而无需将全部内容保存在 RAM 中：

# Download a zip file from S3 and upload it's unzipped contents back
# to S3
def s3_zip_to_s3(source_bucket, source_key, dest_bucket, dest_prefix):
    s3 = boto3.client('s3')

    # Use the S3Wrapper class to avoid having to transfer the entire
    # file into RAM
    with zipfile.ZipFile(S3Wrapper(s3, source_bucket, source_key)) as zip:
        for name in zip.namelist():
            print(f"Uploading {name}...")
            # Use upload_fileobj to only stream to S3
            s3.upload_fileobj(zip.open(name, 'r'), dest_bucket, dest_prefix + name)

# Create a file like object with a bare-bones implementation
# for reading only.  Other than caching the file size, data is read
# from S3 for each call
class S3Wrapper:
    def __init__(self, s3, bucket, key):
        self.s3 = s3
        self.bucket = bucket
        self.key = key
        self.pos = 0
        self.length = s3.head_object(Bucket=bucket, Key=key)['ContentLength']

    def seekable(self):
        return True

    def seek(self, offset, whence=0):
        if whence == 0:
            self.pos = offset
        elif whence == 1:
            self.pos += offset
        else:
            self.pos = self.length + offset

    def tell(self):
        return self.pos

    def read(self, count=None):
        if count is None:
            resp = self.s3.get_object(Bucket=self.bucket, Key=self.key, Range=f'bytes={self.pos}-')
        else:
            resp = self.s3.get_object(Bucket=self.bucket, Key=self.key, Range=f'bytes={self.pos}-{self.pos+count-1}')
        data = resp['Body'].read()
        self.pos += len(data)
        return data

虽然这可行，但在我的测试中，这种技术使用的大小比 zip 文件的大小要小得多，但无论如何它并不快。

我可能会推荐一些解决方案，例如 EC2 或 ECS 上的工作人员来为您完成工作。

【讨论】：