列出来自 s3 的所有目录和子目录路径答案

【问题标题】：List all directories and subdirectories path from s3列出来自 s3 的所有目录和子目录路径
【发布时间】：2021-07-30 15:09:57
【问题描述】：

我知道在 SO 上提出了很多类似的问题（尤其是 this），但没有一个答案能真正解决我的情况。当然，我知道 S3 中没有文件夹之类的东西。在内部，所有内容都存储为密钥。

我有以下目录结构；

TWEAKS/date=2020-03-19/hour=20/file.gzip
TWEAKS/date=2020-03-20/hour=21/file.gzip
TWEAKS/date=2020-03-21/hour=22/file.gzip
TWEAKS/date=2020-03-22/hour=23/file.gzip

我试过了；

def list_folders(s3_client, bucket_name):
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='TWEAKS/', Delimiter='/')
    for content in response.get('CommonPrefixes', []):
        yield content.get('Prefix')

s3_client = session.client('s3')
folder_list = list_folders(s3_client, bucket_name)
for folder in folder_list:
    print('Folder found: %s' % folder)

但这只会列出第一级之前的所有目录

Folder found: TWEAKS/date=2020-03-19/
Folder found: TWEAKS/date=2020-03-20/
Folder found: TWEAKS/date=2020-03-21/
Folder found: TWEAKS/date=2020-03-22/

现在我无法将子目录添加到前缀中，因为名称不同 hour=21、hour=22 ... 有没有办法实现此输出？

Folder found: TWEAKS/date=2020-03-19/hour=20/
Folder found: TWEAKS/date=2020-03-20/hour=21/
Folder found: TWEAKS/date=2020-03-21/hour=22/
Folder found: TWEAKS/date=2020-03-22/hour=23/

【问题讨论】：

您需要递归查看每个CommonPrefix，将CommonPrefix 作为新的Prefix 传递，然后使用新的CommonPrefixes 列表。坦率地说，列出所有对象然后解析字符串会更容易，因为它需要最少的 API 调用。如果您的存储桶很大，那么您可以考虑使用Amazon S3 Inventory 获取存储桶内容的每日 CSV 文件。

标签： python amazon-web-services amazon-s3 boto3

【解决方案1】：

我认为您需要实际枚举所有对象，然后推断出唯一的文件夹名称，如下所示：

import os
import boto3

def list_folders(s3_client, bucket_name):
    folders = set()
    response = s3_client.list_objects_v2(Bucket=bucket_name, Prefix='TWEAKS/')

    for content in response.get('Contents', []):
        folders.add(os.path.dirname(content['Key']))

    return sorted(folders)

s3 = boto3.client("s3")
folder_list = list_folders(s3, 'mybucket')

for folder in folder_list:
    print('Folder found: %s' % folder)

输出是：

Folder found: TWEAKS/date=2020-03-19/hour=20
Folder found: TWEAKS/date=2020-03-20/hour=21
Folder found: TWEAKS/date=2020-03-21/hour=22
Folder found: TWEAKS/date=2020-03-22/hour=23

【讨论】：

所以 boto3 没有像 aws cli 那样的选项？我实际上需要避免读取目录中的所有文件，这需要花费大量时间。
如果不枚举对象，我看不出你怎么能做到这一点。通常，S3 中没有文件夹，都是从对象的键中推断出来的。因此，您必须将它们全部列出才能做出推断。
我的错，aws cli 命令aws s3 ls s3://<bucket_name> --recursive 也显示文件名。

【解决方案2】：

我在尝试实现ls 以在给定路径下立即列出 s3 对象和“子目录”时偶然发现了这个问题。（请注意，S3 中没有“文件夹”，只有键值对。

虽然不完全是答案，但它是相关的。并且觉得我应该分享它，因为它建立在 jarmod 的回答之上。

import boto3
S3_CLIENT = boto3.client(...)

def ls(bucket_and_path):
    parts = bucket_and_path.split('/')
    bucket, prefix = parts[0], '/'.join(parts[1:])

    if not prefix.endswith('/'):
        prefix += '/'

    # Retrieve results in batches (default list methods will truncate)
    paginator = S3_CLIENT.get_paginator('list_objects')
    page_iterator = paginator.paginate(Bucket=bucket, Prefix=prefix)

    # Get immediate child "folders" and/or files of prefix
    children_of_prefix = set()
    for response in page_iterator:
        for content in response.get('Contents', []):
            full_path_to_object = content['Key']
            relative_path_after_prefix = prefix.join(full_path_to_object.split(prefix)[1:])
            child_of_prefix = relative_path_after_prefix.split('/')[0]
            children_of_prefix.add(child_of_prefix)

    return children_of_prefix

用法：

>>> ls('my-bucket')
['dir_1', 'dir_2', 'somefile.txt']
>>> ls('my-bucket/dir_1')
['another_file.txt']

【讨论】：