Python Google Drive API - 下载重复文件答案

【问题标题】：Python Google Drive API - Downloading duplicate filesPython Google Drive API - 下载重复文件
【发布时间】：2023-03-11 06:10:01
【问题描述】：

所以我试图从谷歌驱动器下载很多不同的文件，然后将它们组合成更小的文件。但是，由于某种原因，我的代码正在下载重复文件，或者可能只是错误地读取了 BytesIO 对象。我已经粘贴了下面的代码，这里只是对文件结构的快速解释。

所以我有大约 135 个文件夹，每个文件夹包含 52 个文件。我的目标是遍历每个文件夹，下载 52 个文件，然后将这 52 个文件转换为一个压缩程度更高的文件（去除不必要/重复的数据）。

代码

def main(temporary_workspace, workspace):
    store = file.Storage('tokenRead.json')
    big_list_of_file_ids = []

    creds = store.get()
    if not creds or creds.invalid:
        flow = client.flow_from_clientsecrets('credentials.json', SCOPES)
        creds = tools.run_flow(flow, store)
    service = build('drive', 'v3', http=creds.authorize(Http()))

    # Call the Drive v3 API
    results = service.files().list(
        q="'MAIN_FOLDER_WITH_SUBFOLDERS_ID' in parents",
        pageSize=1000, fields="nextPageToken, files(id, name)").execute()
    items = results.get('files', [])

    list_of_folders_and_ids = []
    if not items:
        raise RuntimeError('No files found.')
    else:
        for item in items:
            list_of_folders_and_ids.append((item['name'], item['id']))

    list_of_folders_and_ids.sort(key=lambda x: x[0])

    for folder_id in list_of_folders_and_ids:
        start_date = folder_id[0][:-3]
        id = folder_id[1]

        print('Folder: ', start_date, ', ID: ', id)

        query_string = "'{}' in parents".format(id)
        results = service.files().list(
            q=query_string, fields="nextPageToken, files(id, name)"
        ).execute()
        items = results.get('files', [])

        list_of_files_and_ids = []
        if not items:
            raise RuntimeError('No files found.')
        else:
            for item in items:
                list_of_files_and_ids.append((item['name'], item['id']))

        for file_id in list_of_files_and_ids:
            # Downloading the files
            if file_id[1] not in big_list_of_file_ids:
                big_list_of_file_ids.append(file_id[1])
            else:
                print('Duplicate file ID!')
                exit()

            print('\tFile: ', file_id[0], ', ID: ', file_id[1])

            request = service.files().get_media(fileId=file_id[1])
            fh = io.BytesIO()
            downloader = MediaIoBaseDownload(fh, request)
            done = False
            while done is False:
                status, done = downloader.next_chunk()
                print("Download: {}".format(int(status.progress() * 100)))

            fh.seek(0)

            temporary_location = os.path.join(tmp_workspace, file_id[0])
            with open(temporary_location, 'wb') as out:
                out.write(fh.read())

            fh.close()

        convert_all_netcdf(temporary_workspace, start_date, workspace, r'Qout_south_america_continental',
                           num_of_rivids=62317)

        os.system('rm -rf %s/*' % tmp_workspace)

如您所见，我首先获取所有文件夹的 ID，然后循环遍历每个文件夹并获取该文件夹中的 52 个文件，然后将所有 52 个文件保存到一个临时文件夹中，然后转换它们进入一个文件，我将其保存在另一个目录中，然后删除所有 52 个文件并移至 Google Drive 中的下一个文件夹。问题是，当我比较使用 convert_all_netcdf 方法压缩的文件时，它们都是相同的。我觉得好像我对 BytesIO 对象做错了什么，我需要做更多的事情来清除它吗？也可能是我每次在 google drive api 调用中不小心从同一个文件夹中读取。任何帮助表示赞赏。

【问题讨论】：

不是您问题的答案，但您的代码中有一个单独的错误。您需要迭代您的files.list 直到nextPageToken == null。我怀疑您认为将 pageSize 设置为 1000，每个文件夹中有 53 个文件将确保您一次获取所有 53 个文件。这不是 pageSize 的工作方式。 pageSize 是一次获取的最大个结果数，因此需要不断迭代直到 nextPageToken 为空。
@pinoyyid 所以我应该删除那部分代码吗？好像默认是100，这只是优化吗？
请研究pageSize，因为我怀疑您仍然误解它。不要删除任何代码，只需添加额外的代码来循环 files.list 直到 nextPageToken 为空。
@pinoyyid 感谢您提供帮助，但我在 API 中看不到您所建议的任何内容...
确实没有。您需要自己编写循环代码。每个文件夹中有 53 个文件。想象一下，您已将 pageSize 设置为 10。显然，您需要一直调用 files.list 直到 nextPageToken==null 才能获取所有 53 个文件。您的代码假定通过将 pageSize 设置为 1000，所有 53 个文件将在对 files.list 的一次调用中返回。这种假设是不正确的。您的代码需要迭代对 files.list 的调用，就像您将 pageSize 设置为 10 一样。

标签： python python-3.x google-drive-api bytesio

【解决方案1】：

我意识到这可能不是一个好问题，我问这个问题主要是因为我认为我对 BytesIO 对象做错了什么，但我找到了答案。我正在阅读使用名为 Xarray 的库下载的所有文件，但忘记关闭连接。这导致我只在后续循环中读取第一个连接，给我重复。感谢任何尝试过的人！

【讨论】：