GCP云函数python-GCS复制文件-重复文件答案

【问题标题】：GCP cloud function python - GCS copy files - duplicate filesGCP云函数python-GCS复制文件-重复文件
【发布时间】：2020-07-22 13:15:39
【问题描述】：

我正在尝试将文件从 GCS 复制到其他位置。但我需要使用云功能实时进行。我创建了一个函数及其工作。但问题是，文件被多个文件夹复制了多次。

EG：

source file path: gs://logbucket/mylog/2020/07/22/log.csv

Expected Target: gs://logbucket/hivelog/2020/07/22/log.csv

我的代码：

from google.cloud import storage

def hello_gcs_generic(data, context):
    sourcebucket=format(data['bucket'])
    source_file=format(data['name'])
    year = source_file.split("/")[1]
    month = source_file.split("/")[2]
    day = source_file.split("/")[3]
    filename=source_file.split("/")[4]
    print(year)
    print(month)
    print(day)
    print(filename)
    print(sourcebucket)
    print(source_file)


    storage_client = storage.Client()

    source_bucket = storage_client.bucket(sourcebucket)
    source_blob = source_bucket.blob(source_file)
    destination_bucket = storage_client.bucket(sourcebucket)
    destination_blob_name = 'hivelog/year='+year+'/month='+month+'/day='+day+'/'+filename


    blob_copy = source_bucket.copy_blob(
        source_blob, destination_bucket, destination_blob_name
    )
    blob.delete()
    print(
        "Blob {} in bucket {} copied to blob {} in bucket {}.".format(
            source_blob.name,
            source_bucket.name,
            blob_copy.name,
            destination_bucket.name,
        )
    )

输出：

你可以看到这个year=year=2020这是怎么来的？在这里面我也有像year=year=2020/month=month=07/这样的文件夹

我无法解决这个问题。

【问题讨论】：

这些答案对你有帮助吗？
是的@dustin-ingram 回答帮助了我

标签： python google-cloud-platform google-cloud-functions

【解决方案1】：

您正在写入您尝试从中复制的同一存储桶：

destination_bucket = storage_client.bucket(sourcebucket)

每次向存储桶添加新文件时，都会再次触发 Cloud Function。

您要么需要使用两个不同的桶，要么根据路径的第一部分添加条件：

top_level_directory = source_file.split("/")[0]
if top_level_directory == "mylog":
    # Do the copying
elif top_level_directory == "hivelog":
    # This is a file created by the function, do nothing
else:
    # We weren't expecting this top level directory

【讨论】：

【解决方案2】：

作为有根据的猜测，您从中复制的源路径确实是格式

/foo/year=2020/month=42/...

所以当你用斜线分割时，你会得到

foo
year=2020
month=42

在重组这些组件时，您再次添加另一个 year=/month=/... 前缀

destination_blob_name = 'hivelog/year='+year+'/month='+month+ ...

你有它； year=year=year= 3 次迭代后...

您还确定您没有对已复制的文件进行迭代吗？这也会导致这种情况。

【讨论】：

我没有迭代

【解决方案3】：

import os
import gcsfs


def hello_gcs_generic(data, context):
    fs = gcsfs.GCSFileSystem(project="Project_Name", token=os.getenv("GOOGLE_APPLICATION_CREDENTIALS"))
    source_filepath = f"{data['bucket']}/{data['name']}"
    destination_filepath = source_filepath.replace("mylog","hivelog")
    fs.cp(source_filepath,destination_filepath)
    print(f"Blob {data['name']} in bucket {data['bucket']} copied to hivelog")

这应该让您在尝试完成的事情上占得先机。将 Project_Name 替换为存储桶所在的 GCP 项目的名称。

还假设您在使用环境变量 GOOGLE_APPLICATION_CREDENTIALS 设置的 json 文件中具有服务帐户凭据，我假设这是基于您使用 google.cloud 存储的情况。

现在您可以接受“mylog”或“hivelog”作为参数并使其在其他场景中有用。同样为了分割你的文件名，如果你需要再次走这条路，一行就可以了：

_,year,month,data,filename = data['name'].split('/')

在这种情况下，下划线只是用来告诉你自己和其他人你不打算使用拆分的那部分。

您可以使用扩展解包来忽略多个值，例如

*_,month,day,filename = data['name'].split('/')

或者你可以将两者结合起来

*_,month,day,_ = data['name'].split('/')

编辑：link 到 gcsfs 文档

【讨论】：