Python GAE - 如何以编程方式将数据从备份导出到 Big Query？答案

【问题标题】：Python GAE - How to export data from a backup to Big Query programmatically?Python GAE - 如何以编程方式将数据从备份导出到 Big Query？
【发布时间】：2016-05-18 19:34:35
【问题描述】：

我已经在谷歌上搜索了很长时间，但没有找到一种方法将我的备份（在存储桶内）导出到 Big Query，而无需手动执行...

可以这样做吗？

非常感谢！

【问题讨论】：

标签： python google-app-engine google-bigquery

【解决方案1】：

您应该可以通过python-bigquery api 这样做。

首先，您需要连接到 BigQuery 服务。这是我用来这样做的代码：

class BigqueryAdapter(object):
    def __init__(self, **kwargs):
        self._project_id = kwargs['project_id']
        self._key_filename = kwargs['key_filename']
        self._account_email = kwargs['account_email']
        self._dataset_id = kwargs['dataset_id']
        self.connector = None
        self.start_connection()

    def start_connection(self):
        key = None
        with open(self._key_filename) as key_file:
            key = key_file.read()
        credentials = SignedJwtAssertionCredentials(self._account_email,
                                                    key,
                                                    ('https://www.googleapis' +
                                                     '.com/auth/bigquery'))
        authorization = credentials.authorize(httplib2.Http())
        self.connector = build('bigquery', 'v2', http=authorization)

之后，您可以使用self.connector 运行jobs（in this answer 您会找到一些示例）。

要从 Google Cloud Storage 获取备份，您必须像这样定义 configuration：

body = "configuration": {
  "load": {
    "sourceFormat": #Either "CSV", "DATASTORE_BACKUP", "NEWLINE_DELIMITED_JSON" or "AVRO".
    "fieldDelimiter": "," #(if it's comma separated)
    "destinationTable": {
      "projectId": #your_project_id
      "tableId": #your_table_to_save_the_data
      "datasetId": #your_dataset_id
    },
    "writeDisposition": #"WRITE_TRUNCATE" or "WRITE_APPEND"
    "sourceUris": [
        #the path to your backup in google cloud storage. it could be something like "'gs://bucket_name/filename*'. Notice you can use the '*' operator.
    ],
    "schema": { # [Optional] The schema for the destination table. The schema can be omitted if the destination table already exists, or if you're loading data from Google Cloud Datastore.
      "fields": [ # Describes the fields in a table.
        {
          "fields": [ # [Optional] Describes the nested schema fields if the type property is set to RECORD.
            # Object with schema name: TableFieldSchema
          ],
          "type": "A String", # [Required] The field data type. Possible values include STRING, BYTES, INTEGER, FLOAT, BOOLEAN, TIMESTAMP or RECORD (where RECORD indicates that the field contains a nested schema).
          "description": "A String", # [Optional] The field description. The maximum length is 16K characters.
          "name": "A String", # [Required] The field name. The name must contain only letters (a-z, A-Z), numbers (0-9), or underscores (_), and must start with a letter or underscore. The maximum length is 128 characters.
          "mode": "A String", # [Optional] The field mode. Possible values include NULLABLE, REQUIRED and REPEATED. The default value is NULLABLE.
        },
      ],
    },
  },

然后运行：

self.connector.jobs().insert(body=body).execute()

希望这就是您想要的。如果您遇到任何问题，请告诉我们。

【讨论】：