我知道在提出并回答了问题之后,这已经很好了,但上述接受的答案不起作用。我尝试做与您描述的相同的事情,并且还尝试使用相同的方法来更新添加了一些新列的现有外部表。假设您将 JSON 文件存储在 /tmp/schema.json 之类的某个位置,这将是正确的 sn-p
[
{
"mode": "NULLABLE",
"name": "mycolumn1",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "mycolumn2",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "mycolumn3",
"type": "STRING"
}
]
如果您已经拥有要添加到外部表中的选项的 API 表示,则只需具备以下条件即可。
from google.cloud import bigquery
client = bigquery.Client()
# dataset must exist first
dataset_name = 'some_dataset'
dataset_ref = client.dataset(dataset_name)
table_name = 'tablename'
# Or wherever your json schema lives
schema = client.schema_from_json('/tmp/schema.json')
external_table_options = {
"autodetect": True,
"maxBadRecords": 9999999,
"csvOptions": {
"skipLeadingRows": 1
},
"sourceFormat": "CSV",
"sourceUris": [
"gs://bucketname/file_*.csv"
]
}
external_config = client.ExternalConfig.from_api_repr(external_table_options)
table = bigquery.Table(dataset_ref.table(table_name), schema=schema)
table.external_data_configuration = external_config
client.create_table(
table,
# Now you can create the table safely with this option
# so that it does not fail if the table already exists
exists_od=True
)
# And if you seek to update the table's schema and/or its
# external options through the same script then use
client.update_table(
table,
# As a side note, this portion of the code had me confounded for hours.
# I could not for the life of me figure our that "fields" did not point
# to the table's columns, but pointed to the `google.cloud.bigquery.Table`
# object's attributes. IMHO, the naming of this parameter is horrible
# given "fields" are already a thing (i.e. `SchemaField`s).
fields=['schema', 'external_data_configuration'])
)
除了使用 API 表示设置外部表配置之外,您还可以通过在 bigquery.ExternalConfig 对象本身上调用这些属性的名称来设置所有相同的属性。因此,这将是另一种仅围绕上述代码的 external_config 部分的方法。
external_config = bigquery.ExternalConfig('CSV')
external_config.autodetect = True
external_config.max_bad_records = 9999999
external_config.options.skip_leading_rows = 1
external_config.source_uris = ["gs://bucketname/file_*.csv"]
然而,我必须再次对 Google 文档提出一些不满。 bigquery.ExternalConfig.options 属性声称可以用字典设置
>>> from google.cloud import bigquery
>>> help(bigquery.ExternalConfig.options)
Help on property:
Optional[Dict[str, Any]]: Source-specific options.
但那是完全错误的。正如您在上面看到的那样,python 对象属性名称和这些相同属性的 API 表示名称略有不同。不管怎样,如果你有源特定选项的字典(例如CSVOptions、GoogleSheetsOptions、BigTableOptions 等)并尝试将该字典作为options 属性传递,它会当着你的面笑,说这些刻薄的话。
>>> from google.cloud import bigquery
>>> external_config = bigquery.ExternalConfig('CSV')
>>> options = {'skip_leading_rows': 1}
>>> external_config.options = options
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'skipLeadingRows': 1}
>>> external_config.options = options
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'CSVOptions': {'skip_leading_rows': 1}}
>>> external_config.options = options
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
>>> options = {'CSVOptions': {'skipLeadingRows': 1}}
>>> external_config.options = options
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: cant set attribute
解决方法是遍历 options 字典并在对我来说效果很好的选项上使用 __setattr__() 方法。从上面选择你最喜欢的方法。我已经测试了所有这些代码并将使用它一段时间。