【发布时间】:2020-08-18 00:01:43
【问题描述】:
我试图在 GCS 中的分区数据上创建一个外部表,该数据是从 spark 作业写入的,日期以 PARQUET 格式分区。
数据是GCS bucket如图所示。
我用表定义创建了一个外部表
"hivePartitioningOptions": {
"mode": "AUTO",
"sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data"
},
"sourceFormat": "PARQUET",
"sourceUris": [
"gs://transaction_data_bucket_for_bigquery/trx_data/*"
]
}
用命令
bq mk --external_table_definition=/tmp/table_def <project>:<dataset>.sample_trx_external
当我尝试查询表时,我收到了一个奇怪的错误。
Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01.; Cannot add hive partitioning to table <data_set>.sample_trx_external -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01..
很遗憾,我无法破译这条消息。只有 1 天的事务写入 GCS 存储桶。
当我尝试自定义模式定义时
{
"hivePartitioningOptions": {
"mode": "CUSTOM",
"sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data/{transaction_date:DATE}"
},
"sourceFormat": "PARQUET",
"sourceUris": [
"gs://transaction_data_bucket_for_bigquery/trx_data/*"
]
}
我得到了一个稍微不同的错误
Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data.; Cannot add hive partitioning to table <data_Set>.sample_trx_external_2 -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data..
在这里被击中,任何建议都会有很大帮助。
【问题讨论】:
-
您的 BigQuery 数据集和 GCS 存储桶放置在哪些位置?你检查过limitations吗?
-
是的。我检查了限制。数据格式采用受支持的 Hive 格式结构。关于数据位置,最初当数据保存在欧盟区域存储桶中时,如果失败并出现无法访问数据的错误,但后来当我创建了一个多区域欧盟存储桶时,我没有看到此错误。 Bigquery 数据集也在同一区域。
-
解决了这个问题。 _SUCCESS 是罪魁祸首。 spark 生成的 parquet 文件会写入这个附加文件,外部表定义也在查看这个文件。
标签: google-cloud-platform google-bigquery partitioning