无法使用 spark 查询分区数据上的 BigQuery 外部表答案

【问题标题】：Unable to query BigQuery external table on partitioned data with spark无法使用 spark 查询分区数据上的 BigQuery 外部表
【发布时间】：2020-08-18 00:01:43
【问题描述】：

我试图在 GCS 中的分区数据上创建一个外部表，该数据是从 spark 作业写入的，日期以 PARQUET 格式分区。

数据是GCS bucket如图所示。

我用表定义创建了一个外部表

  "hivePartitioningOptions": {
    "mode": "AUTO",
    "sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data"
  },
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://transaction_data_bucket_for_bigquery/trx_data/*"
  ]
}

用命令

bq mk --external_table_definition=/tmp/table_def <project>:<dataset>.sample_trx_external

当我尝试查询表时，我收到了一个奇怪的错误。

Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01.; Cannot add hive partitioning to table <data_set>.sample_trx_external -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01..

很遗憾，我无法破译这条消息。只有 1 天的事务写入 GCS 存储桶。

当我尝试自定义模式定义时

{
  "hivePartitioningOptions": {
    "mode": "CUSTOM",
    "sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data/{transaction_date:DATE}"
  },
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://transaction_data_bucket_for_bigquery/trx_data/*"
  ]
}

我得到了一个稍微不同的错误

Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data.; Cannot add hive partitioning to table <data_Set>.sample_trx_external_2 -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data..

在这里被击中，任何建议都会有很大帮助。

【问题讨论】：

您的 BigQuery 数据集和 GCS 存储桶放置在哪些位置？你检查过limitations吗？
是的。我检查了限制。数据格式采用受支持的 Hive 格式结构。关于数据位置，最初当数据保存在欧盟区域存储桶中时，如果失败并出现无法访问数据的错误，但后来当我创建了一个多区域欧盟存储桶时，我没有看到此错误。 Bigquery 数据集也在同一区域。
解决了这个问题。 _SUCCESS 是罪魁祸首。 spark 生成的 parquet 文件会写入这个附加文件，外部表定义也在查看这个文件。

标签： google-cloud-platform google-bigquery partitioning

【解决方案1】：

如问题中的附图所示，有一个文件 _SUCCESS 由创建分区数据集的 spark 作业编写。这里的问题是在“gs://transaction_data_bucket_for_bigquery/trx_data/”路径下，大查询外部表期望每个目录或文件都是分区格式。绝对 _SUCCESS 不会放弃导致上述错误消息的这种结构。

修复：我刚刚删除了文件，一切正常。

【讨论】：