【问题标题】:Unable to query BigQuery external table on partitioned data with spark无法使用 spark 查询分区数据上的 BigQuery 外部表
【发布时间】:2020-08-18 00:01:43
【问题描述】:

我试图在 GCS 中的分区数据上创建一个外部表,该数据是从 spark 作业写入的,日期以 PARQUET 格式分区。

数据是GCS bucket如图所示。

我用表定义创建了一个外部表

  "hivePartitioningOptions": {
    "mode": "AUTO",
    "sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data"
  },
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://transaction_data_bucket_for_bigquery/trx_data/*"
  ]
}

用命令

bq mk --external_table_definition=/tmp/table_def <project>:<dataset>.sample_trx_external

当我尝试查询表时,我收到了一个奇怪的错误。

Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01.; Cannot add hive partitioning to table <data_set>.sample_trx_external -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 0 partition keys ([]), but 1 ([transaction_date]) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data/transaction_date=2016-01-01..

很遗憾,我无法破译这条消息。只有 1 天的事务写入 GCS 存储桶。

当我尝试自定义模式定义时

{
  "hivePartitioningOptions": {
    "mode": "CUSTOM",
    "sourceUriPrefix": "gs://transaction_data_bucket_for_bigquery/trx_data/{transaction_date:DATE}"
  },
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://transaction_data_bucket_for_bigquery/trx_data/*"
  ]
}

我得到了一个稍微不同的错误

Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data.; Cannot add hive partitioning to table <data_Set>.sample_trx_external_2 -- table creation from underlying uri failed.. Underlying error: Partition keys should be invariant from table creation across all partitions, with the number of partition keys held constant with invariant names. Expected 1 partition keys ([transaction_date]), but 0 ([) were encountered along path /bigstore/transaction_data_bucket_for_bigquery/trx_data..

在这里被击中,任何建议都会有很大帮助。

【问题讨论】:

  • 您的 BigQuery 数据集和 GCS 存储桶放置在哪些位置?你检查过limitations吗?
  • 是的。我检查了限制。数据格式采用受支持的 Hive 格式结构。关于数据位置,最初当数据保存在欧盟区域存储桶中时,如果失败并出现无法访问数据的错误,但后来当我创建了一个多区域欧盟存储桶时,我没有看到此错误。 Bigquery 数据集也在同一区域。
  • 解决了这个问题。 _SUCCESS 是罪魁祸首。 spark 生成的 parquet 文件会写入这个附加文件,外部表定义也在查看这个文件。

标签: google-cloud-platform google-bigquery partitioning


【解决方案1】:

如问题中的附图所示,有一个文件 _SUCCESS 由创建分区数据集的 spark 作业编写。这里的问题是在“gs://transaction_data_bucket_for_bigquery/trx_data/”路径下,大查询外部表期望每个目录或文件都是分区格式。绝对 _SUCCESS 不会放弃导致上述错误消息的这种结构。

修复:我刚刚删除了文件,一切正常。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-10-05
    • 2023-03-07
    • 1970-01-01
    • 2018-08-23
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多