通过 Terraform 创建的 AWS Glue 中的无效架构错误答案

【问题标题】：Invalid Schema error in AWS Glue created via Terraform通过 Terraform 创建的 AWS Glue 中的无效架构错误
【发布时间】：2021-09-08 12:34:23
【问题描述】：

我在 Terraform 中有一个 Kinesis Firehose 配置，它以 JSON 格式从 Kinesis 流中读取数据，使用 Glue 将其转换为 Parquet 并写入 S3。数据格式转换有问题，我收到以下错误（删除了一些细节）：

{"attemptsMade":1,"arrivalTimestamp":1624541721545,"lastErrorCode":"DataFormatConversion.InvalidSchema","lastErrorMessage":"The 架构无效。指定的表没有列。","attemptEndingTimestamp":1624542026951,"rawData":"xx","sequenceNumber":"xx","subSequenceNumber":null,"dataCatalogTable":{"catalogId":null,"databaseName ":"db_name","tableName":"table_name","region":null,"versionId":"LATEST","roleArn":"xx"}}

我正在使用的 Glue Table 的 Terraform 配置如下：

resource "aws_glue_catalog_table" "stream_format_conversion_table" {
  name          = "${var.resource_prefix}-parquet-conversion-table"
  database_name = aws_glue_catalog_database.stream_format_conversion_db.name

  table_type = "EXTERNAL_TABLE"

  parameters = {
    EXTERNAL              = "TRUE"
    "parquet.compression" = "SNAPPY"
  }

  storage_descriptor {
    location      = "s3://${element(split(":", var.bucket_arn), 5)}/"
    input_format  = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat"
    output_format = "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat"

    ser_de_info {
      name                  = "my-stream"
      serialization_library = "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"

      parameters = {
        "serialization.format" = 1
      }
    }
    columns {
      name = "metadata"
      type = "struct<tenantId:string,env:string,eventType:string,eventTimeStamp:timestamp>"   
    }
    columns {
      name = "eventpayload"
      type = "struct<operation:string,timestamp:timestamp,user_name:string,user_id:int,user_email:string,batch_id:string,initiator_id:string,initiator_email:string,payload:string>"         
    }
  }
}

这里需要改变什么？

【问题讨论】：

标签： amazon-web-services terraform aws-glue amazon-kinesis amazon-kinesis-firehose

【解决方案1】：

我遇到了“架构无效。指定的表没有列”，组合如下：

Glue 架构注册表中的 avro 架构，
使用“从现有架构添加表”通过控制台创建的粘合表
kinesis data firehose 配置了 Parquet 转换并引用从模式注册表创建的粘合表。

事实证明，如果表是从现有模式创建的，KDF 无法读取表的模式。必须从头开始创建表（与“从现有模式添加表”相反）这没有记录......暂时。

【讨论】：

谢谢，现在可以了 :-)

【解决方案2】：

除了mberchon 的回答之外，我发现 Kinesis Delivery Stream 的默认生成策略不包括实际读取架构所需的 IAM 权限。

我必须手动修改 IAM 政策以包括 glue:GetSchema 和 glue:GetSchemaVersion。

【讨论】：