【问题标题】:Glue table creation using Boto3使用 Boto3 创建胶水表
【发布时间】:2021-10-28 10:45:24
【问题描述】:

我们使用 boto3 方法创建一个 Glue 表。该表已创建,我们可以从 Hive 或使用 Athena boto3 进行 msck 修复。

问题是数据没有填充到 Athena,在 Athena 中只填充了分区列。在 Hive 中,所有列都已填充。

使用boto3创建表的代码

response = glue_client.create_table(
        DatabaseName='avro_database',
        TableInput={
            "Name": "avro_table_name",
            "Description": "Table created with boto3 API",
            "StorageDescriptor": {
                "Location": "s3://bucket_name/api/avro_folder",
                "InputFormat": "org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat",
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.avro.AvroSerDe",
                    "Parameters": { 
                        "DeserializationLibrary": "org.apache.hadoop.hive.serde2.avro.AvroSerDe",
                    },
                },
            },
            "PartitionKeys": [
                {
                    "Name": "insert_yyyymmdd",
                    "Type": "string",
                }
            ],
            "TableType": "EXTERNAL_TABLE",
            "Parameters": {
                "avro.schema.url": "s3://bucket/schema/L1/api/schema_avro.avsc",
                'transient_lastDdlTime': '1635259605'
                
            }
        },
    )

创建表后,我们还可以在Athena中查询表定义

雅典娜中的 DDL

CREATE EXTERNAL TABLE avro_table(
  `error_error_error_error_error_error_error` string COMMENT 'from deserializer', 
  `cannot_determine_schema` string COMMENT 'from deserializer', 
  `check` string COMMENT 'from deserializer', 
  `schema` string COMMENT 'from deserializer', 
  `url` string COMMENT 'from deserializer', 
  `and` string COMMENT 'from deserializer', 
  `literal` string COMMENT 'from deserializer')
PARTITIONED BY ( 
  `insert_yyyymmdd` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
WITH SERDEPROPERTIES ( 
  'DeserializationLibrary'='org.apache.hadoop.hive.serde2.avro.AvroSerDe') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  's3://bucket_name/api/avro_folder'
TBLPROPERTIES (
  'avro.schema.url"= "s3://bucket/schema/L1/api/schema_avro.avsc',
   'transient_lastDdlTime'= '1635259605')

当我在 Athena 中查询时

select * from "avro_database"."avro_table"

仅填充分区列 (insert_yyyymmdd )。

【问题讨论】:

  • msck repair table的结果是什么?您需要有 hive 样式的分区才能工作,即 key=value 在您的 s3 位置。否则你需要使用alter table add partition
  • @Eman,msck 修复工作正常,我只有 hive 样式的分区,即 start_yyyyddmm=date 在 s3 位置。

标签: python hive boto3 aws-glue amazon-athena


【解决方案1】:

测试用例:

架构:

{"namespace": "example.avro",
 "type": "record",
 "name": "avro_table",
 "fields": [
     {"name": "error_error_error_error_error_error_error", "type": "string"},
     {"name": "cannot_determine_schema",  "type": ["string", "null"]},
     {"name": "check", "type": ["string", "null"]},
     {"name": "schema", "type": ["string", "null"]},
     {"name": "url", "type": ["string", "null"]},
     {"name": "and", "type": ["string", "null"]},
     {"name": "literal", "type": ["string", "null"]}
 ]
}

python测试代码:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

schema = avro.schema.parse(open("schema_avro.avsc", "rb").read())

writer = DataFileWriter(open("avro_table.avro", "wb"), DatumWriter(), schema)
writer.append({"error_error_error_error_error_error_error": "META confused", "cannot_determine_schema": "256", "check":"","schema": "meta verse","url":"http://aws.amazon.com", "and":"test", "literal": "chero"})
writer.append({"error_error_error_error_error_error_error": "Meta who", "cannot_determine_schema": "256", "check":"","schema": "meta verse","url":"http://aws.amazon.com", "and":"test", "literal": "chero"})
writer.append({"error_error_error_error_error_error_error": "META dead", "cannot_determine_schema": "256", "check":"","schema": "meta verse","url":"http://aws.amazon.com", "and":"test", "literal": "chero"})
writer.close()

Athena 表或使用爬虫: 创建表如下:

CREATE EXTERNAL TABLE `avro_avro_data`(
  `error_error_error_error_error_error_error` string COMMENT 'from deserializer', 
  `cannot_determine_schema` string COMMENT 'from deserializer', 
  `check` string COMMENT 'from deserializer', 
  `schema` string COMMENT 'from deserializer', 
  `url` string COMMENT 'from deserializer', 
  `and` string COMMENT 'from deserializer', 
  `literal` string COMMENT 'from deserializer')
PARTITIONED BY ( 
  `insert_yyyymmdd` string)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
WITH SERDEPROPERTIES ( 
  'avro.schema.literal'='{\"type\":\"record\",\"name\":\"avro_table\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"error_error_error_error_error_error_error\",\"type\":\"string\"},{\"name\":\"cannot_determine_schema\",\"type\":[\"string\",\"null\"]},{\"name\":\"check\",\"type\":[\"string\",\"null\"]},{\"name\":\"schema\",\"type\":[\"string\",\"null\"]},{\"name\":\"url\",\"type\":[\"string\",\"null\"]},{\"name\":\"and\",\"type\":[\"string\",\"null\"]},{\"name\":\"literal\",\"type\":[\"string\",\"null\"]}]}') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION
  's3://yourbucket/avro-data/'
TBLPROPERTIES (
  'avro.schema.literal'='{\"type\":\"record\",\"name\":\"avro_table\",\"namespace\":\"example.avro\",\"fields\":[{\"name\":\"error_error_error_error_error_error_error\",\"type\":\"string\"},{\"name\":\"cannot_determine_schema\",\"type\":[\"string\",\"null\"]},{\"name\":\"check\",\"type\":[\"string\",\"null\"]},{\"name\":\"schema\",\"type\":[\"string\",\"null\"]},{\"name\":\"url\",\"type\":[\"string\",\"null\"]},{\"name\":\"and\",\"type\":[\"string\",\"null\"]},{\"name\":\"literal\",\"type\":[\"string\",\"null\"]}]}', 
  'classification'='avro', 
  'compressionType'='none')

您在 Athena 中创建的表会引发以下异常: table property 'avro.schema.url"= "s3://yourbucket/avro-schema/schema_avro.avsc' is not supported.

【讨论】:

    【解决方案2】:

    我认为“位置”中的 avro_folder 后缺少“/”:“s3://bucket_name/api/avro_folder”

    【讨论】: