AWS Glue 自定义分类器 Json 路径答案

【问题标题】：AWS Glue Custom Classifiers Json PathAWS Glue 自定义分类器 Json 路径
【发布时间】：2018-08-28 22:34:51
【问题描述】：

我有一组像这样的 Json 数据文件

[
  {"client":"toys",
   "filename":"toy1.csv",
   "file_row_number":1,
   "secondary_db_index":"4050",
   "processed_timestamp":1535004075,
   "processed_datetime":"2018-08-23T06:01:15+0000",
   "entity_id":"4050",
   "entity_name":"4050",
   "is_emailable":false,
   "is_txtable":false,
   "is_loadable":false}
]

我使用以下自定义分类器 Json Path 创建了一个 Glue Crawler

$[*]

Glue 返回正确的架构以及正确识别的列。

但是，当我在 Athena 上查询数据时……所有数据都在第一列中，其余列为空。

我怎样才能让数据根据他们的列分布？

image of Athena query

谢谢！

【问题讨论】：

我遇到了完全相同的问题
Radu，将数据转换为 Parquet，解决问题

标签： jsonpath amazon-athena aws-glue

【解决方案1】：

这是与 Hive 相关的问题。我建议两种方法。首先，您可以在 Athena 中创建具有如下结构数据类型的新表：

CREATE EXTERNAL TABLE `example`(
`row` struct<client:string,filename:string,file_row_number:int,secondary_db_index:string,processed_timestamp:int,processed_datetime:string,entity_id:string,entity_name:string,is_emailable:boolean,is_txtable:boolean,is_loadable:boolean> COMMENT 'from deserializer')
ROW FORMAT SERDE 
'org.openx.data.jsonserde.JsonSerDe' 
STORED AS INPUTFORMAT 
'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://example'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0', 
'CrawlerSchemaSerializerVersion'='1.0', 
'UPDATED_BY_CRAWLER'='example', 
'averageRecordSize'='271', 
'classification'='json', 
'compressionType'='none', 
'jsonPath'='$[*]', 
'objectCount'='1', 
'recordCount'='1', 
'sizeKey'='271', 
'transient_lastDdlTime'='1535533583', 
'typeOfData'='file')

然后您可以按如下方式运行查询：

SELECT row.client, row.filename, row.file_row_number FROM "example"

其次，您可以重新设计您的 json 文件，如下所示，然后再次运行 Crawler。在此示例中，我使用了 Single-JSON-Record-Per-Line 格式。

{"client":"toys","filename":"toy1.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false},
{"client":"toys2","filename":"toy2.csv","file_row_number":1,"secondary_db_index":"4050","processed_timestamp":1535004075,"processed_datetime":"2018-08-23T06:01:15+0000","entity_id":"4050","entity_name":"4050","is_emailable":false,"is_txtable":false,"is_loadable":false}

【讨论】：