【问题标题】:AWS Athena struct not parsing JSON stringAWS Athena 结构不解析 JSON 字符串
【发布时间】:2021-03-23 21:57:42
【问题描述】:

我正在使用 AWS Athena 对 AWS CloudTrail 数据对象日志条目进行一些查询。

典型日志条目中的前几个字段如下所示(为了清晰起见,打印得很漂亮):

{
  "Records": [
    {
      "eventVersion": "1.08",
      "userIdentity": {
        "type": "AWSAccount",
        "principalId": "",
        "accountId": "ANONYMOUS_PRINCIPAL"
      },
      "eventTime": "2021-03-23T14:04:38Z",
      "eventSource": "s3.amazonaws.com",
      "eventName": "GetObject",
      "awsRegion": "us-east-1",
      "sourceIPAddress": "12.34.45.56",
      "userAgent": "[Amazon CloudFront]",
      "requestParameters": {
        "bucketName": "mybucket",
        "Host": "mybucket.s3.amazonaws.com",
        "key": "bin/some/path/to/a/file"
      },
      "responseElements": null,
...

AWS CloudTrail 控制台将创建一个标准表来查询这些条目。该表定义如下:

CREATE EXTERNAL TABLE `cloudtrail_logs_mybucket_logs`(
  `eventversion` string COMMENT 'from deserializer', 
  `useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer', 
  `eventtime` string COMMENT 'from deserializer', 
  `eventsource` string COMMENT 'from deserializer', 
  `eventname` string COMMENT 'from deserializer', 
  `awsregion` string COMMENT 'from deserializer', 
  `sourceipaddress` string COMMENT 'from deserializer', 
  `useragent` string COMMENT 'from deserializer', 
  `errorcode` string COMMENT 'from deserializer', 
  `errormessage` string COMMENT 'from deserializer', 
  `requestparameters` string COMMENT 'from deserializer', 
  `responseelements` string COMMENT 'from deserializer', 
  `additionaleventdata` string COMMENT 'from deserializer', 
  `requestid` string COMMENT 'from deserializer', 
  `eventid` string COMMENT 'from deserializer', 
  `resources` array<struct<arn:string,accountid:string,type:string>> COMMENT 'from deserializer', 
  `eventtype` string COMMENT 'from deserializer', 
  `apiversion` string COMMENT 'from deserializer', 
  `readonly` string COMMENT 'from deserializer', 
  `recipientaccountid` string COMMENT 'from deserializer', 
  `serviceeventdetails` string COMMENT 'from deserializer', 
  `sharedeventid` string COMMENT 'from deserializer', 
  `vpcendpointid` string COMMENT 'from deserializer')
COMMENT 'CloudTrail table for adafruit-circuit-python-logs bucket'
ROW FORMAT SERDE 
  'com.amazon.emr.hive.serde.CloudTrailSerde' 
STORED AS INPUTFORMAT 
  'com.amazon.emr.cloudtrail.CloudTrailInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://mybucket/AWSLogs/12345678901234/CloudTrail'
TBLPROPERTIES (
  'classification'='cloudtrail', 
  'transient_lastDdlTime'='1616514617')

注意useridentity 被描述为struct,但requestParametersstring。我想使用struct 功能来预解析requestParameters,所以我尝试了这个:

CREATE EXTERNAL TABLE `cloudtrail_logs_mybucket_logs2`(
  `eventversion` string COMMENT 'from deserializer', 
  `useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer', 
  `eventtime` string COMMENT 'from deserializer', 
  `eventsource` string COMMENT 'from deserializer', 
  `eventname` string COMMENT 'from deserializer', 
  `awsregion` string COMMENT 'from deserializer', 
  `sourceipaddress` string COMMENT 'from deserializer', 
  `useragent` string COMMENT 'from deserializer', 
  `errorcode` string COMMENT 'from deserializer', 
  `errormessage` string COMMENT 'from deserializer', 
  `requestparameters` struct<`bucketName`:string, `Host`:string, `key`:string> COMMENT 'THIS IS NEW', 
...[rest same as above]

表已创建,但尝试使用它进行简单查询(“预览表”)会出现此错误:

GENERIC_INTERNAL_ERROR: parent builder is null

我尝试在requestparameters 上使用struct 有什么问题?就 JSON 而言,useridentity 的情况似乎没有什么不同。

【问题讨论】:

  • 这里总猜测 - 但也许 cloudtrail SerDe 不支持结构的其他视图。您可以尝试执行 CTAS,使用 Json SerDe 但使用相同的表模式将记录重写为 JSON。最后,创建一个指向将 requestParameters 视为结构的相同 JSON 文件的新表模式。顺便说一句,如果 requestparameters 作为 JSON 编码的字符串通过,您可以使用 Presto 的 JSON 函数(例如 json_extract)来检查其中的道具
  • 我最终使用json_extract_scalar 来获取我需要的字段;这很简单,而且工作正常。谜团在于为什么 SerDe 能够解析 useridentity JSON 但不能解析 requestparameters JSON。

标签: amazon-athena amazon-cloudtrail


【解决方案1】:

您应该改用 json 序列化器/反序列化器:

...
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ( 'paths'='<LIST OF COLUMNS>' )
...

查看文档:https://docs.aws.amazon.com/athena/latest/ug/json-serde.html#hive-json-serde

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-10-05
    • 2023-04-04
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多