【发布时间】:2021-03-23 21:57:42
【问题描述】:
我正在使用 AWS Athena 对 AWS CloudTrail 数据对象日志条目进行一些查询。
典型日志条目中的前几个字段如下所示(为了清晰起见,打印得很漂亮):
{
"Records": [
{
"eventVersion": "1.08",
"userIdentity": {
"type": "AWSAccount",
"principalId": "",
"accountId": "ANONYMOUS_PRINCIPAL"
},
"eventTime": "2021-03-23T14:04:38Z",
"eventSource": "s3.amazonaws.com",
"eventName": "GetObject",
"awsRegion": "us-east-1",
"sourceIPAddress": "12.34.45.56",
"userAgent": "[Amazon CloudFront]",
"requestParameters": {
"bucketName": "mybucket",
"Host": "mybucket.s3.amazonaws.com",
"key": "bin/some/path/to/a/file"
},
"responseElements": null,
...
AWS CloudTrail 控制台将创建一个标准表来查询这些条目。该表定义如下:
CREATE EXTERNAL TABLE `cloudtrail_logs_mybucket_logs`(
`eventversion` string COMMENT 'from deserializer',
`useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer',
`eventtime` string COMMENT 'from deserializer',
`eventsource` string COMMENT 'from deserializer',
`eventname` string COMMENT 'from deserializer',
`awsregion` string COMMENT 'from deserializer',
`sourceipaddress` string COMMENT 'from deserializer',
`useragent` string COMMENT 'from deserializer',
`errorcode` string COMMENT 'from deserializer',
`errormessage` string COMMENT 'from deserializer',
`requestparameters` string COMMENT 'from deserializer',
`responseelements` string COMMENT 'from deserializer',
`additionaleventdata` string COMMENT 'from deserializer',
`requestid` string COMMENT 'from deserializer',
`eventid` string COMMENT 'from deserializer',
`resources` array<struct<arn:string,accountid:string,type:string>> COMMENT 'from deserializer',
`eventtype` string COMMENT 'from deserializer',
`apiversion` string COMMENT 'from deserializer',
`readonly` string COMMENT 'from deserializer',
`recipientaccountid` string COMMENT 'from deserializer',
`serviceeventdetails` string COMMENT 'from deserializer',
`sharedeventid` string COMMENT 'from deserializer',
`vpcendpointid` string COMMENT 'from deserializer')
COMMENT 'CloudTrail table for adafruit-circuit-python-logs bucket'
ROW FORMAT SERDE
'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT
'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://mybucket/AWSLogs/12345678901234/CloudTrail'
TBLPROPERTIES (
'classification'='cloudtrail',
'transient_lastDdlTime'='1616514617')
注意useridentity 被描述为struct,但requestParameters 是string。我想使用struct 功能来预解析requestParameters,所以我尝试了这个:
CREATE EXTERNAL TABLE `cloudtrail_logs_mybucket_logs2`(
`eventversion` string COMMENT 'from deserializer',
`useridentity` struct<type:string,principalid:string,arn:string,accountid:string,invokedby:string,accesskeyid:string,username:string,sessioncontext:struct<attributes:struct<mfaauthenticated:string,creationdate:string>,sessionissuer:struct<type:string,principalid:string,arn:string,accountid:string,username:string>>> COMMENT 'from deserializer',
`eventtime` string COMMENT 'from deserializer',
`eventsource` string COMMENT 'from deserializer',
`eventname` string COMMENT 'from deserializer',
`awsregion` string COMMENT 'from deserializer',
`sourceipaddress` string COMMENT 'from deserializer',
`useragent` string COMMENT 'from deserializer',
`errorcode` string COMMENT 'from deserializer',
`errormessage` string COMMENT 'from deserializer',
`requestparameters` struct<`bucketName`:string, `Host`:string, `key`:string> COMMENT 'THIS IS NEW',
...[rest same as above]
表已创建,但尝试使用它进行简单查询(“预览表”)会出现此错误:
GENERIC_INTERNAL_ERROR: parent builder is null
我尝试在requestparameters 上使用struct 有什么问题?就 JSON 而言,useridentity 的情况似乎没有什么不同。
【问题讨论】:
-
这里总猜测 - 但也许 cloudtrail SerDe 不支持结构的其他视图。您可以尝试执行 CTAS,使用 Json SerDe 但使用相同的表模式将记录重写为 JSON。最后,创建一个指向将 requestParameters 视为结构的相同 JSON 文件的新表模式。顺便说一句,如果 requestparameters 作为 JSON 编码的字符串通过,您可以使用 Presto 的 JSON 函数(例如 json_extract)来检查其中的道具
-
我最终使用
json_extract_scalar来获取我需要的字段;这很简单,而且工作正常。谜团在于为什么 SerDe 能够解析useridentityJSON 但不能解析requestparametersJSON。
标签: amazon-athena amazon-cloudtrail