【问题标题】:Kinesis Firehose - What is S3 extended destination configuration?Kinesis Firehose - 什么是 S3 扩展目标配置?
【发布时间】:2020-06-24 02:28:39
【问题描述】:

问题

什么是 S3 扩展目标配置,AWS 文档中的哪些地方清楚地解释了它的用途?

顾名思义,它一定是关于 S3 目的地。但是,AWS 文档的 S3 目标部分没有提及。

如果有文章或博客解释清楚,请指点。

我一直在以下文档中寻找线索,但通常与 AWS 文档一样,不清楚。它看起来部分与输入记录转换或记录处理有关。

resource "aws_kinesis_firehose_delivery_stream" "extended_s3_stream" {
  name        = "terraform-kinesis-firehose-extended-s3-test-stream"
  destination = "extended_s3"

  extended_s3_configuration {
    role_arn   = "${aws_iam_role.firehose_role.arn}"
    bucket_arn = "${aws_s3_bucket.bucket.arn}"

    processing_configuration {
      enabled = "true"

      processors {
        type = "Lambda"

        parameters {
          parameter_name  = "LambdaArn"
          parameter_value = "${aws_lambda_function.lambda_processor.arn}:$LATEST"
        }
      }
    }
  }
}

【问题讨论】:

    标签: amazon-web-services amazon-kinesis-firehose


    【解决方案1】:

    Terraform 文档最擅长展示 S3 和扩展 S3 目标之间的区别:https://www.terraform.io/docs/providers/aws/r/kinesis_firehose_delivery_stream.html

    S3 Extended 继承 S3 目标配置参数和额外的参数,例如 data_format_conversion_configurationerror_output_prefix

    【讨论】:

      【解决方案2】:

      恐怕 Kinesis Firehose 文档写得太差了,我想知道人们如何仅从文档中弄清楚如何使用 Firehose。

      最初看起来,firehose 只是将数据中继到 S3 存储桶,并且没有内置的转换机制,并且 S3 目标配置没有像 AWS::KinesisFirehose::DeliveryStream S3DestinationConfiguration 中的处理配置。

      然后和Amazon Kinesis Firehose Data Transformation with AWS Lambda一样,似乎在2017年初左右引入了一种转换记录的机制,因此添加了AWS::KinesisFirehose::DeliveryStream ExtendedS3DestinationConfiguration

      显然人们很难找到配置方法:

      好吧,经过大量的努力和文档搜索,我想通了。

      谁能通过阅读 AWS 文档来弄清楚?

      用于 lambda 转换的 Firehose 扩展 S3 配置

      无法从 AWS 文档中弄清楚,但在查看 Internet 中的实际实现后,看起来所需的配置如下。


      更新

      根据 Kevin Eid 的建议。

      s3_configuration - (可选)非 S3 目标需要。 对于 S3 目标,请改用 extended_s3_configuration

      The extended_s3_configuration object supports the same fields from s3_configuration as well as the following:
      
          data_format_conversion_configuration - (Optional) Nested argument for the serializer, deserializer, and schema for converting data from the JSON format to the Parquet or ORC format before writing it to Amazon S3. More details given below.
          error_output_prefix - (Optional) Prefix added to failed records before writing them to S3. This prefix appears immediately following the bucket name.
          processing_configuration - (Optional) The data processing configuration. More details are given below.
          s3_backup_mode - (Optional) The Amazon S3 backup mode. Valid values are Disabled and Enabled. Default value is Disabled.
          s3_backup_configuration - (Optional) The configuration for backup in Amazon S3. Required if s3_backup_mode is Enabled. Supports the same fields as s3_configuration object.
      

      我相信,由于兼容性或遗留原因,s3_configuration 仍然存在,因此只需要使用 extended_s3_configuration 但 AWS 文档没有正确解释。很遗憾 AWS 文档不能作为事实来源。

      【讨论】:

        【解决方案3】:

        ExtendedS3DestinationConfiguration 属性类型的第一个为 Amazon Kinesis Data Firehose 传输流配置 Amazon S3 目标。 看: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-kinesisfirehose-deliverystream-extendeds3destinationconfiguration.html

        谢谢。

        【讨论】:

          【解决方案4】:

          这个小屏幕截图显示了ExtendedS3DestinationConfigurationS3DestinationConfiguration 相比的新组件:

          此外,扩展 s3 配置是什么以及如何定义的,请参见 API documentation

          {
            "RoleARN": "string",
            "BucketARN": "string",
            "Prefix": "string",
            "ErrorOutputPrefix": "string",
            "BufferingHints": {
              "SizeInMBs": integer,
              "IntervalInSeconds": integer
            },
            "CompressionFormat": "UNCOMPRESSED"|"GZIP"|"ZIP"|"Snappy",
            "EncryptionConfiguration": {
              "NoEncryptionConfig": "NoEncryption",
              "KMSEncryptionConfig": {
                "AWSKMSKeyARN": "string"
              }
            },
            "CloudWatchLoggingOptions": {
              "Enabled": true|false,
              "LogGroupName": "string",
              "LogStreamName": "string"
            },
            "ProcessingConfiguration": {
              "Enabled": true|false,
              "Processors": [
                {
                  "Type": "Lambda",
                  "Parameters": [
                    {
                      "ParameterName": "LambdaArn"|"NumberOfRetries"|"RoleArn"|"BufferSizeInMBs"|"BufferIntervalInSeconds",
                      "ParameterValue": "string"
                    }
                    ...
                  ]
                }
                ...
              ]
            },
            "S3BackupMode": "Disabled"|"Enabled",
            "S3BackupUpdate": {
              "RoleARN": "string",
              "BucketARN": "string",
              "Prefix": "string",
              "ErrorOutputPrefix": "string",
              "BufferingHints": {
                "SizeInMBs": integer,
                "IntervalInSeconds": integer
              },
              "CompressionFormat": "UNCOMPRESSED"|"GZIP"|"ZIP"|"Snappy",
              "EncryptionConfiguration": {
                "NoEncryptionConfig": "NoEncryption",
                "KMSEncryptionConfig": {
                  "AWSKMSKeyARN": "string"
                }
              },
              "CloudWatchLoggingOptions": {
                "Enabled": true|false,
                "LogGroupName": "string",
                "LogStreamName": "string"
              }
            },
            "DataFormatConversionConfiguration": {
              "SchemaConfiguration": {
                "RoleARN": "string",
                "CatalogId": "string",
                "DatabaseName": "string",
                "TableName": "string",
                "Region": "string",
                "VersionId": "string"
              },
              "InputFormatConfiguration": {
                "Deserializer": {
                  "OpenXJsonSerDe": {
                    "ConvertDotsInJsonKeysToUnderscores": true|false,
                    "CaseInsensitive": true|false,
                    "ColumnToJsonKeyMappings": {"string": "string"
                      ...}
                  },
                  "HiveJsonSerDe": {
                    "TimestampFormats": ["string", ...]
                  }
                }
              },
              "OutputFormatConfiguration": {
                "Serializer": {
                  "ParquetSerDe": {
                    "BlockSizeBytes": integer,
                    "PageSizeBytes": integer,
                    "Compression": "UNCOMPRESSED"|"GZIP"|"SNAPPY",
                    "EnableDictionaryCompression": true|false,
                    "MaxPaddingBytes": integer,
                    "WriterVersion": "V1"|"V2"
                  },
                  "OrcSerDe": {
                    "StripeSizeBytes": integer,
                    "BlockSizeBytes": integer,
                    "RowIndexStride": integer,
                    "EnablePadding": true|false,
                    "PaddingTolerance": double,
                    "Compression": "NONE"|"ZLIB"|"SNAPPY",
                    "BloomFilterColumns": ["string", ...],
                    "BloomFilterFalsePositiveProbability": double,
                    "DictionaryKeyThreshold": double,
                    "FormatVersion": "V0_11"|"V0_12"
                  }
                }
              },
              "Enabled": true|false
            }
          }
          

          【讨论】:

          • 感谢您的跟进。但是,谁需要使用它并做什么?
          • @mon 它为您提供了很多选项,例如压缩、加密、s3 备份存储桶、日志记录。例如,您可以以压缩格式聚合所有流数据,以节省 s3 存储成本。您不必使用所有这些选项,但它们就在那里。
          猜你喜欢
          • 2019-11-24
          • 2020-01-17
          • 2019-03-18
          • 2016-02-03
          • 2021-12-27
          • 2020-06-09
          • 2021-05-11
          • 1970-01-01
          • 2010-12-20
          相关资源
          最近更新 更多