【问题标题】:How to use Data Pipeline to export a DynamoDB table that has on-demand provision如何使用 Data Pipeline 导出具有按需配置的 DynamoDB 表
【发布时间】:2019-02-13 09:35:12
【问题描述】:

我曾经使用名为 Export DynamoDB table to S3 的数据管道模板将 DynamoDB 表导出到文件。我最近更新了我的所有 DynamoDB 表以进行按需配置,并且该模板不再有效。我很确定这是因为旧模板指定了要消耗的 DynamoDB 吞吐量的百分比,这与按需表无关。

我尝试将旧模板导出为 JSON,删除对吞吐量百分比消耗的引用,并创建一个新管道。然而,这并不成功。

谁能建议如何将提供吞吐量的旧式管道脚本转换为新的按需表脚本?

这是我原来的功能脚本:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "0.25",
      "watermark": "Enter value between 0.1-1.0",
      "description": "DynamoDB read throughput ratio",
      "id": "myDDBReadThroughputRatio",
      "type": "Double"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myDDBReadThroughputRatio": "0.25",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

这是我尝试的更新失败:

{
  "objects": [
    {
      "name": "DDBSourceTable",
      "id": "DDBSourceTable",
      "type": "DynamoDBDataNode",
      "tableName": "#{myDDBTableName}"
    },
    {
      "name": "EmrClusterForBackup",
      "coreInstanceCount": "1",
      "coreInstanceType": "m3.xlarge",
      "releaseLabel": "emr-5.13.0",
      "masterInstanceType": "m3.xlarge",
      "id": "EmrClusterForBackup",
      "region": "#{myDDBRegion}",
      "type": "EmrCluster"
    },
    {
      "failureAndRerunMode": "CASCADE",
      "resourceRole": "DataPipelineDefaultResourceRole",
      "role": "DataPipelineDefaultRole",
      "scheduleType": "ONDEMAND",
      "name": "Default",
      "id": "Default"
    },
    {
      "output": {
        "ref": "S3BackupLocation"
      },
      "input": {
        "ref": "DDBSourceTable"
      },
      "maximumRetries": "2",
      "name": "TableBackupActivity",
      "step": "s3://dynamodb-emr-#{myDDBRegion}/emr-ddb-storage-handler/2.1.0/emr-ddb-2.1.0.jar,org.apache.hadoop.dynamodb.tools.DynamoDbExport,#{output.directoryPath},#{input.tableName}",
      "id": "TableBackupActivity",
      "runsOn": {
        "ref": "EmrClusterForBackup"
      },
      "type": "EmrActivity",
      "resizeClusterBeforeRunning": "true"
    },
    {
      "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
      "name": "S3BackupLocation",
      "id": "S3BackupLocation",
      "type": "S3DataNode"
    }
  ],
  "parameters": [
    {
      "description": "Output S3 folder",
      "id": "myOutputS3Loc",
      "type": "AWS::S3::ObjectKey"
    },
    {
      "description": "Source DynamoDB table name",
      "id": "myDDBTableName",
      "type": "String"
    },
    {
      "default": "us-east-1",
      "watermark": "us-east-1",
      "description": "Region of the DynamoDB table",
      "id": "myDDBRegion",
      "type": "String"
    }
  ],
  "values": {
    "myDDBRegion": "us-east-1",
    "myDDBTableName": "LIVE_Invoices",
    "myOutputS3Loc": "s3://company-live-extracts/"
  }
}

这是数据管道执行的错误:

at org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:322) at org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:198) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1341) at org.apache.hadoop.mapreduce.Job$11.run(Job.java:1338) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapreduce.Job.submit(Job.java:1338) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:575) at org.apache.hadoop.mapred.JobClient$1.run(JobClient.java:570) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:570) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java

【问题讨论】:

    标签: amazon-dynamodb amazon-data-pipeline


    【解决方案1】:

    我为此向 AWS 开了一张支持票。他们的回答相当全面。下面我贴一下


    感谢您就这个问题与我们联系。

    很遗憾,DynamoDB 的 Data Pipeline 导出/导入作业不支持 DynamoDB 的新按需模式 [1]。

    使用按需容量的表没有为读取和写入单元定义的容量。在计算管道的吞吐量时,Data Pipeline 依赖于这个定义的容量。

    例如,如果您有 100 个 RCU(读取容量单位)和 0.25 (25%) 的管道吞吐量,则有​​效管道吞吐量将为每秒 25 个读取单位 (100 * 0.25)。 但是在 On-Demand 容量的情况下,RCU 和 WCU(Write Capacity Units)反映为 0。不管流水线吞吐量值如何,计算出来的有效吞吐量都是 0。

    当有效吞吐量小于1时,管道不会执行。

    您是否需要将 DynamoDB 表导出到 S3?

    如果您仅将这些表导出用于备份目的,我建议您使用 DynamoDB 的按需备份和恢复功能(与按需容量的名称很相似)[2]。

    请注意,按需备份不会影响表的吞吐量,并且可以在几秒钟内完成。您只需支付与备份相关的 S3 存储成本。 但是,客户无法直接访问这些表备份,只能恢复到源表。如果您希望对备份数据执行分析,或将数据导入其他系统、帐户或表,则此备份方法不适合。

    如果您需要使用 Data Pipeline 导出 DynamoDB 数据,那么唯一的方法是将表设置为预配置容量模式。

    您可以手动执行此操作,或者使用 AWS CLI 命令 [3] 将其作为活动包含在管道本身中。

    例如(按需也称为按请求付费模式):

    $ aws dynamodb update-table --table-name myTable --billing-mode PROVISIONED --provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100
    

    -

    $ aws dynamodb update-table --table-name myTable --billing-mode PAY_PER_REQUEST
    

    请注意,禁用按需容量模式后,您需要等待 24 小时才能再次启用它。

    === 参考链接 ===

    [1] DynamoDB On-Demand 容量(另请参阅关于不支持的服务/工具的说明):https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadWriteCapacityMode.html#HowItWorks.OnDemand

    [2] DynamoDB 按需备份和恢复:https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BackupRestore.html

    [3] DynamoDB“更新表”的 AWS CLI 参考:https://docs.aws.amazon.com/cli/latest/reference/dynamodb/update-table.html

    【讨论】:

      【解决方案2】:

      今年早些时候在 DDB 导出工具中添加了对按需表的支持:GitHub commit

      我能够在 S3 上放置该工具的更新版本并更新管道中的一些内容以使其正常工作:

      {
        "objects": [
          {
            "output": {
              "ref": "S3BackupLocation"
            },
            "input": {
              "ref": "DDBSourceTable"
            },
            "maximumRetries": "2",
            "name": "TableBackupActivity",
            "step": "s3://<your-tools-bucket>/emr-dynamodb-tools-4.11.0-SNAPSHOT.jar,org.apache.hadoop.dynamodb.tools.DynamoDBExport,#{output.directoryPath},#{input.tableName},#{input.readThroughputPercent}",
            "id": "TableBackupActivity",
            "runsOn": {
              "ref": "EmrClusterForBackup"
            },
            "type": "EmrActivity",
            "resizeClusterBeforeRunning": "true"
          },
          {
            "failureAndRerunMode": "CASCADE",
            "resourceRole": "DataPipelineDefaultResourceRole",
            "role": "DataPipelineDefaultRole",
            "pipelineLogUri": "s3://<your-log-bucket>/",
            "scheduleType": "ONDEMAND",
            "name": "Default",
            "id": "Default"
          },
          {
            "readThroughputPercent": "#{myDDBReadThroughputRatio}",
            "name": "DDBSourceTable",
            "id": "DDBSourceTable",
            "type": "DynamoDBDataNode",
            "tableName": "#{myDDBTableName}"
          },
          {
            "directoryPath": "#{myOutputS3Loc}/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}",
            "name": "S3BackupLocation",
            "id": "S3BackupLocation",
            "type": "S3DataNode"
          },
          {
            "name": "EmrClusterForBackup",
            "coreInstanceCount": "1",
            "coreInstanceType": "m3.xlarge",
            "releaseLabel": "emr-5.26.0",
            "masterInstanceType": "m3.xlarge",
            "id": "EmrClusterForBackup",
            "region": "#{myDDBRegion}",
            "type": "EmrCluster",
            "terminateAfter": "1 Hour"
          }
        ],
        "parameters": [
          {
            "description": "Output S3 folder",
            "id": "myOutputS3Loc",
            "type": "AWS::S3::ObjectKey"
          },
          {
            "description": "Source DynamoDB table name",
            "id": "myDDBTableName",
            "type": "String"
          },
          {
            "default": "0.25",
            "watermark": "Enter value between 0.1-1.0",
            "description": "DynamoDB read throughput ratio",
            "id": "myDDBReadThroughputRatio",
            "type": "Double"
          },
          {
            "default": "us-east-1",
            "watermark": "us-east-1",
            "description": "Region of the DynamoDB table",
            "id": "myDDBRegion",
            "type": "String"
          }
        ],
        "values": {
          "myDDBRegion": "us-west-2",
          "myDDBTableName": "<your table name>",
          "myDDBReadThroughputRatio": "0.5",
          "myOutputS3Loc": "s3://<your-output-bucket>/"
        }
      }
      

      关键变化:

      • EmrClusterForBackup 的releaseLabel 更新为“emr-5.26.0”。这是获取适用于 Java 的 AWS 开发工具包 v1.11 和 DynamoDB 连接器 v4.11.0 所必需的(请参阅此处的发布矩阵:AWS docs
      • 如上更新TableBackupActivity的步骤。将它指向您构建的 *.jar,并将工具的类名从 DynamoDbExport 更新为 DynamoDBExport

      希望默认模板也能得到更新,使其开箱即用。

      【讨论】:

      • 不错。下次需要导出时我会检查一下,谢谢。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-06-08
      • 1970-01-01
      相关资源
      最近更新 更多