【问题标题】:Using GroupBy while copying from HDFS to S3 to merge files within a folder从 HDFS 复制到 S3 时使用 GroupBy 合并文件夹中的文件
【发布时间】:2018-03-31 16:44:42
【问题描述】:

我在 HDFS 中有以下文件夹:

hdfs://x.x.x.x:8020/Air/BOOK/AE/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/AE/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/BH/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/IN/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/IN/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/KW/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/KW/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/ME/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/OM/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/Others/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/QA/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/QA/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/SA/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/BOOK/SA/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/AE/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/AE/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/BH/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/BH/INT/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/IN/DOM/20171001/2017100101
hdfs://x.x.x.x:8020/Air/SEARCH/IN/INT/20171001/2017100101

每个文件夹中有近 50 个文件。我的目的是合并文件夹中的所有文件以获取单个文件,同时将其从 HDFS 复制到 S3 上。我遇到的问题是带有 groupBy 选项的正则表达式。我试过了,但这似乎不起作用:

s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/  --groupBy '.*/(\w+)/(\w+)/(\w+)/.*' --outputCodec lzo

该命令本身有效,但我没有将每个文件夹中的文件合并到一个文件中,这让我相信问题出在我的正则表达式上。

【问题讨论】:

    标签: hadoop amazon-s3 amazon-emr distcp s3distcp


    【解决方案1】:

    这只是我自己想出来的。正确的正则表达式是

    .*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*
    

    合并和复制的命令是:

    s3-dist-cp --src hdfs:///Air/ --dest s3a://HadoopSplit/Air-merged/  --groupBy '.*/Air/(\w+)/(\w+)/(\w+)/.*/.*/.*' --outputCodec lzo
    

    【讨论】:

    猜你喜欢
    • 2014-07-08
    • 1970-01-01
    • 1970-01-01
    • 2011-11-21
    • 1970-01-01
    • 2023-03-11
    • 2019-11-14
    • 1970-01-01
    • 2017-05-31
    相关资源
    最近更新 更多