【问题标题】:Conditional sum over data using Apache Pig Latin使用 Apache Pig Latin 对数据进行条件求和
【发布时间】:2015-09-01 18:06:40
【问题描述】:

我正在尝试使用 Apache Pig Latin 进行一些日志处理,我想知道是否有更简单的方法来做到这一点:

filtered_logs = FOREACH logs GENERATE numDay, reqSize, optimizedSize, origSize, compressionPct, cacheStatus;

grouped_logs = GROUP filtered_logs BY numDay;

results = FOREACH grouped_logs GENERATE group,
(SUM(filtered_logs.reqSize) + SUM(filtered_logs.optimizedSize)) / 1048576.00 AS     ClientThroughputMB,
(SUM(filtered_logs.reqSize) + SUM(filtered_logs.origSize)) / 1048576.00 AS ServerThroughputMB,
SUM(filtered_logs.origSize) / 1048576.00 AS OrigMB,
SUM(filtered_logs.optimizedSize) / 1048576.00 AS OptMB,
SUM(filtered_logs.reqSize) / 1048576.00 AS SentMB,
AVG(filtered_logs.compressionPct) AS CompressionAvg,
COUNT(filtered_logs) AS NumLogs;

cache_hit_logs = FILTER filtered_logs BY cacheStatus MATCHES '.*HIT.*';

grouped_cache_hit_logs = GROUP cache_hit_logs BY numDay;

cache_hits = FOREACH grouped_cache_hit_logs GENERATE group,
COUNT(cache_hit_logs) AS cnt;

final_results = JOIN results BY group, cache_hits BY group;
DUMP final_results;

(定义了日志,它基本上是读取管道分隔的日志文件并分配字段)

我在这里尝试做的是计算字段 cacheStatus 包含“HIT”的实例数,并计算其他数据,例如 OrigMB、CompressionAvg、NumLogs 等。当前代码有效,但似乎有巨大的性能开销。 Pig Latin 中是否有办法按照这种方式(在 MSSQL 中)做一些事情?

SUM(CASE CacheStatus WHEN 'HIT' THEN 1 else 0 END) as CacheHit

(基本上,我不想多次处理日志,我宁愿一次完成)

抱歉,如果我的问题措辞令人困惑,我对 Pig Latin 还是很陌生。

【问题讨论】:

    标签: hadoop logging apache-pig


    【解决方案1】:

    没关系,我找到了自己的解决方案(我傻了,忘了我可以将语句括在花括号中):

    results = FOREACH grouped_logs 
    {
        cache_hits = FILTER filtered_logs BY cacheStatus MATCHES '.*HIT.*';
    
        GENERATE group,
        (SUM(filtered_logs.reqSize) + SUM(filtered_logs.optimizedSize)) / 1048576.00 AS ClientThroughputMB,
        (SUM(filtered_logs.reqSize) + SUM(filtered_logs.origSize)) / 1048576.00 AS ServerThroughputMB,
        SUM(filtered_logs.origSize) / 1048576.00 AS OrigMB,
        SUM(filtered_logs.optimizedSize) / 1048576.00 AS OptMB,
        SUM(filtered_logs.reqSize) / 1048576.00 AS SentMB,
        AVG(filtered_logs.compressionPct) AS CompressionAvg,
        COUNT(filtered_logs) AS NumLogs,
        COUNT(cache_hits) AS CacheHit;
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-05-02
      • 1970-01-01
      • 1970-01-01
      • 2015-01-04
      • 1970-01-01
      相关资源
      最近更新 更多