【问题标题】:Can I generate nested bags using nested FOREACH statements in Pig Latin?我可以使用 Pig Latin 中的嵌套 FOREACH 语句生成嵌套包吗?
【发布时间】:2011-02-08 11:53:49
【问题描述】:

假设我有一个餐厅评论数据集:

User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5

我想按用户和城市的平均评论生成一个列表。 IE。输出:

User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75

我可以编写一个 Pig 脚本如下:

Data = LOAD 'data.txt' USING PigStorage(',') AS (
    user:chararray, city:chararray, restaurant:charray, rating:float
);

PerUserCity = GROUP Data BY (user, city);

ResultSet = FOREACH PerUserCity {
    GENERATE group.user, group.city, AVG(Data.rating);
}

但是我很好奇我是否可以先将更高级别的组(用户)分组,然后再将下一个级别(城市)分组:即

PerUser = GROUP Data BY user;

Intermediate = FOREACH PerUser {
    B = GROUP Data BY city;
    GENERATE group AS user, B;
}

我明白了:

Error during parsing.
Invalid alias: GROUP in {
  group: chararray,
  Data: {
    user: chararray,
    city: chararray,
    restaurant: chararray,
    rating: float
  }
}

有没有人成功地尝试过这个?是否根本不可能在 FOREACH 中进行 GROUP?

我的目标是做类似的事情:

ResultSet = FOREACH PerUser {
    FOREACH City {
        GENERATE user, city, AVG(City.rating)
    }
}

【问题讨论】:

    标签: apache-pig


    【解决方案1】:

    目前允许的操作是DISTINCTFILTERLIMITORDER BYFOREACH 内。

    现在直接按(用户,城市)分组是按照你说的做的好方法。

    【讨论】:

      【解决方案2】:

      Pig 0.10 版的发行说明表明嵌套的 FOREACH 操作是 now supported

      【讨论】:

      • 谢谢。为什么内部块需要两个 GENERATE?
      • 撤回我的建议。发行说明表明可以做到这一点,但我无法让它工作。
      【解决方案3】:

      试试这个:

      Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
      grpRecs = group Records By (user,city);
      avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average; 
      Result = foreach avgRating_Byuser_perCity generate flatten(group), average;
      

      【讨论】:

      • 您应该添加说明此代码完成了什么以及它是如何做到的。
      • 这是错误的...应该是 Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float) ; grpRecs = 按(用户,城市)分组记录; avgRating_Byuser_perCity = foreach grpRecs 生成 flatten(group), AVG(Records.rating) 作为平均值;结果 = 转储 avgRating_Byuser_perCity ;
      【解决方案4】:
      awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float);
      data = filter rawdata by user != 'User';
      
      groupbyusercity = group data by (user,city);
      
      --describe groupbyusercity;
      --groupbyusercity: {group: (user: chararray,city: chararray),data: {(user: chararray,city: chararray,restaurant: chararray,rating: float)}}
      
      average = foreach groupbyusercity {
          generate group.user,group.city,AVG(data.rating);
      }
      
      dump average;
      

      【讨论】:

        【解决方案5】:

        按两个键分组,然后展平结构会导致相同的结果:

        像你一样加载数据

        Data = LOAD 'data.txt' USING PigStorage(',') AS (
            user:chararray, city:chararray, restaurant:charray, rating:float);
        

        按用户和城市分组

         ByUserByCity = GROUP Data BY (user, city);
        

        添加组的平均评分(您可以添加更多,例如 COUNT(Data) as count_res) 然后将组结构展平为原始结构。

        ByUserByCityAvg = FOREACH ByUserByCity GENERATE
        FLATTEN(group) AS (user, city),
        AVG(Data.rating) as user_city_avg;
        

        结果:

        Jim,London,2.0
        Jim,New York,3.75
        Lisa,London,3.75
        User,City,
        

        【讨论】:

        • 我猜,这不能回答问题
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2021-07-22
        • 1970-01-01
        • 2023-03-16
        • 2013-11-01
        • 1970-01-01
        • 2013-08-18
        • 1970-01-01
        相关资源
        最近更新 更多