【问题标题】:Hadoop Pig correlation usageHadoop Pig 关联使用
【发布时间】:2011-12-06 23:56:21
【问题描述】:

我有一个向量列表,我想通过输入向量(数字)运行相关性。我应该如何存储我的向量列表,以及如何传入我的输入向量并将其传递给Pig's COR() function

-- SET command?  what is it used for? this doesn't work
SET input_nums {0,2,0,1,2,0,0,0,0} AS bag{}

-- storing vectors in this format doesn't seem to work 
-- import via: data = LOAD mynums AS (id:long, nums:bag{});
1\t{1,3,3,4,5}
2\t{3,4,5,6,6}

-- this seems to work, but adds overhead on storage
-- import via: data = LOAD mynums AS (id:long, nums:bag{t:(x:long)});
1\t{(1),(3),(3),(4),(5)}
2\t{(3),(4),(5),(6),(6)}

-- assuming "data" and "input_nums" are set, no idea how to use though:
results = COR(data, input_nums) -- nope
results = FOREACH data GENERATE id, COR(nums, input_nums) -- nope

不太重要的附带问题:我见过带有参数的猪脚本。我可以通过这些参数传入我的input_nums(即字符串参数,然后 Pig 将其放入包中)吗?

【问题讨论】:

    标签: hadoop apache-pig


    【解决方案1】:

    在 Pig 中运行 COR 的唯一要求是输入参数是双精度包。另外,请确保您拥有 >=0.90.1 的 pig 版本(请参阅 JIRA: PIG-2286)。

    输入数据:
    1<tab>10
    2<tab>12
    3<tab>13
    4<tab>14

    脚本:
    data = LOAD 'cor.txt' AS (series1:double, series2:double);
    rel = GROUP data ALL;
    corop = FOREACH rel GENERATE COR(data.series1, data.series2);
    dump corop;

    输出:
    ({(var0,var1,0.9827076298239908)})

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-11-15
      • 1970-01-01
      • 1970-01-01
      • 2012-12-04
      • 1970-01-01
      • 1970-01-01
      • 2013-07-17
      相关资源
      最近更新 更多