(Hortonworks Sandbox) Pig Join 操作重复主键列答案

【问题标题】：(Hortonworks Sandbox) Pig Join operation duplicate primary key columns(Hortonworks Sandbox) Pig Join 操作重复主键列
【发布时间】：2018-09-19 07:52:24
【问题描述】：

我有两个要加入的表。 table1 有 id 和 value 列。
table2 有 id 和 color 列。

final = join table1 by id, table2 by id;
dump final;

我收到了列是 id、value、id、color 的表格。但我想获得一个包含 id、value 和 color 等列的表格。如何从该表中删除此重复的 id 列？

【问题讨论】：

标签： join duplicates apache-pig hortonworks-sandbox

【解决方案1】：

如果您执行DESCRIBE final;，您会看到架构看起来像这样：

final: {table1::id: chararray,table1::value: chararray,table2::id: chararray,table2::color: chararray}

要区分两个 ID 列，可以使用table1::id 或table2::id。因此，要删除其中一个重复的列，您可以这样做：

A = FOREACH final GENERATE 
    table1::id AS id,
    table1::value AS value,
    table2::color AS color;

（我还重命名了这些字段以去掉 table1:: 和 table2:: 前缀，因为它们不再需要了。）

我也可以这样做：

A = FOREACH final GENERATE 
    table1::id AS id,
    value AS value,
    color AS color;

这不会给我一个错误，因为 value 和 color 是明确的名称。

【讨论】：

【解决方案2】：

执行最终的 PIG 脚本：

grunt> table1 = LOAD 'table1_input_path' USING PigStorage(',') as (id:int, value:int);
grunt> table2= LOAD 'table2_input_path' USING PigStorage(',') as (id:int, color:chararray);
grunt> joinlevel = JOIN table1 BY id, table2 BY id;
grunt> final = FOREACH joinlevel generate table1::id as id, table1::color as color, table2::value as value;
grunt> dump final;

【讨论】：