【发布时间】:2023-04-05 22:06:01
【问题描述】:
您好,我对 Pig 编程比较陌生,遇到了一个我很难解决的问题:
我有 2 个数据集
A: (accountId:chararray, title:chararray,genre:chararray)
("A123", "Harry Potter", "Action/Adventure")
("A123", "Sherlock Holmes", "Mystery")
("B456", "James Bond", "Action")
("B456", "Hamlet", "Drama")
B: (accountId:chararray, title:chararray, 流派:chararray)
("B456", "Percy Jackson", "Action/Adventure")
("B456", "Elementary", "Mystery")
("A123", "Divergent", "Action")
("A123", "Downton Abbey", "Drama")
我想要的结果应该是
(accountId:charray, {(),(),...}
(A123, {("A123", "Harry Potter", "Action/Adventure"),
("A123", "Sherlock Holmes", "Mystery"),
("A123", "Divergent", "Action"),
("A123", "Downton Abbey", "Drama")
})
(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama"),
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})
目前我在做:
ANS = JOIN A BY accountId, B BY accountId;
但结果看起来像
SCHEMA: (accountId:chararray, {(accountId:chararray, title:chararray,genre:chararray), ...})
(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama")}
"B456", {
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})
知道我可能做错了什么。
【问题讨论】:
标签: hadoop join mapreduce tuples apache-pig