【问题标题】:NOT IN clause in PIGPIG 中的 NOT IN 子句
【发布时间】:2017-02-02 09:43:10
【问题描述】:

我正在尝试

select * from A where A.ID NOT IN (select id from B) (in sql)

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c= FOREACH destnew GENERATE ID;
D=FILTER sourcenew BY NOT ID (c.ID);
 org.apache.pig.tools.pigscript.parser.ParseException: Encountered " <PATH> "D=FILTER "" at line 1, column 1.
Was expecting one of:
<EOF> 
"cat" ...
"clear" ...<EOF>

任何解决错误的帮助,在最后一行的执行中得到这个。

【问题讨论】:

  • 考虑按 ID 对 2 个关系进行分组并过滤掉不匹配的关系

标签: hadoop mapreduce apache-pig


【解决方案1】:

使用左外连接并过滤空值

sourcenew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Source.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
destnew = LOAD 'hdfs://HADOOPMASTER:54310/DVTTest/Destination.txt' USING PigStorage(',') as (ID:int,Name:chararray,FirstName:chararray ,LastName:chararray,Vertical_Name:chararray ,Vertical_ID:chararray,Gender:chararray,DOB:chararray,Degree_Percentage:chararray ,Salary:chararray,StateName:chararray);
c = FOREACH destnew GENERATE ID;
d = JOIN sourcenew BY ID LEFT OUTER,destnew by ID;
e = FILTER d by destnew.ID is null;

注意 我编写了一个包含几个测试文件的示例脚本,下面是有效的解决方案。在您的情况下,请检查您是否从文件中正确加载数据。

test1.txt

1   abc
2   def
3   ghi
4   jkl
5   mno
6   pqr
7   stu
8   vwx
1   abc
2   def
3   ghi
4   jkl
1   abc
2   def
3   ghi
1   abc
2   def

test2.txt

1
2
3
4

脚本

A = LOAD 'test1.txt' USING PigStorage('\t') AS (aid:int,name:chararray);
B = LOAD 'test2.txt' USING PigStorage('\t') AS (bid:int);
C = JOIN A BY aid LEFT OUTER,B BY bid;
D = FILTER C BY bid is null;
DUMP D;

所以在上面的示例中,记录 5,6,7,8 应该在结果中,因为这些 Id 不在 test2.txt 中。

【讨论】:

  • 错误 org.apache.pig.tools.grunt.Grunt - 错误 1066:无法打开别名 d 的迭代器。后端错误:org.apache.pig.backend.executionengine.ExecException:错误0:标量在输出中有不止一行。第一个:(1),第二个:(2)(常见原因:“JOIN”然后“FOREACH ... GENERATE foo.bar”应该是“foo::bar”)@inquisitive_mind
  • 我什至试过 d = FILTER sourcenew BY NOT (sourcenew.ID == c.ID);
  • @Vickyster,我已经编辑了答案,还包括了一个例子。希望有所帮助。
  • @inquistive_mind 它就像一个魅力,感谢海洋我已经通过三个中间步骤解决了它:D 如果你不介意再提供一个帮助......!
  • @inquistive_mind 非常感谢您为帮助和解决问题所做的努力,一旦我获得足够的声誉来投票,我会将其标记为答案。再次感谢您..!
猜你喜欢
  • 1970-01-01
  • 2012-03-14
  • 2011-01-12
  • 2021-10-02
  • 2010-09-12
  • 1970-01-01
  • 2010-09-30
  • 1970-01-01
相关资源
最近更新 更多