【发布时间】:2015-05-27 03:57:14
【问题描述】:
我正在尝试在使用 python udf 时将文件加载到 pig,我尝试了两种方法:
• (myudf1, sample1.pig):尝试从 python 读取文件,该文件位于我的客户端服务器上。
• (myudf2, sample2.pig):首先从 hdfs 加载文件到 grunt shell,然后将其作为参数传递给 python udf。
myudf1.py
from __future__ import with_statement
def get_words(dir):
stopwords=set()
with open(dir) as f1:
for line1 in f1:
stopwords.update([line1.decode('ascii','ignore').split("\n")[0]])
return stopwords
stopwords=get_words("/home/zhge/uwc/mappings/english_stop.txt")
@outputSchema("findit: int")
def findit(stp):
stp=str(stp)
if stp in stopwords:
return 1
else:
return 0
sample1.pig:
REGISTER '/home/zhge/uwc/scripts/myudf1.py' USING jython as pyudf;
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(title);
DUMP S
我得到:IOError: (2, 'No such file or directory', '/home/zhge/uwc/mappings/english_stop.txt')
对于解决方案 2:
myudf2:
def get_wordlists(wordbag):
stopwords=set()
for t in wordbag:
stopwords.update(t.decode('ascii','ignore'))
return stopwords
@outputSchema("findit: int")
def findit(stopwordbag, stp):
stopwords=get_wordlists(stopwordbag)
stp=str(stp)
if stp in stopwords:
return 1
else:
return 0
Sample2.pig
REGISTER '/home/zhge/uwc/scripts/myudf2.py' USING jython as pyudf;
stops = load '/user/zhge/uwc/mappings/stopwords.txt' AS (stop_w:chararray);
-- this step works fine and i can see the "stops" obejct is loaded to pig
item_title = load '/user/zhge/data/item_title_sample/000000_0' USING PigStorage(',') AS (title:chararray);
T = limit item_title 1;
S = FOREACH T GENERATE pyudf.findit(stops.stop_w, title);
DUMP S;
然后我得到: ERROR org.apache.pig.tools.grunt.Grunt -ERROR 1066: Unable to open iterator for alias S. Backend error : Scalar 在输出中有不止一行。第一个:(a),第二个:(作为
【问题讨论】:
-
对于在寻找ERROR 1066: Unable to open iterator for alias 时发现此帖子的人,这里是generic solution。
标签: python apache-pig user-defined-functions