【发布时间】:2012-06-01 00:50:11
【问题描述】:
我是 PIG 初学者(使用 pig 0.10.0),我有一些简单的 JSON,如下所示:
test.json:
{
"from": "1234567890",
.....
"profile": {
"email": "me@domain.com"
.....
}
}
我在猪中进行了一些分组/计数:
>pig -x local
使用以下 PIG 脚本:
REGISTER /pig-udfs/oink.jar;
REGISTER /pig-udfs/json-simple-1.1.jar;
REGISTER /pig-udfs/guava-12.0.jar;
REGISTER /pig-udfs/elephant-bird-2.2.3.jar;
users = LOAD 'test.json' USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad=true') as (json:map[]);
domain_user = FOREACH users GENERATE oink.EmailDomainFilter(json#'profile'#'email') as email, json#'from' as user_id;
DUMP domain_user; /* Outputs: (domain.com,1234567890) */
grouped_domain_user = GROUP domain_user BY email;
DUMP grouped_domain_user; /* Outputs: =stuck here= */
基本上,当我尝试转储 grouped_domain_user 时,pig 卡住了,似乎在等待地图输出完成:
2012-05-31 17:45:22,111 [Thread-15] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0002_m_000000_0' done.
2012-05-31 17:45:22,119 [Thread-15] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : null
2012-05-31 17:45:22,123 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - ShuffleRamManager: MemoryLimit=724828160, MaxSingleShuffleLimit=181207040
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,125 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,126 [Thread-15] INFO org.apache.hadoop.io.compress.CodecPool - Got brand-new decompressor
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging on-disk files
2012-05-31 17:45:22,128 [Thread for merging in memory files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for merging in memory files
2012-05-31 17:45:22,128 [Thread for merging on-disk files] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread waiting: Thread for merging on-disk files
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Need another 1 map output(s) where 0 is already in progress
2012-05-31 17:45:22,129 [Thread for polling Map Completion Events] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Thread started: Thread for polling Map Completion Events
2012-05-31 17:45:22,129 [Thread-15] INFO org.apache.hadoop.mapred.ReduceTask - attempt_local_0002_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
2012-05-31 17:45:28,118 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:31,122 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:37,123 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:43,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:46,124 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:52,126 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:45:58,127 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
2012-05-31 17:46:01,128 [communication thread] INFO org.apache.hadoop.mapred.LocalJobRunner - reduce > copy >
.... repeats ....
欢迎就为什么会发生这种情况提出建议。
谢谢!
更新
克里斯为我解决了这个问题。我将fs.default.name 等设置为更正pig.properties 中的值,但是我也将HADOOP_CONF_DIR 环境变量设置为指向我的本地Hadoop 安装,并使用<final>true</final> 设置这些相同的值。
很棒的发现,非常感谢。
【问题讨论】:
-
问题 - 就像最近的一篇文章一样,您没有将
fs.default.name和mapred.job.tracker配置属性标记为 final 吗? - stackoverflow.com/questions/10720132/… -
我实际上只是将它们设置在 pig.properties 文件中。我会检查以确保我没有任何 hadoop 版本在路径中徘徊。
-
作为额外的仅供参考,在实时集群上运行的相同脚本运行良好。
-
伟大的克里斯!这就是问题所在。我猜我的自制 hadoop 安装 conf 文件正在被读取(那些参数被设置为 final)。
标签: hadoop apache-pig