一种可能的解决方案可以使用普通的 pig loader 加载,然后通过 UDF 传递以获取列。我将尝试提供代码并在今晚发布。正如承诺的那样:
]$ more cdr.txt
068373748102208100167682477351905149071PLAN1MOCCUST10612287077212:07:1201/01/2012
068373748102208100167682477351905149071PLAN1MTCCUST20600000001312:15:0901/01/2012
068373748102208100167682477351905149071PLAN1SMSCUST10613637193012:18:1801/01/2012
068373748102208100167682477351905149071PLAN1SMSCUST10612899062012:21:0701/01/2012
]$ more cdr.py
import sys
def mysubstr(input,start,nc):
return input[start:nc]
]$ more cdr.pig
REGISTER 'cdr.py' using jython as mysubstr;
A = LOAD 'cdr.txt' AS (inp:chararray);
B = FOREACH A GENERATE
inp, mysubstr.mysubstr(inp,0,13),
mysubstr.mysubstr(inp,14,29),
mysubstr.mysubstr(inp,30,42);
DUMP B;
输出:
(068373748102208100167682477351905149071PLAN1MOCCUST10612287077212:07:1201/01/2012,0683737481022,810016768247735,905149071PLA)
(068373748102208100167682477351905149071PLAN1MTCCUST20600000001312:15:0901/01/2012,0683737481022,810016768247735,905149071PLA)
(068373748102208100167682477351905149071PLAN1SMSCUST10613637193012:18:1801/01/2012,0683737481022,810016768247735,905149071PLA)
(068373748102208100167682477351905149071PLAN1SMSCUST10612899062012:21:0701/01/2012,0683737481022,810016768247735,905149071PLA)