【发布时间】:2021-07-18 06:00:25
【问题描述】:
我在 file1.dat 文件中获取数据,数据由 | 分隔字符。
109|LK98765|2|18.07.2021|01|abc1|01|abc2|01|abc3
110|LK67665|2|10.10.1987|02|abc1|01|abc2|01|abc3
111|LK43465|2|23.07.2005|03|abc1|01|abc2|01|abc3
112|LK23265|2|13.02.2012|04|abc1|01|abc2|01|abc3
我的要求是在文件中添加标题并将其更改为 .csv,字段分隔符为 ,。
为了达到上述要求,下面的代码是用python编写的。
添加标题:
def fn_add_header(file_name):
with(open(file_name) as f:
r=csv.reader(f)
data = line [line for line in r]
with(open(file_name,'wb') as f:
w =csv.writer(f)
w.writerow(['ID','SEC_NO','SEC_CD','SEC_DATE','SEC_ID1','SEC_DESC1','SEC_ID2','SEC_DESC2','SEC_ID3','SEC_DESC3'])
w.writerows(data)
要将文件更改为 csv:
def fn_replace(filename,directory)
final_file = directory+"\file1.csv"
for file in os.listdir(filename)
if fnmatch.fnmatch(file.lower(),filename.lower()):
shutil.copyfile (file,final_file )
cmd = ["sed","-i","-e"'s/|/,/g',final_file )
ret2,out2,err2 = fn_run_cmd(cmd)
上面的代码工作正常,我得到转换后的文件:
ID,SEC_NO,SEC_CD,SEC_DATE,SEC_ID1,SEC_DESC1,SEC_ID2,SEC_DESC2,SEC_ID3,SEC_DESC3
109,LK98765,2,18.07.2021,01,abc1,01,abc2,01,abc3
110,LK67665,2,10.10.1987,02,abc1,01,abc2,01,abc3
111,LK43465,2,23.07.2005,03,abc1,01,abc2,01,abc3
112,LK23265,2,13.02.2012,04,abc1,01,abc2,01,abc3
在 yml 中读取上述转换后的 file.csv 时遇到问题。 要阅读我正在使用以下代码的文件:
frameworkComponents:
today_file:
inputDirectoryPath: <path of the file>
componentName: today_file
componentType: inputLoader
hadoopfileFormat: csv
csvSep: ','
selectstmt:
componentName: selectstmt
componentType: executeSparlSQL
sql: |-
select ID,SEC_NO,
SEC_CD,SEC_DATE,
SEC_ID1,SEC_DESC1,
SEC_ID2,SEC_DESC2,
SEC_ID3,SEC_DESC3
from today_file
write_file:
componentName: write_file
componentType: outputWriter
hadoopfileFormat: avro
numberofPartition: 1
outputDirectoryPath: <path of the file>
precedence:
selectstmt:
dependsOn:
today_file: today_file
write_file:
dependsOn:
selectstmt: selectstmt
当我运行 yml 时出现以下错误。Unable to infer schema for CSV. It must be specified manually.
【问题讨论】:
-
创建
csv.reader时,需要指定delimiter='|'关键字参数,因为默认是逗号分隔的字段。
标签: python apache-spark apache-spark-sql