【发布时间】:2021-06-16 04:24:15
【问题描述】:
我有需要根据位置和数据类型拆分的固定文件 架构文件。如何更改数据类型?我可以投射每一列,但我的要求是 使用 pyspark 动态转换
** 文本文件 **
"00120181120xyz1234"
"00220180203abc56792"
"00320181203pqr25483"
** 架构文件 **
{"Column":"id","From":"1","To":"3","dtype":"int"}
{"Column":"date","From":"4","To":"8","dtype":"date"}
{"Column":"name","From":"12","To":"3","dtype":"String"}
{"Column":"salary","From":"15","To":"5","dtype":"int"}
datafile=spark.read.csv("text.dat")
SchemaFile=spark.read.json("text.json")
sfDict = map(lambda x: x.asDict(), SchemaFile.collect())
**finaldf 根据位置拆分 **
finaldf = inputfiledf.select(*[substring(str='value',
pos=int(row['from']),len=int(row['to'])).alias(row['column']) for row in sfDict])
finaldf.printSchema()
root |-- id: string (nullable = true)
|-- date: string (nullable = true)
|-- name: string (nullable = true)
|-- salary: string (nullable = true)
期望 Date 数据类型为 Date,Salary 数据类型为 Int。我们可以动态执行此操作吗?
【问题讨论】:
标签: python apache-spark pyspark apache-spark-sql