如何在pyspark中动态更改列数据类型答案

【问题标题】：How to change column Data type dynamically in pyspark如何在pyspark中动态更改列数据类型
【发布时间】：2021-06-16 04:24:15
【问题描述】：

我有需要根据位置和数据类型拆分的固定文件架构文件。如何更改数据类型？我可以投射每一列，但我的要求是使用 pyspark 动态转换

** 文本文件 **

"00120181120xyz1234"
"00220180203abc56792"
"00320181203pqr25483"

** 架构文件 **

{"Column":"id","From":"1","To":"3","dtype":"int"}      
{"Column":"date","From":"4","To":"8","dtype":"date"}    
{"Column":"name","From":"12","To":"3","dtype":"String"}    
{"Column":"salary","From":"15","To":"5","dtype":"int"}
 
datafile=spark.read.csv("text.dat")  
SchemaFile=spark.read.json("text.json")  
sfDict = map(lambda x: x.asDict(), SchemaFile.collect())

**finaldf 根据位置拆分 **

finaldf = inputfiledf.select(*[substring(str='value', 
pos=int(row['from']),len=int(row['to'])).alias(row['column']) for row in sfDict])
finaldf.printSchema()
root |-- id: string (nullable = true)      
|-- date: string (nullable = true)    
|-- name: string (nullable = true)  
|-- salary: string (nullable = true)

期望 Date 数据类型为 Date，Salary 数据类型为 Int。我们可以动态执行此操作吗？

【问题讨论】：

标签： python apache-spark pyspark apache-spark-sql

【解决方案1】：

您几乎找到了解决方案。您只需在列表理解中添加.cast()：

finaldf = inputfiledf.select(
    *[
        substring(str="value", pos=int(row["from"]), len=int(row["to"]))
        .alias(row["column"])
        .cast(row["dtype"])
        for row in sfDict
    ]
)

将sfDict 替换为以下内容可能会更好：

def schema():
    with open("text.json", "r") as f:
        for line in f:
            yield json.loads(line)


sfDict = schema()

【讨论】：

提交日期为空 finaldf.show() +---+-----+----+------+ | id|date1|姓名|薪水| +----+-----+----+-----+ |001|空| xyz| 12341| |002|空| ABC| 56792| |003|空| qr| 25483| +---+-----+----+-----+ finaldf.printSchema() root |-- id: string (nullable = true) |-- date1: date (nullable = true ) |-- 姓名: 字符串 (nullable = true) |-- 薪水: 字符串 (nullable = true)