【问题标题】:How to change column Data type dynamically in pyspark如何在pyspark中动态更改列数据类型
【发布时间】:2021-06-16 04:24:15
【问题描述】:

我有需要根据位置和数据类型拆分的固定文件 架构文件。如何更改数据类型?我可以投射每一列,但我的要求是 使用 pyspark 动态转换

** 文本文件 **

"00120181120xyz1234"
"00220180203abc56792"
"00320181203pqr25483"

** 架构文件 **

{"Column":"id","From":"1","To":"3","dtype":"int"}      
{"Column":"date","From":"4","To":"8","dtype":"date"}    
{"Column":"name","From":"12","To":"3","dtype":"String"}    
{"Column":"salary","From":"15","To":"5","dtype":"int"}
 
datafile=spark.read.csv("text.dat")  
SchemaFile=spark.read.json("text.json")  
sfDict = map(lambda x: x.asDict(), SchemaFile.collect())  

**finaldf 根据位置拆分 **

finaldf = inputfiledf.select(*[substring(str='value', 
pos=int(row['from']),len=int(row['to'])).alias(row['column']) for row in sfDict])
finaldf.printSchema()
root |-- id: string (nullable = true)      
|-- date: string (nullable = true)    
|-- name: string (nullable = true)  
|-- salary: string (nullable = true)   

期望 Date 数据类型为 Date,Salary 数据类型为 Int。我们可以动态执行此操作吗?

【问题讨论】:

    标签: python apache-spark pyspark apache-spark-sql


    【解决方案1】:

    您几乎找到了解决方案。您只需在列表理解中添加.cast()

    finaldf = inputfiledf.select(
        *[
            substring(str="value", pos=int(row["from"]), len=int(row["to"]))
            .alias(row["column"])
            .cast(row["dtype"])
            for row in sfDict
        ]
    )
    

    sfDict 替换为以下内容可能会更好:

    def schema():
        with open("text.json", "r") as f:
            for line in f:
                yield json.loads(line)
    
    
    sfDict = schema()
    

    【讨论】:

    • 提交日期为空 finaldf.show() +---+-----+----+------+ | id|date1|姓名|薪水| +----+-----+----+-----+ |001|空| xyz| 12341| |002|空| ABC| 56792| |003|空| qr| 25483| +---+-----+----+-----+ finaldf.printSchema() root |-- id: string (nullable = true) |-- date1: date (nullable = true ) |-- 姓名: 字符串 (nullable = true) |-- 薪水: 字符串 (nullable = true)
    猜你喜欢
    • 2022-07-07
    • 1970-01-01
    • 2018-01-09
    • 2018-01-31
    • 2015-11-23
    • 1970-01-01
    • 2019-01-02
    • 2019-08-22
    • 1970-01-01
    相关资源
    最近更新 更多