【问题标题】:cast a date to integer pyspark将日期转换为整数 pyspark
【发布时间】:2021-02-28 00:21:56
【问题描述】:

是否可以将日期列转换为 pyspark 数据框中的整数列?我尝试了 2 种不同的方法,但每次尝试都会返回一个包含空值的列。我错过了什么?

from pyspark.sql.types import *

# DUMMY DATA
simpleData = [("James",34,"2006-01-01","true","M",3000.60),
    ("Michael",33,"1980-01-10","true","F",3300.80),
    ("Robert",37,"1992-07-01","false","M",5000.50)
  ]

columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
df = spark.createDataFrame(data = simpleData, schema = columns)
df=df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))

# ATTEMPT 1 with cast()

df=df.withColumn("jobStartDateAsInteger1", df['jobStartDate'].cast(IntegerType()))

# ATTEMPT 2 with selectExpr()

df=df.selectExpr("*","CAST(jobStartDate as int) as jobStartDateAsInteger2")
df.show()

【问题讨论】:

    标签: dataframe date apache-spark pyspark casting


    【解决方案1】:

    您可以尝试使用 F.unix_timestamp() 将其转换为 UNIX 时间戳:

    from pyspark.sql.types import *
    import pyspark.sql.functions as F
    
    # DUMMY DATA
    simpleData = [("James",34,"2006-01-01","true","M",3000.60),
        ("Michael",33,"1980-01-10","true","F",3300.80),
        ("Robert",37,"1992-07-01","false","M",5000.50)
      ]
    
    columns = ["firstname","age","jobStartDate","isGraduated","gender","salary"]
    df = spark.createDataFrame(data = simpleData, schema = columns)
    df=df.withColumn("jobStartDate", df['jobStartDate'].cast(DateType()))
    
    df=df.withColumn("jobStartDateAsInteger1", F.unix_timestamp(df['jobStartDate']))
    df.show()
    
    +---------+---+------------+-----------+------+------+----------------------+
    |firstname|age|jobStartDate|isGraduated|gender|salary|jobStartDateAsInteger1|
    +---------+---+------------+-----------+------+------+----------------------+
    |    James| 34|  2006-01-01|       true|     M|3000.6|            1136073600|
    |  Michael| 33|  1980-01-10|       true|     F|3300.8|             316310400|
    |   Robert| 37|  1992-07-01|      false|     M|5000.5|             709948800|
    +---------+---+------------+-----------+------+------+----------------------+
    

    【讨论】:

    • 完美,我只是添加了一些细节来获得自 1970-01-01 以来的天数,而不是秒数,但这正是我所需要的。谢! df=df.withColumn("jobStartDateAsInteger1", F.unix_timestamp(df['jobStartDate'])/(24*60*60));df=df.withColumn("jobStartDateAsInteger1", df['jobStartDateAsInteger1'].cast( IntegerType()))
    猜你喜欢
    • 2021-08-17
    • 2018-10-16
    • 2022-07-07
    • 1970-01-01
    • 1970-01-01
    • 2011-05-05
    • 2011-05-28
    相关资源
    最近更新 更多