如何在 PySpark 中将数据框列从 String 类型更改为 Double 类型？答案

【问题标题】：How to change a dataframe column from String type to Double type in PySpark?如何在 PySpark 中将数据框列从 String 类型更改为 Double 类型？
【发布时间】：2015-11-23 21:59:30
【问题描述】：

我有一个列作为字符串的数据框。我想在 PySpark 中将列类型更改为 Double 类型。

以下是方式，我做到了：

toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

只是想知道，这是跑步时的正确方法吗通过逻辑回归，我得到了一些错误，所以我想知道，这就是麻烦的原因吗？

【问题讨论】：

标签： python apache-spark dataframe pyspark apache-spark-sql

【解决方案1】：

这里不需要UDF。 Column 已经为 cast method 提供了 DataType instance：

from pyspark.sql.types import DoubleType

changedTypedf = joindf.withColumn("label", joindf["show"].cast(DoubleType()))

或短字符串：

changedTypedf = joindf.withColumn("label", joindf["show"].cast("double"))

其中规范字符串名称（也可以支持其他变体）对应于simpleString 值。所以对于原子类型：

from pyspark.sql import types 

for t in ['BinaryType', 'BooleanType', 'ByteType', 'DateType', 
          'DecimalType', 'DoubleType', 'FloatType', 'IntegerType', 
           'LongType', 'ShortType', 'StringType', 'TimestampType']:
    print(f"{t}: {getattr(types, t)().simpleString()}")

BinaryType: binary
BooleanType: boolean
ByteType: tinyint
DateType: date
DecimalType: decimal(10,0)
DoubleType: double
FloatType: float
IntegerType: int
LongType: bigint
ShortType: smallint
StringType: string
TimestampType: timestamp

例如复杂类型

types.ArrayType(types.IntegerType()).simpleString()

'array<int>'

types.MapType(types.StringType(), types.IntegerType()).simpleString()

'map<string,int>'

【讨论】：

使用col 函数也可以。 from pyspark.sql.functions import col, changedTypedf = joindf.withColumn("label", col("show").cast(DoubleType()))
cast() 参数的可能值是什么（“字符串”语法）？
我无法相信 Spark 文档在数据类型的有效字符串上是如此简洁。我能找到的最接近的参考是：docs.tibco.com/pub/sfire-analyst/7.7.1/doc/html/en-US/…。
如何一次转换多列？
如何将 nullable 更改为 false？

【解决方案2】：

保留列的名称，并通过使用与输入列相同的名称来避免额外的列添加：

from pyspark.sql.types import DoubleType
changedTypedf = joindf.withColumn("show", joindf["show"].cast(DoubleType()))

【讨论】：

谢谢我正在寻找如何保留原始列名
Spark 将识别的短字符串数据类型的列表是否存在？
这个解决方案在循环中也能很好地工作，例如from pyspark.sql.types import IntegerType for ftr in ftr_list: df = df.withColumn(f, df[f].cast(IntegerType()))
@Quetzalcoatl 您的代码错误。 f 是什么？你在哪里使用ftr？
是的，谢谢——'f' 应该是 'ftr'。其他人可能已经猜到了。

【解决方案3】：

给出的答案足以解决这个问题，但我想分享另一种可能引入新版本 Spark （我不确定）所以给出的答案没有抓住它。

我们可以通过col("colum_name")关键字到达spark语句中的列：

from pyspark.sql.functions import col
changedTypedf = joindf.withColumn("show", col("show").cast("double"))

【讨论】：

谢谢！使用'double' 比DoubleType() 更优雅，DoubleType() 可能还需要导入。

【解决方案4】：

PySpark 版本：

df = <source data>
df.printSchema()

from pyspark.sql.types import *

# Change column type
df_new = df.withColumn("myColumn", df["myColumn"].cast(IntegerType()))
df_new.printSchema()
df_new.select("myColumn").show()

【讨论】：

【解决方案5】：

解决方案很简单 -

toDoublefunc = UserDefinedFunction(lambda x: float(x),DoubleType())
changedTypedf = joindf.withColumn("label",toDoublefunc(joindf['show']))

【讨论】：