【问题标题】:PySpark: cast nullType field as string under struct type columnPySpark:将 nullType 字段转换为结构类型列下的字符串
【发布时间】:2020-09-15 16:50:37
【问题描述】:

我有一个具有以下架构的数据框。 translations --> languages (no, pt,...) 列下的 translation_version 字段位于 null 中。我想将所有translation_version 转换为字符串。我在translations下有17种语言

root
|-- translations: struct (nullable = true)
|    |-- no: struct (nullable = true)
|    |    |-- Description: string (nullable = true)
|    |    |-- class: string (nullable = true)
|    |    |-- description: string (nullable = true)
|    |    |-- translation_version: null (nullable = true) // Want to cast as string
|    |-- pt: struct (nullable = true)
|    |    |-- Description: string (nullable = true)
|    |    |-- class: string (nullable = true)
|    |    |-- description: string (nullable = true)
|    |    |-- translation_version: null (nullable = true)
|    |-- fr: struct (nullable = true)
|    |    |-- Description: string (nullable = true)
|    |    |-- class: string (nullable = true)
|    |    |-- description: string (nullable = true)
|    |    |-- translation_version: null (nullable = true)

我尝试了df = df.na.fill('null'),但没有改变任何东西。还尝试使用以下代码进行投射

df = df.withColumn("translations", F.col("translations").cast("struct<struct<translation_version: string>>"))

但这返回了以下错误

pyspark.sql.utils.ParseException: u"\nmismatched input '<' expecting ':'(line 1, pos 13)\n\n== SQL ==\nstruct<struct<translation_version: string>>\n-------------^^^\n"

知道如何将所有translation_version 转换为每种语言的字符串吗?

【问题讨论】:

    标签: apache-spark pyspark aws-glue


    【解决方案1】:

    这应该可以解决问题

    from pyspark.sql.functions import col, struct
    from pyspark.sql.types import StructType, StructField, StringType
    
    schema_ = StructType([StructField("Description",StringType(),True),
                          StructField("class",StringType(),True),
                          StructField("description",StringType(),True),
                          StructField("translation_version",StringType(),True)
                         ]
                        )
    
    df_1 = (
        df
        .select("translations.*")
        .withColumn("translations", struct(
            col("fr").cast(schema).alias("fr"),
            col("pt").cast(schema).alias("pt"),
            col("no").cast(schema).alias("no")
                   )
                   )
        .drop("fr", "pt", "no")
    )
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-07-25
      • 2021-06-28
      • 1970-01-01
      • 1970-01-01
      • 2021-12-01
      • 2021-06-06
      相关资源
      最近更新 更多