【问题标题】:Apply withColumn on pyspark array在 pyspark 数组上应用 withColumn
【发布时间】:2020-06-05 14:14:28
【问题描述】:

这是我的代码:

from pyspark.sql import *

department1 = Row(id='123456', name='Computer Science')
department2 = Row(id='789012', name='Mechanical Engineering')

Employee = Row("firstName", "lastName", "email", "salary")
employee1 = Employee('michael', 'armbrust', 'no-reply@berkeley.edu', 100000)
employee2 = Employee('xiangrui', 'meng', 'no-reply@stanford.edu', 120000)


departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2])
departmentWithEmployees2 = Row(department=department2, employees=[employee1, employee2])


departmentsWithEmployeesSeq1 = [departmentWithEmployees1, departmentWithEmployees2]
df1 = spark.createDataFrame(departmentsWithEmployeesSeq1)

我想在数组中加入 firstName 和 lastName。

from pyspark.sql import functions as sf
df2 = df1.withColumn("employees.FullName", sf.concat(sf.col('employees.firstName'), sf.col('employees.lastName')))
df2.printSchema()

root
 |-- department: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- salary: long (nullable = true)
 |-- employees.FullName: array (nullable = true)
 |    |-- element: string (containsNull = true)

我的新列FullName在父级,如何将它们放在数组中。

root
 |-- department: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- name: string (nullable = true)
 |-- employees: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- firstName: string (nullable = true)
 |    |    |-- lastName: string (nullable = true)
 |    |    |-- email: string (nullable = true)
 |    |    |-- salary: long (nullable = true)
 |    |    |-- FullName: string (containsNull = true)

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql pyspark-dataframes


    【解决方案1】:

    这样做的一种方法是使用 inline_outer 分解你的结构数组,并使用 concat_ws 获取你的全名并将它们组合起来使用 arraystruct

    from pyspark.sql import functions as F
    
    df1.selectExpr("department","""inline_outer(employees)""")\
       .withColumn("FullName", F.concat_ws(" ","firstName","lastName"))\
       .select("department", F.array(F.struct(*[F.col(x).alias(x) for x in\
                                         ['firstName','lastName','email','salary','FullName']]))\
               .alias("employees")).printSchema()
    
    #root
     #|-- department: struct (nullable = true)
     #|    |-- id: string (nullable = true)
     #|    |-- name: string (nullable = true)
     #|-- employees: array (nullable = false)
     #|    |-- element: struct (containsNull = false)
     #|    |    |-- firstName: string (nullable = true)
     #|    |    |-- lastName: string (nullable = true)
     #|    |    |-- email: string (nullable = true)
     #|    |    |-- salary: long (nullable = true)
     #|    |    |-- FullName: string (nullable = false)
    

    【讨论】:

    • 感谢您的快速回复@murtihash。你能给我推荐一些好的资源来构建/练习这种复杂的查询吗?
    • @PrakashKumar 除了每天练习之外,没有真正的短而简单的资源来精通这一点,我这样做的方式是在这里回答问题,你可以使用 databricks 社区版免费集群来原型你的代码和实践,也关注 pyspark 的主要贡献者和他们的答案
    猜你喜欢
    • 2020-04-06
    • 1970-01-01
    • 2019-01-31
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-01-19
    • 2021-02-26
    • 2021-10-14
    相关资源
    最近更新 更多