【问题标题】:How to create columns from list values in Pyspark dataframe如何从 Pyspark 数据框中的列表值创建列
【发布时间】:2019-03-21 13:02:17
【问题描述】:

我有一个如下所示的 pyspark 数据框:

Subscription_id Subscription parameters
5516            ["'catchupNotificationsEnabled': True","'newsNotificationsEnabled': True","'autoDownloadsEnabled': False"]

我需要输出数据框为:

Subscription_id catchupNotificationsEnabled newsNotificationsEnabled    autoDownloadsEnabled
5516    True    True    False

如何在 Pyspark 中实现这一点?我尝试了几个使用 UDF 的选项,但都没有成功。

非常感谢任何帮助。

【问题讨论】:

  • 你提前知道钥匙了吗?
  • @pault 是的,只有catchupNotificationsEnabled、newsNotificationsEnabled和autoDownloadsEnabled这3个参数,不同记录的True和False值不同
  • 您能提供 DataFrame 的架构吗? “订阅参数”的类型是: StructType() 还是 ArrayType() ? (或其他)

标签: apache-spark dataframe pyspark


【解决方案1】:

你可以使用类似下面的东西

>>> df.show()
+---------------+-----------------------+
|Subscription_id|Subscription_parameters|
+---------------+-----------------------+
|           5516|   ["'catchupNotific...|
+---------------+-----------------------+

>>> 
>>> df1 = df.select('Subscription_id')
>>> 
>>> data = df.select('Subscription_parameters').rdd.map(list).collect()
>>> data = [i[0][1:-1].split(',') for i in data]
>>> data = {i.split(':')[0][2:-1]:i.split(':')[1].strip()[:-1] for i in data[0]}
>>> 
>>> df2 = spark.createDataFrame(sc.parallelize([data]))
>>> 
>>> df3 = df1.crossJoin(df2)
>>> 
>>> df3.show()
+---------------+--------------------+---------------------------+------------------------+
|Subscription_id|autoDownloadsEnabled|catchupNotificationsEnabled|newsNotificationsEnabled|
+---------------+--------------------+---------------------------+------------------------+
|           5516|               False|                       True|                    True|
+---------------+--------------------+---------------------------+------------------------+

【讨论】:

  • 感谢各位的帮助。这两种解决方案都适合我!
【解决方案2】:

假设您的“订阅参数”列是 ArrayType()。

from pyspark.sql import functions as F
from pyspark.sql import Row
from pyspark.context import SparkContext

# Call SparkContext
sc = SparkContext.getOrCreate()
sc = sparkContext

首先创建DataFrame

df = sc.createDataFrame([Row(Subscription_id=5516,
                         Subscription_parameters=["'catchupNotificationsEnabled': True",
"'newsNotificationsEnabled': True", "'autoDownloadsEnabled': False"])])

通过简单的索引将这个数组分成三列:

df = df.select("Subscription_id", 
      F.col("Subscription_parameters")[0].alias("catchupNotificationsEnabled"),
      F.col("Subscription_parameters")[1].alias("newsNotificationsEnabled"),
      F.col("Subscription_parameters")[2].alias("autoDownloadsEnabled"))

现在您的 DataFrame 已正确拆分,每个新列都包含一个字符串,例如“'catchupNotificationsEnabled': True”:

+---------------+---------------------------+------------------------+--------------------+
|Subscription_id|catchupNotificationsEnabled|newsNotificationsEnabled|autoDownloadsEnabled|
+---------------+---------------------------+------------------------+--------------------+
|           5516|       'catchupNotificat...|    'newsNotification...|'autoDownloadsEna...|
+---------------+---------------------------+------------------------+--------------------+

然后我建议通过检查它是否包含“True”来更新列值

df = df.withColumn('catchupNotificationsEnabled',
                  F.when(F.col("catchupNotificationsEnabled").contains("True"), True).otherwise(False))\
        .withColumn('newsNotificationsEnabled',
                   F.when(F.col("newsNotificationsEnabled").contains("True"), True).otherwise(False))\
        .withColumn('autoDownloadsEnabled',
                   F.when(F.col("autoDownloadsEnabled").contains("True"), True).otherwise(False))

生成的 DataFrame 符合预期

+---------------+---------------------------+------------------------+--------------------+
|Subscription_id|catchupNotificationsEnabled|newsNotificationsEnabled|autoDownloadsEnabled|
+---------------+---------------------------+------------------------+--------------------+
|           5516|                       true|                    true|               false|
+---------------+---------------------------+------------------------+--------------------+

PS:如果该列不是 ArrayType() 的列,您可能需要稍微修改此代码。See this question for example

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-01-07
    • 1970-01-01
    • 2022-01-03
    • 1970-01-01
    • 2019-12-02
    • 1970-01-01
    • 2020-06-27
    相关资源
    最近更新 更多