【问题标题】:how to explode all columns of a dataframe in pyspark?如何在pyspark中爆炸数据框的所有列?
【发布时间】:2016-08-09 00:32:24
【问题描述】:

我有一个这样的 DataFrame:

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|               brand|              diesel|                 e10|                  e5|         houseNumber|                  id|              isOpen|                 lat|                 lng|                name|               place|            postCode|              street|               Datum|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|[TOTAL, ARAL, She...|[1.049, 1.029, 1....|[1.249, 1.209, 1....|[1.269, 1.229, 1....|[49, 12-14, ,  , ...|[4409a024-b190-4b...|[true, true, true...|[50.93128, 50.952...|[6.962356, 6.9616...|[TOTAL KOELN, Ara...|[KOELN, Köln, KOE...|[50676, 50668, 50...|[HOLZMARKT, Riehl...|2016-08-01 10:50:...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

基本上所有的列都是数组。它基于嵌套的 JSON 数据。 我试图把它炸开。但这仅适用于 select 语句中的一列。你知道为什么我可以一次解压pyspark 中的所有值以保持关系吗?

【问题讨论】:

    标签: apache-spark dataframe pyspark


    【解决方案1】:

    如果您首先将所有内容映射到一列,然后分解,然后拆分。

    df=sc.parallelize([[[1,2,3,4],[10, 20,30,40]]]).toDF()
    F=udf(lambda x,y: [[x[i],y[i]]\
      for i in range(4)],ArrayType(ArrayType(IntegerType())))
    
    df.withColumn('merge',F(df._1,df._2))\
      .select(['merge', functions.explode(col('merge'))])
    

    首先,df输出

    +------------+----------------+
    |          _1|              _2|
    +------------+----------------+
    |[1, 2, 3, 4]|[10, 20, 30, 40]|
    +------------+----------------+
    

    那么最后一列就是你需要拆分的:

    +--------------------+-------+
    |               merge|    _c0|
    +--------------------+-------+
    |[WrappedArray(1, ...|[1, 10]|
    |[WrappedArray(1, ...|[2, 20]|
    |[WrappedArray(1, ...|[3, 30]|
    |[WrappedArray(1, ...|[4, 40]|
    +--------------------+-------+
    

    【讨论】:

      猜你喜欢
      • 2022-06-14
      • 1970-01-01
      • 2021-01-21
      • 2020-08-16
      • 2016-11-07
      • 2021-11-09
      • 2018-11-13
      • 2019-01-15
      • 1970-01-01
      相关资源
      最近更新 更多