【问题标题】:Pyspark - Sort dataframe column that contains list of listPyspark - 对包含列表列表的数据框列进行排序
【发布时间】:2017-09-08 00:43:00
【问题描述】:

我有一个如下所示的数据框 -

+----------+-------+-------------------------------------------------+
| WindowID | State |                                         Details |
+----------+-------+-------------------------------------------------+
|        6 |    SD | [[29916,3], [156570,4], [245934,1], [329748,8]] |
|        3 |    CO |              [[524586,2], [1548,3], [527220,1]] |
+----------+-------+-------------------------------------------------+

现在,我想根据列表的第二个元素按降序对Details 列的每一行进行排序。结果应该是 -

+----------+-------+-------------------------------------------------+
| WindowID | State |                                         Details |
+----------+-------+-------------------------------------------------+
|        6 |    SD | [[329748,8], [156570,4], [29916,3], [245934,1]] |
|        3 |    CO |              [[1548,3], [524586,2], [527220,1]] |
+----------+-------+-------------------------------------------------+

我如何在 pyspark 中做到这一点?提前谢谢你。

【问题讨论】:

    标签: python-3.x apache-spark dataframe


    【解决方案1】:

    我不知道您尝试了什么,但请查看以下解决方案,这对您有用。

    from pyspark.sql.types import StructType, StructField, IntegerType, StringType, ArrayType
    from pyspark.sql.functions import udf
    
    dfSchema = StructType([StructField('WindowID', IntegerType(), True),
                         StructField('State', StringType(), True),
                         StructField('Details', ArrayType(ArrayType(IntegerType())), True)])
    #["WindowID", "State", "Details"]
    mydf = sqlContext.createDataFrame([[6, 'SD', [[29916,3], [156570,4], [245934,1], [329748,8]]],
    [3, 'CO', [[524586,2], [1548,3], [527220,1]]]], dfSchema)
    mydf.show(truncate = False)
    
    +--------+-----+---------------------------------------------------------------------------------------------------+
    |WindowID|State|Details                                                                                            |
    +--------+-----+---------------------------------------------------------------------------------------------------+
    |6       |SD   |[WrappedArray(29916, 3), WrappedArray(156570, 4), WrappedArray(245934, 1), WrappedArray(329748, 8)]|
    |3       |CO   |[WrappedArray(524586, 2), WrappedArray(1548, 3), WrappedArray(527220, 1)]                          |
    +--------+-----+---------------------------------------------------------------------------------------------------+
    
    def def_sort(x):
            return sorted(x, key=lambda x:x[1], reverse=True)
    
    udf_sort = udf(def_sort, ArrayType(ArrayType(IntegerType())))
    mydf.select("windowID", "State", udf_sort("Details")).show(truncate = False)
    
    
    +--------+-----+---------------------------------------------------------------------------------------------------+
    |windowID|State|PythonUDF#def_sort(Details)                                                                        |
    +--------+-----+---------------------------------------------------------------------------------------------------+
    |6       |SD   |[WrappedArray(329748, 8), WrappedArray(156570, 4), WrappedArray(29916, 3), WrappedArray(245934, 1)]|
    |3       |CO   |[WrappedArray(1548, 3), WrappedArray(524586, 2), WrappedArray(527220, 1)]                          |
    +--------+-----+---------------------------------------------------------------------------------------------------+
    

    【讨论】:

      【解决方案2】:

      我找到了一个解决这个问题的简单技巧 -

      import operator
      
      mydf = sqlContext.createDataFrame([[6, 'SD', [[29916,3], [156570,4], [245934,1], [329748,8]]],
                 [3, 'CO', [[524586,2], [1548,3], [527220,1]]]],
                 ['WindowID', 'State', 'Details']).show(truncate=False)
      
      +----------+-------+-------------------------------------------------+
      | WindowID | State |                                         Details |
      +----------+-------+-------------------------------------------------+
      |        6 |    SD | [[29916,3], [156570,4], [245934,1], [329748,8]] |
      |        3 |    CO |              [[524586,2], [1548,3], [527220,1]] |
      +----------+-------+-------------------------------------------------+
      
      sorted_df = mydf.rdd.map(lambda x: [x[0], x[1], sorted(x[2], \ 
                key=operator.itemgetter(1), reverse=True)]) \
                .toDF(['WindowID', 'State', 'Details']) \
                .show(truncate=False)
      
      +----------+-------+-------------------------------------------------+
      | WindowID | State |                                         Details |
      +----------+-------+-------------------------------------------------+
      |        6 |    SD | [[329748,8], [156570,4], [29916,3], [245934,1]] |
      |        3 |    CO |              [[1548,3], [524586,2], [527220,1]] |
      +----------+-------+-------------------------------------------------+
      

      【讨论】:

        猜你喜欢
        • 2012-07-07
        • 2022-01-14
        • 1970-01-01
        • 2022-01-09
        • 1970-01-01
        • 1970-01-01
        • 2015-04-06
        • 1970-01-01
        • 2021-10-31
        相关资源
        最近更新 更多