【问题标题】:Datframe Struct fieldType to Array of field except last field on Pyspark数据框结构字段类型到字段数组,Pyspark 上的最后一个字段除外
【发布时间】:2021-12-17 04:35:16
【问题描述】:

我有一个具有以下架构的 spark 数据框:

 stat_chiamate
  |
 chiamate_ricevute: struct (nullable = true)
  |    |    |-- h_0: string (nullable = true)
  |    |    |-- h_1: string (nullable = true)
  |    |    |-- h_10: string (nullable = true)
  |    |    |-- h_11: string (nullable = true)
  |    |    |-- h_12: string (nullable = true)
  |    |    |-- h_13: string (nullable = true)
  |    |    |-- h_14: string (nullable = true)
  |    |    |-- h_15: string (nullable = true)
  |    |    |-- h_16: string (nullable = true)
  |    |    |-- h_17: string (nullable = true)
  |    |    |-- h_18: string (nullable = true)
  |    |    |-- h_19: string (nullable = true)
  |    |    |-- h_2: string (nullable = true)
  |    |    |-- h_20: string (nullable = true)
  |    |    |-- h_21: string (nullable = true)
  |    |    |-- h_22: string (nullable = true)
  |    |    |-- h_23: string (nullable = true)
  |    |    |-- h_3: string (nullable = true)
  |    |    |-- h_4: string (nullable = true)
  |    |    |-- h_5: string (nullable = true)
  |    |    |-- h_6: string (nullable = true)
  |    |    |-- h_7: string (nullable = true)
  |    |    |-- h_8: string (nullable = true)
  |    |    |-- h_9: string (nullable = true)
  |    |    |-- n_totale: string (nullable = true)

我想要一个像这样的数据框:

   stat_chiamate: struct (nullable = true)
     |
    chiamate_ricevute: Array
         |-- element(String)

其中chiamate_ricevute 是字段值的列表,例如:

h_0= 0
h_1= 1
h_2= 2
.
.
.
h_23=23
n_totale=412

我想要:

[0,1,2....,23]  <-- I don't want n_totale values

在我的代码中,我使用df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames()[:-1],但我只有一个fieldsName,但我该如何使用它们?

df=df.select(F.array(*[field for field in 

df.select("stat_chiamate.chiamate_ricevute.*").schema.fieldNames() if field.startswith("h_")]).alias("CIRCO"))

【问题讨论】:

    标签: dataframe apache-spark pyspark struct


    【解决方案1】:

    您可以使用数据框的架构,特别是 struct 的架构来提取除 n_totale 之外的所有字段名称,然后将它们包装到一个数组中。

    from pyspark.sql import functions as f
    
    fields = ['chiamate_ricevute.' + field.name for field in df.schema[0].dataType
                    if field.name != 'n_totale']
    result = df.select(f.array(fields).alias("chiamate_ricevute"))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-10-07
      • 2018-07-27
      • 2020-10-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-06-06
      相关资源
      最近更新 更多