【问题标题】:Flatten a data frame with one parent and one child column展平具有一个父列和一个子列的数据框
【发布时间】:2021-07-30 22:21:09
【问题描述】:

我有以下数据框作为输入。

+--------+---------+
|   child|   parent|
+--------+---------+
|street_1|   city_1|
|street_2|   city_2|
|street_3|   city_1|
|street_4|   city_2|
|  city_2|  state_2|
|  city_1|  state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+

下面应该是输出

+---------+-------+------+--------+
|  country|  state|  city|  street|
+---------+-------+------+--------+
|country_1|state_1|city_1|street_3|
|country_1|state_1|city_1|street_1|
|country_2|state_2|city_2|street_4|
|country_2|state_2|city_2|street_2|
+---------+-------+------+--------+

我尝试了以下方法。但这似乎是矫枉过正。请建议是否有更好的方法。

d1.show()
+--------+---------+
|   child|   parent|
+--------+---------+
|street_1|   city_1|
|street_2|   city_2|
|street_3|   city_1|
|street_4|   city_2|
|  city_2|  state_2|
|  city_1|  state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+

d2.show()
+--------+---------+
| child_2| parent_2|
+--------+---------+
|street_1|   city_1|
|street_2|   city_2|
|street_3|   city_1|
|street_4|   city_2|
|  city_2|  state_2|
|  city_1|  state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+


df_with_state=d1.join(d2,d1['parent']==d2['child_2'],'left').where(d2['child_2'].isNull()).select(d1['parent'].alias('country'),d1['child'].alias('state'))

df_with_city=d1.join(df_with_state,df_with_state['state']==d1['parent'],'inner').select(*df_with_state.columns,d1['child'].alias('city'))

df_with_street=d1.join(df_with_city,df_with_city['city']==d1['parent'],'inner').select(*df_with_city.columns,d1['child'].alias('street'))

df_with_street.show()

+---------+-------+------+--------+
|  country|  state|  city|  street|
+---------+-------+------+--------+
|country_1|state_1|city_1|street_3|
|country_1|state_1|city_1|street_1|
|country_2|state_2|city_2|street_4|
|country_2|state_2|city_2|street_2|
+---------+-------+------+--------+

【问题讨论】:

    标签: apache-spark pyspark apache-spark-sql


    【解决方案1】:

    您的解决方案对我来说似乎是合理的。如果你想要更多的可读性,你可以按照这个代码:

    country_df = df.toDF('state', 'country')
    state_df = df.toDF('city', 'state')
    city_df = df.toDF('street', 'city')
    
    final_df = country_df.join(state_df, on='state')
    final_df = final_df.join(city_df, on='city')
    
    final_df = final_df.select('country', 'state', 'city', 'street')
    final_df.sort('country', 'state', 'city', 'street').show()
    # +---------+-------+------+--------+
    # |  country|  state|  city|  street|
    # +---------+-------+------+--------+
    # |country_1|state_1|city_1|street_1|
    # |country_1|state_1|city_1|street_3|
    # |country_2|state_2|city_2|street_2|
    # |country_2|state_2|city_2|street_4|
    # +---------+-------+------+--------+
    

    【讨论】:

      猜你喜欢
      • 2019-01-06
      • 1970-01-01
      • 2023-04-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-08-07
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多