展平具有一个父列和一个子列的数据框答案

【问题标题】：Flatten a data frame with one parent and one child column展平具有一个父列和一个子列的数据框
【发布时间】：2021-07-30 22:21:09
【问题描述】：

我有以下数据框作为输入。

+--------+---------+
|   child|   parent|
+--------+---------+
|street_1|   city_1|
|street_2|   city_2|
|street_3|   city_1|
|street_4|   city_2|
|  city_2|  state_2|
|  city_1|  state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+

下面应该是输出

+---------+-------+------+--------+
|  country|  state|  city|  street|
+---------+-------+------+--------+
|country_1|state_1|city_1|street_3|
|country_1|state_1|city_1|street_1|
|country_2|state_2|city_2|street_4|
|country_2|state_2|city_2|street_2|
+---------+-------+------+--------+

我尝试了以下方法。但这似乎是矫枉过正。请建议是否有更好的方法。

d1.show()
+--------+---------+
|   child|   parent|
+--------+---------+
|street_1|   city_1|
|street_2|   city_2|
|street_3|   city_1|
|street_4|   city_2|
|  city_2|  state_2|
|  city_1|  state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+

d2.show()
+--------+---------+
| child_2| parent_2|
+--------+---------+
|street_1|   city_1|
|street_2|   city_2|
|street_3|   city_1|
|street_4|   city_2|
|  city_2|  state_2|
|  city_1|  state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+


df_with_state=d1.join(d2,d1['parent']==d2['child_2'],'left').where(d2['child_2'].isNull()).select(d1['parent'].alias('country'),d1['child'].alias('state'))

df_with_city=d1.join(df_with_state,df_with_state['state']==d1['parent'],'inner').select(*df_with_state.columns,d1['child'].alias('city'))

df_with_street=d1.join(df_with_city,df_with_city['city']==d1['parent'],'inner').select(*df_with_city.columns,d1['child'].alias('street'))

df_with_street.show()

+---------+-------+------+--------+
|  country|  state|  city|  street|
+---------+-------+------+--------+
|country_1|state_1|city_1|street_3|
|country_1|state_1|city_1|street_1|
|country_2|state_2|city_2|street_4|
|country_2|state_2|city_2|street_2|
+---------+-------+------+--------+

【问题讨论】：

标签： apache-spark pyspark apache-spark-sql

【解决方案1】：

您的解决方案对我来说似乎是合理的。如果你想要更多的可读性，你可以按照这个代码：

country_df = df.toDF('state', 'country')
state_df = df.toDF('city', 'state')
city_df = df.toDF('street', 'city')

final_df = country_df.join(state_df, on='state')
final_df = final_df.join(city_df, on='city')

final_df = final_df.select('country', 'state', 'city', 'street')
final_df.sort('country', 'state', 'city', 'street').show()
# +---------+-------+------+--------+
# |  country|  state|  city|  street|
# +---------+-------+------+--------+
# |country_1|state_1|city_1|street_1|
# |country_1|state_1|city_1|street_3|
# |country_2|state_2|city_2|street_2|
# |country_2|state_2|city_2|street_4|
# +---------+-------+------+--------+

【讨论】：