【发布时间】:2021-07-30 22:21:09
【问题描述】:
我有以下数据框作为输入。
+--------+---------+
| child| parent|
+--------+---------+
|street_1| city_1|
|street_2| city_2|
|street_3| city_1|
|street_4| city_2|
| city_2| state_2|
| city_1| state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+
下面应该是输出
+---------+-------+------+--------+
| country| state| city| street|
+---------+-------+------+--------+
|country_1|state_1|city_1|street_3|
|country_1|state_1|city_1|street_1|
|country_2|state_2|city_2|street_4|
|country_2|state_2|city_2|street_2|
+---------+-------+------+--------+
我尝试了以下方法。但这似乎是矫枉过正。请建议是否有更好的方法。
d1.show()
+--------+---------+
| child| parent|
+--------+---------+
|street_1| city_1|
|street_2| city_2|
|street_3| city_1|
|street_4| city_2|
| city_2| state_2|
| city_1| state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+
d2.show()
+--------+---------+
| child_2| parent_2|
+--------+---------+
|street_1| city_1|
|street_2| city_2|
|street_3| city_1|
|street_4| city_2|
| city_2| state_2|
| city_1| state_1|
| state_1|country_1|
| state_2|country_2|
+--------+---------+
df_with_state=d1.join(d2,d1['parent']==d2['child_2'],'left').where(d2['child_2'].isNull()).select(d1['parent'].alias('country'),d1['child'].alias('state'))
df_with_city=d1.join(df_with_state,df_with_state['state']==d1['parent'],'inner').select(*df_with_state.columns,d1['child'].alias('city'))
df_with_street=d1.join(df_with_city,df_with_city['city']==d1['parent'],'inner').select(*df_with_city.columns,d1['child'].alias('street'))
df_with_street.show()
+---------+-------+------+--------+
| country| state| city| street|
+---------+-------+------+--------+
|country_1|state_1|city_1|street_3|
|country_1|state_1|city_1|street_1|
|country_2|state_2|city_2|street_4|
|country_2|state_2|city_2|street_2|
+---------+-------+------+--------+
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql