【发布时间】:2020-06-09 15:16:02
【问题描述】:
我正在尝试在 pyspark 中加入两个数据框,如下所示:
df1 :
+----------+----------+--------------------+-----+
|FIRST_NAME| LAST_NAME| COMPANY_NAME|CCODE|
+----------+----------+--------------------+-----+
| Rebbecca| Didio|Brandt, Jonathan ...| AU|
| Stevie| Hallo|Landrum Temporary...| US|
| Mariko| Stayer| Inabinet, Macre Esq| BR|
| Gerardo| Woodka|Morris Downing & ...| US|
| Mayra| Bena| Buelt, David L Esq| CN|
| Idella| Scotland|Artesian Ice & Co...| UK|
| Sherill| Klar| Midway Hotel| CA|
+----------+----------+--------------------+-----+
DF2:
+--------------------+-----------+
| COUNTRY|COUNTRYCODE|
+--------------------+-----------+
| United Kingdom| UK|
| United States| US|
|United Arab Emirates| AE|
| Canada| CA|
| Brazil| BR|
| India| IN|
+--------------------+-----------+
我正在尝试在 df1.CCODE == df2.COUNTRYCODE 上加入两个数据框,但它不起作用:
df1 = df1.alias('df1')
df2 = df2.alias('df2')
tgt_tbl_col='COUNTRYCODE'
src_tbl_col='CCODE'
join_type = 'INNER'
merge_df = df1.join(df2, df2.tgt_tbl_col == df1.src_tbl_col, how=join_type)
错误:
AttributeError: 'DataFrame' object has no attribute 'tgt_tbl_col'
/databricks/spark/python/pyspark/sql/dataframe.py in __getattr__(self, name)
1332 if name not in self.columns:
1333 raise AttributeError(
-> 1334 "'%s' object has no attribute '%s'" % (self.__class__.__name__, name))
1335 jc = self._jdf.apply(name)
1336 return Column(jc)
但是,当我使两个列名相同并运行以下命令时,同样的工作:
merge_df = df1.join(df2, on=[tgt_tbl_col], how=join_type)
需要这方面的建议。
版本:Apache Spark 2.4.5、Scala 2.11、python 3.8
【问题讨论】:
标签: python pyspark apache-spark-sql