【问题标题】:pyspark: AnalysisException when joining two data framepyspark:连接两个数据框时出现 AnalysisException
【发布时间】:2017-06-21 19:18:04
【问题描述】:

我有两个从 sparkSQL 创建的数据框:

df1 = sqlContext.sql(""" ...""")
df2 = sqlContext.sql(""" ...""")

我尝试在my_id 列上加入这两个数据框,如下所示:

from pyspark.sql.functions import col

combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner')

然后我收到以下错误。知道我错过了什么吗?谢谢!

AnalysisException                         Traceback (most recent call last)
<ipython-input-11-45f5313387cc> in <module>()
      3 from pyspark.sql.functions import col
      4 
----> 5 combined_df = df1.join(df2, col("df1.my_id") == col("df2.my_id"), 'inner')
      6 combined_df.take(10)

/usr/local/spark-latest/python/pyspark/sql/dataframe.py in join(self, other, on, how)
    770                 how = "inner"
    771             assert isinstance(how, basestring), "how should be basestring"
--> 772             jdf = self._jdf.join(other._jdf, on, how)
    773         return DataFrame(jdf, self.sql_ctx)
    774 

/usr/local/spark-latest/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1131         answer = self.gateway_client.send_command(command)
   1132         return_value = get_return_value(
-> 1133             answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135         for temp_arg in temp_args:

/usr/local/spark-latest/python/pyspark/sql/utils.py in deco(*a, **kw)
     67                                              e.java_exception.getStackTrace()))
     68             if s.startswith('org.apache.spark.sql.AnalysisException: '):
---> 69                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)
     70             if s.startswith('org.apache.spark.sql.catalyst.analysis'):
     71                 raise AnalysisException(s.split(': ', 1)[1], stackTrace)

AnalysisException: "cannot resolve '`df1.my_id`' given input columns: [...

【问题讨论】:

    标签: pyspark apache-spark-sql spark-dataframe


    【解决方案1】:

    我认为您的代码存在问题,您试图将“df1.my_id”作为列名而不是 col('my_id')。这就是为什么错误显示cannot resolve df1.my_id given input columns

    您可以在不导入 col 的情况下执行此操作。

    combined_df = df1.join(df2, df1.my_id == df2.my_id, 'inner')
    

    【讨论】:

      【解决方案2】:

      不确定pyspark,但如果您在两个dataframe 中具有相同的字段名称,这应该可以工作

      combineDf = df1.join(df2, 'my_id', 'outer')
      

      希望这会有所帮助!

      【讨论】:

        猜你喜欢
        • 2017-11-02
        • 2016-09-16
        • 2018-03-08
        • 1970-01-01
        • 1970-01-01
        • 2020-09-19
        • 1970-01-01
        • 2021-09-25
        • 1970-01-01
        相关资源
        最近更新 更多