【问题标题】:Concatenate two dataframes pyspark连接两个数据帧pyspark
【发布时间】:2017-11-02 10:52:49
【问题描述】:

我正在尝试连接两个数据帧,如下所示:

df1:

+---+---+
|  a|  b|
+---+---+
|  a|  b|
|  1|  2|
+---+---+
only showing top 2 rows

df2:

+---+---+
|  c|  d|
+---+---+
|  c|  d|
|  7|  8|
+---+---+
only showing top 2 rows

它们的行数相同,我想做类似的事情:

+---+---+---+---+                
|  a|  b|  c|  d|            
+---+---+---+---+           
|  a|  b|  c|  d|          
|  1|  2|  7|  8|    
+---+---+---+---+

我试过了:

df1=df1.withColumn('c', df2.c).collect()

df1=df1.withColumn('d', df2.d).collect()

但是没有成功,给我这个错误:

Traceback (most recent call last):
  File "/usr/hdp/current/spark-client/python/pyspark/sql/utils.py", line 45, in deco
    return f(*a, **kw)
  File "/usr/hdp/current/spark-client/python/lib/py4j-0.9-src.zip/py4j/protocol.py", line 308, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o2804.withColumn.

有办法吗?

谢谢

【问题讨论】:

  • rowNumber() 将是加入的方式。
  • 我是 pyspark 新手,我不知道该怎么做
  • 你试过this吗?

标签: dataframe pyspark concatenation


【解决方案1】:

这是@Suresh 提案的示例,添加列rownumber

from pyspark.sql import functions as F
df1 = sqlctx.createDataFrame([('a','b'),('1','2')],['a','b']).withColumn("row_number", F.row_number().over(Window.partitionBy().orderBy("a")))
df2 = sqlctx.createDataFrame([('c','d'),('7','8')],['c','d']).withColumn("row_number", F.row_number().over(Window.partitionBy().orderBy("c")))

 df3=df1.join(df2,df1.row_number==df2.row_number,'inner')\
                       .select(df1.a,df1.b,df2.c,df2.d)

 df3=df1.join(df2,df1.row_number==df2.row_number,'inner').select(df1.a,df1.b,df2.c,df2.d)
 df3.show()

【讨论】:

  • 有没有办法在不改变行顺序的情况下做到这一点?
猜你喜欢
  • 2016-09-16
  • 2016-10-30
  • 1970-01-01
  • 2020-02-13
  • 1970-01-01
  • 2020-09-02
  • 1970-01-01
  • 1970-01-01
  • 2022-08-14
相关资源
最近更新 更多