join two rows itertively to create new table in spark with one row for each two rows in new table答案

【问题标题】：join two rows itertively to create new table in spark with one row for each two rows in new tablejoin two rows itertively to create new table in spark with one row for each two rows in new table
【发布时间】：2022-12-01 14:41:09
【问题描述】：

Have a table where I want to go in range of two rows

id | col b | message
1  |  abc  | hello  |
2  |  abc  | world  |
3  |  abc 1| morning|
4  |  abc  |  night |
...|...    |  ....  |
100|  abc1 | Monday |
101|  abc1 | Tuesday|

How to I create below table that goes in a range of two and shows the first id with the second col b and message in spark.

Final table will look like this.

id | full message 
1  | 01:02,abc,world
3  | 03:04,abc,night
.. |................
100| 100:101,abc1,Tuesday

【问题讨论】：

标签： python pandas dataframe apache-spark pyspark

【解决方案1】：

With pandas, you can use:

group = np.arange(len(df))//2*2+1

(df.astype({'id': 'str'})
   .groupby(group)
   .agg(**{'id': ('id', ':'.join),
           'first': ('col b', 'first'),
           'last': ('message', 'last'),
          })
   .agg(','.join, axis=1)
   .reset_index(name='full message')
)

Output:

   id          full message
0   1         1:2,abc,world
1   3       3:4,abc 1,night
2   5  100:101,abc1,Tuesday

【讨论】：

Can I ask what does this do in terms of allowing pandas to know pick last message to show np.arange(len(df))//2*2+1
it's just to generate groups of the form 1, 3, 5, etc.
Thanks and it aggregates based on group and then can show last messsage of that group is it?
I believe it does what you want

【解决方案2】：

In pyspark you can use Window, example

window = Window.orderBy('id').rowsBetween(Window.currentRow, 1)

(df
.withColumn('ids', F.concat_ws(':', F.first('id').over(window), F.last('id').over(window)))
.withColumn('messages', F.concat_ws(',', F.first('col b').over(window), F.last('message').over(window)))
.withColumn('full_message', F.concat_ws(',', 'ids', 'messages'))
# select only the first entries, regardless of the id
.withColumn('seq_id', F.row_number().over(Window.orderBy('id')))
.filter(F.col('seq_id') % 2 != 0)
.select('id', 'full_message')
)

Output:

id  full_message
1   1:2,abc,world
3   3:4,abc 1,night
100 100:101,abc1,Tuesday

【讨论】：