【发布时间】:2020-08-19 21:27:35
【问题描述】:
我在 pyspark (v2.4.5) 数据框中有一个句子列表,其中包含一组匹配的分数。句子和分数采用列表形式。
df=spark.createDataFrame(
[
(1, ['foo1','foo2','foo3'],[0.1,0.5,0.6]), # create your data here, be consistent in the types.
(2, ['bar1','bar2','bar3'],[0.5,0.7,0.7]),
(3, ['baz1','baz2','baz3'],[0.1,0.2,0.3]),
],
['id', 'txt','score'] # add your columns label here
)
df.show()
+---+------------------+---------------+
| id| txt| score|
+---+------------------+---------------+
| 1|[foo1, foo2, foo3]|[0.1, 0.5, 0.6]|
| 2|[bar1, bar2, bar3]|[0.5, 0.7, 0.7]|
| 3|[baz1, baz2, baz3]|[0.1, 0.2, 0.3]|
+---+------------------+---------------+
我想过滤并只返回那些得分 >=0.5 的句子。
+---+------------------+---------------+
| id| txt| score|
+---+------------------+---------------+
| 1| [foo2, foo3]| [0.5, 0.6]|
| 2|[bar1, bar2, bar3]|[0.5, 0.7, 0.7]|
+---+------------------+---------------+
有什么建议吗?
我尝试了pyspark dataframe filter or include based on list,但无法让它在我的实例中运行
【问题讨论】: