【发布时间】:2018-12-22 19:28:58
【问题描述】:
我想替换 pyspark rdd 中的多个字符串。我想替换这些字符串按长度顺序 - 从最长到最短。该操作最终将替换大量文本,因此需要考虑良好的性能。
问题示例:
在下面的例子中,我想替换字符串:
replace, text, is
按照各自的顺序(从最长到最短):
replacement1, replacement2, replacement3
即如果找到字符串 replace,则应将其替换为 replacement1,在此示例中,首先要搜索和替换。
字符串也将存储为 pyspark rdd,如下所示:
+---------+------------------+
| string | replacement_term |
+---------+------------------+
| replace | replacement1 |
+---------+------------------+
| text | replacement2 |
+---------+------------------+
| is | replacement3 |
+---------+------------------+
查看需要替换为上述术语的rdd示例:
+----+-----------------------------------------+
| id | text |
+----+-----------------------------------------+
| 1 | here is some text to replace with terms |
+----+-----------------------------------------+
| 2 | text to replace with terms |
+----+-----------------------------------------+
| 3 | text |
+----+-----------------------------------------+
| 4 | here is some text to replace |
+----+-----------------------------------------+
| 5 | text to replace |
+----+-----------------------------------------+
我想替换,创建 rdd 输出如下:
+----+----------------------------------------------------------------+
| id | text |
+----+----------------------------------------------------------------+
| 1 | here replacement3 some replacement2 to replacement1 with terms |
+----+----------------------------------------------------------------+
| 2 | replacement2 to replacement1 with terms |
+----+----------------------------------------------------------------+
| 3 | replacement2 |
+----+----------------------------------------------------------------+
| 4 | here replacement3 some replacement2 to replacement1 |
+----+----------------------------------------------------------------+
| 5 | replacement2 to replacement1 |
+----+----------------------------------------------------------------+
感谢您的帮助。
【问题讨论】:
-
那么,如果你有一个预定义了替换术语的 rdd,那么“按长度顺序替换这些字符串”是什么意思?不能一次性全部换掉吗?
-
我的问题中可能有两个字符串发生冲突。例如,“is”和“is not”都包含 is。我更喜欢在我的用例中使用更长的字符串。希望这是有道理的。
标签: python-3.x apache-spark pyspark